瀏覽代碼

Initial commit

Marius 1 天之前
當前提交
b754e816d2
共有 5 個文件被更改,包括 633 次插入0 次删除
  1. 1 0
      .env.example
  2. 333 0
      README.md
  3. 2 0
      requirements.txt
  4. 297 0
      services_uptime_monitor.py
  5. 二進制
      uptime_history.db

+ 1 - 0
.env.example

@@ -0,0 +1 @@
+NOTION_TOKEN=secret_your_notion_integration_token_here

+ 333 - 0
README.md

@@ -0,0 +1,333 @@
+# Notion Uptime Monitor
+
+Monitors the availability of self-hosted services on `smallmountains.de` and publishes live status to a Notion dashboard. A Python script runs every 5 minutes via cron, checks each service, stores history locally in SQLite, and writes the current status + rolling uptime percentages to Notion.
+
+**Notion Dashboard:** https://app.notion.com/p/38210a5f51bd807bae1edb699d9591e8
+
+---
+
+## Table of Contents
+
+1. [User Guide](#user-guide)
+2. [Architecture](#architecture)
+3. [Services](#services)
+4. [Technical Reference](#technical-reference)
+5. [Deployment (NAS / Docker)](#deployment-nas--docker)
+6. [Local Development](#local-development)
+7. [Adding or Removing Services](#adding-or-removing-services)
+8. [Troubleshooting](#troubleshooting)
+
+---
+
+## User Guide
+
+### Reading the Dashboard
+
+The Notion page has two sections:
+
+**Last Updated callout** — shows the timestamp of the most recent check run. If this is stale by more than 10 minutes, the monitoring script may have stopped.
+
+**Service Status table** — one row per service with these columns:
+
+| Column | What it means |
+|--------|---------------|
+| **Service** | Name of the monitored service |
+| **Status** | `Online` (green) or `Offline` (red) or `Unknown` (gray — no data yet) |
+| **Response Time (ms)** | How long the last successful HTTP request took. Empty for UDP services (Factorio). |
+| **Last Checked** | Exact timestamp of the last check for that service |
+| **Uptime 24h %** | Percentage of checks in the last 24 hours where the service was reachable |
+| **Uptime 7d %** | Same, over the last 7 days |
+| **Uptime 30d %** | Same, over the last 30 days |
+
+Uptime percentages are blank for the first few checks — they fill in as history accumulates.
+
+### What counts as "Online"?
+
+- **HTTP services:** The service returns an HTTP response with status code below 500 (2xx, 3xx, 4xx all count as online — a 4xx means the server is up but rejected the request, which is fine for monitoring purposes). A network error or 5xx response counts as offline.
+- **Factorio (UDP):** A probe packet is sent to `localhost:34197`. If the OS returns an immediate port-unreachable error, Factorio is offline. If the probe times out (the server is running but ignores unknown packets), the host is checked via ICMP ping as a tiebreaker.
+
+### Typical causes of false "Offline" readings
+
+- A service restarted mid-check (transient)
+- SSL certificate expired on a service
+- The NAS itself was rebooting (all services offline simultaneously)
+- Gitea is intentionally down for maintenance
+
+---
+
+## Architecture
+
+```
+┌─────────────────────────────────────────┐
+│  Ugreen NAS (Docker)                    │
+│                                         │
+│  ┌─────────────┐   every 5 min (cron)   │
+│  │ monitor.py  │──────────────────────┐ │
+│  └──────┬──────┘                      │ │
+│         │ checks                      │ │
+│  ┌──────▼──────────────────────────┐  │ │
+│  │ Services                        │  │ │
+│  │  • HTTP GET → status code       │  │ │
+│  │  • UDP probe → port reachable?  │  │ │
+│  └──────┬──────────────────────────┘  │ │
+│         │ results                     │ │
+│  ┌──────▼──────┐                      │ │
+│  │  SQLite DB  │ uptime_history.db    │ │
+│  │  (35d ring) │                      │ │
+│  └──────┬──────┘                      │ │
+│         │ computes uptime %           │ │
+└─────────┼─────────────────────────────┘ │
+          │                               │
+          │ Notion API (HTTPS)            │
+          ▼                               │
+┌─────────────────────────────────────────┤
+│  Notion — Services database             │
+│  (one row per service, live data)       │
+└─────────────────────────────────────────┘
+```
+
+**Data flow per run:**
+
+1. Script starts, opens/creates `uptime_history.db`
+2. Prunes records older than 35 days
+3. For each service: runs the appropriate check, records result in SQLite
+4. Queries SQLite to compute 24h / 7d / 30d uptime percentages from stored history
+5. Updates the corresponding row in the Notion database via the Notion API
+6. Scans the Uptime Tracker page blocks to find the "Last Updated" callout and refreshes its timestamp
+7. Exits — next run is triggered by cron
+
+---
+
+## Services
+
+Services are **configured entirely in the Notion database** — the script reads them fresh on every run. No code changes are needed to add, rename, or reconfigure a service.
+
+Each row in the Services database has a **Check Type** column (`HTTP` or `UDP`) that controls how the service is checked:
+
+| Service | Check Type | URL / Endpoint |
+|---------|-----------|----------------|
+| Plex | HTTP | `https://plex.smallmountains.de` |
+| Gitea | HTTP | `https://git.smallmountains.de` |
+| Audiobookshelf | HTTP | `https://audiobooks.smallmountains.de` |
+| Sounds | HTTP | `https://sounds.smallmountains.de` |
+| Cloud (NAS) | HTTP | `https://cloud.smallmountains.de` |
+| Kitchenowl | HTTP | `https://home.smallmountains.de` |
+| Factorio | UDP | `localhost:34197` (see note below) |
+
+**Factorio UDP note:** External UDP checks are unreliable for Factorio because:
+- The server ignores unrecognized UDP packets (no standard "ping" response)
+- ICMP is blocked on the host (common for game servers)
+
+The URL is set to `localhost:34197` so the check runs locally on the NAS, where the OS returns an immediate port-unreachable error when the container is down. To use a different host, just edit the URL field of the Factorio row in Notion.
+
+---
+
+## Technical Reference
+
+### Files
+
+```
+.
+├── services_uptime_monitor.py   # Main monitoring script
+├── requirements.txt             # Python dependencies
+├── uptime_history.db            # SQLite database (created on first run)
+└── README.md                    # This file
+```
+
+### Configuration
+
+Two constants are hardcoded at the top of `services_uptime_monitor.py`:
+
+| Variable | Value | Description |
+|----------|-------|-------------|
+| `NOTION_TOKEN` | `secret_b7Pi…` | Notion integration API token for "PyBot" |
+| `NOTION_DATA_SOURCE_ID` | `22174dd2…` | Collection ID of the Services database — the script queries this on every run to get the current service list |
+| `UPTIME_PAGE_ID` | `38210a5f…` | Notion page ID of the Uptime Tracker |
+
+All service configuration (name, URL, check type) lives in the Notion database and is read dynamically — see [Adding or Removing Services](#adding-or-removing-services).
+
+### Dependencies
+
+```
+notion-client>=2.2.1   # Official Notion Python SDK
+requests>=2.31.0       # HTTP checks
+```
+
+### SQLite Schema
+
+The `uptime_history.db` file contains one table:
+
+```sql
+CREATE TABLE checks (
+    id               INTEGER PRIMARY KEY AUTOINCREMENT,
+    service_name     TEXT    NOT NULL,
+    checked_at       TEXT    NOT NULL,   -- ISO-8601 UTC datetime
+    is_online        INTEGER NOT NULL,   -- 1 = online, 0 = offline
+    response_time_ms REAL                -- NULL for UDP checks
+);
+
+CREATE INDEX idx_service_time ON checks (service_name, checked_at);
+```
+
+Records older than 35 days are pruned on each run. At 5-minute intervals across 7 services, steady-state size is roughly **~2 MB**.
+
+### Notion Integration
+
+The script uses the Notion API v1 via `notion-client`. It accesses two resources:
+
+**Services database** (ID: `774cb57bfa2c43058d400ed8ce3165d5`)  
+One row per service. The script calls `pages.update()` with the page ID of each row to set:
+
+```python
+{
+    "Status":             {"select": {"name": "Online" | "Offline"}},
+    "Last Checked":       {"date": {"start": "<ISO-8601 UTC>"}},
+    "Response Time (ms)": {"number": <float | null>},
+    "Uptime 24h %":       {"number": <0.0–1.0 | null>},   # Notion stores % as fraction
+    "Uptime 7d %":        {"number": <0.0–1.0 | null>},
+    "Uptime 30d %":       {"number": <0.0–1.0 | null>},
+}
+```
+
+**Uptime Tracker page** (ID: `38210a5f51bd807bae1edb699d9591e8`)  
+The script scans top-level blocks, finds the one containing "Last Updated", and calls `blocks.update()` to replace its rich text with the current timestamp.
+
+### Check Logic
+
+**HTTP check (`check_http`)**
+```
+GET <url> with 10s timeout, follow redirects
+→ status < 500   : online  (returns True, response_time_ms)
+→ status >= 500  : offline (returns False, response_time_ms)
+→ exception      : offline (returns False, None)
+```
+
+**UDP check (`check_udp`)**
+```
+UDP socket → connect to host:port → send 4-byte probe
+→ response received        : online
+→ ConnectionRefusedError   : offline (ICMP port-unreachable)
+→ socket.timeout           : fallback to ICMP ping
+    → ping succeeds        : online  (host up, Factorio ignoring probe)
+    → ping fails           : offline
+→ socket.gaierror / OSError: offline (DNS failure or network error)
+```
+
+**Uptime calculation**
+```
+uptime(hours) = (checks where is_online=1 in last <hours>h)
+              / (total checks in last <hours>h)
+```
+Returns `None` (shown as blank in Notion) when fewer than 1 check exists for the window.
+
+---
+
+## Deployment (NAS / Docker)
+
+The script is designed to run inside a Docker container on the Ugreen NAS. The NAS has a watched folder — any `.py` files placed there are executed by a pre-existing Python runner container.
+
+### Steps
+
+1. **Copy files to the watched folder:**
+   ```
+   monitor.py
+   requirements.txt
+   ```
+
+2. **Install dependencies** (once, inside the container):
+   ```bash
+   pip3 install -r requirements.txt
+   ```
+
+3. **Set up cron** to run every 5 minutes:
+   ```cron
+   */5 * * * * /usr/local/bin/python3 /path/to/services_uptime_monitor.py >> /path/to/monitor.log 2>&1
+   ```
+   Adjust the Python path with `which python3` if needed.
+
+4. **Verify Factorio access:** By default the script checks `localhost:34197`. If the Factorio container is on a different Docker network, change `"host": "localhost"` in the Factorio entry in `SERVICES` to the container's hostname or IP.
+
+5. **Check the first run:**
+   ```bash
+   python3 services_uptime_monitor.py
+   ```
+   Expected output:
+   ```
+   [2026-06-17 21:37 UTC] Running uptime checks...
+     Plex                 ONLINE   183ms
+     Gitea                ONLINE   145ms
+     ...
+     Factorio             ONLINE   —
+   Done.
+   ```
+
+### Persistence
+
+The SQLite database (`uptime_history.db`) must survive container restarts to preserve uptime history. Mount the folder containing the script as a persistent volume, or move `DB_PATH` to a dedicated data directory:
+
+```python
+DB_PATH = Path("/data/uptime_history.db")  # example override
+```
+
+---
+
+## Local Development
+
+To test the script on a Mac:
+
+```bash
+# Install dependencies
+pip3 install -r requirements.txt
+
+# Run once
+python3 services_uptime_monitor.py
+```
+
+**Expected on Mac:** All HTTP services show their real status. Factorio shows **Offline** — this is correct, since `localhost:34197` has nothing listening locally. To test against the real server, temporarily change `"host": "localhost"` to `"host": "game.smallmountains.de"` in the Factorio entry (note: will likely still time out due to ICMP/firewall restrictions on the server side).
+
+---
+
+## Adding or Removing Services
+
+### Adding an HTTP service
+
+1. Open the [Services database](https://app.notion.com/p/774cb57bfa2c43058d400ed8ce3165d5) in Notion
+2. Add a new row and fill in:
+   - **Service** — display name
+   - **URL** — the full HTTPS URL to check (e.g. `https://my-service.smallmountains.de`)
+   - **Check Type** — select `HTTP`
+3. The next script run picks it up automatically — no code changes needed
+
+### Adding a UDP service
+
+Same as above, but:
+- **URL** — set to `host:port` format (e.g. `localhost:1234` for a co-located service)
+- **Check Type** — select `UDP`
+
+See the [Factorio note](#services) for why `localhost` is recommended for services running on the same machine.
+
+### Removing a service
+
+Delete or archive the row in the Notion database. The script will no longer check it from the next run. The SQLite history for that service name is retained but no longer updated — delete it manually with:
+```sql
+DELETE FROM checks WHERE service_name = 'ServiceName';
+```
+
+---
+
+## Troubleshooting
+
+**"Could not find page" error in Notion**  
+The "PyBot" integration has lost access to the page. Go to the Uptime Tracker page in Notion → `...` menu → Connections → re-add PyBot.
+
+**Uptime % is blank after many runs**  
+The window has no data yet (fresh install) or all checks in the window failed to write to SQLite. Check that `uptime_history.db` is writable by the script's user.
+
+**"Last Updated" timestamp is not refreshing**  
+The callout block on the Notion page was deleted or its text no longer contains "Last Updated". Re-add a callout to the page containing that phrase — the script finds it by scanning block content, not by a hardcoded block ID.
+
+**Factorio shows Offline on the NAS**  
+Check that the Factorio container and the monitoring container share a Docker network, or that `localhost` resolves to the NAS host (host network mode). If not, change `"host": "localhost"` to the Factorio container's hostname or internal IP in `services_uptime_monitor.py`.
+
+**All services Offline simultaneously**  
+Usually means a network outage, NAS reboot, or the script lost its internet route. Check `monitor.log` for exception messages rather than `OFFLINE` status lines.

+ 2 - 0
requirements.txt

@@ -0,0 +1,2 @@
+notion-client>=2.2.1
+requests>=2.31.0

+ 297 - 0
services_uptime_monitor.py

@@ -0,0 +1,297 @@
+#!/usr/bin/env python3
+"""
+Uptime Monitor for smallmountains.de
+Checks service availability and updates the Notion dashboard.
+
+Run via cron every 5 minutes:
+    */5 * * * * /usr/local/bin/python3 /path/to/services_uptime_monitor.py >> /path/to/monitor.log 2>&1
+
+Services are configured entirely in the Notion database — no code changes needed
+to add, remove, or reconfigure a service.
+"""
+from __future__ import annotations
+
+import sys
+import sqlite3
+import socket
+import subprocess
+import time
+from datetime import datetime, timezone, timedelta
+from pathlib import Path
+
+import requests
+from notion_client import Client
+from notion_client.errors import APIResponseError
+
+# ── Configuration ─────────────────────────────────────────────────────────────
+NOTION_TOKEN           = "secret_b7PiPL2FqC9QEikqkAEWOht7LmzPMIJMWTzUPWwbw4H"
+NOTION_DATA_SOURCE_ID  = "22174dd2-e6fc-4dc9-ac86-a5d614c995bd"  # Services data source
+UPTIME_PAGE_ID         = "38210a5f51bd807bae1edb699d9591e8"       # Uptime Tracker page
+
+DB_PATH      = Path(__file__).parent / "uptime_history.db"
+HTTP_TIMEOUT = 10   # seconds
+UDP_TIMEOUT  = 5    # seconds
+
+
+# ── Fetch services from Notion ─────────────────────────────────────────────────
+def fetch_services(notion: Client) -> list[dict]:
+    """
+    Read the Services database and return a list of service dicts.
+
+    Each HTTP service dict:  {"name", "notion_page_id", "type": "http", "url"}
+    Each UDP service dict:   {"name", "notion_page_id", "type": "udp", "host", "port"}
+
+    Rows missing a name or URL are skipped with a warning.
+    Check Type defaults to HTTP when the field is left blank.
+    """
+    response = notion.data_sources.query(NOTION_DATA_SOURCE_ID)
+    services = []
+
+    for page in response["results"]:
+        props = page["properties"]
+
+        # Service name
+        title_arr = props.get("Service", {}).get("title", [])
+        name = title_arr[0]["plain_text"].strip() if title_arr else ""
+        if not name:
+            continue
+
+        # URL / endpoint
+        url = (props.get("URL") or {}).get("url") or ""
+        if not url:
+            print(f"  Skipping '{name}': no URL configured in Notion")
+            continue
+
+        # Check Type (default to HTTP when blank)
+        select = ((props.get("Check Type") or {}).get("select")) or {}
+        check_type = select.get("name", "HTTP").upper()
+
+        service: dict = {
+            "name": name,
+            "notion_page_id": page["id"],
+            "type": check_type.lower(),
+        }
+
+        if check_type == "UDP":
+            # URL field stores "host:port"
+            host, sep, port_str = url.rpartition(":")
+            service["host"] = host if sep else url
+            service["port"] = int(port_str) if port_str.isdigit() else 34197
+        else:
+            service["url"] = url
+
+        services.append(service)
+
+    if not services:
+        raise RuntimeError(
+            "No services found in Notion database. "
+            "Check that NOTION_DATABASE_ID is correct and PyBot has access."
+        )
+
+    return services
+
+
+# ── SQLite ─────────────────────────────────────────────────────────────────────
+def init_db(db_path: Path) -> sqlite3.Connection:
+    conn = sqlite3.connect(db_path)
+    conn.execute("""
+        CREATE TABLE IF NOT EXISTS checks (
+            id               INTEGER PRIMARY KEY AUTOINCREMENT,
+            service_name     TEXT    NOT NULL,
+            checked_at       TEXT    NOT NULL,
+            is_online        INTEGER NOT NULL,
+            response_time_ms REAL
+        )
+    """)
+    conn.execute(
+        "CREATE INDEX IF NOT EXISTS idx_service_time ON checks (service_name, checked_at)"
+    )
+    conn.commit()
+    return conn
+
+
+def prune_old_records(conn: sqlite3.Connection):
+    cutoff = (datetime.now(timezone.utc) - timedelta(days=35)).isoformat()
+    conn.execute("DELETE FROM checks WHERE checked_at < ?", (cutoff,))
+    conn.commit()
+
+
+def compute_uptime(conn: sqlite3.Connection, service_name: str, hours: int) -> float | None:
+    """Return fraction 0.0–1.0 for Notion's percent format, or None if no data."""
+    cutoff = (datetime.now(timezone.utc) - timedelta(hours=hours)).isoformat()
+    row = conn.execute(
+        "SELECT COUNT(*), COALESCE(SUM(is_online), 0) FROM checks "
+        "WHERE service_name = ? AND checked_at >= ?",
+        (service_name, cutoff),
+    ).fetchone()
+    total, online = row
+    if total == 0:
+        return None
+    return online / total
+
+
+# ── Service Checks ─────────────────────────────────────────────────────────────
+def check_http(url: str) -> tuple[bool, float | None]:
+    try:
+        start = time.monotonic()
+        resp = requests.get(url, timeout=HTTP_TIMEOUT, allow_redirects=True)
+        elapsed_ms = (time.monotonic() - start) * 1000
+        return resp.status_code < 500, round(elapsed_ms, 1)
+    except requests.RequestException:
+        return False, None
+
+
+def _ping(host: str) -> bool:
+    # -W timeout unit differs: milliseconds on macOS, seconds on Linux
+    w_arg = "3000" if sys.platform == "darwin" else "3"
+    try:
+        result = subprocess.run(
+            ["ping", "-c", "1", "-W", w_arg, host],
+            capture_output=True,
+            timeout=6,
+        )
+        return result.returncode == 0
+    except Exception:
+        return False
+
+
+def check_udp(host: str, port: int) -> tuple[bool, None]:
+    """
+    Send a probe UDP packet.
+    - Response received      → online
+    - ConnectionRefusedError → offline (ICMP port-unreachable)
+    - Timeout                → fall back to ICMP ping
+    """
+    try:
+        sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
+        sock.settimeout(UDP_TIMEOUT)
+        sock.connect((host, port))
+        sock.send(b"\x00\x00\x00\x00")
+        try:
+            sock.recv(1024)
+            return True, None
+        except socket.timeout:
+            return _ping(host), None
+        except ConnectionRefusedError:
+            return False, None
+    except (socket.gaierror, OSError):
+        return False, None
+    finally:
+        try:
+            sock.close()
+        except Exception:
+            pass
+
+
+def check_service(service: dict) -> tuple[bool, float | None]:
+    if service["type"] == "http":
+        return check_http(service["url"])
+    if service["type"] == "udp":
+        return check_udp(service["host"], service["port"])
+    return False, None
+
+
+# ── Notion ─────────────────────────────────────────────────────────────────────
+def update_notion_service(
+    notion: Client,
+    service: dict,
+    is_online: bool,
+    response_ms: float | None,
+    uptime_24h: float | None,
+    uptime_7d: float | None,
+    uptime_30d: float | None,
+):
+    now_iso = datetime.now(timezone.utc).isoformat()
+    notion.pages.update(
+        page_id=service["notion_page_id"],
+        properties={
+            "Status":             {"select": {"name": "Online" if is_online else "Offline"}},
+            "Last Checked":       {"date": {"start": now_iso}},
+            "Response Time (ms)": {"number": response_ms},
+            "Uptime 24h %":       {"number": uptime_24h},
+            "Uptime 7d %":        {"number": uptime_7d},
+            "Uptime 30d %":       {"number": uptime_30d},
+        },
+    )
+
+
+def update_last_updated_block(notion: Client, page_id: str, timestamp_str: str):
+    """Find the callout/paragraph containing 'Last Updated' and refresh its text."""
+    try:
+        result = notion.blocks.children.list(block_id=page_id)
+        for block in result.get("results", []):
+            btype = block.get("type")
+            if btype not in ("callout", "paragraph", "quote"):
+                continue
+            rich_text = block.get(btype, {}).get("rich_text", [])
+            plain = "".join(rt.get("plain_text", "") for rt in rich_text)
+            if "Last Updated" not in plain:
+                continue
+            notion.blocks.update(
+                block_id=block["id"],
+                **{
+                    btype: {
+                        "rich_text": [
+                            {
+                                "type": "text",
+                                "text": {"content": f"🔄 Last Updated: {timestamp_str}"},
+                                "annotations": {"bold": False},
+                            }
+                        ]
+                    }
+                },
+            )
+            return
+    except APIResponseError as e:
+        print(f"  Warning: could not update Last Updated block: {e}")
+
+
+# ── Main ───────────────────────────────────────────────────────────────────────
+def main():
+    notion = Client(auth=NOTION_TOKEN)
+
+    services = fetch_services(notion)
+    print(f"Loaded {len(services)} service(s) from Notion.")
+
+    conn = init_db(DB_PATH)
+    prune_old_records(conn)
+
+    now_utc     = datetime.now(timezone.utc)
+    now_iso     = now_utc.isoformat()
+    now_display = now_utc.strftime("%Y-%m-%d %H:%M UTC")
+
+    print(f"[{now_display}] Running uptime checks...")
+
+    for service in services:
+        is_online, response_ms = check_service(service)
+        status_str = "ONLINE " if is_online else "OFFLINE"
+        rt_str     = f"{response_ms:.0f}ms" if response_ms is not None else "—"
+        print(f"  {service['name']:<20} {status_str}  {rt_str}")
+
+        conn.execute(
+            "INSERT INTO checks (service_name, checked_at, is_online, response_time_ms) "
+            "VALUES (?, ?, ?, ?)",
+            (service["name"], now_iso, int(is_online), response_ms),
+        )
+        conn.commit()
+
+        uptime_24h = compute_uptime(conn, service["name"], 24)
+        uptime_7d  = compute_uptime(conn, service["name"], 7 * 24)
+        uptime_30d = compute_uptime(conn, service["name"], 30 * 24)
+
+        try:
+            update_notion_service(
+                notion, service, is_online, response_ms,
+                uptime_24h, uptime_7d, uptime_30d,
+            )
+        except APIResponseError as e:
+            print(f"  Warning: Notion update failed for {service['name']}: {e}")
+
+    update_last_updated_block(notion, UPTIME_PAGE_ID, now_display)
+
+    conn.close()
+    print("Done.")
+
+
+if __name__ == "__main__":
+    main()

二進制
uptime_history.db