Files
infra/README.md

72 lines
2.5 KiB
Markdown
Raw Normal View History

# infra
R&D infrastructure stacks for Waterschap Brabantse Delta. Hub-and-spoke deployment: one **cloud** central hub + per-plant **edge** sites.
## Layout
```
infra/
├── stacks/ # reusable, runnable stack defs (kebab-case)
├── cloud/ # the single central hub
├── sites/ # per-plant edge deployments
└── docs/ # architecture + conventions
```
Stacks are pulled into the cloud and site composes via the Compose Spec `include:` directive. Each stack is also runnable standalone for testing.
## Quick start
```bash
# Cloud hub (run on the central server)
cd cloud
feat: SQL=postgres, nginx+certbot, MQTT split, ML stacks, gitea HTTPS-only, gemaal1 site Round-2 changes locking in scaffold-phase decisions and adding ML/notebook stacks. Locked decisions - sql: postgres 16-alpine (was TBD); init.d/ mount for per-app DB provisioning - nginx-proxy: stock nginx + certbot sidecar (was nginx:alpine TODO). Chose stock over nginxproxy/nginx-proxy because stream{} is required for MQTT-TLS reverse-proxy on tcp/8883 to rabbitmq:1883. - gitea: HTTPS-only (DISABLE_SSH=true). No SSH port published. MQTT split - Remove stacks/mqtt placeholder. - Add stacks/rabbitmq — general-purpose broker (AMQP + MQTT plugin), used at both cloud and edge. External MQTT clients reach cloud broker via nginx stream-proxy on 8883. - Add stacks/mosquitto — reserved for the FROST (SensorThings) stack only. Cloud-only. Internal to its own stack; no external ingress. ML / notebooks (cloud-only) - stacks/mlflow — experiment tracking + model registry. Postgres backend on sql stack; local volume for artifacts (S3/MinIO is a TODO). - stacks/jupyterhub — multi-user notebook server. DockerSpawner via mounted docker.sock; users spawn into cloud-app network so they can reach mlflow, influxdb (via grafana), rabbitmq. Sites - sites/gemaal1 — first edge deployment scaffold. Site-local override template for binding nginx to PLANT_LAN_IP. Docs - README + docs/architecture.md updated: stacks table now lists 15 stacks, ingress + attachment tables reflect mlflow/jupyterhub, TLS strategy section locked, MQTT-split section added, Gitea HTTPS-only noted. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 13:22:46 +02:00
cp .env.example .env # fill in real secrets
feat(sso): wire Keycloak SSO end-to-end across all apps New stack: - stacks/oauth2-proxy/ — per-app sidecars (mlflow, portainer, rabbitmq) that gate vhosts via nginx auth_request against Keycloak's wbd realm. Native OIDC wired into: - grafana (generic_oauth, role-attribute-path → Admin/Editor/Viewer) - jupyterhub (oauthenticator.GenericOAuthenticator) - node-red (passport-openidconnect; in-memory state store + users() resolver because adminAuth doesn't expose req.session) - jenkins (oic-auth plugin via JCasC; matrix-auth for authz; setup wizard suppressed; custom image with plugins.txt) Infra fixes uncovered while bringing the above online: - nginx-proxy: bump proxy_buffer_size to 16k so oauth2-proxy callbacks don't 502 on the JWT-bearing Set-Cookie header. - nginx-proxy: add `resolver 127.0.0.11 valid=30s` so service names re-resolve after sidecar recreates (was cross-wiring oauth2-proxy upstreams after restart). - jupyterhub: pass --allow-root to the singleuser spawner (hub runs as root inside its container; jupyter-server refused root without flag). - jupyterhub Dockerfile: install jupyterlab + notebook so SimpleLocalProcessSpawner has something to launch. - node-red Dockerfile: install passport-openidconnect into the image so settings.js can require() it. - portainer: pre-seed local admin via --admin-password=<bcrypt-hash> so the 5-minute "no admin → lockout" timer can never trigger. - deploy.sh: restore executable bit (was 644 in repo). Admin/viewer policy: - Created realm role `app-admin` in keycloak wbd realm. - Grafana maps app-admin → Admin (default Viewer). - Jenkins matrix-auth grants r.de.ren Overall/Administer, authenticated users get Overall/Read + Job/Read + View/Read. - Node-RED: NODERED_ADMIN_USERS env list → permissions "*", others ["read"]. (TODO: switch to app-admin realm role.) - JupyterHub: JUPYTERHUB_ADMIN_USERS env list. (Same TODO.) - Gitea: r.de.ren pre-created as local admin; OIDC auto-links via email. Docs: - README, cloud/README, stacks/oauth2-proxy/README, and per-stack READMEs updated to reflect the new state and remove resolved TODOs. - cloud/.env.example gains all the new OIDC client + cookie-secret keys. - cloud/README documents the full kcadm realm bootstrap, including the hardcoded-audience mapper and post-logout redirect URIs that are non-obvious gotchas. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 18:34:37 +00:00
./deploy.sh # one-shot bring-up + Let's Encrypt + smoke test
# A plant edge (run on the edge gateway at the plant)
cd sites/<plant>
cp .env.example .env
docker compose up -d
```
feat(sso): wire Keycloak SSO end-to-end across all apps New stack: - stacks/oauth2-proxy/ — per-app sidecars (mlflow, portainer, rabbitmq) that gate vhosts via nginx auth_request against Keycloak's wbd realm. Native OIDC wired into: - grafana (generic_oauth, role-attribute-path → Admin/Editor/Viewer) - jupyterhub (oauthenticator.GenericOAuthenticator) - node-red (passport-openidconnect; in-memory state store + users() resolver because adminAuth doesn't expose req.session) - jenkins (oic-auth plugin via JCasC; matrix-auth for authz; setup wizard suppressed; custom image with plugins.txt) Infra fixes uncovered while bringing the above online: - nginx-proxy: bump proxy_buffer_size to 16k so oauth2-proxy callbacks don't 502 on the JWT-bearing Set-Cookie header. - nginx-proxy: add `resolver 127.0.0.11 valid=30s` so service names re-resolve after sidecar recreates (was cross-wiring oauth2-proxy upstreams after restart). - jupyterhub: pass --allow-root to the singleuser spawner (hub runs as root inside its container; jupyter-server refused root without flag). - jupyterhub Dockerfile: install jupyterlab + notebook so SimpleLocalProcessSpawner has something to launch. - node-red Dockerfile: install passport-openidconnect into the image so settings.js can require() it. - portainer: pre-seed local admin via --admin-password=<bcrypt-hash> so the 5-minute "no admin → lockout" timer can never trigger. - deploy.sh: restore executable bit (was 644 in repo). Admin/viewer policy: - Created realm role `app-admin` in keycloak wbd realm. - Grafana maps app-admin → Admin (default Viewer). - Jenkins matrix-auth grants r.de.ren Overall/Administer, authenticated users get Overall/Read + Job/Read + View/Read. - Node-RED: NODERED_ADMIN_USERS env list → permissions "*", others ["read"]. (TODO: switch to app-admin realm role.) - JupyterHub: JUPYTERHUB_ADMIN_USERS env list. (Same TODO.) - Gitea: r.de.ren pre-created as local admin; OIDC auto-links via email. Docs: - README, cloud/README, stacks/oauth2-proxy/README, and per-stack READMEs updated to reflect the new state and remove resolved TODOs. - cloud/.env.example gains all the new OIDC client + cookie-secret keys. - cloud/README documents the full kcadm realm bootstrap, including the hardcoded-audience mapper and post-logout redirect URIs that are non-obvious gotchas. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 18:34:37 +00:00
After `deploy.sh` finishes, see [`cloud/README.md`](cloud/README.md) for the one-time Keycloak realm bootstrap that wires every app to Keycloak SSO.
## Stacks
| Stack | Purpose | Cloud | Edge |
|---|---|:---:|:---:|
| node-red | Flow-based automation | ✓ | ✓ |
| influxdb | Time-series database | ✓ | ✓ |
| grafana | Dashboards / SCADA | ✓ | ✓ |
| keycloak | Identity / SSO | ✓ | ✓ |
| portainer | Container management UI | ✓ | ✓ |
feat: SQL=postgres, nginx+certbot, MQTT split, ML stacks, gitea HTTPS-only, gemaal1 site Round-2 changes locking in scaffold-phase decisions and adding ML/notebook stacks. Locked decisions - sql: postgres 16-alpine (was TBD); init.d/ mount for per-app DB provisioning - nginx-proxy: stock nginx + certbot sidecar (was nginx:alpine TODO). Chose stock over nginxproxy/nginx-proxy because stream{} is required for MQTT-TLS reverse-proxy on tcp/8883 to rabbitmq:1883. - gitea: HTTPS-only (DISABLE_SSH=true). No SSH port published. MQTT split - Remove stacks/mqtt placeholder. - Add stacks/rabbitmq — general-purpose broker (AMQP + MQTT plugin), used at both cloud and edge. External MQTT clients reach cloud broker via nginx stream-proxy on 8883. - Add stacks/mosquitto — reserved for the FROST (SensorThings) stack only. Cloud-only. Internal to its own stack; no external ingress. ML / notebooks (cloud-only) - stacks/mlflow — experiment tracking + model registry. Postgres backend on sql stack; local volume for artifacts (S3/MinIO is a TODO). - stacks/jupyterhub — multi-user notebook server. DockerSpawner via mounted docker.sock; users spawn into cloud-app network so they can reach mlflow, influxdb (via grafana), rabbitmq. Sites - sites/gemaal1 — first edge deployment scaffold. Site-local override template for binding nginx to PLANT_LAN_IP. Docs - README + docs/architecture.md updated: stacks table now lists 15 stacks, ingress + attachment tables reflect mlflow/jupyterhub, TLS strategy section locked, MQTT-split section added, Gitea HTTPS-only noted. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 13:22:46 +02:00
| nginx-proxy | Stock nginx + certbot sidecar | ✓ | ✓ |
| rabbitmq | General-purpose broker (AMQP + MQTT plugin) | ✓ | ✓ |
| postfix | Outbound mail relay | ✓ | ✓ |
| wireguard-server | VPN server | ✓ | — |
| wireguard-client | VPN client | — | ✓ |
feat: SQL=postgres, nginx+certbot, MQTT split, ML stacks, gitea HTTPS-only, gemaal1 site Round-2 changes locking in scaffold-phase decisions and adding ML/notebook stacks. Locked decisions - sql: postgres 16-alpine (was TBD); init.d/ mount for per-app DB provisioning - nginx-proxy: stock nginx + certbot sidecar (was nginx:alpine TODO). Chose stock over nginxproxy/nginx-proxy because stream{} is required for MQTT-TLS reverse-proxy on tcp/8883 to rabbitmq:1883. - gitea: HTTPS-only (DISABLE_SSH=true). No SSH port published. MQTT split - Remove stacks/mqtt placeholder. - Add stacks/rabbitmq — general-purpose broker (AMQP + MQTT plugin), used at both cloud and edge. External MQTT clients reach cloud broker via nginx stream-proxy on 8883. - Add stacks/mosquitto — reserved for the FROST (SensorThings) stack only. Cloud-only. Internal to its own stack; no external ingress. ML / notebooks (cloud-only) - stacks/mlflow — experiment tracking + model registry. Postgres backend on sql stack; local volume for artifacts (S3/MinIO is a TODO). - stacks/jupyterhub — multi-user notebook server. DockerSpawner via mounted docker.sock; users spawn into cloud-app network so they can reach mlflow, influxdb (via grafana), rabbitmq. Sites - sites/gemaal1 — first edge deployment scaffold. Site-local override template for binding nginx to PLANT_LAN_IP. Docs - README + docs/architecture.md updated: stacks table now lists 15 stacks, ingress + attachment tables reflect mlflow/jupyterhub, TLS strategy section locked, MQTT-split section added, Gitea HTTPS-only noted. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 13:22:46 +02:00
| gitea | Git server (HTTPS-only) | ✓ | — |
| jenkins | CI/CD | ✓ | — |
feat: SQL=postgres, nginx+certbot, MQTT split, ML stacks, gitea HTTPS-only, gemaal1 site Round-2 changes locking in scaffold-phase decisions and adding ML/notebook stacks. Locked decisions - sql: postgres 16-alpine (was TBD); init.d/ mount for per-app DB provisioning - nginx-proxy: stock nginx + certbot sidecar (was nginx:alpine TODO). Chose stock over nginxproxy/nginx-proxy because stream{} is required for MQTT-TLS reverse-proxy on tcp/8883 to rabbitmq:1883. - gitea: HTTPS-only (DISABLE_SSH=true). No SSH port published. MQTT split - Remove stacks/mqtt placeholder. - Add stacks/rabbitmq — general-purpose broker (AMQP + MQTT plugin), used at both cloud and edge. External MQTT clients reach cloud broker via nginx stream-proxy on 8883. - Add stacks/mosquitto — reserved for the FROST (SensorThings) stack only. Cloud-only. Internal to its own stack; no external ingress. ML / notebooks (cloud-only) - stacks/mlflow — experiment tracking + model registry. Postgres backend on sql stack; local volume for artifacts (S3/MinIO is a TODO). - stacks/jupyterhub — multi-user notebook server. DockerSpawner via mounted docker.sock; users spawn into cloud-app network so they can reach mlflow, influxdb (via grafana), rabbitmq. Sites - sites/gemaal1 — first edge deployment scaffold. Site-local override template for binding nginx to PLANT_LAN_IP. Docs - README + docs/architecture.md updated: stacks table now lists 15 stacks, ingress + attachment tables reflect mlflow/jupyterhub, TLS strategy section locked, MQTT-split section added, Gitea HTTPS-only noted. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 13:22:46 +02:00
| sql | Config DB (postgres 16) | ✓ | — |
| mlflow | ML experiment tracking + registry | ✓ | — |
| jupyterhub | Multi-user notebook server | ✓ | — |
feat(cloud): single-shot deploy.sh + FROST stack + healthchecks Stage 5 — make the cloud composition spin up in one command and add the SensorThings (FROST) stack as a fully segregated tenant. cloud/deploy.sh — idempotent, 7-step bring-up: preflight → validate → up + wait → cert state → issue/renew → service status → endpoint smoke test. Reissues LE cert only when current issuer no longer matches ACME_CA_URI. Move-aside-then- restore-on-failure so the bootstrap cert survives a failed certbot. stacks/frost — new stack, segregated from shared sql/rabbitmq: - dedicated postgis container (frost-db) - dedicated internal mosquitto bus (frost-mosquitto) - frost-http + frost-mqtt on a private frost-internal network, joined to cloud-app only for nginx ingress at frost.wbd-rd.nl - shared mosquitto stack deleted; rabbitmq remains the only public MQTT broker (mqtt.wbd-rd.nl:8883 via stream proxy) stacks/sql — pg_isready healthcheck so keycloak/gitea/mlflow can gate on service_healthy via cloud-level depends_on overrides. stacks/nginx-proxy: - nginx-init service generates a self-signed bootstrap cert on fresh deploy so nginx starts before certbot has issued a real one - frost.wbd-rd.nl vhost (/FROST-Server → frost-http:8080, /mqtt → frost-mqtt:9876 WebSocket) stacks/mlflow — custom Dockerfile (upstream + psycopg2-binary) so the official image can speak to the shared sql backend. stacks/jupyterhub — DummyAuthenticator stub gated by JUPYTERHUB_ADMIN_PASSWORD; TODO comments point at OIDC + DockerSpawner. stacks/rabbitmq — config/{enabled_plugins,rabbitmq.conf} stubs (management + mqtt plugins, MQTT auth required). stacks/portainer — ports unpublished; nginx now the only ingress. stacks/node-red — pin to 4.1 (the floating "4" tag does not exist). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 16:37:58 +02:00
| frost | OGC SensorThings API (postgis + dedicated bus) | ✓ | — |
feat(sso): wire Keycloak SSO end-to-end across all apps New stack: - stacks/oauth2-proxy/ — per-app sidecars (mlflow, portainer, rabbitmq) that gate vhosts via nginx auth_request against Keycloak's wbd realm. Native OIDC wired into: - grafana (generic_oauth, role-attribute-path → Admin/Editor/Viewer) - jupyterhub (oauthenticator.GenericOAuthenticator) - node-red (passport-openidconnect; in-memory state store + users() resolver because adminAuth doesn't expose req.session) - jenkins (oic-auth plugin via JCasC; matrix-auth for authz; setup wizard suppressed; custom image with plugins.txt) Infra fixes uncovered while bringing the above online: - nginx-proxy: bump proxy_buffer_size to 16k so oauth2-proxy callbacks don't 502 on the JWT-bearing Set-Cookie header. - nginx-proxy: add `resolver 127.0.0.11 valid=30s` so service names re-resolve after sidecar recreates (was cross-wiring oauth2-proxy upstreams after restart). - jupyterhub: pass --allow-root to the singleuser spawner (hub runs as root inside its container; jupyter-server refused root without flag). - jupyterhub Dockerfile: install jupyterlab + notebook so SimpleLocalProcessSpawner has something to launch. - node-red Dockerfile: install passport-openidconnect into the image so settings.js can require() it. - portainer: pre-seed local admin via --admin-password=<bcrypt-hash> so the 5-minute "no admin → lockout" timer can never trigger. - deploy.sh: restore executable bit (was 644 in repo). Admin/viewer policy: - Created realm role `app-admin` in keycloak wbd realm. - Grafana maps app-admin → Admin (default Viewer). - Jenkins matrix-auth grants r.de.ren Overall/Administer, authenticated users get Overall/Read + Job/Read + View/Read. - Node-RED: NODERED_ADMIN_USERS env list → permissions "*", others ["read"]. (TODO: switch to app-admin realm role.) - JupyterHub: JUPYTERHUB_ADMIN_USERS env list. (Same TODO.) - Gitea: r.de.ren pre-created as local admin; OIDC auto-links via email. Docs: - README, cloud/README, stacks/oauth2-proxy/README, and per-stack READMEs updated to reflect the new state and remove resolved TODOs. - cloud/.env.example gains all the new OIDC client + cookie-secret keys. - cloud/README documents the full kcadm realm bootstrap, including the hardcoded-audience mapper and post-logout redirect URIs that are non-obvious gotchas. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 18:34:37 +00:00
| oauth2-proxy | Keycloak SSO gate (auth_request sidecar) for apps without native OIDC | ✓ | — |
feat: SQL=postgres, nginx+certbot, MQTT split, ML stacks, gitea HTTPS-only, gemaal1 site Round-2 changes locking in scaffold-phase decisions and adding ML/notebook stacks. Locked decisions - sql: postgres 16-alpine (was TBD); init.d/ mount for per-app DB provisioning - nginx-proxy: stock nginx + certbot sidecar (was nginx:alpine TODO). Chose stock over nginxproxy/nginx-proxy because stream{} is required for MQTT-TLS reverse-proxy on tcp/8883 to rabbitmq:1883. - gitea: HTTPS-only (DISABLE_SSH=true). No SSH port published. MQTT split - Remove stacks/mqtt placeholder. - Add stacks/rabbitmq — general-purpose broker (AMQP + MQTT plugin), used at both cloud and edge. External MQTT clients reach cloud broker via nginx stream-proxy on 8883. - Add stacks/mosquitto — reserved for the FROST (SensorThings) stack only. Cloud-only. Internal to its own stack; no external ingress. ML / notebooks (cloud-only) - stacks/mlflow — experiment tracking + model registry. Postgres backend on sql stack; local volume for artifacts (S3/MinIO is a TODO). - stacks/jupyterhub — multi-user notebook server. DockerSpawner via mounted docker.sock; users spawn into cloud-app network so they can reach mlflow, influxdb (via grafana), rabbitmq. Sites - sites/gemaal1 — first edge deployment scaffold. Site-local override template for binding nginx to PLANT_LAN_IP. Docs - README + docs/architecture.md updated: stacks table now lists 15 stacks, ingress + attachment tables reflect mlflow/jupyterhub, TLS strategy section locked, MQTT-split section added, Gitea HTTPS-only noted. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 13:22:46 +02:00
## Sites
| Site | Status |
|---|---|
| gemaal1 | Scaffolded — awaiting hardware provisioning |
## Design
See [`docs/architecture.md`](docs/architecture.md) for the hub-and-spoke topology, 4-network model, ingress table, and the reasoning behind each choice.
## Conventions
- kebab-case folder names
- `compose.yml` (Compose Spec), not `docker-compose.yml`
feat: SQL=postgres, nginx+certbot, MQTT split, ML stacks, gitea HTTPS-only, gemaal1 site Round-2 changes locking in scaffold-phase decisions and adding ML/notebook stacks. Locked decisions - sql: postgres 16-alpine (was TBD); init.d/ mount for per-app DB provisioning - nginx-proxy: stock nginx + certbot sidecar (was nginx:alpine TODO). Chose stock over nginxproxy/nginx-proxy because stream{} is required for MQTT-TLS reverse-proxy on tcp/8883 to rabbitmq:1883. - gitea: HTTPS-only (DISABLE_SSH=true). No SSH port published. MQTT split - Remove stacks/mqtt placeholder. - Add stacks/rabbitmq — general-purpose broker (AMQP + MQTT plugin), used at both cloud and edge. External MQTT clients reach cloud broker via nginx stream-proxy on 8883. - Add stacks/mosquitto — reserved for the FROST (SensorThings) stack only. Cloud-only. Internal to its own stack; no external ingress. ML / notebooks (cloud-only) - stacks/mlflow — experiment tracking + model registry. Postgres backend on sql stack; local volume for artifacts (S3/MinIO is a TODO). - stacks/jupyterhub — multi-user notebook server. DockerSpawner via mounted docker.sock; users spawn into cloud-app network so they can reach mlflow, influxdb (via grafana), rabbitmq. Sites - sites/gemaal1 — first edge deployment scaffold. Site-local override template for binding nginx to PLANT_LAN_IP. Docs - README + docs/architecture.md updated: stacks table now lists 15 stacks, ingress + attachment tables reflect mlflow/jupyterhub, TLS strategy section locked, MQTT-split section added, Gitea HTTPS-only noted. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 13:22:46 +02:00
- Stack composes pulled into cloud/site via `include:`
- Secrets in `.env` files (gitignored); `.env.example` committed with placeholders
- OT layer (OPCUA, PLCs) is **out of scope** for this repo