docs: add production deployment guide
Operational runbook for rolling Mimic to RT infrastructure. Scope is the application repo only; the Ansible playbook (D-010) and Caddy reverse proxy (D-007) are referenced as out-of-scope dependencies. Sections: - Host prerequisites (Podman 5, rootless, linger, PostgreSQL 16 reach). - Filesystem layout: blobs + evidence pools at 0750 under the deploy user (D-012), log directory, Quadlet directory. - Environment variables: split into "required in prod" (MIMIC_SECRET_KEY, MIMIC_FERNET_KEY, MIMIC_DATABASE_URL, MIMIC_DATABASE_AUDIT_URL, MIMIC_ENV) and "required with safe defaults" (cookie flags, log format, CORS origins, blob/evidence roots). Explicit note that the two database DSNs must point to two different Postgres roles to preserve the audit append-only contract (NF-AUDIT, code-reviewer N5). - Secrets management: dedicated section addressing PR3 code-reviewer M2. File-based generation under ~/secrets with 0700 perms, systemd EnvironmentFile or future MIMIC_*_FILE indirection, vault back-up, Fernet key rotation requires re-encryption pass. - Container images: pin policy `:X.Y.Z` (cross-references F-D1), exposed ports per layer (backend 5000 as uid 1001, frontend 8080 as uid 101). - PostgreSQL setup: bootstrap of mimic_audit_writer role with the SQL the Ansible playbook runs, plus the fail-loud rationale if the role is missing. Alembic upgrade head invocation. - Quadlet units: backend example with PublishPort 127.0.0.1:5000 (the external surface is Caddy, not the backend), EnvironmentFile, blob+evidence bind-mounts with `:Z` SELinux relabel. - Smoke validation: three curl checks (Caddy-fronted /healthz, direct backend /healthz, audit DSN presence) with explicit "do not announce the release" gate on failure. - Upgrade procedure: 5-step rolling restart anchored on Quadlet image tag edits + alembic upgrade as part of the entrypoint. - Rollback procedure: image-only (additive schema) vs schema-affecting, with alembic downgrade against an explicit revision. - Open items: explicit pointers to FERNET-KEY, F-D1, F-D2, F-D3 trackers in tasks/todo.md so future operators see them. No other file touched; no application code changed.
This commit is contained in:
274
docs/deploy.md
Normal file
274
docs/deploy.md
Normal file
@@ -0,0 +1,274 @@
|
|||||||
|
# Mimic — production deployment
|
||||||
|
|
||||||
|
Operational guide for rolling Mimic out on the RT infrastructure. Scope is
|
||||||
|
the **application repo only** — the Ansible playbook that automates the
|
||||||
|
host preparation lives in the separate RT infra repository (D-010), and
|
||||||
|
the Caddy reverse proxy is owned by the RT platform (D-007). This document
|
||||||
|
references both without duplicating them.
|
||||||
|
|
||||||
|
For CI/runner setup, see [`docs/podman-runner-setup.md`](./podman-runner-setup.md).
|
||||||
|
For architectural context, see [`docs/architecture.md`](./architecture.md).
|
||||||
|
|
||||||
|
## Audience
|
||||||
|
|
||||||
|
Whoever pushes a new Mimic version to production. Assumes familiarity with
|
||||||
|
Podman rootless, systemd user units, and PostgreSQL DSN syntax.
|
||||||
|
|
||||||
|
## Host prerequisites
|
||||||
|
|
||||||
|
| Component | Version | Notes |
|
||||||
|
| --- | --- | --- |
|
||||||
|
| OS | Linux x86_64 | Tested on Debian 12 and Fedora 41. SELinux-aware. |
|
||||||
|
| Podman | ≥ 5.0 | Rootless mode mandatory. Verify with `podman info --format '{{.Host.Security.Rootless}}'` returns `true`. |
|
||||||
|
| systemd | user mode | `loginctl enable-linger <mimic-user>` so user services survive logout. |
|
||||||
|
| PostgreSQL | 16 | Reachable from the Mimic container. Local socket fine; networked instance fine. |
|
||||||
|
| Reverse proxy | Caddy (out-of-Mimic) | Provides TLS, IP allowlist, and SOC session token plumbing. Configured in the RT infra repo. |
|
||||||
|
|
||||||
|
The deployment user (referred to as `<mimic-user>` below) is typically a
|
||||||
|
dedicated `mimic` system account. Reusing the `gitea` user is acceptable
|
||||||
|
for single-tenant hosts but not recommended in multi-app scenarios.
|
||||||
|
|
||||||
|
## Filesystem layout
|
||||||
|
|
||||||
|
| Path | Owner | Mode | Purpose |
|
||||||
|
| --- | --- | --- | --- |
|
||||||
|
| `/var/lib/mimic/blobs` | `<mimic-user>:<mimic-user>` | `0750` | Content-addressed C2 output blobs (D-012). Default for `MIMIC_BLOB_ROOT`. |
|
||||||
|
| `/var/lib/mimic/evidence` | `<mimic-user>:<mimic-user>` | `0750` | User-uploaded evidence (F8). Default for `MIMIC_EVIDENCE_ROOT`. |
|
||||||
|
| `/var/log/mimic` | `<mimic-user>:<mimic-user>` | `0750` | Application logs if file-logging is enabled. JSON to stdout by default. |
|
||||||
|
| `~<mimic-user>/.config/containers/systemd/` | `<mimic-user>` | `0700` | Quadlet units for the backend + frontend containers. |
|
||||||
|
|
||||||
|
The Ansible playbook in the RT infra repo creates these paths with the
|
||||||
|
correct permissions. Manual provisioning equivalent:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo install -d -o <mimic-user> -g <mimic-user> -m 0750 \
|
||||||
|
/var/lib/mimic/blobs /var/lib/mimic/evidence /var/log/mimic
|
||||||
|
```
|
||||||
|
|
||||||
|
## Environment variables
|
||||||
|
|
||||||
|
Loaded from the systemd unit `Environment=` directives or a separate
|
||||||
|
`.env` file mounted into the container. All variables are prefixed
|
||||||
|
`MIMIC_` (Pydantic Settings convention, see `backend/src/mimic/config.py`).
|
||||||
|
|
||||||
|
### Required in production
|
||||||
|
|
||||||
|
| Variable | Example | Effect |
|
||||||
|
| --- | --- | --- |
|
||||||
|
| `MIMIC_ENV` | `production` | Switches default cookie / log behaviour. |
|
||||||
|
| `MIMIC_SECRET_KEY` | `$(python -c 'import secrets; print(secrets.token_urlsafe(32))')` | Flask session cookie HMAC. Rotating it invalidates every live session — schedule a maintenance window. |
|
||||||
|
| `MIMIC_FERNET_KEY` | `$(python -c 'from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())')` | Symmetric key encrypting `c2_credential.config_json_fernet`. **Required** in prod. `Fernet(b"")` would crash on first credential decrypt; the empty default in `config.py` exists only so tests can boot. |
|
||||||
|
| `MIMIC_DATABASE_URL` | `postgresql+psycopg://mimic_app:<pw>@postgres:5432/mimic` | Main app DSN. The role behind it must NOT have `INSERT` on `audit_log` (NF-AUDIT append-only contract). |
|
||||||
|
| `MIMIC_DATABASE_AUDIT_URL` | `postgresql+psycopg://mimic_audit_writer:<pw>@postgres:5432/mimic` | Write-only DSN used by the audit writer. The role has `INSERT` on `audit_log` and nothing else. See [Bootstrap the audit role](#bootstrap-the-audit-role). |
|
||||||
|
|
||||||
|
### Required with safe defaults
|
||||||
|
|
||||||
|
| Variable | Default | Comment |
|
||||||
|
| --- | --- | --- |
|
||||||
|
| `MIMIC_BLOB_ROOT` | `/var/lib/mimic/blobs` | Override only if the data partition lives elsewhere. |
|
||||||
|
| `MIMIC_EVIDENCE_ROOT` | `/var/lib/mimic/evidence` | Same. |
|
||||||
|
| `MIMIC_SESSION_COOKIE_SECURE` | `true` | Must stay `true` behind Caddy/TLS. Set `false` only for the dev compose. |
|
||||||
|
| `MIMIC_SESSION_COOKIE_SAMESITE` | `Lax` | `Strict` if SOC console is on the same eTLD+1 as Mimic. |
|
||||||
|
| `MIMIC_LOG_LEVEL` | `INFO` | `DEBUG` is verbose, do not enable in prod without a reason. |
|
||||||
|
| `MIMIC_LOG_JSON` | `true` | Required for log shipping. Disable only for human debugging. |
|
||||||
|
| `MIMIC_CORS_ORIGINS` | `[]` (none) | Set to the public Mimic URL if frontend and backend are served from different origins. |
|
||||||
|
|
||||||
|
### Never set in production
|
||||||
|
|
||||||
|
`MIMIC_DATABASE_URL` and `MIMIC_DATABASE_AUDIT_URL` must point to two
|
||||||
|
different roles. Pointing them at the same role defeats the audit
|
||||||
|
append-only guarantee — caught by code review N5 (see
|
||||||
|
`tasks/todo.md` § CI follow-ups).
|
||||||
|
|
||||||
|
## Secrets management
|
||||||
|
|
||||||
|
Three secrets must never appear in container images, git history, or
|
||||||
|
agent transcripts: `MIMIC_SECRET_KEY`, `MIMIC_FERNET_KEY`, and the
|
||||||
|
PostgreSQL password embedded in the two DSNs.
|
||||||
|
|
||||||
|
Recommended flow (matches the team-wide "secrets via file, not chat"
|
||||||
|
convention):
|
||||||
|
|
||||||
|
1. Generate secrets once per environment on the deploy host:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
umask 077
|
||||||
|
install -d -m 0700 ~/secrets
|
||||||
|
python -c 'import secrets; print(secrets.token_urlsafe(32))' > ~/secrets/SECRET_KEY
|
||||||
|
python -c 'from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())' > ~/secrets/FERNET_KEY
|
||||||
|
```
|
||||||
|
|
||||||
|
2. Reference the files from the systemd unit via `EnvironmentFile=` (one
|
||||||
|
`KEY=VALUE` per line) **or** mount them as in-container files and
|
||||||
|
read them with `MIMIC_FERNET_KEY_FILE` equivalent indirection. Today
|
||||||
|
the app reads `MIMIC_FERNET_KEY` directly; the file-based path is
|
||||||
|
tracked as a follow-up.
|
||||||
|
|
||||||
|
3. Back up the secret material to the RT password vault, not anywhere
|
||||||
|
else. Losing `FERNET_KEY` after C2 credentials are persisted means
|
||||||
|
the data is permanently unreadable (no recovery key by design).
|
||||||
|
|
||||||
|
4. Rotating `MIMIC_FERNET_KEY` requires a re-encryption pass over
|
||||||
|
`c2_credential.config_json_fernet`. The Ansible playbook ships a
|
||||||
|
maintenance task for it; it is not exposed in the application CLI.
|
||||||
|
|
||||||
|
## Container images
|
||||||
|
|
||||||
|
| Component | Image | Tag policy |
|
||||||
|
| --- | --- | --- |
|
||||||
|
| Backend | `backend/Dockerfile`, built and pushed by CI | Pin `:X.Y.Z` per release. Never `:latest` in prod (follow-up F-D1). |
|
||||||
|
| Frontend | `frontend/Dockerfile`, built and pushed by CI | Same policy. Served by `nginxinc/nginx-unprivileged:alpine` listening on 8080. |
|
||||||
|
| PostgreSQL | `postgres:16-alpine` | Pin a minor tag (`16.4-alpine`) in production compose. |
|
||||||
|
|
||||||
|
The backend image listens on **5000** as user `mimic` (uid 1001). The
|
||||||
|
frontend image listens on **8080** as user `nginx` (uid 101).
|
||||||
|
|
||||||
|
## PostgreSQL setup
|
||||||
|
|
||||||
|
The application user (`mimic_app`) is created by the Ansible playbook
|
||||||
|
with `LOGIN` and ownership over the application database. It does **not**
|
||||||
|
get `INSERT` on `audit_log` — that grant goes to a separate role, see
|
||||||
|
below.
|
||||||
|
|
||||||
|
### Bootstrap the audit role
|
||||||
|
|
||||||
|
`mimic_audit_writer` exists to enforce the NF-AUDIT append-only contract.
|
||||||
|
The Alembic baseline migration grants `INSERT ON audit_log` to this role
|
||||||
|
if it exists, idempotently. Create the role before running migrations
|
||||||
|
(the Ansible playbook does this; manual equivalent):
|
||||||
|
|
||||||
|
```sql
|
||||||
|
-- run as a Postgres superuser, against the mimic database
|
||||||
|
CREATE ROLE mimic_audit_writer LOGIN PASSWORD '<paste-from-vault>';
|
||||||
|
```
|
||||||
|
|
||||||
|
Then expose its DSN as `MIMIC_DATABASE_AUDIT_URL`. The application boots
|
||||||
|
even if the role is missing (the grant block is a no-op), but every
|
||||||
|
audit write will fail at runtime — fail-loud preferred over silent data
|
||||||
|
loss.
|
||||||
|
|
||||||
|
### Apply migrations
|
||||||
|
|
||||||
|
The backend container runs Alembic at startup via its entrypoint, against
|
||||||
|
the `MIMIC_DATABASE_URL` DSN. To apply manually:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
podman exec -it mimic-backend alembic upgrade head
|
||||||
|
```
|
||||||
|
|
||||||
|
A schema downgrade (rollback procedure below) uses the same surface in
|
||||||
|
reverse.
|
||||||
|
|
||||||
|
## Quadlet units
|
||||||
|
|
||||||
|
Both containers run under the `<mimic-user>` systemd user instance via
|
||||||
|
Quadlet. Example backend unit
|
||||||
|
(`~<mimic-user>/.config/containers/systemd/mimic-backend.container`):
|
||||||
|
|
||||||
|
```ini
|
||||||
|
[Unit]
|
||||||
|
Description=Mimic backend
|
||||||
|
After=network-online.target
|
||||||
|
|
||||||
|
[Container]
|
||||||
|
Image=registry.try2get.in/mimic-backend:X.Y.Z
|
||||||
|
ContainerName=mimic-backend
|
||||||
|
PublishPort=127.0.0.1:5000:5000
|
||||||
|
EnvironmentFile=%h/secrets/mimic-backend.env
|
||||||
|
Volume=/var/lib/mimic/blobs:/var/lib/mimic/blobs:Z
|
||||||
|
Volume=/var/lib/mimic/evidence:/var/lib/mimic/evidence:Z
|
||||||
|
|
||||||
|
[Service]
|
||||||
|
Restart=on-failure
|
||||||
|
RestartSec=5
|
||||||
|
|
||||||
|
[Install]
|
||||||
|
WantedBy=default.target
|
||||||
|
```
|
||||||
|
|
||||||
|
Frontend unit is structurally identical, listening on `127.0.0.1:8080`.
|
||||||
|
Caddy fronts both. Activation:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
systemctl --user daemon-reload
|
||||||
|
systemctl --user enable --now mimic-backend.service mimic-frontend.service
|
||||||
|
```
|
||||||
|
|
||||||
|
The reverse proxy configuration on Caddy (out-of-Mimic) terminates TLS
|
||||||
|
and forwards `https://<mimic-domain>/api/*` → `127.0.0.1:5000`, every
|
||||||
|
other path → `127.0.0.1:8080`.
|
||||||
|
|
||||||
|
## Smoke validation
|
||||||
|
|
||||||
|
Once the stack is up:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# From the deploy host, behind Caddy
|
||||||
|
curl -fsS https://<mimic-domain>/healthz
|
||||||
|
# → "ok"
|
||||||
|
|
||||||
|
# Direct to the backend (should not be reachable externally — sanity)
|
||||||
|
curl -fsS http://127.0.0.1:5000/healthz
|
||||||
|
# → "ok"
|
||||||
|
|
||||||
|
# Verify audit role is wired
|
||||||
|
podman exec -it mimic-backend python -c 'from mimic.config import get_settings; \
|
||||||
|
print(get_settings().database_audit_url is not None)'
|
||||||
|
# → True
|
||||||
|
```
|
||||||
|
|
||||||
|
If any of these fail, do **not** announce the release. Investigate via
|
||||||
|
`journalctl --user -u mimic-backend.service -e`.
|
||||||
|
|
||||||
|
## Upgrade procedure
|
||||||
|
|
||||||
|
Steady-state release flow:
|
||||||
|
|
||||||
|
1. CI builds `mimic-backend:X.Y.Z` and `mimic-frontend:X.Y.Z` and pushes
|
||||||
|
them to `registry.try2get.in`. The tag policy is the same as the
|
||||||
|
sprint 0 follow-up F-D1.
|
||||||
|
2. Update the Quadlet `.container` files on the deploy host to point at
|
||||||
|
the new tags (single line each).
|
||||||
|
3. `systemctl --user daemon-reload`.
|
||||||
|
4. `systemctl --user restart mimic-backend.service mimic-frontend.service`.
|
||||||
|
Quadlet pulls the new image automatically.
|
||||||
|
5. Run smoke validation. Tail logs for one minute.
|
||||||
|
|
||||||
|
If the release ships schema changes, Alembic runs `upgrade head` on
|
||||||
|
container start — the migration is the **first** thing the entrypoint
|
||||||
|
does. A failed migration prevents the new container from accepting
|
||||||
|
traffic and leaves the previous container's exit code visible in
|
||||||
|
`journalctl`.
|
||||||
|
|
||||||
|
## Rollback procedure
|
||||||
|
|
||||||
|
A rollback covers both image and schema. The schema rollback is
|
||||||
|
optional and only required when the new release includes a non-additive
|
||||||
|
migration.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Image-level rollback only (additive schema, no data shape change)
|
||||||
|
sed -i 's|Image=.*mimic-backend:.*|Image=registry.try2get.in/mimic-backend:<previous>|' \
|
||||||
|
~/.config/containers/systemd/mimic-backend.container
|
||||||
|
systemctl --user daemon-reload
|
||||||
|
systemctl --user restart mimic-backend.service
|
||||||
|
|
||||||
|
# Schema-affecting rollback
|
||||||
|
podman exec -it mimic-backend alembic downgrade <previous-revision>
|
||||||
|
# then image rollback as above
|
||||||
|
```
|
||||||
|
|
||||||
|
Always confirm the target Alembic revision matches the previous image's
|
||||||
|
shipped revision before downgrading — there is no enforcement and a
|
||||||
|
mismatch is recoverable but unpleasant.
|
||||||
|
|
||||||
|
## Open items captured in `tasks/todo.md`
|
||||||
|
|
||||||
|
- `FERNET-KEY` (CI follow-ups) — provision `FERNET_KEY_TEST` Gitea secret
|
||||||
|
for CI so integration tests can exercise the encrypted-credential path.
|
||||||
|
- `F-D1` (Frontend follow-ups) — pin every production image by minor +
|
||||||
|
digest. This document already mandates the policy; F-D1 is the
|
||||||
|
implementation step.
|
||||||
|
- `F-D2` (Frontend follow-ups) — decide whether Caddy or the in-image
|
||||||
|
`HEALTHCHECK` owns liveness probing. Currently neither is wired.
|
||||||
|
- `F-D3` — security response headers ownership (Caddy vs nginx.conf).
|
||||||
Reference in New Issue
Block a user