knacky/mimic-big

Fork 0

Files

knacky a8c5400f97

ci / backend (lint + typecheck + unit tests) (push) Failing after 0s

Details

ci / frontend (lint + typecheck + build + unit tests) (push) Failing after 0s

Details

docs: add production deployment guide

Operational runbook for rolling Mimic to RT infrastructure. Scope is
the application repo only; the Ansible playbook (D-010) and Caddy
reverse proxy (D-007) are referenced as out-of-scope dependencies.

Sections:

- Host prerequisites (Podman 5, rootless, linger, PostgreSQL 16 reach).
- Filesystem layout: blobs + evidence pools at 0750 under the deploy
  user (D-012), log directory, Quadlet directory.
- Environment variables: split into "required in prod" (MIMIC_SECRET_KEY,
  MIMIC_FERNET_KEY, MIMIC_DATABASE_URL, MIMIC_DATABASE_AUDIT_URL,
  MIMIC_ENV) and "required with safe defaults" (cookie flags, log
  format, CORS origins, blob/evidence roots). Explicit note that the
  two database DSNs must point to two different Postgres roles to
  preserve the audit append-only contract (NF-AUDIT, code-reviewer N5).
- Secrets management: dedicated section addressing PR3 code-reviewer M2.
  File-based generation under ~/secrets with 0700 perms, systemd
  EnvironmentFile or future MIMIC_*_FILE indirection, vault back-up,
  Fernet key rotation requires re-encryption pass.
- Container images: pin policy `:X.Y.Z` (cross-references F-D1), exposed
  ports per layer (backend 5000 as uid 1001, frontend 8080 as uid 101).
- PostgreSQL setup: bootstrap of mimic_audit_writer role with the SQL
  the Ansible playbook runs, plus the fail-loud rationale if the role
  is missing. Alembic upgrade head invocation.
- Quadlet units: backend example with PublishPort 127.0.0.1:5000 (the
  external surface is Caddy, not the backend), EnvironmentFile,
  blob+evidence bind-mounts with `:Z` SELinux relabel.
- Smoke validation: three curl checks (Caddy-fronted /healthz, direct
  backend /healthz, audit DSN presence) with explicit "do not announce
  the release" gate on failure.
- Upgrade procedure: 5-step rolling restart anchored on Quadlet image
  tag edits + alembic upgrade as part of the entrypoint.
- Rollback procedure: image-only (additive schema) vs schema-affecting,
  with alembic downgrade against an explicit revision.
- Open items: explicit pointers to FERNET-KEY, F-D1, F-D2, F-D3
  trackers in tasks/todo.md so future operators see them.

No other file touched; no application code changed.

2026-05-23 03:15:46 +02:00

11 KiB

Raw Blame History

Mimic — production deployment

Operational guide for rolling Mimic out on the RT infrastructure. Scope is the application repo only — the Ansible playbook that automates the host preparation lives in the separate RT infra repository (D-010), and the Caddy reverse proxy is owned by the RT platform (D-007). This document references both without duplicating them.

For CI/runner setup, see docs/podman-runner-setup.md. For architectural context, see docs/architecture.md.

Audience

Whoever pushes a new Mimic version to production. Assumes familiarity with Podman rootless, systemd user units, and PostgreSQL DSN syntax.

Host prerequisites

Component	Version	Notes
OS	Linux x86_64	Tested on Debian 12 and Fedora 41. SELinux-aware.
Podman	≥ 5.0	Rootless mode mandatory. Verify with `podman info --format '{{.Host.Security.Rootless}}'` returns `true`.
systemd	user mode	`loginctl enable-linger <mimic-user>` so user services survive logout.
PostgreSQL	16	Reachable from the Mimic container. Local socket fine; networked instance fine.
Reverse proxy	Caddy (out-of-Mimic)	Provides TLS, IP allowlist, and SOC session token plumbing. Configured in the RT infra repo.

The deployment user (referred to as <mimic-user> below) is typically a dedicated mimic system account. Reusing the gitea user is acceptable for single-tenant hosts but not recommended in multi-app scenarios.

Filesystem layout

Path	Owner	Mode	Purpose
`/var/lib/mimic/blobs`	`<mimic-user>:<mimic-user>`	`0750`	Content-addressed C2 output blobs (D-012). Default for `MIMIC_BLOB_ROOT`.
`/var/lib/mimic/evidence`	`<mimic-user>:<mimic-user>`	`0750`	User-uploaded evidence (F8). Default for `MIMIC_EVIDENCE_ROOT`.
`/var/log/mimic`	`<mimic-user>:<mimic-user>`	`0750`	Application logs if file-logging is enabled. JSON to stdout by default.
`~<mimic-user>/.config/containers/systemd/`	`<mimic-user>`	`0700`	Quadlet units for the backend + frontend containers.

The Ansible playbook in the RT infra repo creates these paths with the correct permissions. Manual provisioning equivalent:

sudo install -d -o <mimic-user> -g <mimic-user> -m 0750 \
  /var/lib/mimic/blobs /var/lib/mimic/evidence /var/log/mimic

Environment variables

Loaded from the systemd unit Environment= directives or a separate .env file mounted into the container. All variables are prefixed MIMIC_ (Pydantic Settings convention, see backend/src/mimic/config.py).

Required in production

Variable	Example	Effect
`MIMIC_ENV`	`production`	Switches default cookie / log behaviour.
`MIMIC_SECRET_KEY`	`$(python -c 'import secrets; print(secrets.token_urlsafe(32))')`	Flask session cookie HMAC. Rotating it invalidates every live session — schedule a maintenance window.
`MIMIC_FERNET_KEY`	`$(python -c 'from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())')`	Symmetric key encrypting `c2_credential.config_json_fernet`. Required in prod. `Fernet(b"")` would crash on first credential decrypt; the empty default in `config.py` exists only so tests can boot.
`MIMIC_DATABASE_URL`	`postgresql+psycopg://mimic_app:<pw>@postgres:5432/mimic`	Main app DSN. The role behind it must NOT have `INSERT` on `audit_log` (NF-AUDIT append-only contract).
`MIMIC_DATABASE_AUDIT_URL`	`postgresql+psycopg://mimic_audit_writer:<pw>@postgres:5432/mimic`	Write-only DSN used by the audit writer. The role has `INSERT` on `audit_log` and nothing else. See Bootstrap the audit role.

Required with safe defaults

Variable	Default	Comment
`MIMIC_BLOB_ROOT`	`/var/lib/mimic/blobs`	Override only if the data partition lives elsewhere.
`MIMIC_EVIDENCE_ROOT`	`/var/lib/mimic/evidence`	Same.
`MIMIC_SESSION_COOKIE_SECURE`	`true`	Must stay `true` behind Caddy/TLS. Set `false` only for the dev compose.
`MIMIC_SESSION_COOKIE_SAMESITE`	`Lax`	`Strict` if SOC console is on the same eTLD+1 as Mimic.
`MIMIC_LOG_LEVEL`	`INFO`	`DEBUG` is verbose, do not enable in prod without a reason.
`MIMIC_LOG_JSON`	`true`	Required for log shipping. Disable only for human debugging.
`MIMIC_CORS_ORIGINS`	`[]` (none)	Set to the public Mimic URL if frontend and backend are served from different origins.

Never set in production

MIMIC_DATABASE_URL and MIMIC_DATABASE_AUDIT_URL must point to two different roles. Pointing them at the same role defeats the audit append-only guarantee — caught by code review N5 (see tasks/todo.md § CI follow-ups).

Secrets management

Three secrets must never appear in container images, git history, or agent transcripts: MIMIC_SECRET_KEY, MIMIC_FERNET_KEY, and the PostgreSQL password embedded in the two DSNs.

Recommended flow (matches the team-wide "secrets via file, not chat" convention):

Generate secrets once per environment on the deploy host:

umask 077
install -d -m 0700 ~/secrets
python -c 'import secrets; print(secrets.token_urlsafe(32))' > ~/secrets/SECRET_KEY
python -c 'from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())' > ~/secrets/FERNET_KEY

Reference the files from the systemd unit via EnvironmentFile= (one KEY=VALUE per line) or mount them as in-container files and read them with MIMIC_FERNET_KEY_FILE equivalent indirection. Today the app reads MIMIC_FERNET_KEY directly; the file-based path is tracked as a follow-up.
Back up the secret material to the RT password vault, not anywhere else. Losing FERNET_KEY after C2 credentials are persisted means the data is permanently unreadable (no recovery key by design).
Rotating MIMIC_FERNET_KEY requires a re-encryption pass over c2_credential.config_json_fernet. The Ansible playbook ships a maintenance task for it; it is not exposed in the application CLI.

Container images

Component	Image	Tag policy
Backend	`backend/Dockerfile`, built and pushed by CI	Pin `:X.Y.Z` per release. Never `:latest` in prod (follow-up F-D1).
Frontend	`frontend/Dockerfile`, built and pushed by CI	Same policy. Served by `nginxinc/nginx-unprivileged:alpine` listening on 8080.
PostgreSQL	`postgres:16-alpine`	Pin a minor tag (`16.4-alpine`) in production compose.

The backend image listens on 5000 as user mimic (uid 1001). The frontend image listens on 8080 as user nginx (uid 101).

PostgreSQL setup

The application user (mimic_app) is created by the Ansible playbook with LOGIN and ownership over the application database. It does not get INSERT on audit_log — that grant goes to a separate role, see below.

Bootstrap the audit role

mimic_audit_writer exists to enforce the NF-AUDIT append-only contract. The Alembic baseline migration grants INSERT ON audit_log to this role if it exists, idempotently. Create the role before running migrations (the Ansible playbook does this; manual equivalent):

-- run as a Postgres superuser, against the mimic database
CREATE ROLE mimic_audit_writer LOGIN PASSWORD '<paste-from-vault>';

Then expose its DSN as MIMIC_DATABASE_AUDIT_URL. The application boots even if the role is missing (the grant block is a no-op), but every audit write will fail at runtime — fail-loud preferred over silent data loss.

Apply migrations

The backend container runs Alembic at startup via its entrypoint, against the MIMIC_DATABASE_URL DSN. To apply manually:

podman exec -it mimic-backend alembic upgrade head

A schema downgrade (rollback procedure below) uses the same surface in reverse.

Quadlet units

Both containers run under the <mimic-user> systemd user instance via Quadlet. Example backend unit (~<mimic-user>/.config/containers/systemd/mimic-backend.container):

[Unit]
Description=Mimic backend
After=network-online.target

[Container]
Image=registry.try2get.in/mimic-backend:X.Y.Z
ContainerName=mimic-backend
PublishPort=127.0.0.1:5000:5000
EnvironmentFile=%h/secrets/mimic-backend.env
Volume=/var/lib/mimic/blobs:/var/lib/mimic/blobs:Z
Volume=/var/lib/mimic/evidence:/var/lib/mimic/evidence:Z

[Service]
Restart=on-failure
RestartSec=5

[Install]
WantedBy=default.target

Frontend unit is structurally identical, listening on 127.0.0.1:8080. Caddy fronts both. Activation:

systemctl --user daemon-reload
systemctl --user enable --now mimic-backend.service mimic-frontend.service

The reverse proxy configuration on Caddy (out-of-Mimic) terminates TLS and forwards https://<mimic-domain>/api/* → 127.0.0.1:5000, every other path → 127.0.0.1:8080.

Smoke validation

Once the stack is up:

# From the deploy host, behind Caddy
curl -fsS https://<mimic-domain>/healthz
# → "ok"

# Direct to the backend (should not be reachable externally — sanity)
curl -fsS http://127.0.0.1:5000/healthz
# → "ok"

# Verify audit role is wired
podman exec -it mimic-backend python -c 'from mimic.config import get_settings; \
    print(get_settings().database_audit_url is not None)'
# → True

If any of these fail, do not announce the release. Investigate via journalctl --user -u mimic-backend.service -e.

Upgrade procedure

Steady-state release flow:

CI builds mimic-backend:X.Y.Z and mimic-frontend:X.Y.Z and pushes them to registry.try2get.in. The tag policy is the same as the sprint 0 follow-up F-D1.
Update the Quadlet .container files on the deploy host to point at the new tags (single line each).
systemctl --user daemon-reload.
systemctl --user restart mimic-backend.service mimic-frontend.service. Quadlet pulls the new image automatically.
Run smoke validation. Tail logs for one minute.

If the release ships schema changes, Alembic runs upgrade head on container start — the migration is the first thing the entrypoint does. A failed migration prevents the new container from accepting traffic and leaves the previous container's exit code visible in journalctl.

Rollback procedure

A rollback covers both image and schema. The schema rollback is optional and only required when the new release includes a non-additive migration.

# Image-level rollback only (additive schema, no data shape change)
sed -i 's|Image=.*mimic-backend:.*|Image=registry.try2get.in/mimic-backend:<previous>|' \
  ~/.config/containers/systemd/mimic-backend.container
systemctl --user daemon-reload
systemctl --user restart mimic-backend.service

# Schema-affecting rollback
podman exec -it mimic-backend alembic downgrade <previous-revision>
# then image rollback as above

Always confirm the target Alembic revision matches the previous image's shipped revision before downgrading — there is no enforcement and a mismatch is recoverable but unpleasant.

Open items captured in `tasks/todo.md`

FERNET-KEY (CI follow-ups) — provision FERNET_KEY_TEST Gitea secret for CI so integration tests can exercise the encrypted-credential path.
F-D1 (Frontend follow-ups) — pin every production image by minor + digest. This document already mandates the policy; F-D1 is the implementation step.
F-D2 (Frontend follow-ups) — decide whether Caddy or the in-image HEALTHCHECK owns liveness probing. Currently neither is wired.
F-D3 — security response headers ownership (Caddy vs nginx.conf).

11 KiB Raw Blame History