Development Environment SSOT¶

SSOT Key: development Source of Truth for local development, testing, CI, and deployment.

Source Files¶

Prerequisites¶

Node.js: v20+ (Managed by system, not moon)
pnpm/npm: Required for frontend dependencies
Python: v3.12+ (Managed by uv)

File	Purpose
`moon.yml`	Root workspace tasks
`apps/*/moon.yml`	Per-project tasks
`scripts/test_lifecycle.py`	Database lifecycle (Python Context Manager)
`scripts/smoke_test.sh`	Unified smoke tests
`docker-compose.yml`	Development service containers
`.github/workflows/ci.yml`	GitHub Actions CI
`.github/workflows/staging-deploy.yml`	Staging Build & Deploy
`.github/workflows/production-release.yml`	Production Release

Moon Commands (Primary Interface)¶

# Development
moon run :dev -- --backend        # Full Stack (App + DB + Redis + MinIO)
moon run :dev -- --frontend       # Next.js on :3000

# Local CI / Verification (Recommended)
moon run :lint && moon run :test                # One-button check (Lint + Format + Test + Check)
                            # Matches GitHub CI exactly.

# Testing
moon run :test                    # All tests (default, 90% backend coverage)
moon run :test -- --fast         # TDD mode (no coverage, fastest)
moon run :test -- --smart        # Coverage on changed files only
moon run :test -- --e2e          # E2E tests (Playwright)
moon run :test -- tests/accounting/  # Run specific module
moon run :test -- tests/accounting/test_journal_service.py  # Run specific file

# Environment Verification
# (See docs/ssot/env_smoke_test.md for full details)
uv run python -m src.boot --mode full  # Full Stack Check (Gate 3)

# Code Quality
moon run :lint              # Lint all
moon run :lint -- --fix     # Format Python (auto-fix)
moon run :lint -- --fix     # Format Python

# Build
moon run :build             # Build all

Documentation¶

The project uses MkDocs with Material theme for documentation.

Build & Serve Docs¶

# Install dependencies
pip install -r docs/requirements.txt

# Serve docs locally with live reload
mkdocs serve
# → Open http://127.0.0.1:8000

# Build static site
mkdocs build
# → Output: site/ directory

Documentation Structure¶

Path	Content
`docs/`	Source markdown files
`mkdocs.yml`	MkDocs configuration
`site/`	Generated static site (gitignored)

The live documentation is hosted at wangzitian0.github.io/finance_report.

Six Environments (SSOT)¶

Core Principle: "One Codebase, Multiple Environments" - Local uses containers + namespace isolation, CI emphasizes consistency, Production uses image deployment.

Environment Overview¶

#	Environment	URL	Trigger	Code Runtime	Infrastructure	Database	Isolation
1	Local Dev	`localhost:3000`	Manual `moon run :dev -- --backend`	Source (Host) uvicorn/next dev	Shared Containers (Podman/Docker)	`finance_report`	Container name suffix
2	Local CI	`localhost:3000`	Manual `moon run :lint && moon run :test`	Source (Host) pytest	Shared Containers (Podman/Docker)	`finance_report_test_{namespace}`	DB/bucket name
3	GitHub CI	-	Push/PR `ci.yml`	Source (Runner) pytest	GitHub Services (Ephemeral)	`finance_report_test`	Job isolation
4	PR Preview	`report-pr-123.zitian.party`	PR opened `pr-test.yml`	Docker Images (GHCR)	Dedicated Containers (Per PR)	Dedicated DB/Redis/MinIO	Container suffix `-pr-123`
5	Staging	`report-staging.zitian.party`	Push to main `staging-deploy.yml`	Docker Images (GHCR)	Dedicated infra2 + Shared Platform	Dedicated DB/Redis	Bucket name `-staging`
6	Production	`report.zitian.party`	Manual release `production-release.yml`	Docker Images (GHCR)	Dedicated infra2 + Shared Platform	Dedicated DB/Redis	Bucket name

Key Differences¶

Local Environments (Dev + CI)¶

Local Dev - One shared set of containers, isolated by different database names: - Uses docker-compose.yml (Profile: infra) - Persistent: Manually started, data preserved across runs - Isolation: Multiple repo copies use namespace-aware DB names (finance_report, finance_report_dev_branch_a, etc.) - S3: Shared local MinIO with namespace-aware buckets (statements, statements-branch-a) - Command: moon run :dev -- --backend (or moon run :dev -- --infra + manual uvicorn)

Local CI - Reuses Local Dev containers, creates temporary test databases: - Uses same docker-compose.yml (Profile: infra) - Ephemeral data: Test DB reset before each run, worker DBs auto-cleaned - Isolation: finance_report_test_{namespace} + worker DBs (_gw0, _gw1, etc.) - Command: moon run :lint && moon run :test (includes moon run :test) - Matches GitHub CI command exactly (moon run :lint && moon run :test)

GitHub Environments¶

GitHub CI - Temporary services, runs same commands as Local CI: - Uses GitHub Actions services: (ephemeral Postgres container) - Completely ephemeral: Destroyed after job finishes - Command: moon run :lint && moon run :test (identical to Local CI) - Database: finance_report_test (no namespace needed, job-isolated)

PR Preview - Full deployment with code changes: - Builds Docker images from PR branch - Deploys to Dokploy with unique URLs (report-pr-123.zitian.party) - Ephemeral: Destroyed when PR closes - Database/Redis/MinIO: Dedicated per-PR instances - Isolation: Container name suffix -pr-123

Production Environments (Staging + Production)¶

Staging - Tracks latest main branch: - Image deployment: Built from latest main commit after merge - Deployed to Dokploy automatically on push to main - Persistent data, stable environment for QA - Uses dedicated DB/Redis + shared Platform (SigNoz, MinIO with bucket isolation)

Production - Manual release process: - Image deployment: Built from version tags (v1.2.3) - Manual trigger after Staging validation - Most stable environment, persistent data - Uses dedicated DB/Redis + shared Platform

Container/Database Naming Patterns¶

Environment	Backend Container	Frontend Container	Database	S3 Bucket
Local Dev	`finance-report-backend`	`finance-report-frontend`	`finance_report`	`statements`
Local CI	(uses Local Dev containers)	(uses Local Dev containers)	`finance_report_test_{namespace}`	`statements-{namespace}`
GitHub CI	(GitHub Services)	(N/A)	`finance_report_test`	`statements` (mock)
PR Preview	`finance_report-backend-pr-123`	`finance_report-frontend-pr-123`	`finance_report_postgres-pr-123`	(dedicated MinIO)
Staging	`finance_report-backend-staging`	`finance_report-frontend-staging`	`finance_report-postgres-staging`	`finance-report-staging`
Production	`finance_report-backend`	`finance_report-frontend`	`finance_report-postgres`	`finance-report-production`

See AGENTS.md for debugging container names.

Workflow Files Reference¶

Workflow File	Environment	Trigger	Actions
`.github/workflows/ci.yml`	GitHub CI	Push/PR to main	Run `moon run :lint && moon run :test`, upload coverage
`.github/workflows/pr-test.yml`	PR Preview	PR opened/sync	Build images, deploy to Dokploy, cleanup on close
`.github/workflows/staging-deploy.yml`	Staging	Push to main	Build images (`:staging` tag), deploy
`.github/workflows/production-release.yml`	Production	Tag `v..*` or manual	Build release images, deploy on manual trigger

Shared Platform Resources¶

The production Platform layer (SigNoz, MinIO, Traefik) runs as Singleton services. Staging and PR environments use logical isolation:

Service	Scope	Isolation Method	Example
SigNoz	Singleton	`deployment.environment` tag	`staging`, `production`, `pr-47`
MinIO (Prod)	Singleton	Separate buckets	`finance-report-staging`, `finance-report-production`
Postgres	Dedicated	Separate containers/instances	One per environment
Redis	Dedicated	Separate containers/instances	One per environment

Note: PR Previews have dedicated MinIO/DB/Redis to allow destructive testing, but send logs to shared SigNoz.

Test Strategy by Environment¶

Environment	Tests Run	Purpose	Duration
Local Dev	None (manual testing)	Fast iteration	-
Local CI	Unit + Integration (90% backend, 96% unified)	Pre-push validation	~30s
GitHub CI	Unit + Integration (90% backend, 96% unified)	Quality gate	~2min
PR Preview	Health check only	Deployment validation	~30s
Staging	Smoke + Performance	Full validation	~5min
Production	Health check only	Availability check	~10s

Coverage Requirements¶

Backend line coverage: >= 90% (enforced by pytest-cov); 96% unified (enforced by calculate_unified_coverage.py)
Branch coverage: Required (via --cov-branch)
See TDD Transformation Plan for details

No-Regression Coverage Gate¶

The CI workflow enforces a no-regression policy for test coverage, preventing silent coverage drops between main branch commits.

How It Works¶

Baseline Storage: The coverage baseline is stored in unified-coverage.json at the repository root.
Created automatically on the first successful CI run on main branch
Updated automatically on each subsequent main branch push (if coverage changes)
Committed by GitHub Actions bot with [skip ci] flag to prevent infinite loops
Comparison Logic:
Before calculating final coverage, the unified-coverage job reads unified-coverage.json if it exists
Compares current coverage against baseline for all components: unified, backend, frontend, scripts
Uses round(x, 2) for floating-point comparison (same precision as JSON output)
Zero tolerance: If ANY component is below baseline (current < baseline), CI fails immediately
If baseline file doesn't exist, skips comparison and falls through to COVERAGE_THRESHOLD check (safety net)
Fail Conditions:
Unified coverage drops below baseline: ❌ Unified coverage {current:.2f}% is below baseline {baseline:.2f}%
Backend component drops below baseline: ❌ backend coverage {current:.2f}% is below baseline {baseline:.2f}%
Frontend component drops below baseline: ❌ frontend coverage {current:.2f}% is below baseline {baseline:.2f}%
Scripts component drops below baseline: ❌ scripts coverage {current:.2f}% is below baseline {baseline:.2f}%
All components at or above baseline: ✅ No regression: all coverage at or above baseline

Manual Baseline Reset¶

If you need to manually reset the baseline (e.g., after major refactoring):

# Option 1: Update baseline to current state
git pull origin main
# Make your changes, then:
git add unified-coverage.json && git commit -m "chore: manually reset coverage baseline" && git push

# Option 2: Remove baseline temporarily
git rm unified-coverage.json && git commit -m "chore: remove coverage baseline for testing" && git push

Warning: Removing the baseline disables the no-regression gate until the next main branch push recreates it.

Environment Variables¶

BASELINE_FILE: Path to baseline JSON file (default: unified-coverage.json)
COVERAGE_THRESHOLD: Safety net threshold (default: 0, no minimum enforced; baseline comparison is the primary no-regression gate)

Test Coverage¶

Unit tests in scripts/tests/test_calculate_unified_coverage.py::TestBaselineComparison verify: - Equal coverage passes (no regression) - Coverage drops fail with clear error messages - Component-level drop detection (unified ok but individual component drops) - Baseline file path configurable via BASELINE_FILE

CI Job Structure¶

The GitHub Actions workflow (.github/workflows/ci.yml) follows this job dependency order:

lint → backend shards → frontend → unified-coverage → finish

Job Details¶

Job	Purpose	Runs On
lint	Static analysis (ruff check + format check)	None (first job)
backend (Shards 1-4)	Backend unit + integration tests	`needs: [lint]`
frontend	Frontend build + tests	None (runs in parallel with backend)
unified-coverage	Calculate unified coverage, compare to baseline, update Coveralls	`needs: [backend, frontend]`
finish	Aggregate all job results, fail if any job failed	`needs: [backend, frontend, lint, unified-coverage]`

Key Changes (CI Coverage Improvements)¶

Standalone Lint Job: Previously embedded in backend shard 1, now runs independently
Fast failure: Lint failures surface in ~1 min instead of waiting for backend shard 1 to complete (~10 min)
All backend shards depend on lint: needs: [lint]
Coveralls Upload Fixes: All three Coveralls upload steps now have github-token authentication
Prevents silent upload failures (badge stays current)
continue-on-error: true preserved (Coveralls downtime ≠ CI failure)
Baseline Auto-Update: On main branch pushes, the unified-coverage job automatically commits unified-coverage.json
Only runs on github.ref == 'refs/heads/main' && github.event_name == 'push'
Uses BASELINE_UPDATE_PAT secret for authentication
Conditional commit: Only commits if baseline file changed (if ! git diff --staged --quiet)
Commit message: [skip ci] prevents infinite loops
Coverage Threshold Update: Raised from 40% to 80% (closer to actual unified coverage of ~87%)
Baseline comparison is primary gate; threshold remains as safety net

Common Commands¶

# Local Development
moon run :dev -- --infra                    # Start containers (Postgres/Redis/MinIO)
moon run :dev -- --backend               # Start backend dev server
moon run :dev -- --frontend              # Start frontend dev server

# Local CI (matches GitHub CI exactly)
moon run :lint && moon run :test                       # Lint + Format + Test + Build

# Isolated testing (multiple repo copies)
BRANCH_NAME=feature-auth moon run :test
BRANCH_NAME=feature-auth WORKSPACE_ID=alice moon run :test

Database Lifecycle¶

Database Management (Python Context Manager)¶

The scripts/test_lifecycle.py script uses a Python Context Manager (@contextmanager) to robustly handle the database lifecycle:

Setup: Checks for the container runtime (Podman/Docker), starts the postgres service via Docker Compose, and ensures the database is ready.
Isolation: Creates a dedicated finance_report_test database and runs migrations.
Teardown: Automatically stops the database container after tests complete, ensuring resources are freed.
Signal Handling: Catches SIGINT (Ctrl+C) and SIGTERM to perform cleanup even if the test run is interrupted.

Local Test Isolation (Namespace-Based)¶

Purpose: Enable multiple repo copies (or branches) to run tests in parallel without conflicts.

How It Works:

Namespace Generation (priority order):
BRANCH_NAME (explicit) + WORKSPACE_ID (optional) → e.g., feature_auth_abc123
Git branch + repo path hash → e.g., feature_payments_beeba6ed
"default" (with warning if neither is set)
Isolated Resources:
Test Database: finance_report_test_{namespace}
Worker Databases: finance_report_test_{namespace}_gw0, gw1, etc. (pytest-xdist)
S3 Buckets: statements-{namespace}

Usage Examples:

# Explicit namespace (recommended for parallel development)
BRANCH_NAME=feature-auth moon run :test

# With workspace ID (multiple copies of same branch)
BRANCH_NAME=feature-auth WORKSPACE_ID=alice moon run :test
BRANCH_NAME=feature-auth WORKSPACE_ID=bob moon run :test

# Auto-detect from git branch (adds repo path hash)
moon run :test  # Uses current git branch

Automatic Cleanup:
Worker databases (_gw0, _gw1, etc.) are automatically cleaned up after test runs
Prevents database pollution from parallel test execution
See scripts/test_lifecycle.py → cleanup_worker_databases()

Implementation Details: - Shared Podman containers (no port conflicts) - Namespace-aware database and bucket names only - See scripts/isolation_utils.py for namespace logic - Integration tests: apps/backend/tests/infra/test_isolation.py

Key Features¶

Auto-detect runtime: podman compose / docker compose
Lock file: ~/.cache/finance_report/db.lock
Auto-cleanup: Last runner stops container

Isolation Utilities (`scripts/isolation_utils.py`)¶

Purpose: Support parallel test execution across multiple repo copies without resource conflicts.

Namespace Generation¶

The get_namespace() function generates a unique identifier for test resources based on:

# Priority 1: Explicit environment variables
BRANCH_NAME=feature-auth           # → "feature_auth"
BRANCH_NAME=feature-auth WORKSPACE_ID=alice  # → "feature_auth_alice"

# Priority 2: Git branch + repo path hash (auto-detect)
# On branch "feature-payments" at /path/to/repo
# → "feature_payments_beeba6ed"

# Priority 3: Fallback (with warning)
# → "default_abc12345"  # Includes repo path hash for isolation

Resource Naming Functions¶

Function	Input	Output	Purpose
`get_test_db_name(namespace)`	`"feature_auth"`	`"finance_report_test_feature_auth"`	Test database name
`get_s3_bucket(namespace)`	`"feature_auth"`	`"statements-feature_auth"`	S3 bucket name
`get_env_suffix(namespace)`	`"feature_auth"`	`"-feature_auth"`	Docker Compose suffix (future use)
`sanitize_namespace(name)`	`"feature/auth-v2"`	`"feature_auth_v2"`	Convert branch names to safe identifiers

Integration Points¶

scripts/test_lifecycle.py:
Calls get_namespace() at test start
Sets TEST_NAMESPACE environment variable
Creates namespace-specific test database
Overrides S3_BUCKET with namespace-aware bucket
Cleans up worker databases (_gw0, _gw1, etc.) after tests
apps/backend/tests/conftest.py:
Reads TEST_NAMESPACE from environment
Generates worker-specific database URLs:
- Master: finance_report_test_{namespace}
- Worker 0: finance_report_test_{namespace}_gw0
- Worker 1: finance_report_test_{namespace}_gw1
- etc.
Contract Tests (apps/backend/tests/infra/test_isolation.py):
15 tests verifying isolation behavior
Tests namespace generation, database naming, S3 buckets
Verifies conftest integration

Practical Examples¶

Scenario 1: Single developer, multiple feature branches

# Terminal 1 (feature-auth branch)
cd ~/repos/finance_report
BRANCH_NAME=feature-auth moon run :test
# Uses: finance_report_test_feature_auth

# Terminal 2 (feature-payments branch)
cd ~/repos/finance_report
BRANCH_NAME=feature-payments moon run :test
# Uses: finance_report_test_feature_payments

Scenario 2: Multiple developers, same branch

# Alice's terminal
cd ~/repos/finance_report_alice
BRANCH_NAME=feature-auth WORKSPACE_ID=alice moon run :test
# Uses: finance_report_test_feature_auth_alice

# Bob's terminal
cd ~/repos/finance_report_bob
BRANCH_NAME=feature-auth WORKSPACE_ID=bob moon run :test
# Uses: finance_report_test_feature_auth_bob

Scenario 3: Auto-detection from git

cd ~/repos/finance_report
git checkout feature-payments
moon run :test
# Auto-detects: finance_report_test_feature_payments_<hash>
# Hash prevents collisions across different repo copies

Test Optimization¶

Test Modes¶

Mode	Command	Speed	Coverage	Use Case
Smart	`backend:test-smart`	~40%	Changed files 99%	Daily dev (recommended)
Ultra-fast	`backend:test-no-cov`	~30%	None	TDD red-green
Full	`backend:test`	100%	All files 94%	CI/pre-commit

Implementation¶

Scripts: - scripts/get_changed_files.py - Detects changed Python files via git diff - scripts/smart_test.py - Runs all tests, coverage on changed files only - scripts/fast_test.py - Runs all tests, no coverage - scripts/test_lifecycle.py - DB lifecycle (accepts coverage flags from callers)

Key fixes from PR #260: - Removed hardcoded coverage flags from test_lifecycle.py - Aggregate all git changes (branch diff + uncommitted + staged) - Exclude deleted files, verify file existence

CI Optimization (.github/workflows/ci.yml): - 4-way parallel test sharding via pytest-split - Each shard: pytest --splits 4 --group N - Coverage reports merged post-run

[!IMPORTANT] Local CI vs GitHub CI Parallelism

Environment Parallelism Test Scope Resource Usage

GitHub CI -n auto + --splits 4 ~25% tests per shard Low (ephemeral runners)

Local CI -n 4 (fixed) 100% tests Controlled (shared machine)

GitHub CI uses -n auto because each shard only runs ~25% of tests on ephemeral runners. Local CI uses -n 4 to prevent resource exhaustion when running the full test suite. This is intentional design, not inconsistency.

Resource Lifecycle Management¶

All resources are bound to either dev server lifecycle (Ctrl+C) or test lifecycle (start/end).

Dev Server Lifecycle (`scripts/dev_*.py`)¶

┌─────────────────────────────────────────────────────────────────┐
│ User runs: moon run :dev -- --backend                                 │
│ ┌─────────┐    ┌─────────┐    ┌─────────┐                      │
│ │ Start   │ -> │ Server  │ -> │ Ctrl+C  │                      │
│ │ Stack   │    │ Runs    │    │ Cleanup │                      │
│ └─────────┘    └─────────┘    └─────────┘                      │
│      │                               │                          │
│  (DB,Redis,MinIO)              Stops: uvicorn (PID)             │
│                                       (Containers persist)      │
└─────────────────────────────────────────────────────────────────┘

Key safety feature: Scripts track processes by PID, only kill what THEY started. Safe for multi-window development - won't kill other sessions' processes.

Resources managed by dev scripts: | Script | Resources Started | Cleaned up on Ctrl+C | |--------|-------------------|---------------------| | dev_backend.py | uvicorn (PID), Full Stack Containers | ✓ uvicorn only (Containers stay for speed) | | dev_frontend.py | Next.js (PID tracked) | ✓ Only ours |

Test Lifecycle (`scripts/test_lifecycle.py`)¶

┌─────────────────────────────────────────────────────────────────┐
│ User runs: moon run :test                                │
│ ┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐      │
│ │ Start   │ -> │ Create  │ -> │ pytest  │ -> │ Cleanup │      │
│ │ DB      │    │ test DB │    │ runs    │    │ (trap)  │      │
│ └─────────┘    └─────────┘    └─────────┘    └─────────┘      │
│                                                   │             │
│                          Stops: DB (if refcount=0), playwright  │
└─────────────────────────────────────────────────────────────────┘

Resources managed by test script: | Resource | Start | Stop | |----------|-------|------| | Test DB container | Before tests | After last test runner exits | | Playwright driver | By pytest | Cleanup on test end | | Child processes | By pytest | pkill -P $$ on exit |

Smoke Tests (scripts/smoke_test.sh)¶

Usage¶

# Local (after starting servers)
bash scripts/smoke_test.sh

# Against staging/prod
BASE_URL=https://report.zitian.party bash scripts/smoke_test.sh

Endpoints Tested¶

Endpoint	Check
`/`	Homepage loads
`/api/health`	Returns "healthy"
`/api/docs`	Swagger UI loads
`/ping-pong`	Demo page loads
`/reconciliation`	Workbench loads
`/api/ping`	Ping API responds

Deployment Architecture¶

Dual-Repository Model¶

Finance Report uses two git repositories for configuration:

Environment	Configuration Source	Purpose
Local/CI/PR	`/docker-compose.yml`	Development + PR previews
Staging/Production	`/repo/finance_report/.../compose.yaml`	Production with Vault secrets

The /repo/ directory is a git submodule pointing to infra2 (infrastructure repo).

Key implications: - Workflows build images and trigger deployments - Actual deployment config managed in infra2 - Env vars for staging/prod stored in HashiCorp Vault - Container names include env suffix (e.g., -staging)

Secret Injection Flow¶

Production deployments use Vault sidecar pattern:

1. Dokploy pulls compose.yaml from infra2
2. vault-agent sidecar starts → renders /secrets/.env
3. Backend waits for secrets (CHECKPOINT-1)
4. Alembic runs migrations (CHECKPOINT-2)
5. Uvicorn starts application (CHECKPOINT-3)

Health check timeout (6min) accounts for this entire flow.

Container Naming¶

Environment	Backend	Database
Local/CI	`finance-report-backend`	`finance-report-db`
PR #47	`finance-report-backend-pr-47`	`finance-report-db-pr-47`
Staging	`finance_report-backend-staging`	`finance_report-postgres-staging`
Production	`finance_report-backend`	`finance_report-postgres`

Note: Local uses hyphens (Compose), prod uses underscores (Dokploy).

CI Workflows¶

ci.yml (PR/push)¶

Trigger: PR or push to main
Steps:  install → lint → test
DB:     GitHub services (ephemeral)
Smoke:  ❌ Not run (unit tests only)
Note:   Uses moon tasks for install/lint/build (uv/npm invoked via moon)

Deployment Workflows¶

Helper scripts: scripts/dokploy_deploy.sh, scripts/health_check.sh

staging-deploy.yml¶

Trigger: Push to main (apps/** changed)
Flow: Build (commit SHA) → Deploy → Health (6min) → E2E tests
URL: https://report-staging.zitian.party

production-release.yml¶

Triggers:
  - Tag push (v*.*.*): Build release images
  - Manual dispatch: Deploy to production

Build job: Tag → Build backend + frontend → Push to GHCR
Deploy job: Verify images → Deploy → Health (4min) → Smoke test

URL: https://report.zitian.party

Version Release Workflow¶

Manual control for stable releases and cherry-picks:

# Create release tag
git tag -a v1.2.3 -m "Release v1.2.3"
git push origin v1.2.3
# → Triggers production-release.yml (build job)
# → Images: ghcr.io/.../finance_report-{backend,frontend}:v1.2.3

# Deploy to production (manual)
# → Actions → Production Release → Run workflow → Select v1.2.3

Hotfix flow:

git checkout -b hotfix/bug v1.2.3
git cherry-pick abc123
git tag -a v1.2.4 -m "Hotfix: critical bug"
git push origin v1.2.4
# → Build automatically, deploy manually

Deployment Failures¶

Symptom	Cause	Resolution
Stuck "Waiting for secrets"	Vault token expired	`invoke vault.setup-tokens --project=finance_report`
6min timeout	Migration failed	Check SigNoz for CHECKPOINT-2 errors
"Image not found"	Tag not built	`git push origin v1.2.3` to trigger build
502 Bad Gateway	Backend crashed	Check CHECKPOINT-3 in SigNoz logs

Vault Token Lifecycle¶

Staging and production deployments use HashiCorp Vault for secrets management. The vault-agent sidecar renders secrets to /secrets/.env using an app token.

Token Properties¶

Property	Value
Token TTL	768 hours (~32 days)
Secrets file path	`/secrets/.env`
Staleness threshold	1 hour (bootloader warning)

Check Token Status¶

# SSH into VPS
ssh root@$VPS_HOST

# Check vault-agent logs for token issues
docker logs finance_report-vault-agent-staging 2>&1 | tail -20

# Check if secrets file exists and when it was last modified
docker exec finance_report-backend-staging ls -la /secrets/.env

Regenerate Tokens¶

When a token expires, the vault-agent cannot refresh secrets, causing the backend to hang at "Waiting for secrets".

# From local machine with infra2 repo
cd /path/to/infra2

# Regenerate tokens (requires Vault root access)
invoke vault.setup-tokens --project=finance_report

# Restart vault-agent to pick up new token
ssh root@$VPS_HOST "docker restart finance_report-vault-agent-staging"

Monitoring (Bootloader Check)¶

The bootloader includes a _check_vault_secrets() method that runs in FULL mode:

Missing secrets file: Warning with regeneration instructions
Stale secrets file (>1 hour old): Warning that vault-agent may have stopped
Fresh secrets file: OK status with last modified time

This check runs during smoke tests (bash scripts/smoke_test.sh) and provides early warning of token issues.

Deployment Architecture¶

Environment Flow¶

┌─────────────────────────────────────────────────────────────────┐
│ Development Flow                                                 │
│                                                                  │
│   Local Dev        PR/Branch           Staging         Prod     │
│   ┌─────────┐      ┌─────────┐      ┌─────────┐    ┌─────────┐ │
│   │ docker  │  →   │ CI test │  →   │ Auto on │ →  │ Manual  │ │
│   │ compose │      │ + PR    │      │ main    │    │ tag +   │ │
│   │         │      │ preview │      │ merge   │    │ dispatch│ │
│   └─────────┘      └─────────┘      └─────────┘    └─────────┘ │
│                                                                  │
│   docker-compose   pr-test.yml      staging-       production-  │
│   .yml             ci.yml           deploy.yml     release.yml  │
└─────────────────────────────────────────────────────────────────┘

Database Migrations Testing Strategy¶

Migrations are tested at multiple stages:

Local Development: Manual testing with alembic upgrade head before committing
GitHub CI: pytest validates model definitions and constraints
Staging Deployment: First automated test of migrations via entrypoint
Production Deployment: Only after staging validation

Before deploying schema changes: - Test locally: cd apps/backend && alembic upgrade head - Ensure backward-compatible migrations (for rollback) - Consider: existing data, indexes, constraints

Staging Deployment (Automatic)¶

Staging deploys automatically when: 1. Push to main branch 2. Changes in apps/backend/** or apps/frontend/**

The workflow (staging-deploy.yml): 1. Builds images with commit SHA tag 2. Pushes to GHCR 3. Deploys to Dokploy staging 4. Runs health check + E2E tests

Production Deployment (Manual)¶

Production deployment is a two-step process:

Build: Create a git tag (triggers production-release.yml)

git tag -a v1.2.3 -m "Release v1.2.3"
git push origin v1.2.3
# → Builds images: ghcr.io/.../finance_report-{backend,frontend}:v1.2.3

Deploy: Manual workflow dispatch

# Via GitHub Actions UI:
# Actions → "Production Release" → Run workflow → Select tag

# Or via gh CLI:
gh workflow run production-release.yml

The deploy job: 1. Verifies images exist in GHCR 2. Deploys to Dokploy production 3. Runs health check + smoke tests

Database Migrations¶

Migrations run automatically on container startup via the entrypoint:

# In infra2 compose.yaml
entrypoint:
  - sh
  - -c
  - |
    cd /app && export PYTHONPATH=/app
    # Wait for secrets...
    alembic upgrade head  # ← Runs migrations
    exec uvicorn src.main:app --host 0.0.0.0 --port 8000

Important: Before deploying schema changes: 1. Test migration locally with docker-compose.yml 2. Ensure migration is backward-compatible (for rollback) 3. Consider: existing data, indexes, constraints

Environment Variables¶

Scenario	DATABASE_URL	Hostname Strategy
Local Dev	`postgresql+asyncpg://...`	`localhost` or `postgres`
Local Test	`postgresql+asyncpg://...`	`localhost:5432`
PR Test	`postgresql+asyncpg://...`	Unique: `finance-report-db-pr-XX`
CI	Same as Local Test	`localhost:5433` (services)
Staging/Prod	External PostgreSQL	Dokploy Managed

Verification¶

# Verify moon commands work
moon run :test

# Test smoke tests locally
nohup moon run :dev -- --backend > /dev/null 2>&1 &
sleep 10
export BASE_URL="http://localhost:8000"
bash scripts/smoke_test.sh

# Check no orphan containers after tests
podman ps | grep finance_report

Resource Cleanup¶

Automatic Cleanup (Recommended)¶

Install the post-push hook to automatically clean orphaned test databases after every git push:

./scripts/install_git_hooks.sh

This hook safely removes: - ✅ Test databases from interrupted test runs (e.g., Ctrl+C) - ✅ Worker databases left by pytest-xdist crashes - ❌ Does NOT touch development data or running tests

Manual Cleanup¶

Clean Orphaned Test Databases¶

After interrupted test runs (Ctrl+C, SIGKILL, OOM):

# Preview what would be deleted
python scripts/cleanup_orphaned_dbs.py --dry-run

# Clean orphaned databases only (safe)
python scripts/cleanup_orphaned_dbs.py

# Clean ALL test databases (use with caution)
python scripts/cleanup_orphaned_dbs.py --all

Clean All Development Resources¶

⚠️ WARNING: This deletes ALL local data!

# Clean containers and locks only (safe)
./scripts/cleanup_dev_resources.sh

# Clean EVERYTHING including volumes and MinIO data (data loss!)
./scripts/cleanup_dev_resources.sh --all

Monitor Resource Leaks¶

Run weekly to detect accumulated leaks across all 6 environments:

# Quick check
./scripts/check_resource_leaks.sh

# Detailed report with listings
./scripts/check_resource_leaks.sh --verbose

# Include VPS PR volume check (requires SSH access)
VPS_HOST=cloud.zitian.party ./scripts/check_resource_leaks.sh

The monitoring script checks: 1. Local Worker Databases - Orphaned _gw* databases 2. Local Docker Volumes - Total count and size 3. Local MinIO Data - Storage usage 4. VPS PR Volumes - Orphaned -pr-* volumes (requires SSH) 5. GHCR PR Images - Orphaned pr-* tags (requires gh CLI) 6. Cache Files - Active namespace tracker state

PR Preview Cleanup (Automated)¶

When a PR is closed, GitHub Actions automatically cleans: - ✅ Dokploy stack on VPS - ✅ Docker volumes (postgres_data, redis_data, minio_data) - ✅ GHCR container images (backend:pr-{number}, frontend:pr-{number})

To enable VPS volume cleanup, add the VPS_SSH_KEY secret to your GitHub repository:

# Generate SSH key pair (if not exists)
ssh-keygen -t ed25519 -f ~/.ssh/finance_report_vps -N ""

# Add public key to VPS
ssh-copy-id -i ~/.ssh/finance_report_vps.pub root@cloud.zitian.party

# Add private key to GitHub Secrets
# Settings → Secrets → Actions → New repository secret
# Name: VPS_SSH_KEY
# Value: (paste content of ~/.ssh/finance_report_vps)

Engineering Standards¶

Environment Variable Lifecycle¶

Variables follow a strict "Bake vs. Runtime" flow:

flowchart TD
    Start[I need a new Env Var] --> Type{Is it for?}

    Type -->|Frontend| Front[Next.js Public]
    Type -->|Backend| Back[FastAPI Runtime]
    Type -->|Secret| Secret[Production Secret]

    Front --> F1[Add to .env.example]
    F1 --> F2[Add to Dockerfile ARG]
    F2 --> F3[Add to docker-compose.yml args]
    F3 --> F4[Use NEXT_PUBLIC_ prefix]

    Back --> B1[Add to .env.example]
    B1 --> B2[Add to apps/backend/src/config.py]
    B2 --> B3[Set default value in config.py]

    Secret --> S1[Add to secrets.ctmpl]
    S1 --> S2[Add to config.py]
    S2 --> S3[Add to .env.example]

    style Start fill:#f9f,stroke:#333,stroke-width:2px
    style Front fill:#e1f5fe
    style Back fill:#e8f5e9
    style Secret fill:#ffebee

Frontend (Next.js):
- Variables prefixed with NEXT_PUBLIC_ are "baked" into the static JS bundle during npm run build.
- Requirement: These must be defined as ARG in apps/frontend/Dockerfile. See: apps/frontend/Dockerfile
- Requirement: They must also be passed in docker-compose.yml under args.
Backend (FastAPI):
- Variables are loaded at runtime via Pydantic Settings.
- Requirement: All variables must have a type and default in apps/backend/src/config.py. See: apps/backend/src/config.py
- Requirement: Must be documented in .env.example. See: .env.example
Production (Vault):
- Secrets are stored in Vault and rendered by vault-agent using secrets.ctmpl.
- Consistency: CI runs scripts/check_env_keys.py to ensure secrets.ctmpl, config.py, and .env.example are aligned.

Cross-Repo Synchronization¶

The repo/ directory is a submodule pointing to infra2.

Logic: Main Repo (finance_report).
Infrastructure: Submodule (infra2).
Workflow:
1. If a change requires a new environment variable or a change to docker-compose.yml labels/configs for production:
2. Create a branch in repo/.
3. Commit changes to repo/finance_report/finance_report/10.app/.
4. Push and create a PR in infra2.
5. Once merged, update the submodule pointer in the Main Repo PR.

CI Performance & Test Strategy¶

Current Metrics (2025-02-02)¶

CI Pipeline: - Total duration: 6m 24s (Backend: 5m 52s, Frontend: 1m 30s) - Test execution: 893 tests in 4m 47s (320ms avg - excellent) - Caching: UV ✅ (2.9s), Next.js ✅, venv ✅

Test Coverage: - Overall: 94.51% (exceeds 94% requirement) - Critical gap: Service layer 28% (high production risk) - reporting.py: 7%, reconciliation.py: 9%, review_queue.py: 9% - Root cause: Tests focus on happy paths via routers, not direct service calls

Test Organization: - SSOT-aligned structure (73 files across domains) - pytest-xdist with 2 workers (1.7x speedup) - Minimal slow tests (only 1 marked)

Backend Test Parallelization¶

Current setup (pytest-xdist):

# In pyproject.toml
[tool.pytest.ini_options]
addopts = "-n 2"  # 2 workers for parallel execution

Further parallelization options:

Increase worker count (limited by CPU cores)

pytest -n auto  # Auto-detect CPU count
pytest -n 4     # 4 workers (if 4+ cores available)

Current: 2 workers (~1.7x speedup)
Potential: 4 workers on 4-core CI runners (~2.5-3x speedup)
Diminishing returns beyond CPU count

Split tests in CI (GitHub Actions parallel jobs)

backend-unit:  # Fast unit tests (~2m)
  run: pytest tests/infra/ tests/api/ tests/auth/

backend-integration:  # Slower integration tests (~3m)
  run: pytest tests/accounting/ tests/reconciliation/

Benefit: True parallel execution (not limited by single runner)
Drawback: Need separate DB instances per job
Current bottlenecks:
DB setup: ~35s (unavoidable for integration tests)
Large test files: Some tests >1s (acceptable for integration)
Coverage calculation: ~3s (minimal impact)

Recommendation: - Quick win: Increase to -n 4 workers (~30s savings) - Long-term: Split into 2-3 parallel CI jobs (~1-2m savings)

Priority Actions¶

Address service coverage gaps 🔴 (highest production risk)
Add error path tests for core services
Target: 80% service coverage (from 28%)
Increase pytest workers ⚡ (quick win)
Change -n 2 to -n 4 in pyproject.toml
Split CI jobs 🟡 (medium effort, 1-2m savings)
Separate unit vs integration tests in GitHub Actions

Environment	Parallelism	Test Scope	Resource Usage
GitHub CI	`-n auto` + `--splits 4`	~25% tests per shard	Low (ephemeral runners)
Local CI	`-n 4` (fixed)	100% tests	Controlled (shared machine)

Development Environment SSOT¶

Source Files¶

Prerequisites¶

Moon Commands (Primary Interface)¶

Documentation¶

Build & Serve Docs¶

Documentation Structure¶

Six Environments (SSOT)¶

Environment Overview¶

Key Differences¶

Local Environments (Dev + CI)¶

GitHub Environments¶

Production Environments (Staging + Production)¶

Container/Database Naming Patterns¶

Workflow Files Reference¶

Shared Platform Resources¶

Test Strategy by Environment¶

Coverage Requirements¶

No-Regression Coverage Gate¶

How It Works¶

Manual Baseline Reset¶

Environment Variables¶

Test Coverage¶

CI Job Structure¶

Job Details¶

Key Changes (CI Coverage Improvements)¶

Common Commands¶

Database Lifecycle¶

Database Management (Python Context Manager)¶

Local Test Isolation (Namespace-Based)¶

Key Features¶

Isolation Utilities (scripts/isolation_utils.py)¶

Namespace Generation¶

Resource Naming Functions¶

Integration Points¶

Practical Examples¶

Test Optimization¶

Test Modes¶

Implementation¶

Resource Lifecycle Management¶

Dev Server Lifecycle (scripts/dev_*.py)¶

Test Lifecycle (scripts/test_lifecycle.py)¶

Smoke Tests (scripts/smoke_test.sh)¶

Usage¶

Endpoints Tested¶

Deployment Architecture¶

Dual-Repository Model¶

Secret Injection Flow¶

Container Naming¶

CI Workflows¶

ci.yml (PR/push)¶

Deployment Workflows¶

staging-deploy.yml¶

production-release.yml¶

Version Release Workflow¶

Deployment Failures¶

Vault Token Lifecycle¶

Token Properties¶

Check Token Status¶

Regenerate Tokens¶

Monitoring (Bootloader Check)¶

Deployment Architecture¶

Environment Flow¶

Database Migrations Testing Strategy¶

Staging Deployment (Automatic)¶

Production Deployment (Manual)¶

Database Migrations¶

Environment Variables¶

Verification¶

Resource Cleanup¶

Automatic Cleanup (Recommended)¶

Manual Cleanup¶

Clean Orphaned Test Databases¶

Clean All Development Resources¶

Monitor Resource Leaks¶

PR Preview Cleanup (Automated)¶

Engineering Standards¶

Environment Variable Lifecycle¶

Cross-Repo Synchronization¶

CI Performance & Test Strategy¶

Isolation Utilities (`scripts/isolation_utils.py`)¶

Dev Server Lifecycle (`scripts/dev_*.py`)¶

Test Lifecycle (`scripts/test_lifecycle.py`)¶