Skip to content

Development Environment SSOT

SSOT Key: development Source of Truth for local development, testing, CI, and deployment.

Source Files

Prerequisites

  • Node.js: v20+ (Managed by system, not moon)
  • pnpm/npm: Required for frontend dependencies
  • Python: v3.12+ (Managed by uv)
File Purpose
moon.yml Root workspace tasks
apps/*/moon.yml Per-project tasks
scripts/test_lifecycle.py Database lifecycle (Python Context Manager)
scripts/smoke_test.sh Unified smoke tests
docker-compose.yml Development service containers
.github/workflows/ci.yml GitHub Actions CI
.github/workflows/staging-deploy.yml Staging Build & Deploy
.github/workflows/production-release.yml Production Release

Moon Commands (Primary Interface)

# Development
moon run :dev -- --backend        # Full Stack (App + DB + Redis + MinIO)
moon run :dev -- --frontend       # Next.js on :3000

# Local CI / Verification (Recommended)
moon run :lint && moon run :test                # One-button check (Lint + Format + Test + Check)
                            # Matches GitHub CI exactly.

# Testing
moon run :test                    # All tests (default, 90% backend coverage)
moon run :test -- --fast         # TDD mode (no coverage, fastest)
moon run :test -- --smart        # Coverage on changed files only
moon run :test -- --e2e          # E2E tests (Playwright)
moon run :test -- tests/accounting/  # Run specific module
moon run :test -- tests/accounting/test_journal_service.py  # Run specific file

# Environment Verification
# (See docs/ssot/env_smoke_test.md for full details)
uv run python -m src.boot --mode full  # Full Stack Check (Gate 3)

# Code Quality
moon run :lint              # Lint all
moon run :lint -- --fix     # Format Python (auto-fix)
moon run :lint -- --fix     # Format Python

# Build
moon run :build             # Build all

Documentation

The project uses MkDocs with Material theme for documentation.

Build & Serve Docs

# Install dependencies
pip install -r docs/requirements.txt

# Serve docs locally with live reload
mkdocs serve
# → Open http://127.0.0.1:8000

# Build static site
mkdocs build
# → Output: site/ directory

Documentation Structure

Path Content
docs/ Source markdown files
mkdocs.yml MkDocs configuration
site/ Generated static site (gitignored)

The live documentation is hosted at wangzitian0.github.io/finance_report.


Six Environments (SSOT)

Core Principle: "One Codebase, Multiple Environments" - Local uses containers + namespace isolation, CI emphasizes consistency, Production uses image deployment.

Environment Overview

# Environment URL Trigger Code Runtime Infrastructure Database Isolation
1 Local Dev localhost:3000 Manual
moon run :dev -- --backend
Source (Host)
uvicorn/next dev
Shared Containers
(Podman/Docker)
finance_report Container name suffix
2 Local CI localhost:3000 Manual
moon run :lint && moon run :test
Source (Host)
pytest
Shared Containers
(Podman/Docker)
finance_report_test_{namespace} DB/bucket name
3 GitHub CI - Push/PR
ci.yml
Source (Runner)
pytest
GitHub Services
(Ephemeral)
finance_report_test Job isolation
4 PR Preview report-pr-123.zitian.party PR opened
pr-test.yml
Docker Images
(GHCR)
Dedicated Containers
(Per PR)
Dedicated DB/Redis/MinIO Container suffix
-pr-123
5 Staging report-staging.zitian.party Push to main
staging-deploy.yml
Docker Images
(GHCR)
Dedicated infra2
+ Shared Platform
Dedicated DB/Redis Bucket name
-staging
6 Production report.zitian.party Manual release
production-release.yml
Docker Images
(GHCR)
Dedicated infra2
+ Shared Platform
Dedicated DB/Redis Bucket name

Key Differences

Local Environments (Dev + CI)

Local Dev - One shared set of containers, isolated by different database names: - Uses docker-compose.yml (Profile: infra) - Persistent: Manually started, data preserved across runs - Isolation: Multiple repo copies use namespace-aware DB names (finance_report, finance_report_dev_branch_a, etc.) - S3: Shared local MinIO with namespace-aware buckets (statements, statements-branch-a) - Command: moon run :dev -- --backend (or moon run :dev -- --infra + manual uvicorn)

Local CI - Reuses Local Dev containers, creates temporary test databases: - Uses same docker-compose.yml (Profile: infra) - Ephemeral data: Test DB reset before each run, worker DBs auto-cleaned - Isolation: finance_report_test_{namespace} + worker DBs (_gw0, _gw1, etc.) - Command: moon run :lint && moon run :test (includes moon run :test) - Matches GitHub CI command exactly (moon run :lint && moon run :test)

GitHub Environments

GitHub CI - Temporary services, runs same commands as Local CI: - Uses GitHub Actions services: (ephemeral Postgres container) - Completely ephemeral: Destroyed after job finishes - Command: moon run :lint && moon run :test (identical to Local CI) - Database: finance_report_test (no namespace needed, job-isolated)

PR Preview - Full deployment with code changes: - Builds Docker images from PR branch - Deploys to Dokploy with unique URLs (report-pr-123.zitian.party) - Ephemeral: Destroyed when PR closes - Database/Redis/MinIO: Dedicated per-PR instances - Isolation: Container name suffix -pr-123

Production Environments (Staging + Production)

Staging - Tracks latest main branch: - Image deployment: Built from latest main commit after merge - Deployed to Dokploy automatically on push to main - Persistent data, stable environment for QA - Uses dedicated DB/Redis + shared Platform (SigNoz, MinIO with bucket isolation)

Production - Manual release process: - Image deployment: Built from version tags (v1.2.3) - Manual trigger after Staging validation - Most stable environment, persistent data - Uses dedicated DB/Redis + shared Platform

Container/Database Naming Patterns

Environment Backend Container Frontend Container Database S3 Bucket
Local Dev finance-report-backend finance-report-frontend finance_report statements
Local CI (uses Local Dev containers) (uses Local Dev containers) finance_report_test_{namespace} statements-{namespace}
GitHub CI (GitHub Services) (N/A) finance_report_test statements (mock)
PR Preview finance_report-backend-pr-123 finance_report-frontend-pr-123 finance_report_postgres-pr-123 (dedicated MinIO)
Staging finance_report-backend-staging finance_report-frontend-staging finance_report-postgres-staging finance-report-staging
Production finance_report-backend finance_report-frontend finance_report-postgres finance-report-production

See AGENTS.md for debugging container names.

Workflow Files Reference

Workflow File Environment Trigger Actions
.github/workflows/ci.yml GitHub CI Push/PR to main Run moon run :lint && moon run :test, upload coverage
.github/workflows/pr-test.yml PR Preview PR opened/sync Build images, deploy to Dokploy, cleanup on close
.github/workflows/staging-deploy.yml Staging Push to main Build images (:staging tag), deploy
.github/workflows/production-release.yml Production Tag v*.*.* or manual Build release images, deploy on manual trigger

Shared Platform Resources

The production Platform layer (SigNoz, MinIO, Traefik) runs as Singleton services. Staging and PR environments use logical isolation:

Service Scope Isolation Method Example
SigNoz Singleton deployment.environment tag staging, production, pr-47
MinIO (Prod) Singleton Separate buckets finance-report-staging, finance-report-production
Postgres Dedicated Separate containers/instances One per environment
Redis Dedicated Separate containers/instances One per environment

Note: PR Previews have dedicated MinIO/DB/Redis to allow destructive testing, but send logs to shared SigNoz.

Test Strategy by Environment

Environment Tests Run Purpose Duration
Local Dev None (manual testing) Fast iteration -
Local CI Unit + Integration (90% backend, 96% unified) Pre-push validation ~30s
GitHub CI Unit + Integration (90% backend, 96% unified) Quality gate ~2min
PR Preview Health check only Deployment validation ~30s
Staging Smoke + Performance Full validation ~5min
Production Health check only Availability check ~10s

Coverage Requirements

  • Backend line coverage: >= 90% (enforced by pytest-cov); 96% unified (enforced by calculate_unified_coverage.py)
  • Branch coverage: Required (via --cov-branch)
  • See TDD Transformation Plan for details

No-Regression Coverage Gate

The CI workflow enforces a no-regression policy for test coverage, preventing silent coverage drops between main branch commits.

How It Works

  1. Baseline Storage: The coverage baseline is stored in unified-coverage.json at the repository root.
  2. Created automatically on the first successful CI run on main branch
  3. Updated automatically on each subsequent main branch push (if coverage changes)
  4. Committed by GitHub Actions bot with [skip ci] flag to prevent infinite loops

  5. Comparison Logic:

  6. Before calculating final coverage, the unified-coverage job reads unified-coverage.json if it exists
  7. Compares current coverage against baseline for all components: unified, backend, frontend, scripts
  8. Uses round(x, 2) for floating-point comparison (same precision as JSON output)
  9. Zero tolerance: If ANY component is below baseline (current < baseline), CI fails immediately
  10. If baseline file doesn't exist, skips comparison and falls through to COVERAGE_THRESHOLD check (safety net)

  11. Fail Conditions:

  12. Unified coverage drops below baseline: ❌ Unified coverage {current:.2f}% is below baseline {baseline:.2f}%
  13. Backend component drops below baseline: ❌ backend coverage {current:.2f}% is below baseline {baseline:.2f}%
  14. Frontend component drops below baseline: ❌ frontend coverage {current:.2f}% is below baseline {baseline:.2f}%
  15. Scripts component drops below baseline: ❌ scripts coverage {current:.2f}% is below baseline {baseline:.2f}%
  16. All components at or above baseline: ✅ No regression: all coverage at or above baseline

Manual Baseline Reset

If you need to manually reset the baseline (e.g., after major refactoring):

# Option 1: Update baseline to current state
git pull origin main
# Make your changes, then:
git add unified-coverage.json && git commit -m "chore: manually reset coverage baseline" && git push

# Option 2: Remove baseline temporarily
git rm unified-coverage.json && git commit -m "chore: remove coverage baseline for testing" && git push

Warning: Removing the baseline disables the no-regression gate until the next main branch push recreates it.

Environment Variables

  • BASELINE_FILE: Path to baseline JSON file (default: unified-coverage.json)
  • COVERAGE_THRESHOLD: Safety net threshold (default: 0, no minimum enforced; baseline comparison is the primary no-regression gate)

Test Coverage

Unit tests in scripts/tests/test_calculate_unified_coverage.py::TestBaselineComparison verify: - Equal coverage passes (no regression) - Coverage drops fail with clear error messages - Component-level drop detection (unified ok but individual component drops) - Baseline file path configurable via BASELINE_FILE

CI Job Structure

The GitHub Actions workflow (.github/workflows/ci.yml) follows this job dependency order:

lint → backend shards → frontend → unified-coverage → finish

Job Details

Job Purpose Runs On Dependencies
lint Static analysis (ruff check + format check) None (first job)
backend (Shards 1-4) Backend unit + integration tests needs: [lint]
frontend Frontend build + tests None (runs in parallel with backend)
unified-coverage Calculate unified coverage, compare to baseline, update Coveralls needs: [backend, frontend]
finish Aggregate all job results, fail if any job failed needs: [backend, frontend, lint, unified-coverage]

Key Changes (CI Coverage Improvements)

  1. Standalone Lint Job: Previously embedded in backend shard 1, now runs independently
  2. Fast failure: Lint failures surface in ~1 min instead of waiting for backend shard 1 to complete (~10 min)
  3. All backend shards depend on lint: needs: [lint]

  4. Coveralls Upload Fixes: All three Coveralls upload steps now have github-token authentication

  5. Prevents silent upload failures (badge stays current)
  6. continue-on-error: true preserved (Coveralls downtime ≠ CI failure)

  7. Baseline Auto-Update: On main branch pushes, the unified-coverage job automatically commits unified-coverage.json

  8. Only runs on github.ref == 'refs/heads/main' && github.event_name == 'push'
  9. Uses BASELINE_UPDATE_PAT secret for authentication
  10. Conditional commit: Only commits if baseline file changed (if ! git diff --staged --quiet)
  11. Commit message: [skip ci] prevents infinite loops

  12. Coverage Threshold Update: Raised from 40% to 80% (closer to actual unified coverage of ~87%)

  13. Baseline comparison is primary gate; threshold remains as safety net

Common Commands

# Local Development
moon run :dev -- --infra                    # Start containers (Postgres/Redis/MinIO)
moon run :dev -- --backend               # Start backend dev server
moon run :dev -- --frontend              # Start frontend dev server

# Local CI (matches GitHub CI exactly)
moon run :lint && moon run :test                       # Lint + Format + Test + Build

# Isolated testing (multiple repo copies)
BRANCH_NAME=feature-auth moon run :test
BRANCH_NAME=feature-auth WORKSPACE_ID=alice moon run :test

Database Lifecycle

Database Management (Python Context Manager)

The scripts/test_lifecycle.py script uses a Python Context Manager (@contextmanager) to robustly handle the database lifecycle:

  1. Setup: Checks for the container runtime (Podman/Docker), starts the postgres service via Docker Compose, and ensures the database is ready.
  2. Isolation: Creates a dedicated finance_report_test database and runs migrations.
  3. Teardown: Automatically stops the database container after tests complete, ensuring resources are freed.
  4. Signal Handling: Catches SIGINT (Ctrl+C) and SIGTERM to perform cleanup even if the test run is interrupted.

Local Test Isolation (Namespace-Based)

Purpose: Enable multiple repo copies (or branches) to run tests in parallel without conflicts.

How It Works:

  1. Namespace Generation (priority order):
  2. BRANCH_NAME (explicit) + WORKSPACE_ID (optional) → e.g., feature_auth_abc123
  3. Git branch + repo path hash → e.g., feature_payments_beeba6ed
  4. "default" (with warning if neither is set)

  5. Isolated Resources:

  6. Test Database: finance_report_test_{namespace}
  7. Worker Databases: finance_report_test_{namespace}_gw0, gw1, etc. (pytest-xdist)
  8. S3 Buckets: statements-{namespace}

  9. Usage Examples:

    # Explicit namespace (recommended for parallel development)
    BRANCH_NAME=feature-auth moon run :test
    
    # With workspace ID (multiple copies of same branch)
    BRANCH_NAME=feature-auth WORKSPACE_ID=alice moon run :test
    BRANCH_NAME=feature-auth WORKSPACE_ID=bob moon run :test
    
    # Auto-detect from git branch (adds repo path hash)
    moon run :test  # Uses current git branch
    

  10. Automatic Cleanup:

  11. Worker databases (_gw0, _gw1, etc.) are automatically cleaned up after test runs
  12. Prevents database pollution from parallel test execution
  13. See scripts/test_lifecycle.pycleanup_worker_databases()

Implementation Details: - Shared Podman containers (no port conflicts) - Namespace-aware database and bucket names only - See scripts/isolation_utils.py for namespace logic - Integration tests: apps/backend/tests/infra/test_isolation.py

Key Features

  1. Auto-detect runtime: podman compose / docker compose
  2. Lock file: ~/.cache/finance_report/db.lock
  3. Auto-cleanup: Last runner stops container

Isolation Utilities (scripts/isolation_utils.py)

Purpose: Support parallel test execution across multiple repo copies without resource conflicts.

Namespace Generation

The get_namespace() function generates a unique identifier for test resources based on:

# Priority 1: Explicit environment variables
BRANCH_NAME=feature-auth           # → "feature_auth"
BRANCH_NAME=feature-auth WORKSPACE_ID=alice  # → "feature_auth_alice"

# Priority 2: Git branch + repo path hash (auto-detect)
# On branch "feature-payments" at /path/to/repo
# → "feature_payments_beeba6ed"

# Priority 3: Fallback (with warning)
# → "default_abc12345"  # Includes repo path hash for isolation

Resource Naming Functions

Function Input Output Purpose
get_test_db_name(namespace) "feature_auth" "finance_report_test_feature_auth" Test database name
get_s3_bucket(namespace) "feature_auth" "statements-feature_auth" S3 bucket name
get_env_suffix(namespace) "feature_auth" "-feature_auth" Docker Compose suffix (future use)
sanitize_namespace(name) "feature/auth-v2" "feature_auth_v2" Convert branch names to safe identifiers

Integration Points

  1. scripts/test_lifecycle.py:
  2. Calls get_namespace() at test start
  3. Sets TEST_NAMESPACE environment variable
  4. Creates namespace-specific test database
  5. Overrides S3_BUCKET with namespace-aware bucket
  6. Cleans up worker databases (_gw0, _gw1, etc.) after tests

  7. apps/backend/tests/conftest.py:

  8. Reads TEST_NAMESPACE from environment
  9. Generates worker-specific database URLs:

    • Master: finance_report_test_{namespace}
    • Worker 0: finance_report_test_{namespace}_gw0
    • Worker 1: finance_report_test_{namespace}_gw1
    • etc.
  10. Contract Tests (apps/backend/tests/infra/test_isolation.py):

  11. 15 tests verifying isolation behavior
  12. Tests namespace generation, database naming, S3 buckets
  13. Verifies conftest integration

Practical Examples

Scenario 1: Single developer, multiple feature branches

# Terminal 1 (feature-auth branch)
cd ~/repos/finance_report
BRANCH_NAME=feature-auth moon run :test
# Uses: finance_report_test_feature_auth

# Terminal 2 (feature-payments branch)
cd ~/repos/finance_report
BRANCH_NAME=feature-payments moon run :test
# Uses: finance_report_test_feature_payments

Scenario 2: Multiple developers, same branch

# Alice's terminal
cd ~/repos/finance_report_alice
BRANCH_NAME=feature-auth WORKSPACE_ID=alice moon run :test
# Uses: finance_report_test_feature_auth_alice

# Bob's terminal
cd ~/repos/finance_report_bob
BRANCH_NAME=feature-auth WORKSPACE_ID=bob moon run :test
# Uses: finance_report_test_feature_auth_bob

Scenario 3: Auto-detection from git

cd ~/repos/finance_report
git checkout feature-payments
moon run :test
# Auto-detects: finance_report_test_feature_payments_<hash>
# Hash prevents collisions across different repo copies


Test Optimization

Test Modes

Mode Command Speed Coverage Use Case
Smart backend:test-smart ~40% Changed files 99% Daily dev (recommended)
Ultra-fast backend:test-no-cov ~30% None TDD red-green
Full backend:test 100% All files 94% CI/pre-commit

Implementation

Scripts: - scripts/get_changed_files.py - Detects changed Python files via git diff - scripts/smart_test.py - Runs all tests, coverage on changed files only - scripts/fast_test.py - Runs all tests, no coverage - scripts/test_lifecycle.py - DB lifecycle (accepts coverage flags from callers)

Key fixes from PR #260: - Removed hardcoded coverage flags from test_lifecycle.py - Aggregate all git changes (branch diff + uncommitted + staged) - Exclude deleted files, verify file existence

CI Optimization (.github/workflows/ci.yml): - 4-way parallel test sharding via pytest-split - Each shard: pytest --splits 4 --group N - Coverage reports merged post-run

[!IMPORTANT] Local CI vs GitHub CI Parallelism

Environment Parallelism Test Scope Resource Usage
GitHub CI -n auto + --splits 4 ~25% tests per shard Low (ephemeral runners)
Local CI -n 4 (fixed) 100% tests Controlled (shared machine)

GitHub CI uses -n auto because each shard only runs ~25% of tests on ephemeral runners. Local CI uses -n 4 to prevent resource exhaustion when running the full test suite. This is intentional design, not inconsistency.


Resource Lifecycle Management

All resources are bound to either dev server lifecycle (Ctrl+C) or test lifecycle (start/end).

Dev Server Lifecycle (scripts/dev_*.py)

┌─────────────────────────────────────────────────────────────────┐
│ User runs: moon run :dev -- --backend                                 │
│ ┌─────────┐    ┌─────────┐    ┌─────────┐                      │
│ │ Start   │ -> │ Server  │ -> │ Ctrl+C  │                      │
│ │ Stack   │    │ Runs    │    │ Cleanup │                      │
│ └─────────┘    └─────────┘    └─────────┘                      │
│      │                               │                          │
│  (DB,Redis,MinIO)              Stops: uvicorn (PID)             │
│                                       (Containers persist)      │
└─────────────────────────────────────────────────────────────────┘

Key safety feature: Scripts track processes by PID, only kill what THEY started. Safe for multi-window development - won't kill other sessions' processes.

Resources managed by dev scripts: | Script | Resources Started | Cleaned up on Ctrl+C | |--------|-------------------|---------------------| | dev_backend.py | uvicorn (PID), Full Stack Containers | ✓ uvicorn only (Containers stay for speed) | | dev_frontend.py | Next.js (PID tracked) | ✓ Only ours |

Test Lifecycle (scripts/test_lifecycle.py)

┌─────────────────────────────────────────────────────────────────┐
│ User runs: moon run :test                                │
│ ┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐      │
│ │ Start   │ -> │ Create  │ -> │ pytest  │ -> │ Cleanup │      │
│ │ DB      │    │ test DB │    │ runs    │    │ (trap)  │      │
│ └─────────┘    └─────────┘    └─────────┘    └─────────┘      │
│                                                   │             │
│                          Stops: DB (if refcount=0), playwright  │
└─────────────────────────────────────────────────────────────────┘

Resources managed by test script: | Resource | Start | Stop | |----------|-------|------| | Test DB container | Before tests | After last test runner exits | | Playwright driver | By pytest | Cleanup on test end | | Child processes | By pytest | pkill -P $$ on exit |

Smoke Tests (scripts/smoke_test.sh)

Usage

# Local (after starting servers)
bash scripts/smoke_test.sh

# Against staging/prod
BASE_URL=https://report.zitian.party bash scripts/smoke_test.sh

Endpoints Tested

Endpoint Check
/ Homepage loads
/api/health Returns "healthy"
/api/docs Swagger UI loads
/ping-pong Demo page loads
/reconciliation Workbench loads
/api/ping Ping API responds

Deployment Architecture

Dual-Repository Model

Finance Report uses two git repositories for configuration:

Environment Configuration Source Purpose
Local/CI/PR /docker-compose.yml Development + PR previews
Staging/Production /repo/finance_report/.../compose.yaml Production with Vault secrets

The /repo/ directory is a git submodule pointing to infra2 (infrastructure repo).

Key implications: - Workflows build images and trigger deployments - Actual deployment config managed in infra2 - Env vars for staging/prod stored in HashiCorp Vault - Container names include env suffix (e.g., -staging)

Secret Injection Flow

Production deployments use Vault sidecar pattern:

1. Dokploy pulls compose.yaml from infra2
2. vault-agent sidecar starts → renders /secrets/.env
3. Backend waits for secrets (CHECKPOINT-1)
4. Alembic runs migrations (CHECKPOINT-2)
5. Uvicorn starts application (CHECKPOINT-3)

Health check timeout (6min) accounts for this entire flow.

Container Naming

Environment Backend Database
Local/CI finance-report-backend finance-report-db
PR #47 finance-report-backend-pr-47 finance-report-db-pr-47
Staging finance_report-backend-staging finance_report-postgres-staging
Production finance_report-backend finance_report-postgres

Note: Local uses hyphens (Compose), prod uses underscores (Dokploy).


CI Workflows

ci.yml (PR/push)

Trigger: PR or push to main
Steps:  install → lint → test
DB:     GitHub services (ephemeral)
Smoke:  ❌ Not run (unit tests only)
Note:   Uses moon tasks for install/lint/build (uv/npm invoked via moon)

Deployment Workflows

Helper scripts: scripts/dokploy_deploy.sh, scripts/health_check.sh

staging-deploy.yml

Trigger: Push to main (apps/** changed)
Flow: Build (commit SHA) → Deploy → Health (6min) → E2E tests
URL: https://report-staging.zitian.party

production-release.yml

Triggers:
  - Tag push (v*.*.*): Build release images
  - Manual dispatch: Deploy to production

Build job: Tag → Build backend + frontend → Push to GHCR
Deploy job: Verify images → Deploy → Health (4min) → Smoke test

URL: https://report.zitian.party

Version Release Workflow

Manual control for stable releases and cherry-picks:

# Create release tag
git tag -a v1.2.3 -m "Release v1.2.3"
git push origin v1.2.3
# → Triggers production-release.yml (build job)
# → Images: ghcr.io/.../finance_report-{backend,frontend}:v1.2.3

# Deploy to production (manual)
# → Actions → Production Release → Run workflow → Select v1.2.3

Hotfix flow:

git checkout -b hotfix/bug v1.2.3
git cherry-pick abc123
git tag -a v1.2.4 -m "Hotfix: critical bug"
git push origin v1.2.4
# → Build automatically, deploy manually

Deployment Failures

Symptom Cause Resolution
Stuck "Waiting for secrets" Vault token expired invoke vault.setup-tokens --project=finance_report
6min timeout Migration failed Check SigNoz for CHECKPOINT-2 errors
"Image not found" Tag not built git push origin v1.2.3 to trigger build
502 Bad Gateway Backend crashed Check CHECKPOINT-3 in SigNoz logs

Vault Token Lifecycle

Staging and production deployments use HashiCorp Vault for secrets management. The vault-agent sidecar renders secrets to /secrets/.env using an app token.

Token Properties

Property Value
Token TTL 768 hours (~32 days)
Secrets file path /secrets/.env
Staleness threshold 1 hour (bootloader warning)

Check Token Status

# SSH into VPS
ssh root@$VPS_HOST

# Check vault-agent logs for token issues
docker logs finance_report-vault-agent-staging 2>&1 | tail -20

# Check if secrets file exists and when it was last modified
docker exec finance_report-backend-staging ls -la /secrets/.env

Regenerate Tokens

When a token expires, the vault-agent cannot refresh secrets, causing the backend to hang at "Waiting for secrets".

# From local machine with infra2 repo
cd /path/to/infra2

# Regenerate tokens (requires Vault root access)
invoke vault.setup-tokens --project=finance_report

# Restart vault-agent to pick up new token
ssh root@$VPS_HOST "docker restart finance_report-vault-agent-staging"

Monitoring (Bootloader Check)

The bootloader includes a _check_vault_secrets() method that runs in FULL mode:

  1. Missing secrets file: Warning with regeneration instructions
  2. Stale secrets file (>1 hour old): Warning that vault-agent may have stopped
  3. Fresh secrets file: OK status with last modified time

This check runs during smoke tests (bash scripts/smoke_test.sh) and provides early warning of token issues.


Deployment Architecture

Environment Flow

┌─────────────────────────────────────────────────────────────────┐
│ Development Flow                                                 │
│                                                                  │
│   Local Dev        PR/Branch           Staging         Prod     │
│   ┌─────────┐      ┌─────────┐      ┌─────────┐    ┌─────────┐ │
│   │ docker  │  →   │ CI test │  →   │ Auto on │ →  │ Manual  │ │
│   │ compose │      │ + PR    │      │ main    │    │ tag +   │ │
│   │         │      │ preview │      │ merge   │    │ dispatch│ │
│   └─────────┘      └─────────┘      └─────────┘    └─────────┘ │
│                                                                  │
│   docker-compose   pr-test.yml      staging-       production-  │
│   .yml             ci.yml           deploy.yml     release.yml  │
└─────────────────────────────────────────────────────────────────┘

Database Migrations Testing Strategy

Migrations are tested at multiple stages:

  1. Local Development: Manual testing with alembic upgrade head before committing
  2. GitHub CI: pytest validates model definitions and constraints
  3. Staging Deployment: First automated test of migrations via entrypoint
  4. Production Deployment: Only after staging validation

Before deploying schema changes: - Test locally: cd apps/backend && alembic upgrade head - Ensure backward-compatible migrations (for rollback) - Consider: existing data, indexes, constraints

Staging Deployment (Automatic)

Staging deploys automatically when: 1. Push to main branch 2. Changes in apps/backend/** or apps/frontend/**

The workflow (staging-deploy.yml): 1. Builds images with commit SHA tag 2. Pushes to GHCR 3. Deploys to Dokploy staging 4. Runs health check + E2E tests

Production Deployment (Manual)

Production deployment is a two-step process:

  1. Build: Create a git tag (triggers production-release.yml)

    git tag -a v1.2.3 -m "Release v1.2.3"
    git push origin v1.2.3
    # → Builds images: ghcr.io/.../finance_report-{backend,frontend}:v1.2.3
    

  2. Deploy: Manual workflow dispatch

    # Via GitHub Actions UI:
    # Actions → "Production Release" → Run workflow → Select tag
    
    # Or via gh CLI:
    gh workflow run production-release.yml
    

The deploy job: 1. Verifies images exist in GHCR 2. Deploys to Dokploy production 3. Runs health check + smoke tests

Database Migrations

Migrations run automatically on container startup via the entrypoint:

# In infra2 compose.yaml
entrypoint:
  - sh
  - -c
  - |
    cd /app && export PYTHONPATH=/app
    # Wait for secrets...
    alembic upgrade head  # ← Runs migrations
    exec uvicorn src.main:app --host 0.0.0.0 --port 8000

Important: Before deploying schema changes: 1. Test migration locally with docker-compose.yml 2. Ensure migration is backward-compatible (for rollback) 3. Consider: existing data, indexes, constraints


Environment Variables

Scenario DATABASE_URL Hostname Strategy
Local Dev postgresql+asyncpg://... localhost or postgres
Local Test postgresql+asyncpg://... localhost:5432
PR Test postgresql+asyncpg://... Unique: finance-report-db-pr-XX
CI Same as Local Test localhost:5433 (services)
Staging/Prod External PostgreSQL Dokploy Managed

Verification

# Verify moon commands work
moon run :test

# Test smoke tests locally
nohup moon run :dev -- --backend > /dev/null 2>&1 &
sleep 10
export BASE_URL="http://localhost:8000"
bash scripts/smoke_test.sh

# Check no orphan containers after tests
podman ps | grep finance_report

Resource Cleanup

Install the post-push hook to automatically clean orphaned test databases after every git push:

./scripts/install_git_hooks.sh

This hook safely removes: - ✅ Test databases from interrupted test runs (e.g., Ctrl+C) - ✅ Worker databases left by pytest-xdist crashes - ❌ Does NOT touch development data or running tests

Manual Cleanup

Clean Orphaned Test Databases

After interrupted test runs (Ctrl+C, SIGKILL, OOM):

# Preview what would be deleted
python scripts/cleanup_orphaned_dbs.py --dry-run

# Clean orphaned databases only (safe)
python scripts/cleanup_orphaned_dbs.py

# Clean ALL test databases (use with caution)
python scripts/cleanup_orphaned_dbs.py --all

Clean All Development Resources

⚠️ WARNING: This deletes ALL local data!

# Clean containers and locks only (safe)
./scripts/cleanup_dev_resources.sh

# Clean EVERYTHING including volumes and MinIO data (data loss!)
./scripts/cleanup_dev_resources.sh --all

Monitor Resource Leaks

Run weekly to detect accumulated leaks across all 6 environments:

# Quick check
./scripts/check_resource_leaks.sh

# Detailed report with listings
./scripts/check_resource_leaks.sh --verbose

# Include VPS PR volume check (requires SSH access)
VPS_HOST=cloud.zitian.party ./scripts/check_resource_leaks.sh

The monitoring script checks: 1. Local Worker Databases - Orphaned _gw* databases 2. Local Docker Volumes - Total count and size 3. Local MinIO Data - Storage usage 4. VPS PR Volumes - Orphaned -pr-* volumes (requires SSH) 5. GHCR PR Images - Orphaned pr-* tags (requires gh CLI) 6. Cache Files - Active namespace tracker state

PR Preview Cleanup (Automated)

When a PR is closed, GitHub Actions automatically cleans: - ✅ Dokploy stack on VPS - ✅ Docker volumes (postgres_data, redis_data, minio_data) - ✅ GHCR container images (backend:pr-{number}, frontend:pr-{number})

To enable VPS volume cleanup, add the VPS_SSH_KEY secret to your GitHub repository:

# Generate SSH key pair (if not exists)
ssh-keygen -t ed25519 -f ~/.ssh/finance_report_vps -N ""

# Add public key to VPS
ssh-copy-id -i ~/.ssh/finance_report_vps.pub root@cloud.zitian.party

# Add private key to GitHub Secrets
# Settings → Secrets → Actions → New repository secret
# Name: VPS_SSH_KEY
# Value: (paste content of ~/.ssh/finance_report_vps)


Engineering Standards

Environment Variable Lifecycle

Variables follow a strict "Bake vs. Runtime" flow:

flowchart TD
    Start[I need a new Env Var] --> Type{Is it for?}

    Type -->|Frontend| Front[Next.js Public]
    Type -->|Backend| Back[FastAPI Runtime]
    Type -->|Secret| Secret[Production Secret]

    Front --> F1[Add to .env.example]
    F1 --> F2[Add to Dockerfile ARG]
    F2 --> F3[Add to docker-compose.yml args]
    F3 --> F4[Use NEXT_PUBLIC_ prefix]

    Back --> B1[Add to .env.example]
    B1 --> B2[Add to apps/backend/src/config.py]
    B2 --> B3[Set default value in config.py]

    Secret --> S1[Add to secrets.ctmpl]
    S1 --> S2[Add to config.py]
    S2 --> S3[Add to .env.example]

    style Start fill:#f9f,stroke:#333,stroke-width:2px
    style Front fill:#e1f5fe
    style Back fill:#e8f5e9
    style Secret fill:#ffebee
  1. Frontend (Next.js):
    • Variables prefixed with NEXT_PUBLIC_ are "baked" into the static JS bundle during npm run build.
    • Requirement: These must be defined as ARG in apps/frontend/Dockerfile. See: apps/frontend/Dockerfile
    • Requirement: They must also be passed in docker-compose.yml under args.
  2. Backend (FastAPI):
    • Variables are loaded at runtime via Pydantic Settings.
    • Requirement: All variables must have a type and default in apps/backend/src/config.py. See: apps/backend/src/config.py
    • Requirement: Must be documented in .env.example. See: .env.example
  3. Production (Vault):
    • Secrets are stored in Vault and rendered by vault-agent using secrets.ctmpl.
    • Consistency: CI runs scripts/check_env_keys.py to ensure secrets.ctmpl, config.py, and .env.example are aligned.

Cross-Repo Synchronization

The repo/ directory is a submodule pointing to infra2.

  • Logic: Main Repo (finance_report).
  • Infrastructure: Submodule (infra2).
  • Workflow:
    1. If a change requires a new environment variable or a change to docker-compose.yml labels/configs for production:
    2. Create a branch in repo/.
    3. Commit changes to repo/finance_report/finance_report/10.app/.
    4. Push and create a PR in infra2.
    5. Once merged, update the submodule pointer in the Main Repo PR.

CI Performance & Test Strategy

Current Metrics (2025-02-02)

CI Pipeline: - Total duration: 6m 24s (Backend: 5m 52s, Frontend: 1m 30s) - Test execution: 893 tests in 4m 47s (320ms avg - excellent) - Caching: UV ✅ (2.9s), Next.js ✅, venv ✅

Test Coverage: - Overall: 94.51% (exceeds 94% requirement) - Critical gap: Service layer 28% (high production risk) - reporting.py: 7%, reconciliation.py: 9%, review_queue.py: 9% - Root cause: Tests focus on happy paths via routers, not direct service calls

Test Organization: - SSOT-aligned structure (73 files across domains) - pytest-xdist with 2 workers (1.7x speedup) - Minimal slow tests (only 1 marked)

Backend Test Parallelization

Current setup (pytest-xdist):

# In pyproject.toml
[tool.pytest.ini_options]
addopts = "-n 2"  # 2 workers for parallel execution

Further parallelization options:

  1. Increase worker count (limited by CPU cores)
    pytest -n auto  # Auto-detect CPU count
    pytest -n 4     # 4 workers (if 4+ cores available)
    
  2. Current: 2 workers (~1.7x speedup)
  3. Potential: 4 workers on 4-core CI runners (~2.5-3x speedup)
  4. Diminishing returns beyond CPU count

  5. Split tests in CI (GitHub Actions parallel jobs)

    backend-unit:  # Fast unit tests (~2m)
      run: pytest tests/infra/ tests/api/ tests/auth/
    
    backend-integration:  # Slower integration tests (~3m)
      run: pytest tests/accounting/ tests/reconciliation/
    

  6. Benefit: True parallel execution (not limited by single runner)
  7. Drawback: Need separate DB instances per job

  8. Current bottlenecks:

  9. DB setup: ~35s (unavoidable for integration tests)
  10. Large test files: Some tests >1s (acceptable for integration)
  11. Coverage calculation: ~3s (minimal impact)

Recommendation: - Quick win: Increase to -n 4 workers (~30s savings) - Long-term: Split into 2-3 parallel CI jobs (~1-2m savings)

Priority Actions

  1. Address service coverage gaps 🔴 (highest production risk)
  2. Add error path tests for core services
  3. Target: 80% service coverage (from 28%)

  4. Increase pytest workers ⚡ (quick win)

  5. Change -n 2 to -n 4 in pyproject.toml

  6. Split CI jobs 🟡 (medium effort, 1-2m savings)

  7. Separate unit vs integration tests in GitHub Actions