EPIC-012: Foundation Libraries Enhancement¶
Goal: Strengthen core infrastructure libraries to production-grade quality with proper observability, transaction management, and developer experience.
Status: ๐ก In Progress
Vision Anchor: decision-7-tech-stack
Priority: P1 (Infrastructure Debt)
Estimated Duration: 2-3 weeks
Dependencies: None (cross-cutting infrastructure)
๐ Overview¶
This EPIC addresses technical debt in the foundational libraries that all modules depend on. A comprehensive audit identified gaps in:
- Observability - Tracing, log-trace correlation, metrics
- Database - Transaction boundaries, connection pooling
- Error Handling - Unified exception hierarchy
- Rate Limiting - API-wide protection
- Developer Experience - Debugging tools, schema consistency
๐ฏ Success Criteria¶
Must Have (P0)¶
- [x] Distributed tracing with trace_id in all logs
- [ ] Service-layer uses
flush(), router-layer ownscommit() - See:
docs/ssot/accounting.md#async-tx-boundary - [x] Connection pool size configurable via environment
Should Have (P1)¶
- [x] Unified
BaseAppExceptionwith error IDs - [x] API-wide rate limiting (not just auth endpoints)
- [~] Metrics endpoint โ deferred: project uses SigNoz OTLP, not Prometheus pull scraping (see EPIC-010)
Nice to Have (P2)¶
- [ ] UUID auto-serialization structlog processor
๐ Affected Components¶
| Component | File(s) | Changes |
|---|---|---|
| Logging | src/logger.py |
Add tracing, trace_id processor |
| Database | src/database.py, src/config.py |
Pool config, transaction patterns |
| Exceptions | src/utils/exceptions.py |
BaseAppException class |
| Rate Limiting | src/rate_limit.py |
Global API limiter |
| Debugging | scripts/debug.py |
SigNoz API integration |
| Schemas | src/schemas/*.py |
Consistent BaseResponse inheritance |
๐ด High Priority Issues¶
H1: Distributed Tracing Missing¶
Problem: No opentelemetry-instrumentation-* packages installed. Logs lack trace_id/span_id, making it impossible to correlate logs with traces in SigNoz.
Solution:
1. Add OTEL instrumentation packages to pyproject.toml
2. Initialize TracerProvider in logger.py
3. Add structlog processor to inject trace context
4. Auto-instrument FastAPI, SQLAlchemy, and HTTPX
Status: โ Complete (PR pending)
Tracking: #181
๐งช Test Cases¶
Test Organization: Tests organized by feature blocks using AC12.x.y numbering. Coverage: See
apps/backend/tests/infra/
AC12.1: Logging - OTEL Endpoint Configuration¶
| ID | Test Case | Test Function | File | Priority |
|---|---|---|---|---|
| AC12.1.1 | OTEL logs endpoint adds suffix /v1/logs | test_build_otlp_logs_endpoint_adds_suffix() |
infra/test_logger.py |
P1 |
| AC12.1.2 | OTEL logs endpoint preserves logs path with /v1/logs | test_build_otlp_logs_endpoint_preserves_logs_path() |
infra/test_logger.py |
P1 |
AC12.2: Logging - Renderer Selection¶
| ID | Test Case | Test Function | File | Priority |
|---|---|---|---|---|
| AC12.2.1 | Debug mode uses ConsoleRenderer | test_select_renderer_uses_console_in_debug() |
infra/test_logger.py |
P0 |
| AC12.2.2 | Production mode uses JSONRenderer | test_select_renderer_uses_json_in_production() |
infra/test_logger.py |
P0 |
AC12.3: Logging - OTEL Missing Dependency / No Endpoint¶
| ID | Test Case | Test Function | File | Priority |
|---|---|---|---|---|
| AC12.3.1 | OTEL logging not available logs warning | test_configure_otel_logging_missing_dependency_warns() |
infra/test_logger.py |
P0 |
| AC12.3.2 | OTEL tracing not available logs warning | test_configure_otel_tracing_missing_dependency_warns() |
infra/test_logger.py |
P0 |
| AC12.3.3 | OTEL logging with no endpoint skips setup | test_configure_otel_logging_no_endpoint() |
infra/test_logger.py |
P0 |
AC12.4: Logging - OTEL with Fake Exporter¶
| ID | Test Case | Test Function | File | Priority |
|---|---|---|---|---|
| AC12.4.1 | OTEL configuration sets up TracerProvider correctly | test_configure_otel_tracing_with_fake_exporter() |
infra/test_logger.py |
P0 |
AC12.5: Logging - OTEL Resource Configuration¶
| ID | Test Case | Test Function | File | Priority |
|---|---|---|---|---|
| AC12.5.1 | OTEL resource created with correct attributes | test_build_otel_resource() |
infra/test_logger.py |
P0 |
AC12.6: Logging - Timing Utilities¶
| ID | Test Case | Test Function | File | Priority |
|---|---|---|---|---|
| AC12.6.1 | Sync log_timing logs operation with timing | test_log_timing_basic() |
infra/test_logger.py |
P0 |
| AC12.6.2 | Async log_timing includes additional context | test_log_timing_with_context() |
infra/test_logger.py |
P0 |
| AC12.6.3 | log_timing yields mutable dict | test_log_timing_yields_mutable_dict() |
infra/test_logger.py |
P0 |
| AC12.6.4 | log_timing with custom level | test_log_timing_with_custom_level() |
infra/test_logger.py |
P0 |
AC12.7: Logging - External API Logging¶
| ID | Test Case | Test Function | File | Priority |
|---|---|---|---|---|
| AC12.7.1 | Sync external API call logs success | test_log_external_api_sync_success() |
infra/test_logger.py |
P0 |
| AC12.7.2 | Sync external API call logs failure | test_log_external_api_sync_failure() |
infra/test_logger.py |
P0 |
| AC12.7.3 | Async external API call logs success | test_log_external_api_async_success() |
infra/test_logger.py |
P0 |
| AC12.7.4 | Async external API call logs failure | test_log_external_api_async_failure() |
infra/test_logger.py |
P0 |
| AC12.7.5 | Sync external API with log_args=True logs args count | test_log_external_api_with_log_args() |
infra/test_logger.py |
P0 |
AC12.8: Logging - Exception Logging¶
| ID | Test Case | Test Function | File | Priority |
|---|---|---|---|---|
| AC12.8.1 | Log exception logs error with context | test_log_exception_basic() |
infra/test_logger.py |
P0 |
| AC12.8.2 | Log exception includes extra context fields | test_log_exception_with_extra_context() |
infra/test_logger.py |
P0 |
| AC12.8.3 | Log exception without traceback | test_log_exception_without_traceback() |
infra/test_logger.py |
P0 |
| AC12.8.4 | Log exception with custom level | test_log_exception_custom_level() |
infra/test_logger.py |
P0 |
AC12.10: Logging - Build Processors¶
| ID | Test Case | Test Function | File | Priority |
|---|---|---|---|---|
| AC12.10.1 | Build processors returns list | test_build_processors_returns_list() |
infra/test_logger.py |
P0 |
AC12.11: Logging - Trace Context¶
| ID | Test Case | Test Function | File | Priority |
|---|---|---|---|---|
| AC12.11.1 | Trace context injects trace_id and span_id when span is valid | test_add_trace_context_with_valid_span() |
infra/test_logger.py |
P0 |
| AC12.11.2 | Trace context skips injection when span context is invalid | test_add_trace_context_with_invalid_span() |
infra/test_logger.py |
P0 |
| AC12.11.3 | Trace context handles missing opentelemetry gracefully | test_add_trace_context_handles_import_error() |
infra/test_logger.py |
P0 |
AC12.12: Logging - OTEL Tracing Configuration¶
| ID | Test Case | Test Function | File | Priority |
|---|---|---|---|---|
| AC12.12.1 | OTEL tracing skips setup when no endpoint configured | test_configure_otel_tracing_no_endpoint() |
infra/test_logger.py |
P0 |
| AC12.12.2 | TracerProvider created and resource attributes set | test_configure_otel_tracing_with_fake_exporter() |
infra/test_logger.py |
P0 |
| AC12.12.3 | Traces path appends /v1/traces | test_configure_otel_tracing_appends_traces_path() |
infra/test_logger.py |
P0 |
AC12.15: Logging - Configuration Basics¶
| ID | Test Case | Test Function | File | Priority |
|---|---|---|---|---|
| AC12.15.1 | Configure logging in debug mode | test_configure_logging_basic() |
infra/test_logger.py |
P0 |
| AC12.15.2 | Configure logging in production mode | test_configure_logging_production_mode() |
infra/test_logger.py |
P0 |
AC12.16: Logging - Async Timing¶
| ID | Test Case | Test Function | File | Priority |
|---|---|---|---|---|
| AC12.16.1 | Async log_timing logs operation with timing | test_async_log_timing_basic() |
infra/test_logger.py |
P0 |
| AC12.16.2 | Async log_timing includes additional context | test_async_log_timing_with_context() |
infra/test_logger.py |
P0 |
AC12.17: Logging - External API Async with Args¶
| ID | Test Case | Test Function | File | Priority |
|---|---|---|---|---|
| AC12.17.1 | External API async with log_args=True logs args count | test_log_external_api_async_with_log_args() |
infra/test_logger.py |
P0 |
| AC12.17.2 | External API async failure with log_args=True logs args | test_log_external_api_async_failure_with_log_args() |
infra/test_logger.py |
P0 |
AC12.18: Logging - Configuration - Environment Variables¶
| ID | Test Case | Test Function | File | Priority |
|---|---|---|---|---|
| AC12.18.1 | Ensure PRIMARY_MODEL follows expected pattern | test_primary_model_format() |
infra/test_config_contract.py |
P0 |
| AC12.18.2 | Ensure config.py default matches .env.example documentation | test_config_sync_with_env_example() |
infra/test_config_contract.py |
P0 |
| AC12.18.3 | Ensure BASE_CURRENCY is valid ISO 4217 currency code | test_base_currency_format() |
infra/test_config_contract.py |
P0 |
| AC12.18.4 | Ensure S3_BUCKET follows naming conventions | test_s3_bucket_format() |
infra/test_config_contract.py |
P0 |
| AC12.18.5 | Ensure JWT_ALGORITHM is secure algorithm | test_jwt_algorithm_allowed() |
infra/test_config_contract.py |
P0 |
| AC12.18.6 | Ensure DATABASE_URL follows expected format | test_database_url_format() |
infra/test_config_contract.py |
P0 |
| AC12.18.7 | stub | โ | โ | โ |
AC12.19: Infrastructure - Epic 001 Contracts¶
| ID | Test Case | Test Function | File | Priority |
|---|---|---|---|---|
| AC12.19.1 | Moon workspace configuration files exist | test_epic_001_moon_workspace_configs_exist() |
infra/test_epic_001_contracts.py |
P0 |
AC12.20: Database - Connection Pool Configuration¶
| ID | Test Case | Test Function | File | Priority |
|---|---|---|---|---|
| AC12.20.1 | DB_POOL_SIZE config field exists with default | test_db_pool_size_config_default() |
infra/test_config_contract.py |
P1 |
| AC12.20.2 | DB_MAX_OVERFLOW config field exists with default | test_db_max_overflow_config_default() |
infra/test_config_contract.py |
P1 |
| AC12.20.3 | Pool config is positive integer | test_db_pool_config_positive_integer() |
infra/test_config_contract.py |
P1 |
AC12.21: Exceptions - BaseAppException¶
| ID | Test Case | Test Function | File | Priority |
|---|---|---|---|---|
| AC12.21.1 | BaseAppException has error_id attribute | test_base_app_exception_has_error_id() |
infra/test_exceptions.py |
P1 |
| AC12.21.2 | BaseAppException has status_code attribute | test_base_app_exception_has_status_code() |
infra/test_exceptions.py |
P1 |
| AC12.21.3 | BaseAppException is subclass of Exception | test_base_app_exception_is_exception() |
infra/test_exceptions.py |
P1 |
| AC12.21.4 | BaseAppException can be raised and caught | test_base_app_exception_raise_and_catch() |
infra/test_exceptions.py |
P1 |
Test Coverage Summary:
- Total AC IDs: 49
- Requirements converted to AC IDs: 100% (EPIC-012 infrastructure work)
- Requirements with test references: 100%
- Test files: 4 (test_logger.py, test_config_contract.py, test_epic_001_contracts.py, test_exceptions.py)
- Overall coverage: Logging, config infrastructure, pool config, and exception hierarchy verified
๐ Acceptance Criteria¶
โน๏ธ Non-contiguous AC numbering: Gaps in
AC12.x.ynumbers withindocs/infra_registry.yamlreflect deprecated/merged ACs preserved for historical traceability (e.g., AC12.24.1โ3 retained as~~strikethrough~~). Do not renumber. New ACs append to the next available index in the relevant feature block.
Problem: Services call db.commit() directly, making it impossible to compose multiple service calls into a single atomic transaction.
Solution:
1. Change services to use db.flush() for getting IDs
2. Move commit() responsibility to routers
3. Consider @transactional decorator for complex cases
Tracking: #182
๐ก Medium Priority Issues¶
M1: Connection Pool Configuration¶
Problem: Using SQLAlchemy defaults for connection pooling. Production may need tuning.
Solution: Add DB_POOL_SIZE and DB_MAX_OVERFLOW to config.py
Tracking: #184
M2: Unified Exception Hierarchy¶
Problem: Each service defines its own exception class. No unified BaseAppException with error IDs for frontend consumption.
Solution: Create base exception with error_id field, migrate services
Tracking: #185
M3: API-Wide Rate Limiting¶
Problem: Rate limiting only protects /auth/* endpoints. Other endpoints unprotected.
Solution: Add configurable global rate limiter middleware
Tracking: #186
M4: Metrics Endpoint¶
Problem: ~~No /metrics endpoint for Prometheus.~~ โ Architecture uses SigNoz OTLP, not Prometheus pull.
Solution: ~~Add prometheus-fastapi-instrumentator~~ โ Deferred (SigNoz OTLP is the observability path)
Tracking: #187
โ ๏ธ Deferred: This project uses SigNoz via OTLP (see EPIC-010) for observability. Prometheus pull-based
/metricshas zero consumers in this architecture. Metrics via OTLP to SigNoz is a future task tracked separately.
๐ข Low Priority Issues¶
L3: UUID Auto-Serialization¶
Problem: Must manually wrap UUIDs with str() in logger calls.
Solution: Add structlog processor to auto-convert UUIDs
L4: Schema Inheritance Consistency¶
Problem: Not all response schemas inherit from BaseResponse.
Solution: Audit and fix schema inheritance
AC12.22: Schemas - Move Inline Schemas to Dedicated Modules¶
| ID | Test Case | Test Function | File | Priority |
|---|---|---|---|---|
| AC12.22.1 | Move 6 inline schemas from statements router to review module | N/A (mechanical) | N/A | P0 |
| AC12.22.2 | Extract background task schemas from inline/background definitions into dedicated modules | N/A (mechanical) | N/A | P0 |
AC12.23: Rate Limiting - Global API Middleware (M3)¶
| ID | Test Case | Test Function | File | Priority |
|---|---|---|---|---|
| AC12.23.1 | Global rate limit middleware exempts /health | test_global_rate_limit_middleware_exempts_health() |
infra/test_rate_limit.py |
P1 |
| AC12.23.2 | Global rate limit middleware returns 429 after limit exceeded | test_global_rate_limit_middleware_blocks_after_limit() |
infra/test_rate_limit.py |
P1 |
| AC12.23.3 | Global rate limit middleware allows normal requests | test_global_rate_limit_middleware_allows_normal_requests() |
infra/test_rate_limit.py |
P1 |
| AC12.23.4 | Global rate limit middleware exempts /docs | test_global_rate_limit_middleware_exempts_docs() |
infra/test_rate_limit.py |
P1 |
AC12.24: Metrics - Prometheus Endpoint (M4)¶
| ID | Test Case | Test Function | File | Priority |
|---|---|---|---|---|
| AC12.24.1 | ~~/metrics endpoint returns 200 OK~~ |
Removed | Deferred: SigNoz OTLP path, no Prometheus scrape config | P1 |
| AC12.24.2 | ~~/metrics endpoint returns text/plain~~ |
Removed | Deferred: SigNoz OTLP path | P1 |
| AC12.24.3 | ~~/metrics response contains Prometheus data~~ |
Removed | Deferred: SigNoz OTLP path | P1 |
๐ Progress Tracking¶
| Phase | Task | Status | PR |
|---|---|---|---|
| 0 | Audit & Documentation | โ Complete | This EPIC |
| 1 | Distributed Tracing (H1) | โ Complete | Pending |
| 2 | Transaction Boundaries (H2) | โณ Pending | - |
| 3 | Connection Pool Config (M1) | โ Complete | This PR |
| 4 | Exception Hierarchy (M2) | โ Complete | This PR |
| 5 | Rate Limiting (M3) | โ Complete | This PR |
| 6 | Metrics Endpoint (M4) | โ Deferred | Removed โ SigNoz OTLP used instead of Prometheus pull |
๐ Related Documents¶
๐ Audit Summary¶
Current Foundation Libraries¶
| Library | File | Status |
|---|---|---|
| Logging | src/logger.py |
โ Structlog + OTEL export |
| Config | src/config.py |
โ Pydantic Settings |
| Database | src/database.py |
โ ๏ธ Needs pool config |
| Storage | src/services/storage.py |
โ S3/MinIO abstraction |
| Rate Limit | src/rate_limit.py |
โ ๏ธ Auth-only |
| Dependencies | src/deps.py |
โ DbSession, CurrentUserId |
| Boot | src/boot.py |
โ Health checks |
| Debug | scripts/debug.py |
โ ๏ธ Needs SigNoz API |
| Error IDs | src/constants/error_ids.py |
โ Centralized constants |
Frontend Foundation¶
| Library | File | Status |
|---|---|---|
| API Client | lib/api.ts |
โ Unified fetch wrapper |
| Auth | lib/auth.ts |
โ Token management |
| Currency | lib/currency.ts |
โ Decimal.js |
| Workspace | hooks/useWorkspace.tsx |
โ Tab/sidebar state |
Created: January 2026
Last Updated: January 2026