Skip to content

EPIC-018: AI-Driven Data Pipeline

Status: ๐ŸŸก In Progress
Vision Anchor: decision-2-event-middle-layer
Phase: 4 (AI Enhancement)
Duration: 4-7 weeks
Priority: P1 (High Priority - Parallel with EPIC-016)
Dependencies: EPIC-003 (Statement Parsing), EPIC-004 (Reconciliation Engine), EPIC-006 (AI Advisor), EPIC-013 (Statement Parsing V2)


๐ŸŽฏ Objective

Maximize AI utilization across the entire data pipeline from statement upload to financial reports. Currently, AI is only used in 2 of 7 pipeline stages (extraction and chat advisor). This EPIC extends AI into classification, reconciliation, journal entry creation, and feedback learning โ€” transforming the pipeline from "AI extracts, human does everything else" to "AI handles what it can confidently, human reviews what it can't."

Core Principle (from vision.md): AI is a parsing and explanation layer, not a source of record. Confidence thresholds determine auto-accept vs. human review.

Current Pipeline (Before):

Upload โ†’ [AI Vision] โ†’ BankStatement โ†’ [Rules Only] โ†’ Classification
  โ†’ [Hardcoded Uncategorized] โ†’ JournalEntry โ†’ [Bypass Layer 3] โ†’ Reports
  โ†’ [Read-Only AI] โ†’ Chat Insights

Target Pipeline (After):

Upload โ†’ [AI Vision + Category] โ†’ BankStatement โ†’ [AI + Rules Hybrid] โ†’ Classification
  โ†’ [AI-Suggested Accounts] โ†’ JournalEntry โ†’ [Layer 3 Aware] โ†’ Reports
  โ†’ [Learning AI] โ†’ Chat Insights + Feedback Loop

Success Criteria: - AI suggests transaction categories during extraction (โ‰ฅ70% accuracy) - Classification uses AI when rules fail (ML_MODEL rule type implemented) - Journal entries use classified categories instead of "Uncategorized" - User corrections feed back into AI prompts (few-shot learning) - Reports read Layer 3 classification results


๐Ÿ‘ฅ Multi-Role Review

Role Focus Review Opinion
๐Ÿ—๏ธ Architect Pipeline Design AI adds fields to extraction prompt, not new services. Classification becomes AI+rules hybrid. Feedback loop via CorrectionLog table.
๐Ÿ“Š Accountant Data Integrity AI suggestions are NEVER auto-posted. Must pass through review queue. Confidence thresholds: โ‰ฅ85 auto-accept, 60-84 review, <60 flag. See: docs/ssot/reconciliation.md#thresholds
๐Ÿ’ป Developer Implementation Extend existing extraction.py prompt, implement RuleType.ML_MODEL in classification.py, modify create_entry_from_txn in review_queue.py.
๐Ÿงช Tester Validation Test: AI category accuracy, fallback to Uncategorized when AI fails, feedback loop persistence, Layer 3โ†’4 data flow.
๐Ÿ“‹ PM User Experience Reduces manual categorization work by 70%+. User sees AI suggestions and corrects only mistakes. Corrections make future suggestions better.
๐Ÿค– AI/ML Model Strategy No custom model training needed. Uses existing OpenRouter vision model with prompt engineering + few-shot examples from user corrections.

๐Ÿ”— Relationship to Other EPICs

EPIC Relationship
EPIC-003 (Statement Parsing) Extends extraction prompt with category fields
EPIC-004 (Reconciliation) Adds AI semantic scoring for 60-84 confidence matches
EPIC-006 (AI Advisor) Shares OpenRouter infrastructure; advisor gains write-suggest capability
EPIC-013 (Statement Parsing V2) Builds on V2's confidence scoring framework
EPIC-016 (Two-Stage Review) Complementary โ€” AI automates what it can, EPIC-016 handles human review for what AI can't confidently classify
EPIC-017 (Portfolio) Independent โ€” no direct dependency

โœ… Task Checklist

Phase 1: AI-Powered Classification โ€” 1-2 weeks (Highest ROI)

1.1 Extraction Prompt Enhancement

  • [x] Add suggested_category and category_confidence fields to extraction prompt
  • File: apps/backend/src/prompts/statement.py
  • Categories: Food & Dining, Transport, Shopping, Utilities, Salary, Transfer, Investment, Insurance, Rent, Healthcare, Entertainment, Education, Subscriptions, Other
  • Confidence: 0.0-1.0 float returned by AI
  • [x] Add suggested_category VARCHAR(100) and category_confidence DECIMAL(3,2) columns to BankStatementTransaction
  • File: apps/backend/src/models/statement.py
  • Migration: Alembic migration with nullable columns (backward compatible)
  • [x] Update extraction service to parse and persist AI-returned category fields
  • File: apps/backend/src/services/extraction.py
  • Graceful fallback: if AI omits category, set suggested_category=NULL, category_confidence=0.0

1.2 Classification Service: Implement ML_MODEL Rule Type

  • [x] Implement RuleType.ML_MODEL match logic in ClassificationService.evaluate_rule()
  • File: apps/backend/src/services/classification.py
  • Logic: Read suggested_category from BankStatementTransaction โ†’ apply confidence threshold
  • Threshold: category_confidence โ‰ฅ 0.7 โ†’ accept AI suggestion
  • Currently 91 lines, ML_MODEL case returns False โ†’ make it functional
  • [x] Add classify_with_ai() method that queries extraction results before falling back to rules
  • Priority: KEYWORD_MATCH โ†’ REGEX_MATCH โ†’ ML_MODEL (AI suggestion) โ†’ Uncategorized
  • This preserves existing user-defined rules as highest priority

1.3 Journal Entry: Read Classification Before Uncategorized Fallback

  • [x] Modify create_entry_from_txn() to check classification results before defaulting to Uncategorized
  • File: apps/backend/src/services/review_queue.py (lines 264-359)
  • Current: get_or_create_account(db, name="Income - Uncategorized") / "Expense - Uncategorized"
  • Target: Check TransactionClassification for the transaction โ†’ use classified account if exists โ†’ fallback to Uncategorized
  • Account naming: "Income - {category}" or "Expense - {category}" (e.g., "Expense - Food & Dining")
  • [x] Ensure get_or_create_account() creates accounts on-demand for new AI-suggested categories
  • Auto-created accounts must be: user-scoped, correct type (Income/Expense), correct currency

1.4 Tests for Phase 1

  • [x] Test: AI extraction includes suggested_category and category_confidence in response
  • [x] Test: Missing AI category fields gracefully default to NULL/0.0
  • [x] Test: ML_MODEL rule type returns True when confidence โ‰ฅ 0.7
  • [x] Test: ML_MODEL rule type returns False when confidence < 0.7
  • [x] Test: Classification priority: KEYWORD > REGEX > ML_MODEL > Uncategorized
  • [x] Test: create_entry_from_txn uses classified category when available
  • [x] Test: create_entry_from_txn falls back to Uncategorized when no classification exists
  • [x] Test: Auto-created category accounts are correct type and user-scoped

Phase 2: Feedback Learning Loop โ€” 1 week

2.1 Correction Log Model

  • [x] Create CorrectionLog model
  • File: apps/backend/src/models/correction.py (new)
  • Fields: id, user_id, transaction_id, original_category, corrected_category, original_account_id, corrected_account_id, created_at
  • Links to: BankStatementTransaction, Account, User
  • Purpose: Track every user correction for few-shot learning
  • [x] Alembic migration for correction_log table

2.2 Correction Recording API

  • [x] POST /api/corrections โ€” Record a user correction
  • Input: transaction_id, corrected_category, corrected_account_id
  • Auto-fills original_category from transaction's current classification
  • Returns: correction record
  • [x] GET /api/corrections/stats โ€” Get correction statistics
  • Return: top N corrected categories, accuracy rate per category, total corrections
  • Use for monitoring AI quality over time

2.3 Few-Shot Prompt Injection

  • [x] Query CorrectionLog for user's recent corrections (last 50)
  • Group by original_category โ†’ corrected_category pattern
  • Inject as few-shot examples into extraction prompt
  • Format: "Previously, transactions like '{description}' were categorized as '{corrected_category}'"
  • [x] Update apps/backend/src/prompts/statement.py to accept correction examples
  • Add correction_examples: list[dict] parameter to prompt builder
  • Inject up to 10 most relevant corrections as few-shot context
  • [x] Add cache for correction examples (per user, 1-hour TTL)
  • Avoid querying correction log on every extraction call

2.4 Tests for Phase 2

  • [x] Test: Correction log records original and corrected categories
  • [x] Test: Correction stats aggregate correctly
  • [x] Test: Few-shot examples injected into extraction prompt
  • [x] Test: Prompt with corrections produces different output than without (mock test)
  • [x] Test: Correction cache invalidates after TTL
  • [x] Test: Empty correction log produces standard prompt (no few-shot)

Phase 3: AI-Assisted Reconciliation โ€” 1-2 weeks

3.1 AI Semantic Scoring

  • [x] Add ai_semantic_score() method to reconciliation service
  • File: apps/backend/src/services/reconciliation.py
  • Trigger: Only for candidates scoring 60-84 (review queue range)
  • Input: Transaction description pair (bank statement + journal entry memo)
  • Output: Semantic similarity score (0-100) from AI
  • Cost control: Only called for review-queue candidates, not all matches
  • [x] Create apps/backend/src/prompts/reconciliation.py (new)
  • Prompt: "Given these two transaction descriptions, rate their semantic similarity (0-100)"
  • Include context: date proximity, amount match, account info
  • Response format: JSON with similarity_score and reasoning

3.2 Hybrid Scoring Integration

  • [x] Modify calculate_match_score() to incorporate AI semantic score
  • Current: Pure algorithmic (date, amount, description fuzzy match)
  • New: final_score = 0.7 * algorithmic_score + 0.3 * ai_semantic_score
  • Only applies when algorithmic score is in 60-84 range
  • Scores outside that range remain unchanged (โ‰ฅ85 auto-accept, <60 unmatched)
  • [x] Add feature flag: enable_ai_reconciliation in config.py
  • Default: False (opt-in to avoid unexpected API costs)
  • When disabled: existing pure-algorithmic behavior unchanged

3.3 Tests for Phase 3

  • [x] Test: ai_semantic_score() returns score for matching descriptions
  • [x] Test: ai_semantic_score() returns low score for unrelated descriptions
  • [x] Test: Hybrid scoring only triggers for 60-84 range candidates
  • [x] Test: Feature flag disables AI scoring when False
  • [x] Test: Algorithmic scores โ‰ฅ85 and <60 bypass AI scoring entirely
  • [x] Test: Final score correctly weights algorithmic (0.7) and AI (0.3)

Phase 4: Pipeline Integration & Report Fix โ€” 1-2 weeks

4.1 Reports Read Layer 3 Classification

  • [x] Modify reporting.py to read TransactionClassification (Layer 3) instead of raw JournalLine
  • File: apps/backend/src/services/reporting.py
  • Current: Reports read JournalEntry โ†’ JournalLine directly, ignoring Layer 3
  • Target: Reports query TransactionClassification for category breakdowns
  • Fallback: If transaction has no classification, use account name as category (backward compatible)
  • [x] Add category breakdown to Income Statement
  • Group expenses/income by classified category
  • Show: Category, Amount, % of Total
  • Use TransactionClassification.assigned_category field

4.2 ReportSnapshot (Layer 4) Utilization

  • [x] Implement ReportSnapshot generation
  • File: apps/backend/src/models/layer4.py (model exists but unused)
  • Generate snapshots after report computation
  • Store: report type, date range, computed data (JSONB), generated_at
  • Enable historical comparison: "This month vs last month" reports
  • [x] Add GET /api/reports/{type}/snapshots endpoint
  • List available snapshots for a report type
  • Enable time-series trend analysis

4.3 CSV Parsing via AI (Remove Hardcoding)

  • [x] Add AI-powered CSV parsing as fallback for unknown institutions
  • Current: CSV parsing is hardcoded per institution (DBS, Wise, etc.)
  • New: When institution is unknown, send CSV header + sample rows to AI
  • AI returns: column mapping (date, description, amount, balance)
  • Preserve existing hardcoded parsers for known institutions (they're faster and free)
  • [x] Create apps/backend/src/prompts/csv_mapping.py (new)
  • Prompt: "Given this CSV header and sample data, identify which columns are date, description, amount, balance"
  • Response: JSON column mapping

4.4 Tests for Phase 4

  • [x] Test: Reports include category breakdown from Layer 3 classification
  • [x] Test: Reports fallback to account name when no classification exists
  • [x] Test: ReportSnapshot generated and stored after report computation
  • [x] Test: ReportSnapshot endpoint returns historical snapshots
  • [x] Test: AI CSV parsing returns valid column mapping for unknown institutions
  • [x] Test: Known institution CSV parsing still uses hardcoded parsers (no AI call)

๐Ÿ“Š Acceptance Criteria Summary

AC ID Phase Description
AC18.1.1 1 Extraction prompt returns suggested_category and category_confidence
AC18.1.2 1 BankStatementTransaction has suggested_category and category_confidence columns
AC18.1.3 1 RuleType.ML_MODEL evaluates AI suggestion with confidence threshold โ‰ฅ 0.7
AC18.1.4 1 Classification priority: KEYWORD > REGEX > ML_MODEL > Uncategorized
AC18.1.5 1 create_entry_from_txn reads classification before defaulting to Uncategorized
AC18.1.6 1 Auto-created category accounts are user-scoped and correctly typed
AC18.2.1 2 CorrectionLog model records original and corrected categories
AC18.2.2 2 Corrections API records and retrieves correction stats
AC18.2.3 2 Few-shot examples from corrections injected into extraction prompt
AC18.2.4 2 Correction cache with 1-hour TTL
AC18.3.1 3 ai_semantic_score() returns similarity for transaction description pairs
AC18.3.2 3 Hybrid scoring: 0.7 * algorithmic + 0.3 * AI for 60-84 range only
AC18.3.3 3 Feature flag enable_ai_reconciliation controls AI scoring
AC18.4.1 4 Reports read Layer 3 TransactionClassification for category breakdowns
AC18.4.2 4 ReportSnapshot (Layer 4) generated and queryable via API
AC18.4.3 4 AI CSV parsing handles unknown institutions as fallback

๐Ÿšซ Out of Scope (v1)

  • Custom ML model training (use prompt engineering + few-shot only)
  • Real-time model fine-tuning
  • Automated rule generation from corrections
  • AI-powered anomaly detection (separate EPIC if needed)
  • Multi-model A/B testing

โš ๏ธ Risk Assessment

Risk Impact Mitigation
AI category accuracy < 70% Users lose trust, more corrections needed Start with broad categories (14), not fine-grained. Measure accuracy before expanding.
OpenRouter API costs increase Budget overrun AI reconciliation behind feature flag. AI classification adds ~1 field to existing call (minimal cost).
Few-shot examples degrade quality Worse suggestions over time Limit to 10 most recent corrections. Monitor accuracy metrics. Reset mechanism available.
Layer 3โ†’4 migration breaks reports Existing reports break Fallback: if no classification, use account name. Backward compatible.

๐Ÿ“ Metrics & Monitoring

Metric Target Measurement
AI category accuracy โ‰ฅ 70% (Phase 1) corrections / total_classifications
Uncategorized reduction โ‰ฅ 50% decrease Count of "Uncategorized" journal entries before/after
AI reconciliation improvement +5% match rate in 60-84 range Compare match rates with flag on/off
Feedback loop effectiveness Accuracy improves 5%+ after 50 corrections Track accuracy over time per user
API cost per extraction < $0.01 increase Monitor OpenRouter billing (category field adds ~50 tokens)

Last updated: March 2026


๐Ÿ†• Phase 5 โ€” UI Gap Audit (April 2026): Confidence Hierarchy, AI Suggestion Review Queue, Feature-Flag UI & Audit Trail

Origin: UI gap audit against vision.md and docs/ssot/source-type-priority.md / docs/ssot/confirmation-workflow.md. Backend confidence hierarchy and enable_ai_reconciliation flag exist but are invisible to users โ€” no badge, no review queue for AI suggestions, no in-product flag toggle, no audit trail.

Acceptance Criteria โ€” Phase 5 (Confidence & AI Suggestion UI)

  • [x] AC18.5.1 <ConfidenceBadge /> component renders TRUSTED / HIGH / MEDIUM / LOW pill with consistent color tokens (green / blue / amber / gray) and tooltip explaining source-type priority
  • [x] AC18.5.2 ConfidenceBadge mounted on every transaction row in Stage 1 review, Stage 2 listing, and processing-account listing; reads confidence_tier from API response
  • [x] AC18.5.3 AI Suggestion Review Queue page /review/ai-suggestions lists pending AI classifications and AI reconciliation matches in score band 60-84 with {transaction, suggested_category_or_match, ai_score, ai_reasoning}
  • [x] AC18.5.4 Queue actions: Accept, Reject, Edit-then-Accept; each action calls POST /api/ai/feedback with {suggestion_id, action, corrected_value?} to feed the feedback loop
  • [x] AC18.5.5 Settings page /settings/ai exposes toggles for enable_ai_reconciliation, enable_ai_classification, persisted via PATCH /api/users/me/settings; toggle reflects backend feature-flag state on load
  • [x] AC18.5.6 Audit Trail panel on transaction detail page lists chronological {timestamp, actor, action, old_value, new_value} from GET /api/transactions/{id}/audit, including AI-applied changes labeled with actor ai
  • [x] AC18.5.7 Frontend tests: mount ConfidenceBadge for each tier; mount AI Suggestion Queue and assert Accept/Reject buttons render; mount Settings AI toggles and assert default state matches API

Priority: P0 โ€” confidence visibility is a vision-critical trust signal; AI review queue is the human-in-the-loop hinge for the entire AI pipeline. Estimated effort: 2 days ConfidenceBadge + integration โ€ข 4-5 days AI Suggestion Queue โ€ข 2 days Settings AI toggles โ€ข 2-3 days Audit Trail panel โ€ข 1-2 days frontend tests. Total ~11-14 days frontend, assumes Phase 1-4 backend endpoints from this EPIC are landed.