EPIC-018: AI-Driven Data Pipeline¶
Status: ๐ก In Progress
Vision Anchor:decision-2-event-middle-layer
Phase: 4 (AI Enhancement)
Duration: 4-7 weeks
Priority: P1 (High Priority - Parallel with EPIC-016)
Dependencies: EPIC-003 (Statement Parsing), EPIC-004 (Reconciliation Engine), EPIC-006 (AI Advisor), EPIC-013 (Statement Parsing V2)
๐ฏ Objective¶
Maximize AI utilization across the entire data pipeline from statement upload to financial reports. Currently, AI is only used in 2 of 7 pipeline stages (extraction and chat advisor). This EPIC extends AI into classification, reconciliation, journal entry creation, and feedback learning โ transforming the pipeline from "AI extracts, human does everything else" to "AI handles what it can confidently, human reviews what it can't."
Core Principle (from vision.md): AI is a parsing and explanation layer, not a source of record. Confidence thresholds determine auto-accept vs. human review.
Current Pipeline (Before):
Upload โ [AI Vision] โ BankStatement โ [Rules Only] โ Classification
โ [Hardcoded Uncategorized] โ JournalEntry โ [Bypass Layer 3] โ Reports
โ [Read-Only AI] โ Chat Insights
Target Pipeline (After):
Upload โ [AI Vision + Category] โ BankStatement โ [AI + Rules Hybrid] โ Classification
โ [AI-Suggested Accounts] โ JournalEntry โ [Layer 3 Aware] โ Reports
โ [Learning AI] โ Chat Insights + Feedback Loop
Success Criteria: - AI suggests transaction categories during extraction (โฅ70% accuracy) - Classification uses AI when rules fail (ML_MODEL rule type implemented) - Journal entries use classified categories instead of "Uncategorized" - User corrections feed back into AI prompts (few-shot learning) - Reports read Layer 3 classification results
๐ฅ Multi-Role Review¶
| Role | Focus | Review Opinion |
|---|---|---|
| ๐๏ธ Architect | Pipeline Design | AI adds fields to extraction prompt, not new services. Classification becomes AI+rules hybrid. Feedback loop via CorrectionLog table. |
| ๐ Accountant | Data Integrity | AI suggestions are NEVER auto-posted. Must pass through review queue. Confidence thresholds: โฅ85 auto-accept, 60-84 review, <60 flag. See: docs/ssot/reconciliation.md#thresholds |
| ๐ป Developer | Implementation | Extend existing extraction.py prompt, implement RuleType.ML_MODEL in classification.py, modify create_entry_from_txn in review_queue.py. |
| ๐งช Tester | Validation | Test: AI category accuracy, fallback to Uncategorized when AI fails, feedback loop persistence, Layer 3โ4 data flow. |
| ๐ PM | User Experience | Reduces manual categorization work by 70%+. User sees AI suggestions and corrects only mistakes. Corrections make future suggestions better. |
| ๐ค AI/ML | Model Strategy | No custom model training needed. Uses existing OpenRouter vision model with prompt engineering + few-shot examples from user corrections. |
๐ Relationship to Other EPICs¶
| EPIC | Relationship |
|---|---|
| EPIC-003 (Statement Parsing) | Extends extraction prompt with category fields |
| EPIC-004 (Reconciliation) | Adds AI semantic scoring for 60-84 confidence matches |
| EPIC-006 (AI Advisor) | Shares OpenRouter infrastructure; advisor gains write-suggest capability |
| EPIC-013 (Statement Parsing V2) | Builds on V2's confidence scoring framework |
| EPIC-016 (Two-Stage Review) | Complementary โ AI automates what it can, EPIC-016 handles human review for what AI can't confidently classify |
| EPIC-017 (Portfolio) | Independent โ no direct dependency |
โ Task Checklist¶
Phase 1: AI-Powered Classification โ 1-2 weeks (Highest ROI)¶
1.1 Extraction Prompt Enhancement¶
- [x] Add
suggested_categoryandcategory_confidencefields to extraction prompt - File:
apps/backend/src/prompts/statement.py - Categories: Food & Dining, Transport, Shopping, Utilities, Salary, Transfer, Investment, Insurance, Rent, Healthcare, Entertainment, Education, Subscriptions, Other
- Confidence: 0.0-1.0 float returned by AI
- [x] Add
suggested_categoryVARCHAR(100) andcategory_confidenceDECIMAL(3,2) columns toBankStatementTransaction - File:
apps/backend/src/models/statement.py - Migration: Alembic migration with nullable columns (backward compatible)
- [x] Update extraction service to parse and persist AI-returned category fields
- File:
apps/backend/src/services/extraction.py - Graceful fallback: if AI omits category, set
suggested_category=NULL,category_confidence=0.0
1.2 Classification Service: Implement ML_MODEL Rule Type¶
- [x] Implement
RuleType.ML_MODELmatch logic inClassificationService.evaluate_rule() - File:
apps/backend/src/services/classification.py - Logic: Read
suggested_categoryfromBankStatementTransactionโ apply confidence threshold - Threshold:
category_confidence โฅ 0.7โ accept AI suggestion - Currently 91 lines,
ML_MODELcase returnsFalseโ make it functional - [x] Add
classify_with_ai()method that queries extraction results before falling back to rules - Priority: KEYWORD_MATCH โ REGEX_MATCH โ ML_MODEL (AI suggestion) โ Uncategorized
- This preserves existing user-defined rules as highest priority
1.3 Journal Entry: Read Classification Before Uncategorized Fallback¶
- [x] Modify
create_entry_from_txn()to check classification results before defaulting to Uncategorized - File:
apps/backend/src/services/review_queue.py(lines 264-359) - Current:
get_or_create_account(db, name="Income - Uncategorized")/"Expense - Uncategorized" - Target: Check
TransactionClassificationfor the transaction โ use classified account if exists โ fallback to Uncategorized - Account naming:
"Income - {category}"or"Expense - {category}"(e.g.,"Expense - Food & Dining") - [x] Ensure
get_or_create_account()creates accounts on-demand for new AI-suggested categories - Auto-created accounts must be: user-scoped, correct type (Income/Expense), correct currency
1.4 Tests for Phase 1¶
- [x] Test: AI extraction includes
suggested_categoryandcategory_confidencein response - [x] Test: Missing AI category fields gracefully default to NULL/0.0
- [x] Test:
ML_MODELrule type returns True when confidence โฅ 0.7 - [x] Test:
ML_MODELrule type returns False when confidence < 0.7 - [x] Test: Classification priority: KEYWORD > REGEX > ML_MODEL > Uncategorized
- [x] Test:
create_entry_from_txnuses classified category when available - [x] Test:
create_entry_from_txnfalls back to Uncategorized when no classification exists - [x] Test: Auto-created category accounts are correct type and user-scoped
Phase 2: Feedback Learning Loop โ 1 week¶
2.1 Correction Log Model¶
- [x] Create
CorrectionLogmodel - File:
apps/backend/src/models/correction.py(new) - Fields:
id,user_id,transaction_id,original_category,corrected_category,original_account_id,corrected_account_id,created_at - Links to:
BankStatementTransaction,Account,User - Purpose: Track every user correction for few-shot learning
- [x] Alembic migration for
correction_logtable
2.2 Correction Recording API¶
- [x]
POST /api/correctionsโ Record a user correction - Input:
transaction_id,corrected_category,corrected_account_id - Auto-fills
original_categoryfrom transaction's current classification - Returns: correction record
- [x]
GET /api/corrections/statsโ Get correction statistics - Return: top N corrected categories, accuracy rate per category, total corrections
- Use for monitoring AI quality over time
2.3 Few-Shot Prompt Injection¶
- [x] Query
CorrectionLogfor user's recent corrections (last 50) - Group by
original_category โ corrected_categorypattern - Inject as few-shot examples into extraction prompt
- Format: "Previously, transactions like '{description}' were categorized as '{corrected_category}'"
- [x] Update
apps/backend/src/prompts/statement.pyto accept correction examples - Add
correction_examples: list[dict]parameter to prompt builder - Inject up to 10 most relevant corrections as few-shot context
- [x] Add cache for correction examples (per user, 1-hour TTL)
- Avoid querying correction log on every extraction call
2.4 Tests for Phase 2¶
- [x] Test: Correction log records original and corrected categories
- [x] Test: Correction stats aggregate correctly
- [x] Test: Few-shot examples injected into extraction prompt
- [x] Test: Prompt with corrections produces different output than without (mock test)
- [x] Test: Correction cache invalidates after TTL
- [x] Test: Empty correction log produces standard prompt (no few-shot)
Phase 3: AI-Assisted Reconciliation โ 1-2 weeks¶
3.1 AI Semantic Scoring¶
- [x] Add
ai_semantic_score()method to reconciliation service - File:
apps/backend/src/services/reconciliation.py - Trigger: Only for candidates scoring 60-84 (review queue range)
- Input: Transaction description pair (bank statement + journal entry memo)
- Output: Semantic similarity score (0-100) from AI
- Cost control: Only called for review-queue candidates, not all matches
- [x] Create
apps/backend/src/prompts/reconciliation.py(new) - Prompt: "Given these two transaction descriptions, rate their semantic similarity (0-100)"
- Include context: date proximity, amount match, account info
- Response format: JSON with
similarity_scoreandreasoning
3.2 Hybrid Scoring Integration¶
- [x] Modify
calculate_match_score()to incorporate AI semantic score - Current: Pure algorithmic (date, amount, description fuzzy match)
- New:
final_score = 0.7 * algorithmic_score + 0.3 * ai_semantic_score - Only applies when algorithmic score is in 60-84 range
- Scores outside that range remain unchanged (โฅ85 auto-accept, <60 unmatched)
- [x] Add feature flag:
enable_ai_reconciliationinconfig.py - Default:
False(opt-in to avoid unexpected API costs) - When disabled: existing pure-algorithmic behavior unchanged
3.3 Tests for Phase 3¶
- [x] Test:
ai_semantic_score()returns score for matching descriptions - [x] Test:
ai_semantic_score()returns low score for unrelated descriptions - [x] Test: Hybrid scoring only triggers for 60-84 range candidates
- [x] Test: Feature flag disables AI scoring when False
- [x] Test: Algorithmic scores โฅ85 and <60 bypass AI scoring entirely
- [x] Test: Final score correctly weights algorithmic (0.7) and AI (0.3)
Phase 4: Pipeline Integration & Report Fix โ 1-2 weeks¶
4.1 Reports Read Layer 3 Classification¶
- [x] Modify
reporting.pyto readTransactionClassification(Layer 3) instead of rawJournalLine - File:
apps/backend/src/services/reporting.py - Current: Reports read
JournalEntryโJournalLinedirectly, ignoring Layer 3 - Target: Reports query
TransactionClassificationfor category breakdowns - Fallback: If transaction has no classification, use account name as category (backward compatible)
- [x] Add category breakdown to Income Statement
- Group expenses/income by classified category
- Show: Category, Amount, % of Total
- Use
TransactionClassification.assigned_categoryfield
4.2 ReportSnapshot (Layer 4) Utilization¶
- [x] Implement
ReportSnapshotgeneration - File:
apps/backend/src/models/layer4.py(model exists but unused) - Generate snapshots after report computation
- Store: report type, date range, computed data (JSONB), generated_at
- Enable historical comparison: "This month vs last month" reports
- [x] Add
GET /api/reports/{type}/snapshotsendpoint - List available snapshots for a report type
- Enable time-series trend analysis
4.3 CSV Parsing via AI (Remove Hardcoding)¶
- [x] Add AI-powered CSV parsing as fallback for unknown institutions
- Current: CSV parsing is hardcoded per institution (DBS, Wise, etc.)
- New: When institution is unknown, send CSV header + sample rows to AI
- AI returns: column mapping (date, description, amount, balance)
- Preserve existing hardcoded parsers for known institutions (they're faster and free)
- [x] Create
apps/backend/src/prompts/csv_mapping.py(new) - Prompt: "Given this CSV header and sample data, identify which columns are date, description, amount, balance"
- Response: JSON column mapping
4.4 Tests for Phase 4¶
- [x] Test: Reports include category breakdown from Layer 3 classification
- [x] Test: Reports fallback to account name when no classification exists
- [x] Test: ReportSnapshot generated and stored after report computation
- [x] Test: ReportSnapshot endpoint returns historical snapshots
- [x] Test: AI CSV parsing returns valid column mapping for unknown institutions
- [x] Test: Known institution CSV parsing still uses hardcoded parsers (no AI call)
๐ Acceptance Criteria Summary¶
| AC ID | Phase | Description |
|---|---|---|
| AC18.1.1 | 1 | Extraction prompt returns suggested_category and category_confidence |
| AC18.1.2 | 1 | BankStatementTransaction has suggested_category and category_confidence columns |
| AC18.1.3 | 1 | RuleType.ML_MODEL evaluates AI suggestion with confidence threshold โฅ 0.7 |
| AC18.1.4 | 1 | Classification priority: KEYWORD > REGEX > ML_MODEL > Uncategorized |
| AC18.1.5 | 1 | create_entry_from_txn reads classification before defaulting to Uncategorized |
| AC18.1.6 | 1 | Auto-created category accounts are user-scoped and correctly typed |
| AC18.2.1 | 2 | CorrectionLog model records original and corrected categories |
| AC18.2.2 | 2 | Corrections API records and retrieves correction stats |
| AC18.2.3 | 2 | Few-shot examples from corrections injected into extraction prompt |
| AC18.2.4 | 2 | Correction cache with 1-hour TTL |
| AC18.3.1 | 3 | ai_semantic_score() returns similarity for transaction description pairs |
| AC18.3.2 | 3 | Hybrid scoring: 0.7 * algorithmic + 0.3 * AI for 60-84 range only |
| AC18.3.3 | 3 | Feature flag enable_ai_reconciliation controls AI scoring |
| AC18.4.1 | 4 | Reports read Layer 3 TransactionClassification for category breakdowns |
| AC18.4.2 | 4 | ReportSnapshot (Layer 4) generated and queryable via API |
| AC18.4.3 | 4 | AI CSV parsing handles unknown institutions as fallback |
๐ซ Out of Scope (v1)¶
- Custom ML model training (use prompt engineering + few-shot only)
- Real-time model fine-tuning
- Automated rule generation from corrections
- AI-powered anomaly detection (separate EPIC if needed)
- Multi-model A/B testing
โ ๏ธ Risk Assessment¶
| Risk | Impact | Mitigation |
|---|---|---|
| AI category accuracy < 70% | Users lose trust, more corrections needed | Start with broad categories (14), not fine-grained. Measure accuracy before expanding. |
| OpenRouter API costs increase | Budget overrun | AI reconciliation behind feature flag. AI classification adds ~1 field to existing call (minimal cost). |
| Few-shot examples degrade quality | Worse suggestions over time | Limit to 10 most recent corrections. Monitor accuracy metrics. Reset mechanism available. |
| Layer 3โ4 migration breaks reports | Existing reports break | Fallback: if no classification, use account name. Backward compatible. |
๐ Metrics & Monitoring¶
| Metric | Target | Measurement |
|---|---|---|
| AI category accuracy | โฅ 70% (Phase 1) | corrections / total_classifications |
| Uncategorized reduction | โฅ 50% decrease | Count of "Uncategorized" journal entries before/after |
| AI reconciliation improvement | +5% match rate in 60-84 range | Compare match rates with flag on/off |
| Feedback loop effectiveness | Accuracy improves 5%+ after 50 corrections | Track accuracy over time per user |
| API cost per extraction | < $0.01 increase | Monitor OpenRouter billing (category field adds ~50 tokens) |
Last updated: March 2026
๐ Phase 5 โ UI Gap Audit (April 2026): Confidence Hierarchy, AI Suggestion Review Queue, Feature-Flag UI & Audit Trail¶
Origin: UI gap audit against vision.md and docs/ssot/source-type-priority.md / docs/ssot/confirmation-workflow.md. Backend confidence hierarchy and enable_ai_reconciliation flag exist but are invisible to users โ no badge, no review queue for AI suggestions, no in-product flag toggle, no audit trail.
Acceptance Criteria โ Phase 5 (Confidence & AI Suggestion UI)¶
- [x] AC18.5.1
<ConfidenceBadge />component rendersTRUSTED/HIGH/MEDIUM/LOWpill with consistent color tokens (green / blue / amber / gray) and tooltip explaining source-type priority - [x] AC18.5.2 ConfidenceBadge mounted on every transaction row in Stage 1 review, Stage 2 listing, and processing-account listing; reads
confidence_tierfrom API response - [x] AC18.5.3 AI Suggestion Review Queue page
/review/ai-suggestionslists pending AI classifications and AI reconciliation matches in score band 60-84 with{transaction, suggested_category_or_match, ai_score, ai_reasoning} - [x] AC18.5.4 Queue actions:
Accept,Reject,Edit-then-Accept; each action callsPOST /api/ai/feedbackwith{suggestion_id, action, corrected_value?}to feed the feedback loop - [x] AC18.5.5 Settings page
/settings/aiexposes toggles forenable_ai_reconciliation,enable_ai_classification, persisted viaPATCH /api/users/me/settings; toggle reflects backend feature-flag state on load - [x] AC18.5.6 Audit Trail panel on transaction detail page lists chronological
{timestamp, actor, action, old_value, new_value}fromGET /api/transactions/{id}/audit, including AI-applied changes labeled with actorai - [x] AC18.5.7 Frontend tests: mount ConfidenceBadge for each tier; mount AI Suggestion Queue and assert Accept/Reject buttons render; mount Settings AI toggles and assert default state matches API
Priority: P0 โ confidence visibility is a vision-critical trust signal; AI review queue is the human-in-the-loop hinge for the entire AI pipeline. Estimated effort: 2 days ConfidenceBadge + integration โข 4-5 days AI Suggestion Queue โข 2 days Settings AI toggles โข 2-3 days Audit Trail panel โข 1-2 days frontend tests. Total ~11-14 days frontend, assumes Phase 1-4 backend endpoints from this EPIC are landed.