RAG Financial Analysis System
LLM-powered Q&A over Uber, Lyft, and United 10-K filings using FAISS and GPT-4o
Overview
Built a production-grade Retrieval-Augmented Generation (RAG) pipeline over SEC 10-K filings for Uber, Lyft, and United Airlines. The system uses dual FAISS vector indices (text + tables) with Azure OpenAI embeddings to retrieve relevant passages, then passes them to GPT-4o for grounded, citation-aware financial analysis — enabling natural language Q&A over thousands of pages of structured financial data.
Methods
- Document ingestion: parsed Uber, Lyft, and United 10-K PDFs into 2,626 text chunks and 712 table chunks
- Dual-index architecture: separate FAISS indices for narrative text and financial tables to preserve table structure during retrieval
- Embedding: Azure OpenAI text-embedding-3-small for both indexing and query encoding
- Retrieval: top-k=15 nearest neighbor search across both indices, combined into a unified context window
- Generation: GPT-4o (Azure-hosted) with a structured prompt enforcing comparison, trend analysis, and actionable insights
- Frontend: Streamlit interface for interactive query submission and response rendering
Key Findings
- Dual-index retrieval significantly improves coverage — table index captures numerical data that text search misses
- GPT-4o accurately compares multi-company financials: Uber 2023 revenue $37.28B vs Lyft $4.40B (8.5× gap), with correct growth rates (17% vs 7.5%)
- System correctly identifies revenue trends, cost structures, and year-over-year changes from raw financial tables
- Structured prompting (compare + trend + implication + next steps) produces analyst-quality responses rather than raw retrieval
Results
2,626 text chunks + 712 table chunks indexed across 3 company 10-K filings
Accurate cross-company financial comparison: Uber revenue 8.5× Lyft with 2.3× higher growth rate
Sub-second retrieval latency from FAISS approximate nearest neighbor search
Full Streamlit UI for interactive natural language queries over financial documents