A real-time, job-tracked quantitative pipeline that processes equity data at scale — with live WebSocket updates, formal QA validation, and seamless demo/live modes.
Most financial data projects stop at "I got results." Four real gaps exist in that approach.
01
Scalability
Data doesn't stay small. Pandas breaks at scale — PySpark handles millions of rows across tickers without breaking a sweat.
02
Correctness
Rolling metrics like volatility and drawdown are expensive to compute correctly. Window functions need care — especially drawdown's unbounded cumulative max.
03
Validation
No one validates whether the numbers are actually right. This project does — with a formal QA layer that scores alignment against a trusted benchmark.
04
Observability
Most pipelines are black boxes. This one streams live status via WebSocket so you can see exactly what’s happening at every step.
03 — Architecture Overview
Four clean layers, one cohesive real-time system.
Layer 01
UI Layer
React dashboard for visualization and pipeline control
Layer 02
API Layer
FastAPI backend that decouples the frontend from processing logic
Normalized price change series for cross-asset comparison.
〜
Rolling Volatility
20-day standard deviation of returns for risk measurement.
⚡
Momentum
Rate of price change signal for trend detection.
↘
Maximum Drawdown
Peak-to-trough decline tracking for downside risk.
🔴
Real-time Pipeline Timeline
WebSocket-driven live status updates with job_id tracking — watch the pipeline breathe in real time.
🔄
Seamless Demo ↔ Live Mode
Same UI, zero code change. Works instantly on Vercel (demo) or connected to real Spark backend (live).
Key Differentiator
✓
QA Validation System
Every Spark output benchmarked against Pandas/SciPy with relative error thresholds and an alignment score (>95% target).
05 — Tech Stack
Every tool chosen with purpose.
Layer
Technology
Role
Frontend
React, Recharts
Dashboard & visualization
Backend
FastAPI, Python
API layer & pipeline orchestration
Processing
PySpark
Distributed feature engineering
QA
Pandas, NumPy
Benchmark validation
Storage
Parquet
Columnar data storage
Data Source
yfinance
Equity market data ingestion
Infrastructure
AWS EC2
Single-instance deployment
Job Management
FastAPI Background Tasks + UUID
job_id tracking & multi-user safety
Real-time
FastAPI WebSockets
live pipeline timeline updates
Demo Layer
Static JSON + env switch
Vercel-ready zero-backend preview
06 — Why This Stands Out
Built for two audiences.
Most portfolio projects compute results and stop. This one questions them in real time. The QA validation layer + live WebSocket timeline + job_id tracking is the kind of rigor you see in production financial systems. The architecture mirrors real quant research and risk teams.
07 — AI Full-Stack Finance Expertise
I own the entire stack — AI, Full-Stack & Quant Finance.
This project is deliberately architected as a complete AI-powered quantitative platform. Here’s my depth across the three core pillars.
🧠
AI & Predictive Intelligence
Extending the PySpark pipeline with MLlib and future Torch models for predictive signals and anomaly detection.
→ Built modular feature store ready for ML training
→ QA layer designed to benchmark both classical quant metrics AND ML model outputs
→ Momentum & volatility features engineered to feed directly into LSTM/Transformer forecasters
→ Anomaly detection hook using isolation forest (planned Phase 6)