Machine Learning

LensDB: Compressed Learned Index for Traffic Video Analytics

View Project Report | View Code Overview LensDB is a compressed learned index for surveillance and traffic video that decouples ingestion from interactive querying. Instead of running object detection per frame, it stores a sparse set of representative frames as CLIP embeddings and answers exploratory queries directly in the latent space, avoiding repeated video decode from network-attached storage. Key Achievements Up to 99.991% storage reduction with sub-second query latency vs. exhaustive YOLO baselines. Compresses a 1.4 GB video into ~1.5 MB of embeddings while supporting approximate car-count queries. Keyframe filtering removes 90–99% redundant frames before embedding using FrameDiff, SSIM, MOG2, and optical flow. Technical Implementation Ingestion: 1 FPS sampling → heuristic keyframe selection → CLIP (ViT-B/32) image embeddings → FAISS index + timestamp metadata map. Query: CLIP text encoder retrieves top-k via FAISS, then an MLP predicts object count for threshold filters (Count ≥ T), with temporal expansion via metadata. Results Achieves F1 = 0.963 for event detection at T ≥ 1, with expected precision trade-offs at higher count thresholds.

TokenSmith: Agentic RAG system on Local LLM

View Code Overview TokenSmith is a retrieval-augmented QA pipeline for technical textbooks, designed to reduce “semantic miss” failures by combining sparse keyword matching with dense embedding retrieval, then re-ranking candidates with an ensemble ranker. Key Achievements Hybrid retrieval (dense + sparse) to improve coverage on exact terminology and equations-heavy text. EnsembleRanker supports Reciprocal Rank Fusion (RRF) and weighted score fusion for configurable ranking behavior. Stable evaluation harness with structured logging to debug retrieval failures and measure end-to-end answer quality. Technical Implementation Retrieval: FAISS for dense similarity + BM25 for keyword recall, merged via fusion strategies. Ranking: Config-driven EnsembleRanker stage to combine multiple retrievers and optionally add lightweight feature-based scoring. Inference: Pluggable LLM backend (local or hosted), with chunking and provenance attached to retrieved passages. Inference: Support for llama.cpp inference and document chunking Impact This RAG pipeline enables efficient querying of large document collections, making educational content more accessible and searchable through natural language interfaces. ...

CardEst: Cardinality Estimation Benchmark & Optimization

View Project Report | View Code Overview CardEst is a research project focused on analyzing and improving database query optimization through better cardinality estimation. The project implements a custom Python framework to benchmark traditional estimators (Histogram, Sampling) against the Join Order Benchmark (JOB) on the IMDB dataset, and introduces a feedback-driven refinement loop to reduce estimation errors. Key Achievements Built a robust benchmarking framework supporting schema definition, data loading, and Q-error analysis for the JOB benchmark (70 queries). Implemented core estimators: Histogram-based (with bucket alignment) and Sample-based strategies. Developed a feedback-driven refinement loop that reduced median Q-error by ~7.5% (7.96 to 7.37) and max Q-error by ~26% (2498 to 1852). Integrated advanced techniques like correlation-aware selectivity estimation and explored HyperLogLog for distinct count estimation. Technical Implementation Framework: Custom Python engine for query parsing and statistical structure creation. Estimators: Histogram: Equi-width/Equi-depth buckets for selectivity estimation. Sampling: Dynamic sampling strategies for result cardinality approximation. Metrics: Q-error (multiplicative error) analysis to quantify estimator accuracy. Optimization: Feedback loop analyzing stage-level errors to suggest histogram splits and join-selectivity adjustments. Results Baseline: Median Q-error of 7.964 with significant outliers in join-heavy queries. With Feedback: Improved robustness with reduced tail latency (99th percentile Q-error dropped from 1354 to 1123).