LensDB: Compressed Learned Index for Traffic Video Analytics

View Project Report | View Code Overview LensDB is a compressed learned index for surveillance and traffic video that decouples ingestion from interactive querying. Instead of running object detection per frame, it stores a sparse set of representative frames as CLIP embeddings and answers exploratory queries directly in the latent space, avoiding repeated video decode from network-attached storage. Key Achievements Up to 99.991% storage reduction with sub-second query latency vs. exhaustive YOLO baselines. Compresses a 1.4 GB video into ~1.5 MB of embeddings while supporting approximate car-count queries. Keyframe filtering removes 90–99% redundant frames before embedding using FrameDiff, SSIM, MOG2, and optical flow. Technical Implementation Ingestion: 1 FPS sampling → heuristic keyframe selection → CLIP (ViT-B/32) image embeddings → FAISS index + timestamp metadata map. Query: CLIP text encoder retrieves top-k via FAISS, then an MLP predicts object count for threshold filters (Count ≥ T), with temporal expansion via metadata. Results Achieves F1 = 0.963 for event detection at T ≥ 1, with expected precision trade-offs at higher count thresholds.

1 min · 170 words · Raj Shah

TokenSmith: Agentic RAG system on Local LLM

View Code Overview TokenSmith is a retrieval-augmented QA pipeline for technical textbooks, designed to reduce “semantic miss” failures by combining sparse keyword matching with dense embedding retrieval, then re-ranking candidates with an ensemble ranker. Key Achievements Hybrid retrieval (dense + sparse) to improve coverage on exact terminology and equations-heavy text. EnsembleRanker supports Reciprocal Rank Fusion (RRF) and weighted score fusion for configurable ranking behavior. Stable evaluation harness with structured logging to debug retrieval failures and measure end-to-end answer quality. Technical Implementation Retrieval: FAISS for dense similarity + BM25 for keyword recall, merged via fusion strategies. Ranking: Config-driven EnsembleRanker stage to combine multiple retrievers and optionally add lightweight feature-based scoring. Inference: Pluggable LLM backend (local or hosted), with chunking and provenance attached to retrieved passages. Inference: Support for llama.cpp inference and document chunking Impact This RAG pipeline enables efficient querying of large document collections, making educational content more accessible and searchable through natural language interfaces. ...

1 min · 158 words · Raj Shah

CardEst: Cardinality Estimation Benchmark & Optimization

View Project Report | View Code Overview CardEst is a research project focused on analyzing and improving database query optimization through better cardinality estimation. The project implements a custom Python framework to benchmark traditional estimators (Histogram, Sampling) against the Join Order Benchmark (JOB) on the IMDB dataset, and introduces a feedback-driven refinement loop to reduce estimation errors. Key Achievements Built a robust benchmarking framework supporting schema definition, data loading, and Q-error analysis for the JOB benchmark (70 queries). Implemented core estimators: Histogram-based (with bucket alignment) and Sample-based strategies. Developed a feedback-driven refinement loop that reduced median Q-error by ~7.5% (7.96 to 7.37) and max Q-error by ~26% (2498 to 1852). Integrated advanced techniques like correlation-aware selectivity estimation and explored HyperLogLog for distinct count estimation. Technical Implementation Framework: Custom Python engine for query parsing and statistical structure creation. Estimators: Histogram: Equi-width/Equi-depth buckets for selectivity estimation. Sampling: Dynamic sampling strategies for result cardinality approximation. Metrics: Q-error (multiplicative error) analysis to quantify estimator accuracy. Optimization: Feedback loop analyzing stage-level errors to suggest histogram splits and join-selectivity adjustments. Results Baseline: Median Q-error of 7.964 with significant outliers in join-heavy queries. With Feedback: Improved robustness with reduced tail latency (99th percentile Q-error dropped from 1354 to 1123).

1 min · 202 words · Raj Shah