CardEst: Cardinality Estimation Benchmark & Optimization
View Project Report | View Code Overview CardEst is a research project focused on analyzing and improving database query optimization through better cardinality estimation. The project implements a custom Python framework to benchmark traditional estimators (Histogram, Sampling) against the Join Order Benchmark (JOB) on the IMDB dataset, and introduces a feedback-driven refinement loop to reduce estimation errors. Key Achievements Built a robust benchmarking framework supporting schema definition, data loading, and Q-error analysis for the JOB benchmark (70 queries). Implemented core estimators: Histogram-based (with bucket alignment) and Sample-based strategies. Developed a feedback-driven refinement loop that reduced median Q-error by ~7.5% (7.96 to 7.37) and max Q-error by ~26% (2498 to 1852). Integrated advanced techniques like correlation-aware selectivity estimation and explored HyperLogLog for distinct count estimation. Technical Implementation Framework: Custom Python engine for query parsing and statistical structure creation. Estimators: Histogram: Equi-width/Equi-depth buckets for selectivity estimation. Sampling: Dynamic sampling strategies for result cardinality approximation. Metrics: Q-error (multiplicative error) analysis to quantify estimator accuracy. Optimization: Feedback loop analyzing stage-level errors to suggest histogram splits and join-selectivity adjustments. Results Baseline: Median Q-error of 7.964 with significant outliers in join-heavy queries. With Feedback: Improved robustness with reduced tail latency (99th percentile Q-error dropped from 1354 to 1123).