View Project Report | View Code
Overview
CardEst is a research project focused on analyzing and improving database query optimization through better cardinality estimation. The project implements a custom Python framework to benchmark traditional estimators (Histogram, Sampling) against the Join Order Benchmark (JOB) on the IMDB dataset, and introduces a feedback-driven refinement loop to reduce estimation errors.
Key Achievements
- Built a robust benchmarking framework supporting schema definition, data loading, and Q-error analysis for the JOB benchmark (70 queries).
- Implemented core estimators: Histogram-based (with bucket alignment) and Sample-based strategies.
- Developed a feedback-driven refinement loop that reduced median Q-error by ~7.5% (7.96 to 7.37) and max Q-error by ~26% (2498 to 1852).
- Integrated advanced techniques like correlation-aware selectivity estimation and explored HyperLogLog for distinct count estimation.
Technical Implementation
- Framework: Custom Python engine for query parsing and statistical structure creation.
- Estimators:
- Histogram: Equi-width/Equi-depth buckets for selectivity estimation.
- Sampling: Dynamic sampling strategies for result cardinality approximation.
- Metrics: Q-error (multiplicative error) analysis to quantify estimator accuracy.
- Optimization: Feedback loop analyzing stage-level errors to suggest histogram splits and join-selectivity adjustments.
Results
- Baseline: Median Q-error of 7.964 with significant outliers in join-heavy queries.
- With Feedback: Improved robustness with reduced tail latency (99th percentile Q-error dropped from 1354 to 1123).