CardEst: Cardinality Estimation Benchmark & Optimization

Overview

CardEst is a research project focused on analyzing and improving database query optimization through better cardinality estimation. The project implements a custom Python framework to benchmark traditional estimators (Histogram, Sampling) against the Join Order Benchmark (JOB) on the IMDB dataset, and introduces a feedback-driven refinement loop to reduce estimation errors.

Key Achievements

Built a robust benchmarking framework supporting schema definition, data loading, and Q-error analysis for the JOB benchmark (70 queries).
Implemented core estimators: Histogram-based (with bucket alignment) and Sample-based strategies.
Developed a feedback-driven refinement loop that reduced median Q-error by ~7.5% (7.96 to 7.37) and max Q-error by ~26% (2498 to 1852).
Integrated advanced techniques like correlation-aware selectivity estimation and explored HyperLogLog for distinct count estimation.

Technical Implementation

Framework: Custom Python engine for query parsing and statistical structure creation.
Estimators:
- Histogram: Equi-width/Equi-depth buckets for selectivity estimation.
- Sampling: Dynamic sampling strategies for result cardinality approximation.
Metrics: Q-error (multiplicative error) analysis to quantify estimator accuracy.
Optimization: Feedback loop analyzing stage-level errors to suggest histogram splits and join-selectivity adjustments.

Results

Baseline: Median Q-error of 7.964 with significant outliers in join-heavy queries.
With Feedback: Improved robustness with reduced tail latency (99th percentile Q-error dropped from 1354 to 1123).

Overview#

Key Achievements#

Technical Implementation#

Results#

Overview

Key Achievements

Technical Implementation

Results