View Project Report | View Code

Overview

CardEst is a research project focused on analyzing and improving database query optimization through better cardinality estimation. The project implements a custom Python framework to benchmark traditional estimators (Histogram, Sampling) against the Join Order Benchmark (JOB) on the IMDB dataset, and introduces a feedback-driven refinement loop to reduce estimation errors.

Key Achievements

  • Built a robust benchmarking framework supporting schema definition, data loading, and Q-error analysis for the JOB benchmark (70 queries).
  • Implemented core estimators: Histogram-based (with bucket alignment) and Sample-based strategies.
  • Developed a feedback-driven refinement loop that reduced median Q-error by ~7.5% (7.96 to 7.37) and max Q-error by ~26% (2498 to 1852).
  • Integrated advanced techniques like correlation-aware selectivity estimation and explored HyperLogLog for distinct count estimation.

Technical Implementation

  • Framework: Custom Python engine for query parsing and statistical structure creation.
  • Estimators:
    • Histogram: Equi-width/Equi-depth buckets for selectivity estimation.
    • Sampling: Dynamic sampling strategies for result cardinality approximation.
  • Metrics: Q-error (multiplicative error) analysis to quantify estimator accuracy.
  • Optimization: Feedback loop analyzing stage-level errors to suggest histogram splits and join-selectivity adjustments.

Results

  • Baseline: Median Q-error of 7.964 with significant outliers in join-heavy queries.
  • With Feedback: Improved robustness with reduced tail latency (99th percentile Q-error dropped from 1354 to 1123).