Skip to content

Reproducing Paper Results

This guide explains how to reproduce the benchmark results from the ShortKit-ML paper.

Prerequisites

  • Python 3.10, 3.11, or 3.12
  • pip or uv package manager
  • (Optional) Docker for fully isolated runs

Quick Start

1. Install the package

git clone https://github.com/criticaldata/ShortKit-ML.git
cd Shortcut_Detect
pip install -e ".[dev]"

2. Run the reproducibility script

# Quick sanity check
./scripts/reproduce_paper.sh smoke

# Moderate run (recommended first attempt)
./scripts/reproduce_paper.sh default

# Full paper reproduction
./scripts/reproduce_paper.sh full

Profiles

Profile Grid size Seeds Expected runtime
smoke 2 effect sizes, 1 sample size, 1 dim 2 2-5 minutes
default 4 effect sizes, 2 sample sizes, 2 dims 3+ 15-30 minutes
full 5 effect sizes, 3 sample sizes, 3 dims 10 2-4 hours

All profiles use random_seed: 42 for deterministic results.

Using Docker

For a fully isolated, reproducible environment:

# Build the image
docker build -t shortcut-detect .

# Run with a specific profile (output is mounted to the host)
docker run --rm -v $(pwd)/output:/app/output shortcut-detect smoke
docker run --rm -v $(pwd)/output:/app/output shortcut-detect default
docker run --rm -v $(pwd)/output:/app/output shortcut-detect full

Configuration

The reproducible config is at examples/paper_benchmark_config_reproducible.json. It explicitly specifies every parameter so there is no ambiguity:

  • random_seed: 42 (fixed for all runs)
  • methods: hbac, probe, statistical, geometric
  • effect_sizes: 0.2, 0.5, 0.8, 1.2, 2.0 (Cohen's d)
  • sample_sizes: 200, 1000, 5000
  • imbalance_ratios: 0.5 (balanced), 0.9 (imbalanced)
  • embedding_dims: 128, 256, 512
  • shortcut_dims: 5
  • seeds: 10 independent seeds per configuration
  • alpha: 0.05
  • corrections: bonferroni, fdr_bh

For custom runs, copy the config and modify as needed, then call:

python -m shortcut_detect.benchmark.paper_runner --config your_config.json

Expected Output Files

After a successful run, the timestamped output directory contains:

File Description
synthetic_runs.csv Raw results for every synthetic configuration
synthetic_power_recall.csv Power/recall analysis across effect sizes
synthetic_false_positive.csv False positive rates under the null
correction_comparison.csv Multiple testing correction comparison
benchmark_meta.json Run metadata (config, timing, environment)

CheXpert (Dataset 2)

The CheXpert real-data benchmark requires access to the CheXpert dataset. Set chexpert.enabled: true and provide chexpert.manifest_path in the config. See scripts/extract_chexpert_embeddings.py for embedding extraction.

Troubleshooting

  • Import errors: Make sure the package is installed with pip install -e ".[dev]".
  • PyTorch/CUDA: For GPU support, install PyTorch separately following pytorch.org.
  • PDF export errors: Install system libraries for weasyprint: brew install pango gdk-pixbuf libffi (macOS) or apt-get install libpango1.0-dev libgdk-pixbuf2.0-dev libffi-dev (Debian/Ubuntu).