Skip to content

CAV (Concept Activation Vectors)

Concept Activation Vectors (Kim et al., 2018) test whether model representations encode human-interpretable shortcut concepts.

This implementation focuses on precomputed activations and optional directional derivatives (for TCAV scoring), which fits the library's loader-driven XAI workflow.

What It Detects

  • Whether a concept is linearly separable from a random baseline in activation space.
  • Whether model sensitivity aligns with that concept direction (TCAV score).

Required Inputs

  • concept_sets: dict[str, np.ndarray] where each array is (n, d)
  • random_set: np.ndarray (n, d) or list of arrays

Optional:

  • target_activations: np.ndarray (m, d)
  • target_directional_derivatives: np.ndarray (m, d)

If directional derivatives are omitted, TCAV risk is reported as unknown.

Unified API Example

from shortcut_detect import ShortcutDetector

bundle = {
    "concept_sets": {
        "hospital_token": concept_arr,  # (n,d)
        "device_artifact": artifact_arr,
    },
    "random_set": random_arr,  # (n,d)
    "target_activations": target_acts,  # optional
    "target_directional_derivatives": target_dd,  # optional
}

detector = ShortcutDetector(
    methods=["cav"],
    cav_quality_threshold=0.7,
    cav_shortcut_threshold=0.6,
)
detector.fit_from_loaders({"cav": bundle})

result = detector.get_results()["cav"]
print(result["metrics"])
print(result["report"]["per_concept"])

Interpretation

  • High concept quality (AUC) + high TCAV indicates likely shortcut reliance.
  • A concept is flagged when both thresholds are exceeded.
  • Risk levels:
  • high: very strong flagged TCAV signal
  • moderate: flagged concepts, weaker maximum signal
  • low: tested with derivatives, no flagged concepts
  • unknown: derivatives not provided

Reference

Kim, Been, et al. "Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV)." ICML 2018.