๐Ÿ† Eval-Unlearn Leaderboard

About eval-learn

eval-learn is an open-source Python library providing a unified, reproducible benchmarking framework for concept unlearning in Stable Diffusion models.

The proliferation of concept unlearning techniques has produced a fragmented evaluation landscape: methods are assessed under heterogeneous experimental conditions with different datasets, metrics, and hyperparameters, making principled cross-method comparison difficult. eval-learn is designed to fill this gap.

๐Ÿ“ฆ Install: pip install eval-learn ย |ย  ๐Ÿ’ป GitHub ย |ย  ๐Ÿ“– Docs


Key Features

  • Unified pipeline supporting fine-tuning, closed-form model editing, and inference-time intervention techniques on a common Stable Diffusion base
  • 9 standardised evaluation metrics covering erasure efficacy, adversarial robustness, generative quality, and concept retention
  • Plugin architecture via Python entry points โ€” third-party techniques and metrics self-register upon installation without modifying the core framework
  • GPU-efficient execution with FP16 inference, batch streaming, and proactive VRAM management
  • CLI + YAML/JSON config for streamlined experiment management
  • HuggingFace Hub integration via eval-learn push

BenchScore

Results are aggregated using a composite BenchScore that balances safety and quality:

BenchScore(ฮฑ) = ฮฑ ยท S + (1 โˆ’ ฮฑ) ยท Q

  • BenchScore-S (ฮฑ = 0.6): safety-prioritised
  • BenchScore-Q (ฮฑ = 0.4): quality-prioritised

Quick Start

pip install eval-learn
eval-learn run --config config.yaml

Example config.yaml:

output_dir: ./results/nudity/esd
seed: 42
technique:
  name: esd
  config: {erase_concept: nudity, train_method: noxattn}
metrics:
  - name: asr_i2p
    config: {concept: nudity, detector: nudenet}
  - name: clip_score
    config: {device: cuda}

Implemented Techniques (v0.1.6)

Technique Category
ESDFine-Tuning
CAFine-Tuning
CoGFDFine-Tuning
AdvUnlearnFine-Tuning
SSDFine-Tuning / Closed-Form
UCEClosed-Form
MACEClosed-Form
SLDInference-Time
SAFREEInference-Time
TraSCEInference-Time
Concept SteerersInference-Time
SAeUronInference-Time

Evaluation Metrics

Metric Type
ASR (I2P) โ†“Erasure efficacy
ASR (Ring-A-Bell) โ†“Adversarial robustness
ASR (MMA-Diffusion) โ†“Adversarial robustness
ASR (P4D) โ†“Adversarial robustness
FID โ†“Image quality
CLIP Score โ†‘Text-image fidelity
TIFA โ†‘Compositional fidelity
ERRErasure-retention
UA-IRA โ†‘Unlearning / retention

Results

Comparison of unlearning techniques across all benchmarks. โ†“ = lower is better, โ†‘ = higher is better. Sorted by average BenchScore.

{
  • "headers": [
    • "#",
    • "Technique",
    • "ASR I2P โ†“",
    • "ASR RingABell โ†“",
    • "ASR MMA โ†“",
    • "FID โ†“",
    • "CLIP Score โ†‘",
    • "UA-IRA โ†‘",
    • "TIFA โ†‘",
    • "BenchScore-S โ†‘",
    • "BenchScore-Q โ†‘",
    • "Avg BenchScore"
    ],
  • "data": [
    • [
      • 1,
      • "<a href="https://doi.org/10.1609/aaai.v38i11.29092" target="_blank" style="color:#7eb8f7; text-decoration:underline; font-weight:600;">SSD</a> <span style="color:#aaa; font-size:0.85em;">(Foster et al., 2024)</span>",
      • 0,
      • 0,
      • 0.08,
      • 157.9,
      • 25.48,
      • 0.6,
      • 0.13,
      • 0.8235,
      • 0.7918,
      • 0.8076
      ],
    • [
      • 2,
      • "<a href="https://doi.org/10.1109/CVPR52733.2024.00615" target="_blank" style="color:#7eb8f7; text-decoration:underline; font-weight:600;">MACE</a> <span style="color:#aaa; font-size:0.85em;">(Lu et al., 2024)</span>",
      • 0.04,
      • 0,
      • 0.08,
      • 133.17,
      • 26.35,
      • 0.7,
      • 0.2,
      • 0.7962,
      • 0.8116,
      • 0.8039
      ],
    • [
      • 3,
      • "<a href="https://doi.org/10.1109/CVPR52729.2023.02157" target="_blank" style="color:#7eb8f7; text-decoration:underline; font-weight:600;">SLD</a> <span style="color:#aaa; font-size:0.85em;">(Schramowski et al., 2023)</span>",
      • 0.02,
      • 0,
      • 0,
      • 133.37,
      • 25.8,
      • 0.75,
      • 0.07,
      • 0.7765,
      • 0.7666,
      • 0.7716
      ],
    • [
      • 4,
      • "<a href="https://doi.org/10.1109/WACV57701.2024.00503" target="_blank" style="color:#7eb8f7; text-decoration:underline; font-weight:600;">UCE</a> <span style="color:#aaa; font-size:0.85em;">(Gandikota et al., 2024)</span>",
      • 0.06,
      • 0.1,
      • 0.08,
      • 129.66,
      • 27.16,
      • 0.65,
      • 0.2,
      • 0.7377,
      • 0.7809,
      • 0.7593
      ],
    • [
      • 5,
      • "<a href="https://openreview.net/forum?id=dkpmfIydrF" target="_blank" style="color:#7eb8f7; text-decoration:underline; font-weight:600;">AdvUnlearn</a> <span style="color:#aaa; font-size:0.85em;">(Zhang et al., 2024)</span>",
      • 0.06,
      • 0,
      • 0.25,
      • 126.15,
      • 25.83,
      • 0.6,
      • 0.07,
      • 0.7423,
      • 0.7492,
      • 0.7457
      ],
    • [
      • 6,
      • "<a href="https://openreview.net/forum?id=hgTFotBRKl" target="_blank" style="color:#7eb8f7; text-decoration:underline; font-weight:600;">SAFREE</a> <span style="color:#aaa; font-size:0.85em;">(Yoon et al., 2025)</span>",
      • 0.16,
      • 0.1,
      • 0.33,
      • 133.61,
      • 26.27,
      • 0.55,
      • 0.4,
      • 0.6694,
      • 0.7678,
      • 0.7186
      ],
    • [
      • 7,
      • "<a href="https://openreview.net/forum?id=HFCaWGWEzi" target="_blank" style="color:#7eb8f7; text-decoration:underline; font-weight:600;">SAEuron</a> <span style="color:#aaa; font-size:0.85em;">(Cywinski et al., 2025)</span>",
      • 0.1,
      • 0,
      • 0.17,
      • 196.11,
      • 24.8,
      • 0.5,
      • 0.07,
      • 0.7108,
      • 0.6704,
      • 0.6906
      ],
    • [
      • 8,
      • "<a href="https://openreview.net/forum?id=OBjF5I4PWg" target="_blank" style="color:#7eb8f7; text-decoration:underline; font-weight:600;">CogFD</a> <span style="color:#aaa; font-size:0.85em;">(Nie et al., 2025)</span>",
      • 0.16,
      • 0.1,
      • 0.5,
      • 129.93,
      • 25.48,
      • 0.5,
      • 0.13,
      • 0.5705,
      • 0.6432,
      • 0.6068
      ],
    • [
      • 9,
      • "<a href="https://doi.org/10.1109/ICCV51070.2023.00230" target="_blank" style="color:#7eb8f7; text-decoration:underline; font-weight:600;">ESD</a> <span style="color:#aaa; font-size:0.85em;">(Gandikota et al., 2023)</span>",
      • 0,
      • 0,
      • 0,
      • 242.05,
      • 21.36,
      • 0.55,
      • 0.13,
      • 0.6474,
      • 0.5425,
      • 0.595
      ],
    • [
      • 10,
      • "<a href="https://doi.org/10.1109/ICCV51070.2023.02074" target="_blank" style="color:#7eb8f7; text-decoration:underline; font-weight:600;">CA</a> <span style="color:#aaa; font-size:0.85em;">(Kumari et al., 2023)</span>",
      • 0.2,
      • 0.2,
      • 0.5,
      • 126.83,
      • 26.13,
      • 0.55,
      • 0.07,
      • 0.4358,
      • 0.5465,
      • 0.4912
      ],
    • [
      • 11,
      • "<a href="https://arxiv.org/abs/2412.07658" target="_blank" style="color:#7eb8f7; text-decoration:underline; font-weight:600;">TraSCE</a> <span style="color:#aaa; font-size:0.85em;">(Jain et al., 2024)</span>",
      • 0,
      • 0,
      • 0,
      • 230.96,
      • 16.69,
      • 0.3,
      • 0,
      • 0.5759,
      • 0.3996,
      • 0.4878
      ],
    • [
      • 12,
      • "<a href="https://doi.org/10.48550/arXiv.2501.19066" target="_blank" style="color:#7eb8f7; text-decoration:underline; font-weight:600;">Concept-Steerers</a> <span style="color:#aaa; font-size:0.85em;">(Kim et al., 2025)</span>",
      • 0,
      • 0,
      • 0,
      • 236.58,
      • 15.62,
      • 0.6,
      • 0.07,
      • 0.488,
      • 0.3571,
      • 0.4226
      ]
    ],
  • "metadata": null
}

๐Ÿš€ Add Your Technique to the Leaderboard

Want to see your unlearning method on this leaderboard? Here's how:

  1. Follow the contribution guidelines at eval-unlearn.readthedocs.io/en/latest/contributing to integrate your technique into the eval-learn framework
  2. Submit a Pull Request to the Eval-Unlearn GitHub repository
  3. Once your PR is reviewed and merged, we will evaluate your method using eval-learn under the same standardised conditions as all other techniques
  4. Your results will be published on this leaderboard

We welcome contributions from the community โ€” the more techniques we evaluate under unified conditions, the more useful this benchmark becomes for the field.