🏆 Eval-Unlearn Leaderboard

About eval-learn

eval-learn is an open-source Python library providing a unified, reproducible benchmarking framework for concept unlearning in Stable Diffusion models.

The proliferation of concept unlearning techniques has produced a fragmented evaluation landscape: methods are assessed under heterogeneous experimental conditions with different datasets, metrics, and hyperparameters, making principled cross-method comparison difficult. eval-learn is designed to fill this gap.

📦 Install: pip install eval-learn | 💻 GitHub | 📖 Docs

Key Features

Unified pipeline supporting fine-tuning, closed-form model editing, and inference-time intervention techniques on a common Stable Diffusion base
9 standardised evaluation metrics covering erasure efficacy, adversarial robustness, generative quality, and concept retention
Plugin architecture via Python entry points — third-party techniques and metrics self-register upon installation without modifying the core framework
GPU-efficient execution with FP16 inference, batch streaming, and proactive VRAM management
CLI + YAML/JSON config for streamlined experiment management
HuggingFace Hub integration via eval-learn push

BenchScore

Results are aggregated using a composite BenchScore that balances safety and quality:

BenchScore(α) = α · S + (1 − α) · Q

BenchScore-S (α = 0.6): safety-prioritised
BenchScore-Q (α = 0.4): quality-prioritised

Quick Start

pip install eval-learn
eval-learn run --config config.yaml

Example config.yaml:

output_dir: ./results/nudity/esd
seed: 42
technique:
  name: esd
  config: {erase_concept: nudity, train_method: noxattn}
metrics:
  - name: asr_i2p
    config: {concept: nudity, detector: nudenet}
  - name: clip_score
    config: {device: cuda}

Implemented Techniques (v0.1.6)

Technique	Category
ESD	Fine-Tuning
CA	Fine-Tuning
CoGFD	Fine-Tuning
AdvUnlearn	Fine-Tuning
SSD	Fine-Tuning / Closed-Form
UCE	Closed-Form
MACE	Closed-Form
SLD	Inference-Time
SAFREE	Inference-Time
TraSCE	Inference-Time
Concept Steerers	Inference-Time
SAeUron	Inference-Time

Evaluation Metrics

Metric	Type
ASR (I2P) ↓	Erasure efficacy
ASR (Ring-A-Bell) ↓	Adversarial robustness
ASR (MMA-Diffusion) ↓	Adversarial robustness
ASR (P4D) ↓	Adversarial robustness
FID ↓	Image quality
CLIP Score ↑	Text-image fidelity
TIFA ↑	Compositional fidelity
ERR	Erasure-retention
UA-IRA ↑	Unlearning / retention

Results

Comparison of unlearning techniques across all benchmarks. ↓ = lower is better, ↑ = higher is better. Sorted by average BenchScore.

{

"headers": [
- "#",
- "Technique",
- "ASR I2P ↓",
- "ASR RingABell ↓",
- "ASR MMA ↓",
- "FID ↓",
- "CLIP Score ↑",
- "UA-IRA ↑",
- "TIFA ↑",
- "BenchScore-S ↑",
- "BenchScore-Q ↑",
- "Avg BenchScore"
],
"data": [
- [
 - 1,
 - "<a href="https://doi.org/10.1609/aaai.v38i11.29092" target="_blank" style="color:#7eb8f7; text-decoration:underline; font-weight:600;">SSD</a> (Foster et al., 2024)",
 - 0,
 - 0,
 - 0.08,
 - 157.9,
 - 25.48,
 - 0.6,
 - 0.13,
 - 0.8235,
 - 0.7918,
 - 0.8076
 ],
- [
 - 2,
 - "<a href="https://doi.org/10.1109/CVPR52733.2024.00615" target="_blank" style="color:#7eb8f7; text-decoration:underline; font-weight:600;">MACE</a> (Lu et al., 2024)",
 - 0.04,
 - 0,
 - 0.08,
 - 133.17,
 - 26.35,
 - 0.7,
 - 0.2,
 - 0.7962,
 - 0.8116,
 - 0.8039
 ],
- [
 - 3,
 - "<a href="https://doi.org/10.1109/CVPR52729.2023.02157" target="_blank" style="color:#7eb8f7; text-decoration:underline; font-weight:600;">SLD</a> (Schramowski et al., 2023)",
 - 0.02,
 - 0,
 - 0,
 - 133.37,
 - 25.8,
 - 0.75,
 - 0.07,
 - 0.7765,
 - 0.7666,
 - 0.7716
 ],
- [
 - 4,
 - "<a href="https://doi.org/10.1109/WACV57701.2024.00503" target="_blank" style="color:#7eb8f7; text-decoration:underline; font-weight:600;">UCE</a> (Gandikota et al., 2024)",
 - 0.06,
 - 0.1,
 - 0.08,
 - 129.66,
 - 27.16,
 - 0.65,
 - 0.2,
 - 0.7377,
 - 0.7809,
 - 0.7593
 ],
- [
 - 5,
 - "<a href="https://openreview.net/forum?id=dkpmfIydrF" target="_blank" style="color:#7eb8f7; text-decoration:underline; font-weight:600;">AdvUnlearn</a> (Zhang et al., 2024)",
 - 0.06,
 - 0,
 - 0.25,
 - 126.15,
 - 25.83,
 - 0.6,
 - 0.07,
 - 0.7423,
 - 0.7492,
 - 0.7457
 ],
- [
 - 6,
 - "<a href="https://openreview.net/forum?id=hgTFotBRKl" target="_blank" style="color:#7eb8f7; text-decoration:underline; font-weight:600;">SAFREE</a> (Yoon et al., 2025)",
 - 0.16,
 - 0.1,
 - 0.33,
 - 133.61,
 - 26.27,
 - 0.55,
 - 0.4,
 - 0.6694,
 - 0.7678,
 - 0.7186
 ],
- [
 - 7,
 - "<a href="https://openreview.net/forum?id=HFCaWGWEzi" target="_blank" style="color:#7eb8f7; text-decoration:underline; font-weight:600;">SAEuron</a> (Cywinski et al., 2025)",
 - 0.1,
 - 0,
 - 0.17,
 - 196.11,
 - 24.8,
 - 0.5,
 - 0.07,
 - 0.7108,
 - 0.6704,
 - 0.6906
 ],
- [
 - 8,
 - "<a href="https://openreview.net/forum?id=OBjF5I4PWg" target="_blank" style="color:#7eb8f7; text-decoration:underline; font-weight:600;">CogFD</a> (Nie et al., 2025)",
 - 0.16,
 - 0.1,
 - 0.5,
 - 129.93,
 - 25.48,
 - 0.5,
 - 0.13,
 - 0.5705,
 - 0.6432,
 - 0.6068
 ],
- [
 - 9,
 - "<a href="https://doi.org/10.1109/ICCV51070.2023.00230" target="_blank" style="color:#7eb8f7; text-decoration:underline; font-weight:600;">ESD</a> (Gandikota et al., 2023)",
 - 0,
 - 0,
 - 0,
 - 242.05,
 - 21.36,
 - 0.55,
 - 0.13,
 - 0.6474,
 - 0.5425,
 - 0.595
 ],
- [
 - 10,
 - "<a href="https://doi.org/10.1109/ICCV51070.2023.02074" target="_blank" style="color:#7eb8f7; text-decoration:underline; font-weight:600;">CA</a> (Kumari et al., 2023)",
 - 0.2,
 - 0.2,
 - 0.5,
 - 126.83,
 - 26.13,
 - 0.55,
 - 0.07,
 - 0.4358,
 - 0.5465,
 - 0.4912
 ],
- [
 - 11,
 - "<a href="https://arxiv.org/abs/2412.07658" target="_blank" style="color:#7eb8f7; text-decoration:underline; font-weight:600;">TraSCE</a> (Jain et al., 2024)",
 - 0,
 - 0,
 - 0,
 - 230.96,
 - 16.69,
 - 0.3,
 - 0,
 - 0.5759,
 - 0.3996,
 - 0.4878
 ],
- [
 - 12,
 - "<a href="https://doi.org/10.48550/arXiv.2501.19066" target="_blank" style="color:#7eb8f7; text-decoration:underline; font-weight:600;">Concept-Steerers</a> (Kim et al., 2025)",
 - 0,
 - 0,
 - 0,
 - 236.58,
 - 15.62,
 - 0.6,
 - 0.07,
 - 0.488,
 - 0.3571,
 - 0.4226
 ]
],
"metadata": null

}

🚀 Add Your Technique to the Leaderboard

Want to see your unlearning method on this leaderboard? Here's how:

Follow the contribution guidelines at eval-unlearn.readthedocs.io/en/latest/contributing to integrate your technique into the eval-learn framework
Submit a Pull Request to the Eval-Unlearn GitHub repository
Once your PR is reviewed and merged, we will evaluate your method using eval-learn under the same standardised conditions as all other techniques
Your results will be published on this leaderboard

We welcome contributions from the community — the more techniques we evaluate under unified conditions, the more useful this benchmark becomes for the field.