
PII Redaction
This example shows how to fine-tune GPT-2 to automatically detect and redact personally identifiable information (names, emails, phone numbers, addresses) from text—then use a full-factorial experiment to find the best configuration across prompt design, model capacity, and learning rate.
Agent
A fine-tuned GPT-2 (124M params) trained via SFT + LoRA (PEFT), sweeping 8 configurations across 3 dimensions in a 2×2×2 grid.
Objectives
This example can serve as a starting point to understand how to rapidly experiment to:
Key takeaways
LoRA rank is the biggest lever:
r=32 averaged 22.7% lower eval loss than r=8. Higher capacity was essential for capturing the diversity of PII entity patterns.
One-shot examples matter:
Prompt B (instruction + one hardcoded example) beat Prompt A (minimal instruction) by 15.2% on average eval loss. The example helped the model learn the PII→mask token mapping pattern.
Aggressive LR worked on small data:
5e-4 consistently outperformed 2e-4 across all configs. With only 64 training examples, faster learning converged without instability.
Measure + sanity-check:
The best config hit 77.1% token accuracy and 1.0465 eval loss, but exact match was 0%—even one token off counts as failure. Human review of outputs is essential in small-data regimes.
Full experiment dashboard showing training loss, evaluation loss, token accuracy, and per-knob hyperparameter comparisons across all 8 configurations.
Experiment Design
Full factorial 2×2×2 = 8 configurations across three knobs:
Knob | Values | Why |
|---|---|---|
Prompt scheme | A (minimal) vs B (one-shot example) | Does an example improve PII pattern recognition? |
LoRA rank | r=8 vs r=32 | Capacity vs overfitting risk on small data |
Learning rate | 2e-4 vs 5e-4 | Convergence speed vs stability |
Base model: GPT-2 (124M) · Split: 64 train / 10 eval · Metrics: Eval loss (primary), token accuracy (secondary)
Results
Config | Key change(s) | BERT-F1 | ROUGE-L | Eval-mean-token-accuracy | Runtime | Notes |
|---|---|---|---|---|---|---|
Baseline | \ | 0.8074 | 0.1003 | 0.7271 | 3.27 | shortest runtime |
A | LR: 1e-4 | 0.7733 | 0.0931 | 0.7817 | 3.40 | comparable performance and long runtime |
B (Best) | Modules: all linear; LoRA rank: 32 | 0.8086 | 0.1172 | 0.7893 | 3.28 | better performance with slightly increased runtime |
C | 1e-4 LR + all linear (LoRA rank: 32) | 0.8134 | 0.1062 | 0.7946 | 3.46 | better performance, but largely increased runtime |
All 8 configs ran in ~7 minutes on free Colab hardware using hyperparallel execution.









