Solutions

Icon
Icon

Solutions

Icon
Icon

Solutions

Icon
Icon

PII Redaction

This example shows how to fine-tune GPT-2 to automatically detect and redact personally identifiable information (names, emails, phone numbers, addresses) from text—then use a full-factorial experiment to find the best configuration across prompt design, model capacity, and learning rate.

Bento Card 01

Bento Card 01

Bento Card 01

Bento Card 03

Bento Card 03

Bento Card 02

Bento Card 02

Agent

A fine-tuned GPT-2 (124M params) trained via SFT + LoRA (PEFT), sweeping 8 configurations across 3 dimensions in a 2×2×2 grid.

Objectives

This example can serve as a starting point to understand how to rapidly experiment to:

Replace PII with appropriate mask tokens ([NAME], [EMAIL], [PHONE], etc.) while preserving document structure and readability.

Replace PII with appropriate mask tokens ([NAME], [EMAIL], [PHONE], etc.) while preserving document structure and readability.

Replace PII with appropriate mask tokens ([NAME], [EMAIL], [PHONE], etc.) while preserving document structure and readability.

Find the right balance of prompt design, LoRA capacity, and learning rate for a small-data privacy task.

Find the right balance of prompt design, LoRA capacity, and learning rate for a small-data privacy task.

Find the right balance of prompt design, LoRA capacity, and learning rate for a small-data privacy task.

Key takeaways

LoRA rank is the biggest lever:

r=32 averaged 22.7% lower eval loss than r=8. Higher capacity was essential for capturing the diversity of PII entity patterns.

One-shot examples matter:

Prompt B (instruction + one hardcoded example) beat Prompt A (minimal instruction) by 15.2% on average eval loss. The example helped the model learn the PII→mask token mapping pattern.

Aggressive LR worked on small data:

5e-4 consistently outperformed 2e-4 across all configs. With only 64 training examples, faster learning converged without instability.

Measure + sanity-check:

The best config hit 77.1% token accuracy and 1.0465 eval loss, but exact match was 0%—even one token off counts as failure. Human review of outputs is essential in small-data regimes.

AI marketing  automation for data driven strategies.
AI marketing  automation for data driven strategies.
AI marketing  automation for data driven strategies.

Full experiment dashboard showing training loss, evaluation loss, token accuracy, and per-knob hyperparameter comparisons across all 8 configurations.

Experiment Design

Full factorial 2×2×2 = 8 configurations across three knobs:

Knob

Values

Why

Prompt scheme

A (minimal) vs B (one-shot example)

Does an example improve PII pattern recognition?

LoRA rank

r=8 vs r=32

Capacity vs overfitting risk on small data

Learning rate

2e-4 vs 5e-4

Convergence speed vs stability

Base model: GPT-2 (124M) · Split: 64 train / 10 eval · Metrics: Eval loss (primary), token accuracy (secondary)

Results

Config

Key change(s)

BERT-F1

ROUGE-L

Eval-mean-token-accuracy

Runtime

Notes

Baseline

\

0.8074

0.1003

0.7271

3.27

shortest runtime

A

LR: 1e-4

0.7733

0.0931

0.7817

3.40

comparable performance and long runtime

B (Best)

Modules: all linear; LoRA rank: 32

0.8086

0.1172

0.7893

3.28

better performance with slightly increased runtime

C

1e-4 LR + all linear (LoRA rank: 32)

0.8134

0.1062

0.7946

3.46

better performance, but largely increased runtime

All 8 configs ran in ~7 minutes on free Colab hardware using hyperparallel execution.

How to apply this to your domain

Use this workflow as a template for your own chatbot:

Structure your data as text-to-text pairs (source with PII → masked output with tokens). Works for support logs, legal docs, medical records.

Include at least one example in your prompt—minimal instructions alone underperformed significantly in this experiment.

Start with a higher LoRA rank (r=32+) when your task involves recognizing diverse entity patterns, and sweep learning rate.

Scale up training data for production—64 examples proved the concept, but nested PII and edge cases need more coverage.

Cta Image

How to apply this to your domain

Use this workflow as a template for your own chatbot:

Structure your data as text-to-text pairs (source with PII → masked output with tokens). Works for support logs, legal docs, medical records.

Include at least one example in your prompt—minimal instructions alone underperformed significantly in this experiment.

Start with a higher LoRA rank (r=32+) when your task involves recognizing diverse entity patterns, and sweep learning rate.

Scale up training data for production—64 examples proved the concept, but nested PII and edge cases need more coverage.

Cta Image

How to apply this to your domain

Use this workflow as a template for your own chatbot:

Structure your data as text-to-text pairs (source with PII → masked output with tokens). Works for support logs, legal docs, medical records.

Include at least one example in your prompt—minimal instructions alone underperformed significantly in this experiment.

Start with a higher LoRA rank (r=32+) when your task involves recognizing diverse entity patterns, and sweep learning rate.

Scale up training data for production—64 examples proved the concept, but nested PII and edge cases need more coverage.

Cta Image