
Age-Aware Chatbot: Fine-Tuned Responses for Children's Education
This example shows how to fine-tune TinyLlama-1.1B to generate age-appropriate educational responses for children across different developmental stages—then use a structured experiment to find the best LoRA configuration for balancing performance and efficiency.
Dataset
yxpan/children_sft_dataset (evolved via Gemini-2.0-flash-lite API into 1,200 samples across 5 age groups and 4 intent types)
Agent
A fine-tuned TinyLlama-1.1B trained via SFT + LoRA (PEFT), with experiments across learning rate, target modules, and LoRA rank.
Objectives
This example can serve as a starting point to understand how to rapidly experiment to:
Key takeaways
Module coverage + rank matter most:
Expanding LoRA targets to all linear modules with rank 32 (Config B) delivered the best efficiency-adjusted results — +16.9% ROUGE-L and +6.2% token accuracy over baseline with only 0.3% more runtime.
Higher LR didn't help
Raising the learning rate to 1e-4 (Config A) caused training instability and actually hurt BERT-F1, dropping it from 0.8074 to 0.7733.
Combining knobs has diminishing returns:
Config C (high LR + all modules) achieved the highest raw metrics but at a disproportionate runtime cost, making Config B the practical winner.
Age-adaptation is hard to measure:
ROUGE-L underperforms on open-ended tasks like storytelling. The model also shows readability gaps exceeding 11 grade levels in edge cases and occasional "tone leakage" where it lapses into childish language mid-response for older age groups.
Figure 1. Comparison of ROUGE-L scores across experimental configurations
Experiment Design
Full factorial 2×2×2 = 8 configurations across three knobs:
Config | Key change(s) | BERT-F1 | ROUGE-L | Eval-mean-token-accuracy | Runtime | Notes |
|---|---|---|---|---|---|---|
Baseline | \ | 0.8074 | 0.1003 | 0.7271 | 3.27 | shortest runtime |
A | LR: 1e-4 | 0.7733 | 0.0931 | 0.7817 | 3.40 | comparable performance and long runtime |
B (Best) | Modules: all linear; LoRA rank: 32 | 0.8086 | 0.1172 | 0.7893 | 3.28 | better performance with slightly increased runtime |
C | 1e-4 LR + all linear (LoRA rank: 32) | 0.8134 | 0.1062 | 0.7946 | 3.46 | better performance, but largely increased runtime |
Performance summary of fine-tuning experiments, identifying Config B as the optimal balance between high ROUGE-L scores and minimal runtime overhead for educational content generation.









