Solutions

Age-Aware Chatbot: Fine-Tuned Responses for Children's Education

This example shows how to fine-tune TinyLlama-1.1B to generate age-appropriate educational responses for children across different developmental stages—then use a structured experiment to find the best LoRA configuration for balancing performance and efficiency.

Dataset

yxpan/children_sft_dataset (evolved via Gemini-2.0-flash-lite API into 1,200 samples across 5 age groups and 4 intent types)

Agent

A fine-tuned TinyLlama-1.1B trained via SFT + LoRA (PEFT), with experiments across learning rate, target modules, and LoRA rank.

Objectives

This example can serve as a starting point to understand how to rapidly experiment to:

Generate pedagogically sound, age-stratified responses that match the cognitive level of the target age group.

Find the right trade-off between LoRA capacity, module coverage, and training cost on constrained hardware (T4 GPU).

Key takeaways

Module coverage + rank matter most:

Expanding LoRA targets to all linear modules with rank 32 (Config B) delivered the best efficiency-adjusted results — +16.9% ROUGE-L and +6.2% token accuracy over baseline with only 0.3% more runtime.

Higher LR didn't help

Raising the learning rate to 1e-4 (Config A) caused training instability and actually hurt BERT-F1, dropping it from 0.8074 to 0.7733.

Combining knobs has diminishing returns:

Config C (high LR + all modules) achieved the highest raw metrics but at a disproportionate runtime cost, making Config B the practical winner.

Age-adaptation is hard to measure:

ROUGE-L underperforms on open-ended tasks like storytelling. The model also shows readability gaps exceeding 11 grade levels in edge cases and occasional "tone leakage" where it lapses into childish language mid-response for older age groups.

AI marketing automation for data driven strategies.

Figure 1. Comparison of ROUGE-L scores across experimental configurations

Experiment Design

Full factorial 2×2×2 = 8 configurations across three knobs:

Config	Key change(s)	BERT-F1	ROUGE-L	Eval-mean-token-accuracy	Runtime	Notes
Baseline	\	0.8074	0.1003	0.7271	3.27	shortest runtime
A	LR: 1e-4	0.7733	0.0931	0.7817	3.40	comparable performance and long runtime
B (Best)	Modules: all linear; LoRA rank: 32	0.8086	0.1172	0.7893	3.28	better performance with slightly increased runtime
C	1e-4 LR + all linear (LoRA rank: 32)	0.8134	0.1062	0.7946	3.46	better performance, but largely increased runtime

Performance summary of fine-tuning experiments, identifying Config B as the optimal balance between high ROUGE-L scores and minimal runtime overhead for educational content generation.