Solutions

Icon
Icon

Solutions

Icon
Icon

Solutions

Icon
Icon

Age-Aware Chatbot: Fine-Tuned Responses for Children's Education

This example shows how to fine-tune TinyLlama-1.1B to generate age-appropriate educational responses for children across different developmental stages—then use a structured experiment to find the best LoRA configuration for balancing performance and efficiency.

Bento Card 01

Bento Card 01

Bento Card 01

Bento Card 02

Bento Card 02

Bento Card 03

Bento Card 03

Dataset

yxpan/children_sft_dataset (evolved via Gemini-2.0-flash-lite API into 1,200 samples across 5 age groups and 4 intent types)

Agent

A fine-tuned TinyLlama-1.1B trained via SFT + LoRA (PEFT), with experiments across learning rate, target modules, and LoRA rank.

Objectives

This example can serve as a starting point to understand how to rapidly experiment to:

Generate pedagogically sound, age-stratified responses that match the cognitive level of the target age group.

Generate pedagogically sound, age-stratified responses that match the cognitive level of the target age group.

Generate pedagogically sound, age-stratified responses that match the cognitive level of the target age group.

Find the right trade-off between LoRA capacity, module coverage, and training cost on constrained hardware (T4 GPU).

Find the right trade-off between LoRA capacity, module coverage, and training cost on constrained hardware (T4 GPU).

Find the right trade-off between LoRA capacity, module coverage, and training cost on constrained hardware (T4 GPU).

Key takeaways

Module coverage + rank matter most:

Expanding LoRA targets to all linear modules with rank 32 (Config B) delivered the best efficiency-adjusted results — +16.9% ROUGE-L and +6.2% token accuracy over baseline with only 0.3% more runtime.

Higher LR didn't help

Raising the learning rate to 1e-4 (Config A) caused training instability and actually hurt BERT-F1, dropping it from 0.8074 to 0.7733.

Combining knobs has diminishing returns:

Config C (high LR + all modules) achieved the highest raw metrics but at a disproportionate runtime cost, making Config B the practical winner.

Age-adaptation is hard to measure:

ROUGE-L underperforms on open-ended tasks like storytelling. The model also shows readability gaps exceeding 11 grade levels in edge cases and occasional "tone leakage" where it lapses into childish language mid-response for older age groups.

AI marketing  automation for data driven strategies.
AI marketing  automation for data driven strategies.
AI marketing  automation for data driven strategies.

Figure 1. Comparison of ROUGE-L scores across experimental configurations

Experiment Design

Full factorial 2×2×2 = 8 configurations across three knobs:

Config

Key change(s)

BERT-F1

ROUGE-L

Eval-mean-token-accuracy

Runtime

Notes

Baseline

\

0.8074

0.1003

0.7271

3.27

shortest runtime

A

LR: 1e-4

0.7733

0.0931

0.7817

3.40

comparable performance and long runtime

B (Best)

Modules: all linear; LoRA rank: 32

0.8086

0.1172

0.7893

3.28

better performance with slightly increased runtime

C

1e-4 LR + all linear (LoRA rank: 32)

0.8134

0.1062

0.7946

3.46

better performance, but largely increased runtime

Performance summary of fine-tuning experiments, identifying Config B as the optimal balance between high ROUGE-L scores and minimal runtime overhead for educational content generation.

How to apply this to your domain

Use this workflow as a template for your own chatbot:

Cta Image

How to apply this to your domain

Use this workflow as a template for your own chatbot:

Cta Image

How to apply this to your domain

Use this workflow as a template for your own chatbot:

Cta Image