Research collection · LLM systems

LLM Training Knowledge Base

A cleaned, public-facing synthesis of handwritten notes on LLM training. This page turns scattered raw ideas into a structured map covering post-training, evaluation, reasoning, instruction following, tool use, multilingual quality, long-context behavior, data curation, and serving-time reliability.

Post-training Evaluation Reasoning Function calling Data quality Serving
Source basis: five handwritten note images. This page is a structured synthesis, not a literal transcript.

Overview

The source notes point toward a practical view of LLM development: strong product behavior is often shaped more by post-training and evaluation discipline than by raw pretraining scale alone. The important question is not just whether the model knows more, but whether it is consistently useful, controllable, benchmark-honest, tool-capable, and robust under real inference conditions.

This knowledge base organizes the notes around capability buckets so they can be reused as a training or post-training checklist rather than staying trapped as unstructured scratch paper.

Post-training priorities

The notes strongly suggest that post-training should be treated as a multi-objective optimization problem. A useful model needs reasoning quality, instruction following, response quality, tool correctness, and broad capability retention at the same time.

Primary goals

  • Improve reasoning on hard tasks
  • Preserve or improve instruction following
  • Support accurate function calling and tool use
  • Maintain strong conversational output quality
  • Avoid multilingual collapse

Common failure modes

  • Benchmark chasing that hurts real-world usability
  • RL gains that degrade style, formatting, or latency
  • Tool-call regressions after reasoning-focused tuning
  • Template/runtime mismatches at deployment time

Evaluation stack

Several benchmark names or fragments appear in the notes, including arena-style comparisons, hard knowledge or reasoning evaluations, math-heavy tasks, and pass@k-style thinking for smaller sets. The clear takeaway is that no single eval should dominate the training narrative.

  • Arena-Hard or related pairwise preference evaluations for broad quality comparison
  • GPQA / hard reasoning-style benchmarks for difficult scientific or knowledge-heavy tasks
  • MATH or similar structured reasoning evaluations
  • Pass@k framing for stochastic or small-sample task settings
Best practice: split metrics by capability bucket instead of collapsing everything into one composite score.

Reasoning and RL

The notes appear to connect reasoning gains with reinforcement-style post-training, verifier or reward-guided optimization, and DeepSeek-R1-like distinctions between zero-style RL and a supervised cold start.

  • Cold-start supervised data can stabilize later RL
  • Pure RL-style reasoning gains may produce awkward or unstable behavior if left unchecked
  • Reward design matters: optimizing only final correctness can create brittle outputs
  • Reasoning quality should be measured alongside answer readability and utility

Instruction following and response quality

A strong model should not only solve hard tasks but also obey constraints, maintain structure, produce the expected format, and communicate cleanly. The notes seem to explicitly treat instruction following as a core target, which is the right framing for assistant-style systems.

  • Follow explicit constraints without drifting into over-refusal
  • Preserve formatting, schema, and style requirements
  • Improve naturalness of responses without sacrificing precision
  • Track response quality separately from benchmark accuracy

Tool use and function calling

Tool use appears as an explicit category in the notes. That matters because function calling often breaks when teams optimize only for free-form reasoning or preference wins.

  • Train structured tool selection and argument formatting directly
  • Evaluate exact schema adherence, not just rough intent
  • Test multi-turn tool use under ambiguity and partial context
  • Protect tool reliability through later post-training stages

Multilingual quality

Multilingual behavior is likely called out because post-training can easily overfit to English-centric assistant behavior. If multilingual quality matters, it needs explicit representation in both the data mixture and the eval suite.

  • Keep multilingual supervised examples in the post-training mixture
  • Evaluate across representative languages, not just translated prompts
  • Check whether preference signals are implicitly biased toward English outputs

Long-context and recall

The notes also point to long-context performance, likely around retrieval fidelity, instruction persistence, and degradation in longer windows. Max context length alone is not a meaningful success metric.

  • Measure retrieval accuracy inside long contexts
  • Check instruction persistence after large distractor spans
  • Separate recall tasks from true synthesis across long documents

Data mixing, decontamination, and smaller sets

Some of the most operationally important notes concern the training mixture itself: decontamination, smaller but high-quality datasets, supervised warm starts, and preserving the right balance across capability areas.

  • Decontaminate aggressively for benchmark integrity
  • Track mixture weights across reasoning, tools, multilingual, and chat data
  • Use smaller curated datasets carefully and evaluate them with the right sampling logic
  • Prefer explicit dataset taxonomies over a single undifferentiated blob

Serving-time details

The notes mention chat templates and vLLM compatibility, which is exactly the kind of implementation detail that causes silent quality regressions. Training-time and inference-time formatting assumptions need to match.

  • Align training and serving chat templates
  • Validate tool formatting in the actual runtime stack
  • Document BOS/EOS and special-token assumptions
  • Test the model in the target inference engine before calling a run successful

Suggested roadmap

  1. Define capability buckets: reasoning, chat, tools, multilingual, long-context, safety.
  2. Build the evaluation suite first so every training run is interpretable.
  3. Create a supervised post-training seed with high-quality formatting and instruction-following behavior.
  4. Add reasoning optimization while monitoring regressions outside reasoning benchmarks.
  5. Decontaminate and rebalance the data mixture whenever gains look narrow or suspicious.
  6. Validate the model using the exact chat template and inference stack intended for deployment.

In short: don’t optimize for a single impressive score. Optimize for a reliable system.