Primary goals
- Improve reasoning on hard tasks
- Preserve or improve instruction following
- Support accurate function calling and tool use
- Maintain strong conversational output quality
- Avoid multilingual collapse
A cleaned, public-facing synthesis of handwritten notes on LLM training. This page turns scattered raw ideas into a structured map covering post-training, evaluation, reasoning, instruction following, tool use, multilingual quality, long-context behavior, data curation, and serving-time reliability.
The source notes point toward a practical view of LLM development: strong product behavior is often shaped more by post-training and evaluation discipline than by raw pretraining scale alone. The important question is not just whether the model knows more, but whether it is consistently useful, controllable, benchmark-honest, tool-capable, and robust under real inference conditions.
This knowledge base organizes the notes around capability buckets so they can be reused as a training or post-training checklist rather than staying trapped as unstructured scratch paper.
The notes strongly suggest that post-training should be treated as a multi-objective optimization problem. A useful model needs reasoning quality, instruction following, response quality, tool correctness, and broad capability retention at the same time.
Several benchmark names or fragments appear in the notes, including arena-style comparisons, hard knowledge or reasoning evaluations, math-heavy tasks, and pass@k-style thinking for smaller sets. The clear takeaway is that no single eval should dominate the training narrative.
The notes appear to connect reasoning gains with reinforcement-style post-training, verifier or reward-guided optimization, and DeepSeek-R1-like distinctions between zero-style RL and a supervised cold start.
A strong model should not only solve hard tasks but also obey constraints, maintain structure, produce the expected format, and communicate cleanly. The notes seem to explicitly treat instruction following as a core target, which is the right framing for assistant-style systems.
Tool use appears as an explicit category in the notes. That matters because function calling often breaks when teams optimize only for free-form reasoning or preference wins.
Multilingual behavior is likely called out because post-training can easily overfit to English-centric assistant behavior. If multilingual quality matters, it needs explicit representation in both the data mixture and the eval suite.
The notes also point to long-context performance, likely around retrieval fidelity, instruction persistence, and degradation in longer windows. Max context length alone is not a meaningful success metric.
Some of the most operationally important notes concern the training mixture itself: decontamination, smaller but high-quality datasets, supervised warm starts, and preserving the right balance across capability areas.
The notes mention chat templates and vLLM compatibility, which is exactly the kind of implementation detail that causes silent quality regressions. Training-time and inference-time formatting assumptions need to match.
In short: don’t optimize for a single impressive score. Optimize for a reliable system.