Shekkizh Knowledge Base
Home · Nemotron KB · Contents
Execution plan

Single-Node 8-GPU Nemotron Plan

A practical plan for understanding LLM pretraining with NVIDIA’s stack on one machine with eight GPUs. The emphasis here is not on reproducing a frontier model exactly, but on building a setup that teaches the right abstractions clearly.

1B–3B dense Megatron-Bridge Nemotron tokenizer torchrun BF16

Contents

01

Main idea

The right way to learn this stack is to keep the architecture simple while keeping the training mechanics real. That means a small dense model, real distributed launch, real tokenized data, real checkpoints, and a real evaluation loop.

02

What to optimize for

  • Clarity over feature count
  • Fast iteration over maximal scale
  • Fewer moving parts over architectural novelty
  • Understanding loss, throughput, and memory over copying a flagship recipe verbatim
03

The recommended stack

Nemotron

Use as the reference blueprint for the full stage structure.

Megatron-Bridge

Use as the actual pretraining implementation.

Megatron-LM

Use as the internals and parallelism guide.

04

The first model to train

The best first run is a ~1B dense decoder-only model. A ~3B dense model is the best second run. Moving directly into a sparse or hybrid Nemotron-style model makes it harder to isolate what is happening during training.

05

Tokenizer choice

Reusing the Nemotron Nano tokenizer is a sensible choice. It reduces setup complexity and keeps the early experiments focused on the trainer, the data pipeline, and the distributed configuration rather than tokenizer design.

06

Data strategy

Use a clean, modest public corpus for the first run. This should be large enough to show genuine training dynamics, but small enough that iteration stays fast and failure cases stay interpretable.

  • Avoid ambitious web-scale curation at the beginning.
  • Avoid mixing too many dataset types in the first pass.
  • Add Curator later if the goal shifts to studying data quality effects.
07

First parallelism plan

  • TP=2, PP=1, DP=4 — best first default
  • TP=4, PP=1, DP=2 — useful if memory pressure rises
  • PP>1 only if model size forces it

Pipeline parallelism is powerful, but it adds conceptual overhead. There is no need to pay that cost immediately unless memory makes it necessary.

08

Launch philosophy

Start with direct torchrun, not an orchestration layer. Orchestration is valuable later, but the first milestone should be a fully understood direct launch that can be modified by hand.

09

Phase 0: smoke test

  1. Use a Megatron-Bridge quickstart pretrain example.
  2. Run on mock data.
  3. Verify that all 8 processes launch correctly.
  4. Check that loss decreases and checkpoints write successfully.
  5. Keep this run intentionally short.
10

Phase 1: short real run

  1. Swap in the real tokenizer.
  2. Point the recipe at a modest real dataset.
  3. Run a few thousand steps.
  4. Track loss, throughput, memory, checkpoint size, and startup overhead.

This is where the training loop starts becoming real rather than hypothetical.

11

Phase 2: ablations

Change one variable at a time and record the result:

  • micro batch size
  • global batch size
  • sequence length
  • activation checkpointing
  • tensor parallel degree
  • optimizer and scheduler settings

This is where understanding deepens. The point is not just to run training, but to see how each lever changes behavior.

12

What not to do first

  • Do not start with MoE.
  • Do not start with hybrid Mamba/Transformer architectures.
  • Do not start with long-context specialization.
  • Do not start with RL stages.
  • Do not start with elaborate orchestration unless you already understand the direct launch path.
13

What the final result should look like

By the end of this project, the useful output is not just a checkpoint. It is a documented understanding of how the stack works:

  • a reproducible 8-GPU pretraining recipe
  • a clean record of the model, tokenizer, data, and launch choices
  • notes on memory, throughput, and scaling behavior
  • a clear path from the simple dense run back toward the richer Nemotron recipes