Single-Node 8-GPU Nemotron Plan · Shekkizh Knowledge Base

Main idea

The right way to learn this stack is to keep the architecture simple while keeping the training mechanics real. That means a small dense model, real distributed launch, real tokenized data, real checkpoints, and a real evaluation loop.

What to optimize for

Clarity over feature count
Fast iteration over maximal scale
Fewer moving parts over architectural novelty
Understanding loss, throughput, and memory over copying a flagship recipe verbatim

The recommended stack

Nemotron

Use as the reference blueprint for the full stage structure.

Megatron-Bridge

Use as the actual pretraining implementation.

Megatron-LM

Use as the internals and parallelism guide.

The first model to train

The best first run is a ~1B dense decoder-only model. A ~3B dense model is the best second run. Moving directly into a sparse or hybrid Nemotron-style model makes it harder to isolate what is happening during training.

Tokenizer choice

Reusing the Nemotron Nano tokenizer is a sensible choice. It reduces setup complexity and keeps the early experiments focused on the trainer, the data pipeline, and the distributed configuration rather than tokenizer design.

Data strategy

Use a clean, modest public corpus for the first run. This should be large enough to show genuine training dynamics, but small enough that iteration stays fast and failure cases stay interpretable.

Avoid ambitious web-scale curation at the beginning.
Avoid mixing too many dataset types in the first pass.
Add Curator later if the goal shifts to studying data quality effects.

First parallelism plan

TP=2, PP=1, DP=4 — best first default
TP=4, PP=1, DP=2 — useful if memory pressure rises
PP>1 only if model size forces it

Pipeline parallelism is powerful, but it adds conceptual overhead. There is no need to pay that cost immediately unless memory makes it necessary.

Launch philosophy

Start with direct torchrun, not an orchestration layer. Orchestration is valuable later, but the first milestone should be a fully understood direct launch that can be modified by hand.

Phase 0: smoke test

Use a Megatron-Bridge quickstart pretrain example.
Run on mock data.
Verify that all 8 processes launch correctly.
Check that loss decreases and checkpoints write successfully.
Keep this run intentionally short.

Phase 1: short real run

Swap in the real tokenizer.
Point the recipe at a modest real dataset.
Run a few thousand steps.
Track loss, throughput, memory, checkpoint size, and startup overhead.

This is where the training loop starts becoming real rather than hypothetical.

Phase 2: ablations

Change one variable at a time and record the result:

micro batch size
global batch size
sequence length
activation checkpointing
tensor parallel degree
optimizer and scheduler settings

This is where understanding deepens. The point is not just to run training, but to see how each lever changes behavior.

What not to do first

Do not start with MoE.
Do not start with hybrid Mamba/Transformer architectures.
Do not start with long-context specialization.
Do not start with RL stages.
Do not start with elaborate orchestration unless you already understand the direct launch path.

What the final result should look like

By the end of this project, the useful output is not just a checkpoint. It is a documented understanding of how the stack works:

a reproducible 8-GPU pretraining recipe
a clean record of the model, tokenizer, data, and launch choices
notes on memory, throughput, and scaling behavior
a clear path from the simple dense run back toward the richer Nemotron recipes

Contents