Research Problems in Pretraining

June 3, 20262 minShurui Liu

Speaker Sheng Zha's Notes is availabe online.

What is pretraining: next token prediction

Next-token prediction is prequential coding, an online compression algorithm that upper bounds the Kolmogorov complexity of data.

learning pipeline
prequential code length
learned prediction model

Spikes lower the efficiency of pretraining. Read the log for the datasets used before the spike (microwave gun). See if this is a dataset quality issue.

Ingrediants:

data: Allen Zhu et al: Junk data significatnlty harm LLM's knowledge capacity on good data (sometimes by $20\times$); add domain name at front of all pretraining data paragraphs LLM can automatically detect domains rich in high-quality knowledge and prioritize them.
model architecture:There are many architecture designs: AdamW, Muon.
opmizer

Foundation:

Scalng laws: empirical laws that describe how the model performance reliable improves as you increase compute.
Principled scaling: [Greg Yang et al] a spectral condition for feature learning and muP variants.
Systems
What gives rise to scaling laws?
What leads to emergent properties?
What leads to hyperparameter transfre?
What model architecture compresses the best?

`?`	Toggle this help
`/`	Search
`f`	Link hints (vim-like)
`t`	Toggle dark mode
`j` / `k`	Scroll down / up
`g` / `G`	Top / bottom
`o`	Jump back
`l`	Cycle language (en→zh→fr)
`H` / `L`	History back / forward
`r`	Reload
`F`	Fullscreen
`i`	Idle in the Matrix
`a`	ASCII Aquarium
`Esc`	Close / cancel

What is pretraining: next token prediction

Keyboard Shortcuts