Research Problems in Pretraining

Shurui Liu

Speaker Sheng Zha's Notes is availabe online.

What is pretraining: next token prediction

Next-token prediction is prequential coding, an online compression algorithm that upper bounds the Kolmogorov complexity of data.

  1. learning pipeline
  2. prequential code length
  3. learned prediction model

Spikes lower the efficiency of pretraining. Read the log for the datasets used before the spike (microwave gun). See if this is a dataset quality issue.

Ingrediants:

  1. data: Allen Zhu et al: Junk data significatnlty harm LLM's knowledge capacity on good data (sometimes by $20\times$); add domain name at front of all pretraining data paragraphs LLM can automatically detect domains rich in high-quality knowledge and prioritize them.

  2. model architecture:There are many architecture designs: AdamW, Muon.

  3. opmizer

Foundation:

  1. Scalng laws: empirical laws that describe how the model performance reliable improves as you increase compute.

  2. Principled scaling: [Greg Yang et al] a spectral condition for feature learning and muP variants.

  3. Systems

  4. What gives rise to scaling laws?

  5. What leads to emergent properties?

  6. What leads to hyperparameter transfre?

  7. What model architecture compresses the best?