Research Problems in Pretraining
Speaker Sheng Zha's Notes is availabe online.
What is pretraining: next token prediction
Next-token prediction is prequential coding, an online compression algorithm that upper bounds the Kolmogorov complexity of data.
- learning pipeline
- prequential code length
- learned prediction model
Spikes lower the efficiency of pretraining. Read the log for the datasets used before the spike (microwave gun). See if this is a dataset quality issue.
Ingrediants:
-
data: Allen Zhu et al: Junk data significatnlty harm LLM's knowledge capacity on good data (sometimes by $20\times$); add domain name at front of all pretraining data paragraphs LLM can automatically detect domains rich in high-quality knowledge and prioritize them.
-
model architecture:There are many architecture designs: AdamW, Muon.
-
opmizer
Foundation:
-
Scalng laws: empirical laws that describe how the model performance reliable improves as you increase compute.
-
Principled scaling: [Greg Yang et al] a spectral condition for feature learning and muP variants.
-
Systems
-
What gives rise to scaling laws?
-
What leads to emergent properties?
-
What leads to hyperparameter transfre?
-
What model architecture compresses the best?