| Date | Title | Description |
|---|---|---|
| April 2nd | Overview of Transformers [In-Person] Speakers: Instructors | Brief intro and overview of the history of ML/NLP, Transformers and how they work, and their impact. Discussion about recent trends, breakthroughs, applications, and current challenges. Link to slides. Paper discussed: Feng et al., Baby Scale: Investigating Models Trained on Individual Children's Language Input, arXiv:2603.29522 Zeng et al., Bringing Up a Bilingual BabyLM: Investigating Multilingual Language Acquisition Using Small-Scale Models, arXiv:2603.29552 Singh et al., To Memorize or to Retrieve: Scaling Laws for RAG-Considerate Pretraining, arXiv:2604.00715 Singh et al., Curriculum-Guided Layer Scaling for Language Model Pretraining, arXiv:2506.11389 Singh et al., Interpretable Cross-Network Attention for Resting-State fMRI Representation Learning, arXiv:2603.00786 Liu et al., A Unified Definition of Hallucination: It's The World Model, Stupid!, arXiv:2512.21577 |
| April 9th | From Representation Learning to World Modeling through Joint Embedding Predictive Architectures [In-Person] Speakers: Hazel Nam & Lucas Maes (Brown University) | World models are increasingly moving away from reconstruction and toward prediction in latent space. In this talk, we will present two recent JEPA-based approaches that illustrate this shift from complementary angles. Causal-JEPA induces object-level relational bias to promote representations that capture entities, and interactions, leading to stronger reasoning and more efficient planning. LeWorldModel shows that such predictive world models can also be trained stably end-to-end from raw pixels using a minimal objective and a clean architectural recipe, while remaining competitive on control tasks. Taken together, these works argue for a unified view of world modeling: predictive latent learning becomes most powerful when combined with both structural bias and architectural simplicity. This perspective suggests a promising path toward robust world models that support abstraction, reasoning, and control. Heejeong Nam is a Master's student at Brown University, working on representation learning, causality, and self-supervised learning. Lucas Maes is a PhD student at Mila and the University of Montreal, working on JEPA and planning. |
| April 16th | SSMs [In-Person] Speaker: Albert Gu (CMU) | |
| April 23th | The Ultra-Scale Talk: Scaling Training to Thousands of GPUs [In-Person] Speaker: Nouamane Tazi (Hugging Face) | Training large language models demands more than raw compute, it requires a deep understanding of parallelism and architecture choices. This talk dives into the practicalities of ultra-scale training: How 5D parallelism makes it possible to stretch a single run across massive GPU clusters, how Mixture-of-Experts architectures introduce new scaling dimensions and stability challenges, and the performance tuning and communication patterns that drive throughput. Drawing from the Ultra-Scale Playbook and real-world scaling efforts, we'll cover benchmarks, hard-earned lessons, and hands-on recommendations for engineers and researchers ready to train state-of-the-art models efficiently. Nouamane Tazi is a Machine Learning Engineer at Hugging Face, focused on large-scale distributed LLM training. He is the lead author of the Ultra-Scale Playbook and a core developer of Nanotron, Hugging Face's open-source distributed training library. His work spans projects like StarCoder2, SmolLM3, and Mixture-of-Experts scaling with several initiatives. He is passionate about making large-scale training practical and accessible. |
| April 30th | TBA | |
| May 7th | Speaker: Andrew Lampinen (Anthropic) | |
| May 14th | Speaker: Vivek Natarajan (DeepMind) | |
| May 21th | TBA | |
| May 28th | Speaker: Charles Frye (Modal) |



