Math 276 : Syllabus

Two review papers that cover part of the material: first and second

Here is a rough syllabus (precise schedule will depend on the progress in class, and suggestions/feedback are welcome).

April 2, 4

Uniform convergence theory.

  1. Bartlett, P.L., 1998. The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network. IEEE transactions on Information Theory, 44(2), pp.525-536

  2. Bach, F., 2017. Breaking the curse of dimensionality with convex neural networks. The Journal of Machine Learning Research, 18(1), pp.629-681.

  3. Bartlett, P.L., Foster, D.J. and Telgarsky, M., 2017, December. Spectrally-normalized margin bounds for neural networks. In Proceedings of the 31st International Conference on Neural Information Processing Systems (pp. 6241-6250)

April 9, 11

Overparametrized models and implicit bias.

  1. Gunasekar, S., Lee, J., Soudry, D. and Srebro, N., 2018, July. Characterizing implicit bias in terms of optimization geometry. In International Conference on Machine Learning (pp. 1832-1841). PMLR.

  2. Soudry, D., Hoffer, E., Nacson, M.S., Gunasekar, S. and Srebro, N., 2018. The implicit bias of gradient descent on separable data. The Journal of Machine Learning Research, 19(1), pp.2822-2878.

  3. Chizat, L., Oyallon, E. and Bach, F., 2019. On Lazy Training in Differentiable Programming. Advances in Neural Information Processing Systems, 32, pp.2937-2947.

  4. Oymak, S. and Soltanolkotabi, M., 2020. Toward Moderate Overparameterization: Global Convergence Guarantees for Training Shallow Neural Networks. IEEE Journal on Selected Areas in Information Theory, 1(1), pp.84-105.

April 16, 18

Benign overfitting in ridge regression and kernel methods

  1. Tsigler, A. and Bartlett, P.L., 2020. Benign overfitting in ridge regression. arXiv preprint arXiv:2009.14286.

  2. Mei, S., Misiakiewicz, T. and Montanari, A., 2022. Generalization error of random feature and kernel methods: hypercontractivity and kernel matrix concentration. Applied and Computational Harmonic Analysis, 59, pp.3-84.

  3. Rakhlin, Alexander, and Xiyu Zhai. “Consistency of interpolation with laplace kernels is a high-dimensional phenomenon.” Conference on Learning Theory. PMLR, 2019.

Complements:

  1. Ghorbani, B., Mei, S., Misiakiewicz, T. and Montanari, A., 2019. Linearized two-layers neural networks in high dimension. Annals of Statistics (to appear)

  2. Liang, T. and Rakhlin, A., 2020. Just interpolate: Kernel “ridgeless” regression can generalize. Annals of Statistics, 48(3), pp.1329-1347.

April 23, 25

Benign overfitting in random features and neural tangent models

  1. Mei, S. and Montanari, A., 2022. The generalization error of random features regression: Precise asymptotics and the double descent curve. Communications on Pure and Applied Mathematics, 75(4), pp.667-766.

  2. Montanari, A. and Zhong, Y., 2022. The interpolation phase transition in neural networks: Memorization and generalization under lazy training. The Annals of Statistics, 50(5), pp.2816-2847.

Complements:

  1. Cheng, C. and Montanari, A., 2022. Dimension free ridge regression. arXiv preprint arXiv:2210.08571.

April 30, May 2

Feature learning: Mean field theory

  1. Mei, S., Montanari, A. and Nguyen, P.M., 2018. A mean field view of the landscape of two-layer neural networks. Proceedings of the National Academy of Sciences, 115(33), pp.E7665-E7671

  2. Chizat, L. and Bach, F., 2018. On the global convergence of gradient descent for over-parameterized models using optimal transport. Advances in neural information processing systems, 31.

  3. Chizat, L., 2022. Mean-field langevin dynamics: Exponential convergence and annealing. arXiv:2202.01009.

Complements:

  1. Chizat, L., Colombo, M., Fernández-Real, X. and Figalli, A., 2022. Infinite-width limit of deep linear neural networks. arXiv:2211.16980.

  2. TBD

May 7, 9

Feature learning: Multi-index models

  1. Ba, J., Erdogdu, M.A., Suzuki, T., Wang, Z., Wu, D. and Yang, G., 2022. High-dimensional asymptotics of feature learning: How one gradient step improves the representation. Advances in Neural Information Processing Systems, 35, pp.37932-37946.

  2. Bietti, A., Bruna, J. and Pillaud-Vivien, L., 2023. On learning gaussian multi-index models with gradient flow. arXiv preprint arXiv:2310.19793

  3. Damian, A., Lee, J. and Soltanolkotabi, M., 2022, June. Neural networks can learn representations with gradient descent. In Conference on Learning Theory (pp. 5413-5452). PMLR.

Complements

  1. Bietti, A., Bruna, J., Sanford, C. and Song, M.J., 2022. Learning single-index models with shallow neural networks. Advances in Neural Information Processing Systems, 35, pp.9768-9783.

  2. Berthier, R., Montanari, A. and Zhou, K., 2023. Learning time-scales in two-layers neural networks. arXiv:2303.00055.

  3. Glasgow, M., 2023. SGD Finds then Tunes Features in Two-Layer Neural Networks with near-Optimal Sample Complexity: A Case Study in the XOR problem. arXiv:2309.15111.

  4. Damian, A., Nichani, E., Ge, R. and Lee, J.D., 2024. Smoothing the landscape boosts the signal for sgd: Optimal sample complexity for learning single index models. Advances in Neural Information Processing Systems, 36.

  5. Arnaboldi, L., Stephan, L., Krzakala, F. and Loureiro, B., 2023, July. From high-dimensional & mean-field dynamics to dimensionless ODEs: A unifying approach to SGD in two-layers networks. In The Thirty Sixth Annual Conference on Learning Theory (pp. 1199-1227). PMLR.

May 14, 16

Feature learning: Staircase functions, general parametrizations, etc

  1. Abbe, E., Adsera, E.B. and Misiakiewicz, T., 2022, June. The merged-staircase property: a necessary and nearly sufficient condition for sgd learning of sparse functions on two-layer neural networks. In Conference on Learning Theory (pp. 4782-4887). PMLR.

  2. Yang, G. and Hu, E.J., 2020. Feature learning in infinite-width neural networks. arXiv:2011.14522.

Complements:

  1. TBD

May 21, 23

Diffusion models

  1. Montanari, A., 2023. Sampling, diffusions, and stochastic localization. arXiv:2305.10690.

  2. Chen, S., Chewi, S., Li, J., Li, Y., Salim, A. and Zhang, A.R., 2022. Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions. arXiv:2209.11215.

  3. Shah, K., Chen, S. and Klivans, A., 2024. Learning mixtures of gaussians using the ddpm objective. Advances in Neural Information Processing Systems, 36.

  4. Li, G., Wei, Y., Chen, Y. and Chi, Y., 2023. Towards faster non-asymptotic convergence for diffusion-based generative models. arXiv:2306.09251.

Complements:

  1. Montanari, A. and Wu, Y., 2023. Posterior sampling from the spiked models via diffusion processes. arXiv:2304.11449.

  2. Chen, S., Chewi, S., Lee, H., Li, Y., Lu, J. and Salim, A., 2024. The probability flow ode is provably fast. Advances in Neural Information Processing Systems, 36

  3. Benton, J., De Bortoli, V., Doucet, A. and Deligiannidis, G., 2023. Linear convergence bounds for diffusion models via stochastic localization. arXiv:2308.03686

  4. Ghio, D., Dandi, Y., Krzakala, F. and Zdeborová, L., 2023. Sampling with flows, diffusion and autoregressive neural networks: A spin-glass perspective. arXiv:2308.14085.

May 28, 30; June 4

Attention and transformers

  1. Bai, Y., Chen, F., Wang, H., Xiong, C. and Mei, S., 2024. Transformers as statisticians: Provable in-context learning with in-context algorithm selection. Advances in neural information processing systems, 36.

  2. Zhang, R., Wu, J. and Bartlett, P.L., 2024. In-Context Learning of a Linear Transformer Block: Benefits of the MLP Component and One-Step GD Initialization. arXiv:2402.14951

  3. Geshkovski, B., Letrouit, C., Polyanskiy, Y. and Rigollet, P., 2023. A mathematical perspective on Transformers. arXiv:2312.10794.

Complements:

  1. TBD