CS 124: From Languages to Information

Dan Jurafsky
Winter 2024, Tu/Th 3:00-4:20 in Hewlett 200

The online world has a vast array of unstructured information in the form of language and social networks. Learn how to make sense of it using neural networks and other machine learning tools, and how to interact with humans via language, from answering questions to giving advice, from regular expressions to information retrieval to large language models!

Schedule Ed Discussion Canvas Material

Course Staff

    Dan Jurafsky
    Professor
    Deveshi Buch
    Head TA
    Veronica Rivera
    Embedded Ethics Postdoc
    Mo Akintan
    TA
    Pranav Gurusankar
    TA
    Natasha Kacharia
    TA
    Hanna Lee
    TA
    Amelia Leon
    TA
    Jasper McAvity
    TA
    Anwesha Mukherjee
    TA
    Fahad Nabi
    TA
    Tolu Oyeniyi
    TA
    Uma Phatak
    Embedded Ethics TA
    Michael Ryan
    TA
    Francis Santiago
    TA
    Jingwen Wu
    TA
    Jack Xiao
    TA

Schedule

Week Date Homework Quiz In-class Video Lectures and Readings (to be done by the Monday of the week unless I specify another date)
1 Jan 9, 11

PA 0: Setup and Tutorial
[starter code]

Due Fri Jan 12, 5:00pm (We'll also go over this in Thursday Jan 11's in-person tutorial )

-
  • Tue Jan 9: In-person Lecture: Intro (not recorded)

    [slides pptx slides pdf]

  • Thurs Jan 11: In-person tutorial: Jupyter notebooks
Watch before Thursday:
2 Jan 16 and 18

PA 1: Regular Expressions
[starter code]

Due Fri Jan 19, 5:00pm

Quiz 1: Text Processing/Edit Distance [quiz 1 on gradescope]

Due Tue Jan 16, 11:59pm

Basic Text Processing Canvas Videos (watch videos before Mon Jan 15) [canvas slides pptx] [ canvas slides pdf]
Edit Distance Canvas Videos (watch videos before Mon Jan 15) [canvas slides pptx] [canvas slides pdf]
  • Just for historical reference: Ken Church's original tutorial Unix for Poets, slides/pages 1-19
3 Jan 23 and 25

PA 2: Naive Bayes and Sentiment Analysis!
[starter code]

Due Fri Jan 26, 5:00pm

Quiz 2: Language Modeling/Naive Bayes [gradescope]

Due Tuesday Jan 23, 11:59pm

    Tue Jan 23: Lab #2: Naive Bayes, Classification, and Data Ethics
    (watch NB videos beforehand)
    (don't look at the solution until you've completed all the questions!)
    [Lab 2] [Lab 2 Solutions]


    Thu Jan 25: No class: extra in-person TA office hours during class time in classroom


Language Modeling Canvas Videos (watch before Monday Jan 23) [canvas slides pptx] [canvas slides pdf]
Naive Bayes and Text Classification Canvas Videos (watch before Monday Jan 22) [canvas slides pptx] [canvas slides pdf]
  • J+M (3ed) Chapter 4, "Naive Bayes and Sentiment Classification" pages 1-14 plus page 18, sections 4.1 through 4.8 and 4.10.
  • 4 Jan 30 and Feb 1

    PA 3: Logistic Regression!
    [starter code]

    Due Fri Feb 2, 5:00pm

    Quiz 3: Logistic Regression [gradescope]

    Due Tuesday Jan 30, 11:59pm

      Tuesday: Lecture from Dan (required and not recorded): NLP for Public Good: Computational Social Science [canvas slides pdf]


      Thursday: No class: extra in-person TA office hours during class time in Hewlett 200


    5 Feb 6 and 8

    PA 4: Information Retrieval
    [starter code]

    Due Fri Feb 9, 5:00pm

    Quiz 4: Information Retrieval [gradescope]

    Due Tuesday Feb 6, 11:59pm

    • Tuesday: Lab #3: Information Retrieval
      [lab 3] [solutions]

      • Thursday: No class: extra in-person TA office hours during class time in Hewlett 200
    Chris Manning Canvas Video: Information Retrieval (I) (watch/read before Monday Feb 5) [canvas slides pptx] [canvas slides pdf]
    • MR+S Chapter 1: Boolean Retrieval (pages 1-17)
    • MR+S Chapter 2: Term vocabulary and postings lists (only pages 33-42)
    Chris Manning Canvas Video: Information Retrieval (II) (watch/read before Monday Feb 5) [canvas slides pptx] [canvas slides pdf]
    6 Feb 13 and 15

    PA 5: Embeddings and Vector Semantics [starter code]

    Due Fri Feb 16, 5:00pm.

    Quiz 5: Vector Semantics and Sequence Labelling [gradescope]

    Due Tue Feb 13, 11:59pm

    Tuesday: Guest Lecture (required and not recorded): Dora Demszky, Graduate School of Education
    [slides pdf]

    Thursday: No class: extra in-person TA office hours during class time in Hewlett 200




    7 Feb 20 and 22

    PA 6: Neural Networks [starter code]

    Due Fri Feb 23, 5:00pm

    Quiz 6: Neural Networks [gradescope]

    Due Tue Feb 20, 11:59pm.

    Tuesday Lab #4: Large Language Models and ChatGPT
    [Lab 4] [Lab 4 solutions]

    Thursday: No class: extra in-person TA office hours during class time in Hewlett 200


    8 Feb 27 and Feb 29

    Tuesday: Review for Midterm (online)

    Thursday: Midterm (online)


    9 Mar 5 and 7

    PA 7: Chatbot

    Due Fri Mar 15, 5:00pm

    Quiz 7: Recommendation Systems

    Due Tues Mar 5, 11:59pm

    Tuesday: Lab #5: PA7 and Git

    Thursday: No class: extra in-person TA office hours during class time in Hewlett 200

    Recommender systems and Collaborative Filtering Canvas videos (watch by Monday Mar 4) [slides pptx] [slides pdf]
    10 Mar 12 and 14

    Reminder: PA 7: Chatbot due Fri Mar 15, 5:00pm

    Quiz 8: Pagerank and Networks

    Due Tues Mar 12, 11:59pm


    Tuesday: No class (but no extra office hours)


    Thursday: Final Lecture from Dan (required and not recorded): "The Current State of NLP"
    Web graphs, Links, and PageRank (watch by Mon Mar 11)
    • MR+S Chapter 21: Link Analysis, just pages 421-433 (Skip section 21.3 and 21.4)
    Social Networks Canvas Videos (watch by Mon Mar 11)

    Logistics

    Instructor
    Dan Jurafsky (jurafsky@stanford.edu)
    Office: Margaret Jacks 117
    Office Hours: Tuesdays 4:30-5:50 after class except Jan 9. I'm going to try an experiment with individual one-at-a-time in-person office hours, where we have a 10 minute slot and possibly even take a walk outside. We can walk over from class together or you can just come to my office, which is Margaret Jacks 117!

    TA Office Hours
    • [Office hours begin on Saturday Jan 13]
    • Mondays: 7:00-9:45pm at Lathrop 299 (except not MLK Day or President's Day)
    • Tuesdays: 6:00-8:45pm virtual (details here)
    • Wednesdays: 4:30-7:15pm in McMurtry Building ART350
    • Thursdays:
      • Weeks 3,5,6,9: 3:00-5:45pm Hewlett 200
      • Weeks 4,7: 3:00-4:30pm Hewlett 200 + 4:30-5:45pm Hewlett 103
      • Weeks 2,10: 4:30-7:15pm Hewlett 200
      • Week 8: No regular office hours
    • Fridays: 9:00am-11:45am at Gates B12 (except none Week 8)
    • Saturdays: 1:00-3:45pm virtual (details here)
    Class Time

    Tuesday and Thursday 3:00-4:20

    Attendance

    We require you come to 6 classes: the 4 live lectures and lab #1 and lab #5 and strongly strongly recommend the other 3 labs and 2 tutorials, you will learn more from doing them with other people (I won't require attendance at labs 2/3/4 but I will give extra credit for attending labs 2/3/4). For any lab you miss, you must still do them at home yourself. The course can be taken asynchronously only if you have permission from Dan due to a required conflict or medical issue. Also: different people learn better from different combinations of videos/lectures, reading the chapters, coming to the labs, and coming to office hours. But I will say that students who do all four tend to do the best on quizzes and exams and in the course in general.

    Email

    Alas, we can't reply to email sent to individual staff members. If you have a question that is not confidential or personal, post it on the Ed Discussion forum! Responses are quicker and you'll also be helping others with the same question! To contact the teaching staff directly, come see us in office hours!

    If that is not possible, you can also email (non-technical questions) to the course staff list, cs124_requests@lists.stanford.edu. For urgent requests: We check the staff email list very frequently, but please don't worry if you don't hear from us right away. We will do our best to get back to you within a day or so. Just make sure to send an email as soon as you have the request so it's timestamped!

    If you have a matter to be discussed privately, come to office hours or use cs124_requests@lists.stanford.edu to make an appointment. For grading questions, please talk to us after class or during office hours.

    Class announcements will be on Ed Discussion (although we will occasionally try Canvas and mailing lists). We will assume that everyone reads all announcements.

    Honor Code

    Since we occasionally reuse homeworks from previous years, we expect students not to copy, refer to, or look at the solutions in preparing their answers. It is an honor code violation to intentionally refer to a previous year's solutions. This applies both to the official solutions and to solutions that you or someone else may have written up in a previous year. It is also an honor code violation to find some way to look at the test set, or to interfere in any way with programming assignment scoring or tampering with the submit script. It's also an honor code violation to use ChatGPT or any automatic coding system to write your code for you.

    Since quizzes are a form of assessment, students are not allowed to collaborate on completing quizzes. It is an honor code violation to discuss quiz questions with other students.

    CS124 follows the general Stanford policy on generative AI which is that "use of or consultation with generative AI shall be treated analogously to assistance from another person. In particular, using generative AI tools to substantially complete an assignment or exam (e.g. by entering exam or assignment questions) is not permitted", just as having someone do your homework or exams for you is not permitted.

    Textbooks
    Course Description

    Extracting meaning, information, and structure from human language text, speech, web pages, social networks. Introducing methods (string algorithms, edit distance, language modeling, machine learning, logistic regression, neural networks, neural embeddings, inverted indices, collaborative filtering, PageRank), applications (chatbots, sentiment analysis, information retrieval, text classification, social networks, recommender systems), and ethical issues.

    Prerequisites

    CS106B, Python (at the level of CS106A), CS109 (or equivalent background in probability), and programming maturity and knowledge of UNIX equivalent to CS107 (or taking CS107 or CS1U concurrently).

    Required Work

    From Languages to Information is a flipped class with much of the material online. All the lectures (except 4 live lectures) have been prerecorded, and you can watch them at home. The weekly quizzes and programming homeworks will be automatically uploaded and graded. Lectures are available in the Modules section on Canvas. Quizzes and homeworks are on Gradescope and github, but you can find them all on this webpage!!
    Prerecorded Video Lectures

    Most weeks, we will ask you to watch a set of video lectures (2 to 2.5 hours total). Most videos will have some in-video questions embedded in them, which you should answer. You are required to watch the videos but the embedded quizzes are not counted toward the final grade.

    In-class Lectures

    4 lectures will be live, and are required and the material is fair game on the midterm.

    Labs

    There are 5 in-class labs are in which we do group problem-solving activities. The labs are required and will be tested on the quizzes and midterm, meaning that if you can't make a particular in-person lab, you must still do the exercises at home instead. But Lab 1 and Lab 5 are required to be attended in-class; the other 3 you can do at home.

    Automated Review Quizzes

    After watching a week's video lectures, we will ask you to answer an open-notes, open-book review quiz (about 5 questions) on the content that you just learned. These quizzes are not timed, they are open book, and they may be attempted an infinite number of times. The questions, as well as the options for each question, are randomly selected from a larger pool each time you take a quiz. You will not see your quiz grade/correct answers until after the due date, but the system will take the the score from the last submission of all your infinitely-allowed submissions for the quiz. So if you worry you might have got something wrong, just submit another one! Review Quizzes for each week are due 11:59pm Tuesday of the following week There are no late days for review quizzes. Because of the strict no-late-day policy, we will drop your lowest scoring quiz (i.e. we will only count your best 7 of the 8 quizzes in your final grade).

    Midterm

    There is one online midterm! Details will be posted in week 6.

    Class Participation

    You have to watch all lectures, and attendance for the 4 live lectures is required. The labs are required and we will test material from them on the midterm, and labs 1 and 5 must be attended in person. However, attendance for labs 2,3,4 is only strongly recommended; you may do them yourself at home if you really cannot come to class. You can get extra credit for class participation and other things by: Coming to the 5 labs; helpful answers on the class forum, helping out other students in office hours or labs, being the first person to find typos in the textbook (not counting bugs in figure or chapter numbering), speaking up in the labs. Plus there will be extra credit problems on some of the labs and the midterm.

    Programming Assignments

    7 Python programming assignments. All are due Fridays at 5pm.

    Programming Assignment Collaboration for PA 1-6: You may talk to anybody you want about the assignments and bounce ideas off each other. And if you want, you can also choose a partner and do pair programming for PA 1-6. Pair programming has many advantages for learning!!! You and your pair-partner can discuss code, but it's important that each of you work on each part of the assignment so that you're comfortable with the whole assignment, since assignments build on each other (and we will test concepts from the assignments on the midterm). If you choose to pair-program, you should specify in the submission who your partner is. We will use the normal automatic checks for overlap between your code and other students' code who are not your pair partner.

    Programming Assignment Collaboration for PA 7: PA7 is a group homework that must be done in groups. You will work together with your group, and write code together. Groups must be of size 3 or 4. To work in a group of size 2, you must get special permission from the staff. You cannot work by yourself on PA 7, because part of the goal of this homework is to learn to work on group projects. You must describe in your writeup in detail exactly who in your group did what, and who worked on which parts of the assignment/code.

    Late homeworks

    You have 4 free late (calendar) days to use on programming assignments 1-6. If you are pair programming, late days are still individual (i.e if one of you has used up late days, and one has not, and you submit a homework late one day, only the student without remaining late days will be penalized). You cannot use late days on PA 7. Once late days are exhausted, any PA turned in late will be penalized 20% per late day. Each 24 hours or part thereof that a homework is late uses up one full late day. However, no assignment will be accepted more than four days after its due date.

    Readings

    This class has a significant amount of textbook reading. Most weeks have around 25 textbook pages. The homeworks, quizzes, and midterm will be based heavily on the readings.

    Final grade computation
    • 65% homeworks (PAs 1-6 are each worth the same, 8% (ignore the different point values for each homework). PA7 is worth 16%, double the others, PA0 is worth 1%.)
    • 20% Midterm
    • 15% weekly review quizzes
    Final letter grades
    • Some sort of A: 90% and above of the total points
      • (the numerator will include your extra credit, the denominator does not include possible extra credit (otherwise it wouldn't be extra credit))
    • A+: 100.000% and above strictly no rounding (i.e., not 99.99% or below)
    • Some sort of B: 80% and above
    • Some sort of C: 70% and above