Course Description

Aimed at non-CS undergraduate and graduate students who want to learn the basics of big data tools and techniques and apply that knowledge in their areas of study. Many of the world's biggest discoveries and decisions in science, technology, business, medicine, politics, and society as a whole, are now being made on the basis of analyzing massive data sets. At the same time, it is surprisingly easy to make errors or come to false conclusions from data analysis alone. This course provides a broad and practical introduction to big data: data analysis techniques including databases, data mining, and machine learning; data analysis tools including spreadsheets, relational databases and SQL, Python, and R; data visualization techniques and tools; pitfalls in data collection and analysis; historical context, privacy, and other ethical issues. Tools and techniques are hands-on but at a cursory level, providing a basis for future exploration and application. Prerequisites: comfort with basic logic and mathematical concepts, along with high school AP computer science, CS106A, or other equivalent programming experience.


Time Tuesdays & Thursdays 1:30-2:50 PM
Location: Hewlett 201

Office Hours

The CAs hold 20 hours of office hours a week, Monday-Friday, in reserved areas in the Engineering Quad. Times and places are given in the course calendar.

Professor Widom holds office hours on Wednesdays 4:00-5:00pm in the Dean's Office #227 on the 2nd floor of the Huang building. Updates to her office hours will be posted on the course calendar.


Grades for the course will be weighted equally on composite scores for projects, exams, and homework assignments. That is, the 5 homework assignments will carry the same weight as the 2 exams. There will be 5 assignments, 2 projects, a midterm exam, and a final exam. See the syllabus below for dates and times. There will be no alternate exams, so please make sure you will be available for the midterm on May 10 and the final exam on June 8.


Please use Piazza for all questions related to the course. We will be using Piazza as our primary portal for course-related announcements, so make sure to sign up! For all Piazza posts, we guarantee that we will respond within 24 hours. DO NOT post assignment code on Piazza for debugging; we will not respond to posts containing assignment code. Also check out the list of frequently asked questions.

Course Staff

Professor: Jennifer Widom
No picture available
Course Assistant: Arjun Kunna
No picture available
Course Assistant: Lucy Wang
No picture available

Course Assistant: Steven Chen
No picture available
Course Assistant: Alex Haigh
No picture available

Course Assistant: Jesse Min
No picture available

Date Topic and Assignments Readings/References Notes
Tue Apr 3 Introductions, course logistics, Big Data Overview (start) Introductory Readings Course Information
Big Data Overview
Thu Apr 5 Big Data Overview (finish)
Data Analysis & Visualization Using Spreadsheets (Part 1)
Google Spreadsheets References Data Analysis Using Spreadsheets Slides
Spreadsheet Analysis Notes
Mon Apr 9 Assignment 1 released: Spreadsheets
Project 1 released: Personal Data Analysis
Tue Apr 10 Data Analysis & Visualization Using Spreadsheets (Part 2) Common Visualization Mistakes Data Visualization Using Spreadsheets Slides
Spreadsheet Visualization Notes
Thu Apr 12 Advanced Data Visualization Using Tableau Tableau References Advanced Data Visualization Using Tableau Slides
Tableau Notes
Mon Apr 16 Assignment 1 due
Assignment 2 released: Tableau, SQL
Tue Apr 17 Relational Databases and Basic SQL SQL References
Project Jupyter home page
Relational Databases and SQL Slides
Basic SQL Notes
Thu Apr 19 Advanced SQL Advanced SQL Slides
Advanced SQL Notes
Mon Apr 23 Project 1 proposal due
Tue Apr 24 Introduction to Python
(optional if familiar with Python including lists and dictionaries)
Python References
SQL vs Python Comparison
Python Slides
Basic Python Notes
Thu Apr 26 Python for Data Analysis & Visualization (part 1)
Thu Apr 26 Assignment 2 due
Assignment 3 released: Python
Tue May 1 Python for Data Analysis & Visualization (part 2) Pandas intro
Thu May 3 Guest Lecture: Google's Big Data Platforms and Services - Zoltan Fern
Mon May 7 Assignment 3 due
Tue May 8 Machine Learning - Regression ML References
Thu May 10 Midterm Exam - Education Building, Cubberly Auditorium
(During class hours)
Mon May 14 Project 1 due
Assignment 4 released: Machine Learning
Project 2 released: Movie-Rating Predictions
The Netflix Prize
Tue May 15 Machine Learning - Classification and Clustering ML References - Classification and Clustering
Thu May 17 Using Python for Machine Learning ML References - Python
Tue May 22 Data Mining Algorithms Data Mining References
Thu May 24 Data Mining Using SQL and Python
Thu May 24 Assignment 4 due
Assignment 5 released: Data Mining, R, Network Analysis
Tue May 29 The R Language - Data Analysis, Visualization, and Machine Learning R Tutorial
Quick-R: accessing the power of R
Python vs. R for Data Visualization
Thu May 31 Network Analysis Network References
Thu May 31 Project 2 due
Tue June 5 Project #2 results and discussion
Text mining and image analysis
Follow-on courses and pathways
Wed June 6 Assignment 5 due (no lates)
Fri June 8 Final Exam 12:15-3:15 PM
Location: Lathrop Library - Bishop Auditorium
Students with Documented Disabilities

Students who may need an academic accommodation based on the impact of a disability must initiate the request with the Office of Accessible Education (OAE). Professional staff will evaluate the request with required documentation, recommend reasonable accommodations, and prepare an Accommodation Letter for faculty dated in the current quarter in which the request is being made. Students should contact the OAE as soon as possible since timely notice is needed to coordinate accommodations. For CS102 we require accommodation requests and letters to be submitted a minimum of two weeks before the requested accommodation. The OAE is located at 563 Salvatierra Walk (phone: 723-1066, URL: