CS345 - Topics in Data Warehousing
Autumn 2004

Assignments

There will be five homework assignments: one problem set and four small programming assignments. The goal of the assignments is for students to gain hands-on experience with the ideas convered in the class.

Students who find their course projects to be particularly interesting have the option to expand the scope of their project instead of completing one or two of the assignments. Make arrangements with the instructor if you are interested in electing this option.

Late Policy
Homework assignments will be given out on Thursdays and will be due 12 days later (one week from the following Tuesday).

Late assignments will be accepted without penalty, provided that they are submitted no more than three days late (that is, by 11:59 pm on the Friday following the due date). No submissions will be accepted that are more than three days late.

Students are strongly encouraged to complete the assignments by the nominal due date, rather than relying on the automatic three-day extension. Requests for additional extensions due to exceptional circumstances will be considered on a case-by-case basis, provided they are received before the nominal due date of the assignment.

Programming Environment
All of the assignments after the first involve writing programs that interact with a database. The database management system we are using for this class is Oracle. Each student in the class will have an Oracle account created. A web page with useful tips about interacting with Oracle is here.

You may use the programming language of your choice for the programming assignments.

Assignment #1: Data Modeling Trade-Offs
Due Date: Tuesday, October 19

This assignment will be a written problem set. It will consist of various data modeling scenarios. In each scenario, you will be asked to compare alternative schema designs, listing the pros and cons of each.

Assignment #1 (Postscript format)

Assignment #2: Extraction, Transformation, and Load
Due Date: Tuesday, November 2

In this programming assignment, you will extract data from three different source systems--a web site, a relational database, and a flat file--and load it into a data warehouse. The assignment will include both an initial load and an incremental load.

Assignment #2 (Postscript format)

Logging into Oracle using sqlplus Also describes how to change your password.

CS 145 Oracle Tips Page Lots of helpful info.

Using the Oracle bulk loader (sqlldr)

Using TOAD / TOra on your Windows or Linux machine

"Hello World" in JDBC

Differences between Players source system tables and Players dimension table

Using the extract.pl script

Test queries to validate your data warehouse

Assignment #3: Aggregate Selection and Navigation
Due Date: Friday, November 19
(Late assignments will be accepted without penalty until Monday, November 22, 11:59 PM.)

This programming assignment has three parts. In the first part, you will implement an algorithm for selecting a good set of precomputed aggregate tables (materialized views) to create for a given database. In the second part, you will generate the SQL to create the aggregate tables that were selected in the first part. In the third part, you will implement a query re-writer that takes an aggregation query that is written to run against the fact table, selects the most efficient aggregate table to use for that query, and re-writes the query to execute against the aggregate table.

Assignment #3 (Postscript format)

Metadata Tables Describes the format of the metadata tables in which configuration information is stored.

Scripts that are provided

SQL query format

Star schema for testing

Specifics of the aggregate selection algorithm

Some tips for the aggregate creation phase of the assignment

Results on test data Results of the aggregate selection algorithm on the two test data sets. Use these to verify that your algorithm is working correctly.

Assignment #4: Decision Tree Learning
Due Date: Thursday, December 2
(Late assignments will be accepted without penalty until Sunday, December 5, 11:59 PM.)

In this programming assignment, you will implement the ID3 algorithm to learn a decision tree classifier, along with pruning using a holdout set. The data to be classified will be stored in the database, and computation of the statistics needed by the classifier will be performed using SQL.

Assignment #4 (Postscript format)

Scripts that are provided

Results on test data You can compare the decision tree learned by your program with these test results.