CS345 - Topics in Data Warehousing
Autumn 2004

Supplementary Readings

This page contains supplemental readings for the course.
None of these readings is required for the course; however, they may be helpful in broadening your understanding of data warehousing or in generating project ideas.
More supplemental readings will be added to this page as the course progresses.

General

The Data Warehousing Information Center
Contains a set of well-written, practical articles about data warehousing. Recommended.

Data Warehousing and OLAP: A Research-Oriented Bibliography
This web site contains links to lots of research papers about data warehousing and OLAP, organized by category. Recommended.

Data Warehousing Resource Site
An overview of data warehousing from a practical, project-planning perspective.

The OLAP Report
A resource for information about vendors of OLAP tools.

Week 1 - Data Warehousing Basics

The Case For Data Warehousing
Lists some of the reasons that people build data warehouses.

The PANDA Project
The "overview" tab has a good introduction to OLAP and decision support.

Data cube: a relational aggregation operator generalizing group-by, cross-tabs and subtotals, by J. Gray, A. Bosworth, A. Layman, and H. Pirahesh, Data Mining and Knowledge Discovery 1:1, 1997. (A PDF version.)

An overview of data warehousing and OLAP technology, by Surajit Chaudhuri and Umesh Dayal, ACM SIGMOD Record 26:1, 1997.

Research problems in data warehousing, by Jennifer Widom, Int'l Conference on Information and Knowledge Management (CIKM), 1995.

Weeks 2-3 - Dimensional Modeling

Data Warehousing Articles from Ralph Kimball
Articles about a variety of practical data warehousing issues, organized by topic. Recommended.

Design Tips from Ralph Kimball
Presents best practices for a variety of data modeling problems.

Data Cleaning

Data Integration Course Web Page
Web page for a data integration course at another university. Includes bibliography of research papers.

Record Linkage: Current Practice and Future Directions by L. Gu, R. Baxter, D. Vickers, and C. Rainsford
Survey of approaches to the merge/purge problem.

Data Cleaning: Problems and Current Approaches by Erhard Rahm and Hong Hai Do (2000)
A survey of data cleaning research literature as of the year 2000.

Robust and Efficient Fuzzy Match for Online Data Cleaning by S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani (2003)
Proposes a sophisticated string-matching function for fuzzy matching. [Link only works from Stanford computers]

A Comparison of String-Distance Metrics for Name-Matching Tasks by W. Cohen, P. Ravikumar, and S. Fienberg (2003)
Experimentally compares distance functions based on edit distance and TF-IDF.

Eliminating Fuzzy Duplicates in Data Warehouses by R. Ananthakrishna, S. Chaudhuri, and V. Ganti (2002)
Uses relational structure as well as string-matching.

An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records by A. Monge and C. Elkan (1996)
An algorithm for merging duplicate records based on edit distance.

The Merge/Purge Problem for Large Databases by M. Hernandez and S. Stolfo (1995)
Merge/purge using a sorted neighborhood approach.

Weeks 4-5 - Query Processing

SQL Server Query Processor Overview
High-level documentation describing the functioning of the query processor in Microsoft SQL Server.

Star Queries in Oracle8
Technical white paper describing specialized query processing strategies for star schema queries in Oracle.

The Value of Merge Join and Hash Join in SQL Server by Graefe (1999)
Demonstrates that merge join and hash join can outperform nested loop join for OLAP workloads.

Hash Joins and Hash Teams in Microsoft SQL Server by Graefe, Bunker, and Cooper (1998)
Describes in detail how an "industrial-strength" hash join operator is implemented.

Bitmap Indexes and Compression

Improved Query Performance with Variant Indexes by O'Neil and Quass (1997)
Describes bitmap indexes, projection indexes, and bit-sliced indexes.

Performance Measurements of Compressed Bitmap Indices by Johnson (1999)
Empirical comparison of the performance of various bitmap compression schemes.

Compressing Bitmap Indexes for Faster Search Performance by Wu, Otoo, and Shoshani (2002)
Describes the word-aligned hybrid code for bitmap index compression.