CS345 Assignment #2

CS345 - Topics in Data Warehousing
Autumn 2004

Assignment #2 Details

This page contains detailed information about Assignment #2.

Information about players is stored in the players_v1 and players_v2 tables. Think of these tables as representing the state of the players table in the source system at two different points in time. The V1 table is current as of the first (July) extraction, and the V2 table is current as of the second (August) extraction.

These tables are stored in the same Oracle database where you are building your data warehouse. However, they are owned by the user "babcock", so you will need to refer to them as babcock.players_v1 and babcock.players_v2.

For the most part, there is a simple mapping between columns in the source system and columns in the Player dimension table in your data warehouse: most columns have the same names and data types. However, there are three exceptions:

There is no Player_key column in the source system. This column should be populated by a surrogate key that you assign. The PlayerId column is the natural key from the source system; using the value of PlayerId as the surrogate key will not work because you will run into problems as the values of dimension attributes change over time.
The Height column has a string data type in the source system and an integer data type in the data warehouse. Store the player's height in inches in the data warehouse. For example, '5-11' should be stored as 71.
The Weight column has a string data type in the source system and an integer data type in the data warehoues. Store the player's weight as a number of pounds. For example, '220 lbs.' should be stored as 220. This transformation is very easy -- simply drop the 'lbs.' and convert to a numeric type.

There are a variety of approaches you could take in performing the necessary transformations. These include:

Perform the transformations in Oracle using SQL. Use Oracle's string functions (such as SUBSTR) and data type conversion functions (such as TO_CHAR and TO_NUMBER). Generate surrogate keys using an Oracle SEQUENCE.
Write a program that loads the data into memory using JDBC or OCI, performs the necessary transformations, and writes the data to the warehouse or staging area using JDBC or OCI.
Export the data as a text file using the sqlldr program, transform the text file using a scripting language such as Perl, sed, or awk, then load it into the warehouse or staging area using sqlldr.

Batting Fact

The fact information for your warehouse will be extracted from the ESPN web site. You will extract data from 2 weeks of major league baseball games. The first extraction will load the data from the week of July 25, 2004, and the second extraction will load the data from the week of August 1, 2004.

A Perl script named extract.pl is provided for you in the /usr/class/cs345/HW2 directory. The script expects one command-line argument, which should be "1" to load the first week's data and "2" to load the second week's data. The output of the script is a number of text files, one file per baseball game. Each file is named <gameid>.txt, where gameid represents the unique identifier assigned to that game by the ESPN web site. It is recommended that you create separate directories to hold the output of each run of the extract.pl script, because this will simplify your file management.

The format of each data file produced by extract.pl is as follows: The first line of the file contains three tab-delimited fields (the game id, the visiting team, and the home team). The second line of the file is blank. The visiting team's batting statistics are listed next, with one player per line. Then follows a blank line, and finally the home team's batting statistics, again with one player per line. Each player's batting statistics are recorded as a tab-delimited record with the following fields (in this order): playerid, name, at bats, runs, hits, RBIs, walks, strikeouts, left on base, doubles, triples, home runs. The player ID is an integer identifier used by the web site to uniquely identify players. The same player IDs are used for the PlayerId column of the players_v1 and players_v2 database tables. The player name extracted by extract.pl is not stored in the data warehouse, and the remaining fields represent measurements that should be stored in the corresponding columns in the Batting fact table.

If you wish, you may modify the extract.pl script to format its output differently.

Here is how the dimensions and measurement columns for each Batting fact row are determined:

The AtBats, Hits, Doubles, Triples, HomeRuns, Runs, RBIs, Walks, Strikeouts, and LeftOnBase attributes should be copied from a row in an extract.pl output file.
The GamesPlayed attribute should always be equal to 1. This fact column is simply used to sum up the total number of games in which each player played.
The Singles and TotalBases attributes are derived from the other measurement columns using simple arithmetic formulas. Singles = Hits - Doubles - Triples - HomeRuns. TotalBases = Singles + (2 * Doubles) + (3 * Triples) + (4 * HomeRuns).
The Game_id degenerate dimension should hold the unique identifier for the game from which the batting statistics are drawn (e.g. 240725120).
The Date_key indicates the date of the game, which can be determined from the game ID. The fourth digit of the game id indicates the month, and the fifth and sixth digits indicate the day. For example, the game with ID 240725120 took place on July 25 because digits 4-6 are 7/25.
The Player_key indicates the player whose statistics are recorded in this fact for. The source systems for the players dimension and batting fact use the same player IDs.
The Team_key and Opponent_key indicate the team for which the player was playing and the team against which the player was playing. These can be determined from the header information in the first line of the files produced by extract.pl. The ESPN web site (and thus the extract.pl output files) usually uses refers to teams by the city where they play; however, for cities that have more than one team, a portion of the team name is also included. For example, PHILADELPLHIA represents the Philadelphia Phillies, and CHICAGO SOX represents the Chicago White Sox. The teams.csv file from which the Team dimension information is drawn does not use the same naming scheme, so one of the data integration problems that you must resolve is mapping from the team identifiers used by the web site to dimension rows in the Team dimension table. (NOTE: Both the Team_key and Opponent_key columns are foreign keys to the Team dimension table; thus this dimension table has two "dimension roles" in the Batting fact.

Test Queries

These test queries will help you verify that your data warehouse load completed properly.

This query returns the best sluggers in the NL West division.

select p.firstname, p.lastname, t.teamname, sum(totalbases), sum(atbats), 
 round(sum(totalbases) / sum(atbats) * 1000)/1000 slugging
from batting b
join team t on b.team_key = t.team_key
join player p on b.player_key = p.player_key
where t.division = 'NL West'
group by p.firstname, p.lastname, t.teamname
having sum(atbats) > 15 and sum(totalbases)/sum(atbats) >= 0.750
order by slugging desc

After the first load, the answer should be:

Jeromy  Burnitz Rockies 22      18      1.222
Todd    Helton  Rockies 18      23      0.783
Phil    Nevin   Padres  19      25      0.76
Barry   Bonds   Giants  12      16      0.75

This query returns the teams that allowed their opponents to hit the most home runs.

select t.teamname, sum(b.homeruns)
from batting b
join team t on b.opponent_key = t.team_key
having sum(b.homeruns) > 10
group by t.teamname
order by sum(b.homeruns) desc

After the first load, the answer should be:

Mariners  13
Reds      12
Yankees   11

This query returns statistics for players named Alomar.

select p.firstname, t.teamname, p.weight,
sum(gamesplayed), sum(atbats), sum(hits) 
from batting b
join player p on b.player_key = p.player_key
join team t on b.team_key = t.team_key
where p.lastname = 'Alomar'
group by p.firstname, t.teamname, p.weight
order by p.firstname, t.teamname, p.weight

After the second load, the answer should be:

Roberto Diamondbacks   190     6     16    7
Roberto White Sox      190     2     7     2
Sandy   White Sox      235     2     5     0
Sandy   White Sox      240     1     3     0

This query returns the American League batting average leaders.

select p.firstname, p.lastname, t.teamname, sum(hits), sum(atbats), 
round(sum(hits) / sum(atbats) * 1000)/1000  batting_avg
from batting b
join team t on b.team_key = t.team_key
join player p on b.player_key = p.player_key
where t.league = 'American'
group by p.firstname, p.lastname, t.teamname
having sum(atbats) > 30 and sum(hits)/sum(atbats) >= 0.390
order by batting_avg desc

After the second load, the answer should be:

Ichiro  Suzuki  Mariners  33     66     0.5
Hideki  Matsui  Yankees   19     47     0.404
B.J.    Surhoff Orioles   13     33     0.394

CS345 - Topics in Data Warehousing Autumn 2004

CS345 - Topics in Data Warehousing
Autumn 2004