| Assignment #2 Details |
|
This page contains detailed information about Assignment #2.
|
| Player Dimension |
|
Information about players is stored in the players_v1 and players_v2 tables. Think of these tables as representing the state of the players table in the source system at two different points in time. The V1 table is current as of the first (July) extraction, and the V2 table is current as of the second (August) extraction. These tables are stored in the same Oracle database where you are building your data warehouse. However, they are owned by the user "babcock", so you will need to refer to them as babcock.players_v1 and babcock.players_v2. For the most part, there is a simple mapping between columns in the source system and columns in the Player dimension table in your data warehouse: most columns have the same names and data types. However, there are three exceptions:
|
| Batting Fact |
|
The fact information for your warehouse will be extracted from the ESPN web site. You will extract data from 2 weeks of major league baseball games. The first extraction will load the data from the week of July 25, 2004, and the second extraction will load the data from the week of August 1, 2004. A Perl script named extract.pl is provided for you in the /usr/class/cs345/HW2 directory. The script expects one command-line argument, which should be "1" to load the first week's data and "2" to load the second week's data. The output of the script is a number of text files, one file per baseball game. Each file is named <gameid>.txt, where gameid represents the unique identifier assigned to that game by the ESPN web site. It is recommended that you create separate directories to hold the output of each run of the extract.pl script, because this will simplify your file management. The format of each data file produced by extract.pl is as follows: The first line of the file contains three tab-delimited fields (the game id, the visiting team, and the home team). The second line of the file is blank. The visiting team's batting statistics are listed next, with one player per line. Then follows a blank line, and finally the home team's batting statistics, again with one player per line. Each player's batting statistics are recorded as a tab-delimited record with the following fields (in this order): playerid, name, at bats, runs, hits, RBIs, walks, strikeouts, left on base, doubles, triples, home runs. The player ID is an integer identifier used by the web site to uniquely identify players. The same player IDs are used for the PlayerId column of the players_v1 and players_v2 database tables. The player name extracted by extract.pl is not stored in the data warehouse, and the remaining fields represent measurements that should be stored in the corresponding columns in the Batting fact table. If you wish, you may modify the extract.pl script to format its output differently. Here is how the dimensions and measurement columns for each Batting fact row are determined:
|
| Test Queries |
|
These test queries will help you verify that your data warehouse load completed properly. This query returns the best sluggers in the NL West division. select p.firstname, p.lastname, t.teamname, sum(totalbases), sum(atbats), round(sum(totalbases) / sum(atbats) * 1000)/1000 slugging from batting b join team t on b.team_key = t.team_key join player p on b.player_key = p.player_key where t.division = 'NL West' group by p.firstname, p.lastname, t.teamname having sum(atbats) > 15 and sum(totalbases)/sum(atbats) >= 0.750 order by slugging descAfter the first load, the answer should be: Jeromy Burnitz Rockies 22 18 1.222 Todd Helton Rockies 18 23 0.783 Phil Nevin Padres 19 25 0.76 Barry Bonds Giants 12 16 0.75 This query returns the teams that allowed their opponents to hit the most home runs. select t.teamname, sum(b.homeruns) from batting b join team t on b.opponent_key = t.team_key having sum(b.homeruns) > 10 group by t.teamname order by sum(b.homeruns) descAfter the first load, the answer should be: Mariners 13 Reds 12 Yankees 11 This query returns statistics for players named Alomar. select p.firstname, t.teamname, p.weight, sum(gamesplayed), sum(atbats), sum(hits) from batting b join player p on b.player_key = p.player_key join team t on b.team_key = t.team_key where p.lastname = 'Alomar' group by p.firstname, t.teamname, p.weight order by p.firstname, t.teamname, p.weightAfter the second load, the answer should be: Roberto Diamondbacks 190 6 16 7 Roberto White Sox 190 2 7 2 Sandy White Sox 235 2 5 0 Sandy White Sox 240 1 3 0 This query returns the American League batting average leaders. select p.firstname, p.lastname, t.teamname, sum(hits), sum(atbats), round(sum(hits) / sum(atbats) * 1000)/1000 batting_avg from batting b join team t on b.team_key = t.team_key join player p on b.player_key = p.player_key where t.league = 'American' group by p.firstname, p.lastname, t.teamname having sum(atbats) > 30 and sum(hits)/sum(atbats) >= 0.390 order by batting_avg descAfter the second load, the answer should be: Ichiro Suzuki Mariners 33 66 0.5 Hideki Matsui Yankees 19 47 0.404 B.J. Surhoff Orioles 13 33 0.394 |