CS345 Assignment #3

CS345 - Topics in Data Warehousing
Autumn 2004

Assignment #3 Details

This page contains detailed information about Assignment #3.

Information about the structure star schema and the query workload to use when selecting aggregates will be provided through a set of metadata tables. These tables can be created using the createTables.sql script and populated using the insertRows.sql script. Both SQL scripts can be found in the directory \usr\class\cs345\HW3.

Star Schema Metadata

The structure of the star schema over which you will be building aggregates is described in three metadata tables: FACT_TBL, FACT_COL, and DIM_TBL.

The FACT_TBL table has one column: FACT_NAME. This table contains exactly one row. The FACT_NAME column holds the name of the fact table in the data warehouse.

The FACT_COL table has one column: COLUMN_NAME. This table holds one row per measurement column in the fact table. The COLUMN_NAME column holds the name of the measurement column.

The DIM_TBL table has four columns: DIM_ID, DIM_NAME, DIM_PK_NAME, and FACT_FK_NAME. This table holds one row per dimension table in the data warehouse. The DIM_ID column is an integer-valued primary key. The DIM_NAME column is the name of the dimension table. The DIM_PK_NAME column holds the name of the primary key column in the dimension table. The FACT_FK_NAME column holds the name of the foreign key column in the fact table that references the dimension table.

Aggregate Selection Metadata

The candidate column sets for the aggregate selection algorithm are described in two metadata tables: CANDIDATE and CANDIDATE_COL.

The CANDIDATE table has one column: CANDIDATE_ID. The CANDIDATE_ID column is an integer-valued primary key. This table holds one row per candidate column set.

The CANDIDATE_COL table has three columns: CANDIDATE_ID, DIM_ID, and COLUMN_NAME. This table has one row per column per candidate column set. The CANDIDATE_ID and DIM_ID columns are foreign keys to the CANDIDATE and DIM_TBL tables, respectively. The COLUMN_NAME column holds the name of an attribute from the dimension table.

The query workload is described in two metadata tables: QUERY and QUERY_COL.

The QUERY table has two columns: QUERY_ID and WEIGHT. This table holds one row per workload query. The QUERY_ID column is an integer-valued primary key. The WEIGHT column is an integer-valued weight that represents the frequency or importance of this query. When evaluating the total cost of the workload on a particular aggregate configuration, the cost of each query should be multiplied by its weight.

The QUERY_COL table has three colums: QUERY_ID, DIM_ID, and COLUMN_NAME. This table holds one row per column referenced in a query. The QUERY_ID and DIM_ID columns are foreign keys to the QUERY and DIM_TBL tables, respectively. The COLUMN_NAME column holds the name of the dimension column referenced in the query.

The constraint on the number of aggregates is expressed in the SETTINGS metadata table. This table has exactly one row and one column, NUM_AGGS. The value of the NUM_AGGS column specifies the number of fact aggregates to select.

Aggregate Navigation Metadata

The aggregate tables that have been constructed are described in four metadata tables: FACT_AGG, DIM_AGG, DIM_AGG_COL, and FACT_AGG_DEF.

The DIM_AGG table has two columns: DIM_AGG_ID and DIM_ID. This table holds one row per dimension aggregate table. The DIM_AGG_ID column is an integer-valued primary key. The DIM_ID column is a foreign key to the DIM_TBL table, indicating which dimension is being aggregated.

The DIM_AGG_COL table has two columns: DIM_AGG_ID and COLUMN_NAME. This table holds one row per dimension attribute in a dimension aggregate. Its purpose is to indicate which attributes are included in which aggregates. The DIM_AGG_ID column is a foreign key to the DIM_AGG table, and the COLUMN_NAME column holds the name of a dimension column that is included in a dimension aggregate. The primary key of this table is (DIM_AGG_ID, COLUMN_NAME).

The FACT_AGG table has two columns: FACT_AGG_ID and NUM_ROWS. This table holds one row per fact aggregate table. The FACT_AGG_ID column is an integer-valued primary key. The NUM_ROWS column holds the number of rows in the fact aggregate table.

The FACT_AGG_DEF table has four columns: FACT_AGG_ID, DIM_ID, DIM_AGG_ID, and DIM_STATUS. This table describes which fact aggregate tables join to which dimension aggregate tables. If there are $n$ dimension tables in total, then this table has $n$ rows per fact aggregate table. The FACT_AGG_ID and DIM_ID columns are foreign keys to the FACT_AGG and DIM_TBL tables, respectively. The primary key of this table is (FACT_AGG_ID, DIM_ID). The DIM_STATUS column holds one of the following three values: 0 means the dimension is omitted from the fact aggregate; 1 means the fact aggregate joins to the base dimension table; and 2 means the fact aggregate joins to a dimension aggregate. The DIM_AGG_ID column is a nullable foreign key to the DIM_AGG table. It has a NULL value whenever DIM_STATUS is 0 or 1.

Provided Scripts and Sample Code

The following SQL scripts and sample code fragments are provided in the directory /usr/class/cs345/HW3 on Stanford's UNIX systems.

createTables.sql Creates the metadata tables used to describe the data warehouse schema and candidate column sets.
insertRows1.sql Populates the metadata tables with a description of the baseball data warehouse that you created in HW#2. This configuration can be used to test your code.
insertRows2.sql Populates the metadata tables with a description of an orders data warehouse derived from the TPC-H benchmark. This configuration can be used to test your code.
insertRows3.sql An alternative configuration for the orders data warehouse. This configurations has fewer candidate column sets, so it runs a little faster than insertRows2.sql.
deleteRows.sql Deletes the configuration information inserted by insertRows1.sql or insertRows2.sql.
deleteAggRows.sql Deletes the contents of the FACT_AGG, FACT_AGG_DEF, DIM_AGG, and DIM_AGG_COL metadata tables.
dropTables.sql Drops the metadata tables.
createOrdersSchema.sql Creates a local copy of the orders data warehouse in your Oracle account.
parse.cpp C++ code for parsing SQL queries
Parse.java Java code for parsing SQL queries

SQL Query Format

In the third phase of Assignment #3, you implement a query re-writer that performs aggregate navigation. In order to do this, you will need to be able to parse a SQL query that is passed to your program on stdin. Your program only needs to be able to parse SQL queries that have a very specific format; you may assume that your program is only called with properly-formatted queries. Here is an example of the query format:

SELECT
DATEDIM.MONTH_NAME,PRODUCT.BRAND,SUM(SALES.QUANTITY),SUM(SALES.DOLLARSALES)
FROM
SALES,DATEDIM,PRODUCT,STORE
WHERE
SALES.DATE_KEY = DATEDIM.DATE_KEY
AND
SALES.PRODUCT_KEY = PRODUCT.PRODUCT_KEY
AND 
SALES.STORE_KEY = STORE.STORE_KEY
AND
STORE.REGION = 'Northern California'
AND
PRODUCT.DEPT = 'Produce'
GROUP BY
DATEDIM.MONTH_NAME, PRODUCT.BRAND

Some properties of the format include: keywords, table names, and column names are in all capital letters; keywords such as SELECT appear on their own line; all columns are fully referenced in table.column form; grouping columns come before measurements in the select list; all aggregates use SUM; the fact table comes first in the from list; and join conditions come before filters conditions in the where clause. You may use the parsing code provided in the files Parse.java or parse.c, or adapt the code to the language of your choice. The sample code files are available from the directory /usr/class/cs345/HW3.

Star Schema for Testing

In addition to testing your code on the Baseball data warehouse from Assignment #2, using the configuration rows created by the insertRows1.sql script, you can also test your code on a second data warehouse that is provided for this assignment. To create your own copy of the Orders data warehouse, run the script createOrdersSchema.sql. The script will create a fact table and four dimension tables and populate them with data. To load configuration rows for the Orders data warehouse into the metadata tables, run the script insertRows2.sql. If you have already run insertRows1.sql, then you will need to run deleteRows.sql before running insertRows2.sql. A description of the dimension tables and fact measurement columns in the Orders data warehouse follows.

orders_fact The orders fact table tracks orders of spare parts. The grain of the fact table is one row per order. An order is placed by a single customer, for a single part, from a single supplier. There are four dimensions: customer, part, supplier, and date. The fact table has four measurement columns: o_quantity, o_extendedprice, o_discount, o_tax.

customer_dim The customer dimension includes the following attributes: name, address, nation, region, phone, account balance, market segment, and comment.

part_dim The part dimension includes the following attributes: name, manufacturer, brand, type, size, container, retail price, and coment.

supplier_dim The supplier dimension includes the following attributes: name, address, nation, region, phone, account balance, and comment.

date_dim The date dimension includes the following attributes: date value (a SQL date column), month name, month number, year, day of week, and day number in month.

Aggregate Selection Algorithm

The first phase of the assignment involves implementing a greedy aggregate selection algorithm.

The pseudocode for the greedy selection algorithm is as follows:

Let S = { (the fact table) } .
Let C = the set of possible fact aggregate tables.
while |S| < (maximum number of aggregates + 1) do
   Let Best = S.
   Let BestCost = Infinity.
   for each fact aggregate F in C do
      Let S' = S union {F}.
      Let Cost = 0.
      for each query Q in the query workload do
         Let A = (smallest table in S' that can be used to answer Q).
         Cost = Cost + (number of rows in A).
      end for
      if (Cost < BestCost)
         Let Best = S'.
         Let BestCost = Cost.
      end if
   end for
   Let S = Best.
   C = C minus Best.
end while
return S.

A few steps of the algorithm deserve to be described in more detail.

What is "the set of possible fact aggregate tables"?
When can a fact aggregate table be used to answer a query?
How can you tell how many rows a fact aggregate table will have?

What is "the set of possible fact aggregate tables"?

A fact aggregate table can be specified by listing, for each dimension, the way in which that dimension is included in the fact aggregate. There are three choices for how a dimension D may be included in a fact aggregate table F: (1) D is included in its entirety, meaning that F includes a foreign key to the dimension table. (2) D is included in aggregated form, meaning that F includes a foreign key to an aggregated version of D. (3) D is omitted entirely, meaning that F has no foreign key to D, nor does it have a foreign key to any aggregated version of D.

What is meant by "an aggregated version of D"? A dimension aggregate table can be specified by listing the columns that are included in the table's schema (other than the surrogate key). This will always be a subset of the columns in the dimension table from which the aggregate is derived (not including the surrogate key). A dimension aggregate table has one row for each combination of those attributes. For example, the Team dimension table from Assignment #2 had 5 columns (Team_key, City, TeamName, League, and Division) and 30 rows. One possible dimension aggregate table for the Team dimension would include the columns League and Division (but not City or TeamName). Here is what the aggregate table would look like:

Team_key     League     Division
--------     ------     --------
1            American   AL East
2            American   AL Central
3            American   AL West
4            National   NL East
5            National   NL Central
6            National   NL West

There are only 6 distinct combinations of League and Division in the Team dimension, so this dimension aggregate table would only have 6 rows. The surrogate keys stored in the Team_key column of this aggregate table are NOT related to the surrogate keys from the Team dimension table in any way; in other words, there is no correspondence between the aggregate table row that has a key of 3 and the dimension table row that has a key of 3.

Notice that for a dimension table that has n dimension columns (other than the surrogate key), there are 2^n - 1 possible dimension aggregates. For dimension tables that have a lot of columns, this is simply too many aggregates to consider. Thus, part of the input to your aggregate selection program is a set of "candidate columns sets" that are used to limit the dimension aggregates being considered to a manageable set. Your algorithm should only consider dimension aggregates consisting of sets of columns that are unions of one or more candidate column sets from that dimension. In other words, consider a dimension that has 6 columns (A, B, C, D, E, and F) but only three candidate column sets (A, BC, and D). The dimension aggregates you should consider for this dimension would be those that group by the following sets of columns: A, BC, D, ABC, AD, BCD, and ABCD.

So how large is the set of fact aggregate tables considered by the algorithm? That depends on the number of dimensions and the number of candidate column sets for each dimension. A dimension that has m candidate column sets can be included in a fact aggregate in 2^m + 1 different ways: the fact aggregate table could have a foriegn key to any of 2^m - 1 different dimension aggregates, or it could have a foreign key to the original dimension table, or it could have no foreign key to that dimension or any of its aggregates.

As an example, consider a fact table that has three dimensions, and those dimensions have 2, 3, and 4 candidate column sets respectively. There are 5*9*17=765 possible fact aggregates that could be built for this fact table.

When can a fact aggregate table be used to answer a query?

A fact aggregate table can be used to answer a query when all columns referenced in that query are included either in the schema of the fact aggregate table itself, or else in one of the dimension tables or dimension aggregate tables to which the fact aggregate table has foreign keys. For the purposes of this assignment, we are assuming that all measurement columns from the original fact table will be included in all fact aggregate tables. Therefore, in determining whether a fact aggregate can be used to answer a query, we only need to be concerned with the dimension columns that are used as grouping or filtering attributes. For this reason, the metadata specification for the queries in the query workload only lists the dimension columns referenced by each query, and not the measurement columns referenced by the queries.

How can you tell how many rows a fact aggregate table will have?

The most direct way is to simply query the database to learn how many distinct combinations of the relevant dimension attributes are present in the fact table. For example, consider a potential fact aggregate built over the schema from Assignment #2 that includes a foreign key to the original Team dimension, has no foreign key to the Player dimension or any of its aggregates, and includes a foreign key to a Date dimension aggregate with the columns (Month_name, Month_num, Day_of_week, and Weekend_flag). The following SQL query will return the number of rows that would be in that fact aggregate:

SELECT COUNT(*) FROM 
(
   SELECT DISTINCT Batting.Team_key, Month_name, Month_num,
                   Day_of_week, Weekend_flag
   FROM Batting 
   JOIN Team on Batting.Team_key = Team.Team_key
   JOIN DateDim on Batting.Date_key = DateDim.Date_key
)

Creating the Aggregate Tables

In the second phase of the assignment, you will write a program to create and populate the aggregate tables that were selected in the first phase. Your program for the second phase will read the FACT_AGG, FACT_AGG_DEF, DIM_AGG, and DIM_AGG_COL metadata tables to learn which aggregate tables to create.

One way to organize your program is as follows: first create all the dimension aggregates based on the information from the DIM_AGG and DIM_AGG_COL tables. Then create all the fact aggregates using the information from the FACT_AGG and FACT_AGG_DEF tables. One reasonable way to name your aggregate tables is to use the name of the base dimension or fact table, with the aggregate ID afterwards. In other words, a dimension aggregate table for the Team dimension that had DIM_AGG_ID = 4 would be named Team4. However, you are welcome to use an alternative naming scheme if you prefer.

When creating the dimension aggregate tables, you'll need to be sure that the dimension attribute columns have the correct data types. There are two ways to do this:

Query the system catalog table Oracle stores information about the names and types of tables and columns in its system catalog. You can query the COLS table to learn the data type of each column, like this:
```
SELECT DATA_TYPE, CHAR_COL_DECL_LENGTH
FROM COLS
WHERE TABLE_NAME = 'MYDIM'
AND COLUMN_NAME = 'DIMCOL1'
```
The DATA_TYPE column returns the type of the column, and CHAR_COL_DECL_LENGTH returns the length in case the column is a CHAR or VARCHAR column.
Use the CREATE TABLE...AS SELECT statement You can use the "create table...as select..." statement in Oracle to create a table that is similar to another table without needing to specify the column data types. For example:
```
CREATE TABLE MyDimAgg AS
SELECT MyDimKey, DimCol1, DimCol2, DimCol5
FROM MyDim
WHERE 0=1
```
The above SQL statement will create a table named "MyDimAgg" that has columns named "MyDimKey", "DimCol1", "DimCol2", and "DimCol5". These columns will have the same data types as the corresponding columns in the "MyDim" table.

Results on Test Data

After running your aggregate selection algorithm on the baseball data warehouse using the metadata rows created by the insertRows1.sql script, the following should be the fact aggregates selected by your algorithm:

SIZE: 1202 TEAM:ALL DATEDIM:MONTH_NAME,MONTH_NUM PLAYER:ALL
SIZE: 299 DATEDIM:DAY_OF_WEEK,WEEKEND_FLAG PLAYER:POSITION,BATS,THROWS
SIZE: 116 TEAM:LEAGUE,DIVISION DATEDIM:MONTH_NAME,MONTH_NUM PLAYER:POSITION
SIZE: 684 TEAM:ALL PLAYER:ALL
SIZE: 256 TEAM:ALL DATEDIM:MONTH_NAME,MONTH_NUM PLAYER:BATS,THROWS

The file aggRows1.sql in the /usr/class/cs345/HW3 directory will populate the metadata tables FACT_AGG, FACT_AGG_DEF, DIM_AGG, and DIM_AGG_COL with the rows corresponding to the above solution. You can run the aggRows1.sql script if your Phase 1 program isn't working yet, but you want to work on Phase 2 or Phase 3 of the assignment. Here is another possibility for what your algorithm might have returned:

SIZE: 116 TEAM:LEAGUE,DIVISION DATEDIM:MONTH_NAME,MONTH_NUM PLAYER:POSITION
SIZE: 299 DATEDIM:DAY_OF_WEEK,WEEKEND_FLAG PLAYER:POSITION,BATS,THROWS
SIZE: 282 PLAYER:BATS,THROWS,COLLEGE
SIZE: 12 TEAM:LEAGUE,DIVISION DATEDIM:MONTH_NAME,MONTH_NUM
SIZE: 58 TEAM:DIVISION PLAYER:POSITION

If your algorithm returned the second result, that means that you are not considering fact aggregates that include one or more original dimensions in their entirety. The assignment handout was a little bit unclear on whether that type of fact aggregate should be included in your search. Therefore, if your implementation doesn't consider that class of fact aggregate, it is still acceptable.

Note: When you run the aggregate selection algorithm on the test data from the orders data warehouse, it will likely take a long time (like 3 hours or so) to compute the sizes of all the fact aggregates. If you prefer, you can test your code on the configuration from the insertRows3.sql script instead of the insertRows2.sql script. The insertRows3.sql script has fewer candidate column sets, so it doesn't take quite as long to run (about 1 hour instead of about 3 hours).

After running your aggregate selection algorithm on the test data produced by the insertRows2.sql script, you should get the following results:

SIZE: 7990 PART_DIM:P_BRAND,P_MFGR,P_CONTAINER,P_SIZE SUPPLIER_DIM:ALL
SIZE: 9865 CUSTOMER_DIM:C_MKTSEGMENT PART_DIM:P_BRAND,P_MFGR DATE_DIM:D_MONTHNAME,D_MONTHNUM,D_YEAR
SIZE: 4817 CUSTOMER_DIM:C_MKTSEGMENT PART_DIM:P_CONTAINER SUPPLIER_DIM:S_NATION
SIZE: 13298 CUSTOMER_DIM:C_NATION,C_REGION DATE_DIM:ALL
SIZE: 15335 PART_DIM:P_BRAND,P_MFGR SUPPLIER_DIM:ALL DATE_DIM:D_YEAR

The file aggRows2.sql in the /usr/class/cs345/HW3 directory will populate the metadata tables FACT_AGG, FACT_AGG_DEF, DIM_AGG, and DIM_AGG_COL with the rows corresponding to the above solution. You can run the aggRows2.sql script if your Phase 1 program isn't working yet, but you want to work on Phase 2 or Phase 3 of the assignment. Alternatively, if your algorithm is not considering fact aggregates that use the original base dimension tables, you should get the following result:

SIZE: 9865 CUSTOMER_DIM:C_MKTSEGMENT PART_DIM:P_BRAND,P_MFGR DATE_DIM:D_MONTHNAME,D_MONTHNUM,D_YEAR
SIZE: 27417 CUSTOMER_DIM:C_MKTSEGMENT PART_DIM:P_CONTAINER,P_SIZE SUPPLIER_DIM:S_NATION
SIZE: 31573 PART_DIM:P_BRAND,P_MFGR SUPPLIER_DIM:S_NATION DATE_DIM:D_MONTHNAME,D_MONTHNUM,D_YEAR
SIZE: 1258 PART_DIM:P_CONTAINER,P_SIZE
SIZE: 4817 CUSTOMER_DIM:C_MKTSEGMENT PART_DIM:P_CONTAINER SUPPLIER_DIM:S_NATION

Here is the result for the insertRows3.sql script:

SIZE: 9865 CUSTOMER_DIM:C_MKTSEGMENT PART_DIM:P_MFGR,P_BRAND DATE_DIM:D_MONTHNAME,D_MONTHNUM,D_YEAR
SIZE: 11494 CUSTOMER_DIM:C_MKTSEGMENT PART_DIM:P_MFGR,P_BRAND SUPPLIER_DIM:ALL
SIZE: 4304 PART_DIM:P_MFGR,P_BRAND SUPPLIER_DIM:S_NATION,S_REGION DATE_DIM:D_YEAR
SIZE: 13298 CUSTOMER_DIM:C_NATION,C_REGION DATE_DIM:ALL
SIZE: 31573 PART_DIM:P_BRAND SUPPLIER_DIM:S_NATION,S_REGION DATE_DIM:D_MONTHNAME,D_MONTHNUM,D_YEAR

Here is the result when fact aggregates involving base dimension tables are excluded:

SIZE: 9865 CUSTOMER_DIM:C_MKTSEGMENT PART_DIM:P_MFGR,P_BRAND DATE_DIM:D_MONTHNAME,D_MONTHNUM,D_YEAR
SIZE: 18305 CUSTOMER_DIM:C_MKTSEGMENT PART_DIM:P_BRAND SUPPLIER_DIM:S_NATION,S_REGION DATE_DIM:D_YEAR
SIZE: 31573 PART_DIM:P_BRAND SUPPLIER_DIM:S_NATION,S_REGION DATE_DIM:D_MONTHNAME,D_MONTHNUM,D_YEAR
SIZE: 3095 CUSTOMER_DIM:C_MKTSEGMENT PART_DIM:P_MFGR,P_BRAND SUPPLIER_DIM:S_NATION,S_REGION
SIZE: 4304 PART_DIM:P_BRAND SUPPLIER_DIM:S_NATION,S_REGION DATE_DIM:D_YEAR