CS206 Assignment #4

Due Tuesday, May 7, 2002, in class

Note: The honor-code rules are the same as for the previous assignments. Also, please show work for partial credit.

  1. (25 pts.) Four different "experts" have different tastes in music, and have each rated CD's as "like" (represented by 1) or "don't like" (represented by 0. Sally Student also knows what she likes, and would like to use the ratings of the four experts to decide whether it is probable she would like a CD. Here are the ratings of the four experts for CD's that Sally likes: 0101, 1111, 0011, 1101, 0100, 0111 (i.e., the ith bit indicates whether the ith expert liked the CD). Here are ratings by the same experts for some CD's that Sally doesn't like: 1010, 0000, 1100, 1011, 0010, 1101.

    If we wish to design a decision tree to make a decision for Sally, which attribute (i.e., opinion of which expert) should we put at the root? Show the entropies of each attribute for each side, and pick the attribute whose maximum entropy (on either side) is minimum.

    If we are willing to have a second level in the tree, which attribute should we use to make a decision at each child of the root? What outcomes (good or bad) do we place at the leaves? How many of the 12 given CD's are misclassified by this tree?

  2. (75 pts.; 15 each) We have now seen several rather different forms of data-mining: finding frequent itemsets, finding highly correlated pairs, clustering, and building decision trees. Different problems are best solved using one or another of these methods (or will yield to none of them). Below we give you five situations where you wish to extract some information from data. Suggest which method, if any, would be most appropriate for each situation. Note: there are no right or wrong answers for this problem. Just answers that will receive credit and answers that won't. Seriously --- we'll be generous with the grading, as long as you give some reasoning with your answer; i.e., don't just say "clustering" or something like that.

    The State Department has electronic records of which countries each US passport holder has traveled to in the past 10 years. The records of 1000 suspected terrorists and 1000 people believed not to be terrorists are extracted, and it is desired to find a way to tell whether a person is a terrorist by examining the set of countries they have traveled to.

    Orbitz.com has records of the locations traveled to by all their customers (orbitz is an on-line travel service owned by the airlines). They wish to determine, based on the places a customer has traveled, what sale or deal they would be most likely to respond positively to.

    Amazon.com wishes to offer tie-in deals (which they actually do): when you examine book A, they offer you "buy books A and B for the total of only X dollars" (which seems always to be the sum of the prices of the two books, but that's another matter). How could they select good tie-in deals?

    EBay wants to improve its classification of items for sale by examining its records of which items were bid on by the same customers and which items were examined (but not bid on) by the same customer.

    ETrade would like to find pairs of stocks such that when the first rises at some time, the second is unusually likely to rise within a short time.