CS206 Assignment #4
Due Tuesday, May 7, 2002, in class
Note: The honor-code rules are the same as for the previous
Also, please show work for partial credit.
- (25 pts.)
Four different "experts" have different tastes in music, and have each
rated CD's as "like" (represented by 1) or "don't like" (represented by
Sally Student also knows what she likes, and would like to use the
ratings of the four experts to decide whether it is probable she would
like a CD.
Here are the ratings of the four experts for CD's that Sally likes:
0101, 1111, 0011, 1101, 0100, 0111 (i.e., the ith bit indicates
whether the ith expert liked the CD).
Here are ratings by the same experts for some CD's that Sally doesn't
like: 1010, 0000, 1100, 1011, 0010, 1101.
- If we wish to design a decision tree to make a decision for
Sally, which attribute (i.e., opinion of which expert) should we put at
Show the entropies of each attribute for each side, and pick the
attribute whose maximum entropy (on either side) is minimum.
- If we are willing to have a second level in the tree, which
attribute should we use to make a decision at each child of the root?
What outcomes (good or bad) do we place at the leaves?
How many of the 12 given CD's are misclassified by this tree?
(75 pts.; 15 each)
We have now seen several rather different forms of data-mining: finding
itemsets, finding highly correlated pairs, clustering, and building
Different problems are best solved using one or another of these methods
(or will yield to none of them).
Below we give you five situations where you wish to extract some
information from data.
Suggest which method, if any, would be most appropriate for each
Note: there are no right or wrong answers for this problem.
Just answers that will receive credit and answers that won't.
Seriously --- we'll be generous with the grading, as long as you give some
reasoning with your answer; i.e., don't just say "clustering" or
something like that.
- The State Department has electronic records of which countries
each US passport holder has traveled to in the past 10 years.
The records of 1000 suspected terrorists and 1000 people believed not to
be terrorists are extracted, and it is desired to find a way to tell
whether a person is a terrorist by examining the set of countries they
have traveled to.
- Orbitz.com has records of the locations traveled to by all
their customers (orbitz is an on-line travel service owned by the
They wish to determine, based on the places a customer has traveled,
what sale or deal they would be most likely to respond positively to.
- Amazon.com wishes to offer tie-in deals (which they actually
do): when you examine book A, they offer you "buy books A
and B for the total of only X dollars" (which seems always
to be the sum of the prices of the two books, but that's another
How could they select good tie-in deals?
- EBay wants to improve its classification of items for sale by
examining its records of which items were bid on by the same customers
and which items were examined (but not bid on) by the same customer.
- ETrade would like to find pairs of stocks such that when the
first rises at some time, the second is unusually likely to rise within
a short time.