# CS206 Assignment #2

### Due Tuesday, April 23, 2002, in class

Note: We'll follow the standard CSD rules regarding the honor code. You are expected to do your own work, but if you discuss the homework with anyone and use their ideas in any way in your own homework, then you must mention that fact on the homework. If we find pairs of homeworks that look too similar, and neither acknowledges the other's contribution, we will assume there is an honor-code violation. Of course, we reserve the right to adjust grades if your acknowledgement indicates that you didn't really learn anything and merely copied someone else's work; but at least you won't have to face the HC commission.
1. (25 pts.) Imagine there are 100 baskets, numbered 1,2,...,100, and 100 items, similarly numbered. Item i is in basket j if and only if i divides j evenly. For example, basket 24 is the set of items {1,2,3,4,6,8,12,24}. Answer the following:

a)
What is the support for the set of items {3,4}?

b)
If the support threshold is 10, what are all the frequent items?

c)
What triple of items has the highest support? (Think carefully --- your first instinct may not be right.)

d)
Are there any association rules that have 100% confidence, but do not have item 1 on the right? Justify your answer.

2. (25 pts.) Suppose items are numbered 0,1,...,n-1, where n is the number of items. We mentioned in our discussion of the a-priori algorithm that it was possible to create an array of counts for pairs, so that given pair {i,j}, it is possible to calculate easily the array entry for this pair. This technique allows us to devote only 4 bytes per pair in main memory to a count for that pair (assuming 4 bytes/integer is sufficient).

Give a method for assigning array elements to pairs, such that one can figure out the index of the element for a pair, from the two item numbers (integers) in the pair. Note that a pair must be associated with the same array element regardless of the order in which the pair's two integers are presented, and the array must not have wasted space; i.e., each array element is assigned to a unique pair.

3. (25 pts.) Suppose we wish to find frequent pairs of items using the a-priori algorithm under the following assumptions:

• s, the support threshold, is 10,000.
• There are one million items, which are represented by the integers 0,1,...,999999.
• There are 25,000 frequent items, that is, items that occur 10,000 times or more.
• There are one million pairs that occur 10,000 times or more.
• There are one billion pairs that occur exactly once. Half of these pairs consist of two frequent items, the other half each have at least one nonfrequent item.
• No other pairs occur at all.
• Integers are always represented by 4 bytes.

Estimate the minimum amount of main memory needed to run the a-priori algorithm successfully. Show your reasoning. You should begin with an estimate of the memory needed for the first pass. Then, consider how counts for pairs should be organized in the second pass. Do you want to renumber the frequent items and use the technique of Problem 2? If so, explain how the translation of numbers is stored, and how much space it takes. Or would you prefer to maintain a count for only those pairs of frequent items that actually occur in the data? If so, explain how you will do so, and how much space it takes.

4. (25 pts.) Repeat Question (3) for the PCY algorithm. You may assume that when you hash pairs on the first pass:

• Frequent pairs each hash to a different bucket.
• Infrequent pairs divide exactly evenly among buckets, including those that contain a frequent pair.
• Each infrequent pair (in particular, a pair that consists of two frequent items) is equally likely to belong to any bucket.

You should explain your reasoning. For example:

a)
How many buckets will you use on the first pass, and how much space do they take?
b)
How many candidate pairs will there be on the second pass?
c)
How do you store and lookup counts of candidate pairs on the second pass?
d)
How much space is needed to maintain those counts?
e)
What other space is needed on the second pass?

Note: you do not have to answer the above as if they were five parts to a question. However, you should think about each of these issues, and others, as part of your overall response to the question. Also, observe that the more buckets you use on the first pass, the more space you need, but the fewer candidates you have on the second pass. The minimum space utilzation will occur when the space needs on the two passes are equal.