CS206 Assignment #3
Due Tuesday, April 30, 2002, in class
Note: The honorcode rules are the same as for the previous
assignments.
Also, please show work for partial credit.
 (35 pts.)
Imagine a matrix with 1,000,000 columns.
Each column has exactly three 1's.
There are 1000 pairs of columns that are actually 50% similar (i.e.,
they share two rows), 1,000,000
pairs that are 20% similar (i.e., they share
one row), and the rest of the pairs have no row in
common.
Suppose we compute minhash signatures consisting of 8 independent
hashings, and then use localitysensitive hashing to determine candidate
pairs to be checked for similarity.
 a)

One LSH scheme uses four bands of two rows each to find candidate pairs
of signatures.
Give the probability that a pair will be a candidate if it is 50%, 20%,
or 0% similar.
How many candidate pairs of each similarity do we expect to find?
 b)

Repeat (a) if the LSH scheme is two bands of four rows each.
 c)

If our goal is to find all and only the 50%similar pairs, which LSH
scheme would you prefer?
 (25 pts.)
Suppose our points lie in a line (i.e., a 1dimensional space), at the
prime points 2, 3, 5, 7, 11, 13, 17, 19, 23, 29.
Distance is the usual Euclidean distance, i.e., the magnitude of the
difference between two numbers.
Suppose we cluster points hierarchically, using the centroid (i.e., the
average) to represent the location of a cluster.
Find the clusters when only three clusters remain.
Note: break ties for shortest distance in favor of the pair of clusters
to the left (i.e., low end of the line), and it is OK if centroids are
not at integer points.
 (40 pts.)
Here are five vectors representing purchases of books by 10 customers:
A = 0101010101, B = 0011001100, C = 1110010001, D
= 1001001001, and E = 0001111100.
 a)

Give the table of cosine distances between the vectors.
Note: It is sufficient to represent angles by their cosines.
The cosines themselves don't obey the triangle inequality, like the
angles themselves do.
However, we only need to compare angles, and between 0 and 90 degrees,
larger angles have larger cosines, so we can do the comparisons without
ever knowing the angles themselves.
Thus, you may compute the cosines of angles rather than the angles.
 b)

Which three vectors form a cluster with the best "cohesion," which we
define here to be the maximum distance of a node to the clustroid of the
cluster?
Recall that the clustroid is the node with the minimum maximum distance
to the other nodes.
 c)

Repeat (a) for the Jaccard distance (i.e., 1 minus the Jaccard
measure).
You need not repeat (b).