CS206 Assignment #3

Due Tuesday, April 30, 2002, in class

Note: The honor-code rules are the same as for the previous assignments. Also, please show work for partial credit.

  1. (35 pts.) Imagine a matrix with 1,000,000 columns. Each column has exactly three 1's. There are 1000 pairs of columns that are actually 50% similar (i.e., they share two rows), 1,000,000 pairs that are 20% similar (i.e., they share one row), and the rest of the pairs have no row in common. Suppose we compute min-hash signatures consisting of 8 independent hashings, and then use locality-sensitive hashing to determine candidate pairs to be checked for similarity.

    a)
    One LSH scheme uses four bands of two rows each to find candidate pairs of signatures. Give the probability that a pair will be a candidate if it is 50%, 20%, or 0% similar. How many candidate pairs of each similarity do we expect to find?

    b)
    Repeat (a) if the LSH scheme is two bands of four rows each.

    c)
    If our goal is to find all and only the 50%-similar pairs, which LSH scheme would you prefer?

  2. (25 pts.) Suppose our points lie in a line (i.e., a 1-dimensional space), at the prime points 2, 3, 5, 7, 11, 13, 17, 19, 23, 29. Distance is the usual Euclidean distance, i.e., the magnitude of the difference between two numbers. Suppose we cluster points hierarchically, using the centroid (i.e., the average) to represent the location of a cluster. Find the clusters when only three clusters remain. Note: break ties for shortest distance in favor of the pair of clusters to the left (i.e., low end of the line), and it is OK if centroids are not at integer points.

  3. (40 pts.) Here are five vectors representing purchases of books by 10 customers: A = 0101010101, B = 0011001100, C = 1110010001, D = 1001001001, and E = 0001111100.

    a)
    Give the table of cosine distances between the vectors. Note: It is sufficient to represent angles by their cosines. The cosines themselves don't obey the triangle inequality, like the angles themselves do. However, we only need to compare angles, and between 0 and 90 degrees, larger angles have larger cosines, so we can do the comparisons without ever knowing the angles themselves. Thus, you may compute the cosines of angles rather than the angles.

    b)
    Which three vectors form a cluster with the best "cohesion," which we define here to be the maximum distance of a node to the clustroid of the cluster? Recall that the clustroid is the node with the minimum maximum distance to the other nodes.

    c)
    Repeat (a) for the Jaccard distance (i.e., 1 minus the Jaccard measure). You need not repeat (b).