CS 124 / LING 180 From Languages to Information, Dan Jurafsky, Winter 2019
Week 5: Group Exercises on IR

  1. Part 1: Group Exercise

    1. An IR system returns eight relevant documents and ten non-relevant documents. There are a total of twenty relevant documents in the collection. What is the precision of the system on this search, and what is its recall?

    2. Draw the inverted index that would be built for the following document collection.

          Doc 1: new home sales top forecasts
          Doc 2: home sales rise in july
          Doc 3: increase in home sales in july
          Doc 4: july new home sales rise

    3. Compute cosines to find out whether Doc1, Doc2, or Dc3 will be ranked higher for the two-word query "Linus pumpkin", given these counts for the (only) 3 documents in the corpus:

        term    Doc1      Doc2     Doc3
        Linus   10        0        1
        Snoopy  1         4        0
        pumpkin 4       100       10

    Do this by computing the tf-idf cosine between the query and Doc1, the cosine between the query and Doc2, and the cosine between the query and Doc3, and choose the highest value. You should use the ltc.lnn weighting variation (remember that's ddd.qqq), using the following table:

Part 2: Challenge Problems

  1. Do modern web search engines use stemming? If so, are all suffixes removed or just some of them? How do search engines deal with Boolean terms like OR or AND? Do some experimenting with Google, Bing, DuckDuckGo, or your favorite search engines.

  2. Consider two documents A and B whose Euclidean distance is d and cosine similarity is c (using no normalization other than raw term frequencies). If we create a new document A' by appending A to itself and another document B' by appending B to itself, then:
    1. What is the Euclidean distance between A' and B' (using raw term frequency)?

    2. What is the cosine similarity between A' and B' (using raw term frequency)?

    3. What does this say about using cosine similarity as opposed to Euclidean distance in information retrieval?

  3. Is it important to remove stop words in a system that uses idf in its weighting scheme?