1. An IR system returns eight relevant documents and ten non-relevant documents. There are a total of twenty relevant documents in the collection. What is the precision of the system on this search, and what is its recall?
2. Draw the inverted index that would be built for the following document collection.
Doc 1: new home sales top forecasts
Doc 2: home sales rise in july
Doc 3: increase in home sales in july
Doc 4: july new home sales rise
3. Compute cosines to find out whether Doc1, Doc2, or Dc3 will be ranked higher for the two-word query "Linus pumpkin", given these counts for the (only) 3 documents in the corpus:
term Doc1 Doc2 Doc3
---------------------------------------
Linus 10 0 1
Snoopy 1 4 0
pumpkin 4 100 10
Do this by computing the tf-idf cosine between the query and Doc1, the cosine between
the query and Doc2, and the cosine between the query and Doc3, and choose the highest value. You should use the ltc.lnn weighting variation (remember that's ddd.qqq),
using the following table:
It might help to look at this useful handout
Personalization is an important topic in information retrieval; after all, we'd like our search results to be relevant to us and our interests. However, as with many other tasks involving people's personal data, this has ethical implications. Do the following in your group:
(b) Now let's talk ethics:
What is the Euclidean distance between A' and B' (using raw term frequency)?
What is the cosine similarity between A' and B' (using raw term frequency)?
What does this say about using cosine similarity as opposed to Euclidean distance in information retrieval?