1. An IR system returns eight relevant documents and ten non-relevant documents. There are a total of twenty relevant documents in the collection. What is the precision of the system on this search, and what is its recall?
2. Draw the inverted index that would be built for the following document collection.
Doc 1: new home sales top forecasts Doc 2: home sales rise in july Doc 3: increase in home sales in july Doc 4: july new home sales rise
3. Compute cosines to find out whether Doc1, Doc2, or Dc3 will be ranked higher for the two-word query "Linus pumpkin", given these counts for the (only) 3 documents in the corpus:
term Doc1 Doc2 Doc3 --------------------------------------- Linus 10 0 1 Snoopy 1 4 0 pumpkin 4 100 10
Do this by computing the tf-idf cosine between the query and Doc1, the cosine between
the query and Doc2, and the cosine between the query and Doc3, and choose the highest value. You should use the ltc.lnn weighting variation (remember that's ddd.qqq),
using the following table:
What is the Euclidean distance between A' and B' (using raw term frequency)?
What is the cosine similarity between A' and B' (using raw term frequency)?
What does this say about using cosine similarity as opposed to Euclidean distance in information retrieval?