CS 124 / LING 180 From Languages to Information, Dan Jurafsky, Winter 2020
Week 5: Group Exercises on IR

Part 1: Group Exercise
1. An IR system returns eight relevant documents and ten non-relevant documents. There are a total of twenty relevant documents in the collection. What is the precision of the system on this search, and what is its recall?

2. Draw the inverted index that would be built for the following document collection.
```
      Doc 1: new home sales top forecasts
      Doc 2: home sales rise in july
      Doc 3: increase in home sales in july
      Doc 4: july new home sales rise
      
```
3. Compute cosines to find out whether Doc1, Doc2, or Dc3 will be ranked higher for the two-word query "Linus pumpkin", given these counts for the (only) 3 documents in the corpus:
```
    term    Doc1      Doc2     Doc3
    ---------------------------------------
    Linus   10        0        1
    Snoopy  1         4        0
    pumpkin 4       100       10
```
Do this by computing the tf-idf cosine between the query and Doc1, the cosine between the query and Doc2, and the cosine between the query and Doc3, and choose the highest value. You should use the ltc.lnn weighting variation (remember that's ddd.qqq), using the following table:

It might help to look at this useful handout

Part 2: Ethics Problems

Personalization is an important topic in information retrieval; after all, we'd like our search results to be relevant to us and our interests. However, as with many other tasks involving people's personal data, this has ethical implications. Do the following in your group:

Privacy. (a) Google "marguerite". What is the first search result? Would you expect another person - say, someone in New York - to get the same search result? Discuss any incidents in which your group members have had search engines return such examples of personalization based on location, search and browsing history, or social media?

(b) Now let's talk ethics:
- Is it okay that search engines are using this data to personalize our searches? Or is there a limit to what kind of data should be okay for search engines to use?
- Are there any risks with getting personalized searches? Or do the benefits outweigh the risks? What about using people's queries about HIV or opioids for public health research? How should decide how to weigh benefits against risks?
- Does any of your group use anonymous search engines to avoid this?
Bias. Google "professor" and select "Images". Google "painter" and select "Images". Is the result biased? Is this bias representative of the world or not? If it were both biased and representative, is that OK? When or when not? Discuss!

Part 3: Challenge Problems

Do modern web search engines use stemming? If so, are all suffixes removed or just some of them? How do search engines deal with Boolean terms like OR or AND? Do some experimenting with Google, Bing, DuckDuckGo, or your favorite search engines.
Consider two documents A and B whose Euclidean distance is d and cosine similarity is c (using no normalization other than raw term frequencies). If we create a new document A' by appending A to itself and another document B' by appending B to itself, then:
1. What is the Euclidean distance between A' and B' (using raw term frequency)?
2. What is the cosine similarity between A' and B' (using raw term frequency)?
3. What does this say about using cosine similarity as opposed to Euclidean distance in information retrieval?
Is it important to remove stop words in a system that uses idf in its weighting scheme?

CS 124 / LING 180 From Languages to Information, Dan Jurafsky, Winter 2020 Week 5: Group Exercises on IR

CS 124 / LING 180 From Languages to Information, Dan Jurafsky, Winter 2020
Week 5: Group Exercises on IR