Data Analysis Demos


Demos and datasets from Keith Schwarz

Data Science Demos

The assignment starter files also include three demos that use your PQHeap and topK as tools to explore large data sets and display some neat data visualizations. Once you’ve completed those tasks, try these demos to see your code in action! To run the demos, select the option to run tests from the demos.cpp file and then follow the instructions on the GUI window.

  • Earthquakes (requires working topK implementation): The U.S. Geological Survey operates a global network of seismometers and publishes lists of earthquakes updated every hour. Where are these earthquakes? How big are they?
    • This demo reads a live data feed from the USGS is occasionally offline or may be inaccessible given your networking setup. Give this one a whirl, but move on if it's being fussy.
  • Child Mortality (requires working topK implementation): The United Nations Millennium Development Goals were a set of ambitious targets for improving health and welfare across the globe. Over twenty-five years, the UN kept records of child mortality data worldwide. How did those numbers change since when they started keeping track in 1990 to when the most recent public numbers were released in 2013?
  • Women’s 800m Freestyle (requires working topK and PQHeap implementation): The women’s 800m freestyle swim race was introduced as a competitive event in the 1960s. How have the fastest times in that event improved since then? A certain Stanford-affiliated athlete might make an appearance here.

There is no more code for this part of the assignment. Just sit back, enjoy, and celebrate having gotten everything working! Once you've taken a well-deserved break, finish up the assignment by reading through the short case study below and answering two final reflective ethics questions.

Embedded Ethics Case Study

As you've seen in the above demos, your new priority queue is useful in ranking a group of entities that have an associated "priority" value. Generating a numerical priority value is easy when the ranking criterion is clear, like earthquake magnitude or time to complete a race, but can get much more difficult. For example, a review aggregation app like Yelp might want to display restaurants in a dynamic list sorted by highest priority, but generating the priority score might involve computing a single numerical value out of things like the restaurant's rating, type of cuisine, price point, your likes and interests, etc. When using priority queues to organize complex data, generation of the priority score is one of the most important parts of the process.

For the ethics component of this week's assignment, we will analyze a real life use of a priority queue in the Los Angeles County Coordinated Entry System, an electronic registry of unhoused persons looking to apply to a variety of housing support programs offered by the LA County government. This case study is summarized from the article "High Tech Homelessness" by Virginia Eubanks and published in American Scientist.

The coordinated entry process begins when a social service worker administers a survey to an unhoused person, collecting personal information including name, birth date, demographic information, immigrant and residency status, and respondent location at various times of the day. The survey also includes intimate questions, including ones about mental health, sexual activity, and substance use. Survey results expire after 6 months, which means that participants who remain unhoused must go through the survey process at least twice every year in order to be eligible for services.

All of this information is then entered into a centralized database where a numerical ranking between 1 (least vulnerable) and 17 (most vulnerable) is calculated for every survey respondent. Once respondents have all been ranked with numerical scores, it is possible to maintain a dynamic, updatable priority queue of people seeking resources, ranked in order of assessed vulnerability/need. The argument that the system designers make is that calculating this vulnerability ranking from all the available data is the best way to get resources to the people who need them.

One of the criticisms of the coordinated entry system is that those that want to access homeless services have no other choice but to go through this system. This raises concerns of privacy and autonomy. Storage of the personal information enables further surveillance and criminalization of the unhoused. In the long term, options exist to expunge data from the system but they are complicated, hard to navigate, and incomplete. Furthermore, unhoused individuals are not able to choose what services would best serve their needs; the algorithm chooses for them.

Q16. If you were working on constructing a priority-based system like this, how would you weigh the tradeoff between collecting enough information to make informed decisions about how to allocate support resources to people in need while allowing members of vulnerable populations to maintain their privacy and autonomy?

Having discussed the potential privacy implications of this system, we now turn our focus towards thinking critically about the underlying ethical factors. The CES did improve matching between people and services, but didn’t increase the total number of apartments, vouchers, or shelter beds available, and therefore didn’t increase the number of people housed. Eubanks asks us to consider whether the $11 million cost of the coordinated entry system would have been better spent on giving each person helped by the system $1,140 to put towards a security deposit for an apartment.

When the County does not provide enough housing, it is often those in the middle of the vulnerability range who are most negatively affected. The article outlines the following in its discussion of how the ranking system ends up impacting the populations it was built to serve:

According to its designers, the county’s coordinated entry system matches the greatest need to the most appropriate resource. But there is another way to see the ranking function of the coordinated entry system: as a cost-benefit analysis. It is cheaper to provide the most vulnerable, chronically unhoused with permanent supportive housing than it is to leave them to emergency rooms, mental health facilities, and prisons. It is cheaper to provide the least vulnerable unhoused with the small, time-limited investments of rapid rehousing than to let them become chronically homeless. This social sorting works out well for those at the top and the bottom of the rankings. But if the cost of your survival exceeds potential taxpayer savings, your life is deprioritized.

Eubanks argues that unless further action beyond just building a technical system to most effectively allocate available resources is taken, automated tools for classifying vulnerable populations will continue to exacerbate the underlying issues.

Q17. One of the key themes of the embedded ethics content in this course has been the importance of considering the societal and human impact of technical solutions. What is your main takeaway from reading this case study, specifically as it related to ideas around consequences of human ranking/prioritization algorithms? If you were working on a real-world project in the future that required you to reduce complex sources of data into a single numerical ranking or priority score, what factors might you want to consider as you implemented these algorithms? How would you decide when not to build a system?