A2-LeighHagestad

From cs448b-wiki
Jump to: navigation, search

Exploratory Data Analysis

VERSION 1

Original Question:

Can you predict school pickups and get a sense of how many children are using the service for school pickups?

Data:

www.fivethirtyeight.com New York City Uber pickups, April 2014

Version 1 Image

Problem:

Way too many data points, and no visible clustering based solely on time/location.

New Approach:

But! This reminded me of another fun geographical/automobile dataset I had in my pocket: the SF 2012 parking ticket data.


VERSION 2

Modified Question:

Can we see if different types of tickets are written in notable different parts of the city?


New Data:

SF City Parking Data. This is publicly available data provided (upon request) by the San Francisco Municipal Transportation Association. This dataset includes a sample of ticket data from 2012.


Version 2 Image


Problem:

Like with the Uber data, the visual noise of all the data points makes it hard to capture any significant insights. Furthermore, it seems like most tickets are (unsurprisingly) densely clustered in high-population ares of the city, and absent in areas with less infrastructure (like GG Park, the Presidio, etc.). Among these high-volume ticket areas, the violation seems pretty well distributed.


New Approach:

So, we learned that violations are generally evenly distributed, with certain areas of higher clustering. That's interesting, but not mindblowing. Seeing all the different types of tickets made me wonder about the cost of each different type of ticket...


VERSION 3

Modified Question:

Can we identify if certain tickets are written at certain days/times? 


Version 3 Image


Problem:

So, it looks cool, and we can clearly see the heavy hitters for each time/day index. We also get some interesting information here, that we haven't in the past: for example, that street sweeping tickets dominate the work week, meter tickets dominate Saturdays, and fewer tickets are given on Sundays and Mondays. This is good information! but I wonder if we can get more...


New Approach:

Also, one thing I noticed was that the number of tickets given by hour is pretty uniform. We could probably cull this out and significantly cut down on the area of ink. While this visualization is a strong improvement, I think there's more to see and learn.


VERSION 4

Modified Question:

What is the relative breakdown of tickets by violation type? I.e. can we compare the relative number of tickets written are for certain types of violations? How are these distributed by weekday?


Version 4 Image


Problem:

Better! I like seeing these two images together, because one gives a greater sense of the overall comparison of the number violations written, compared by type, whereas the other looks at the 6 most written tickets and looks at their distribution by weekday. I like where the visualizations stand at this point, although - given the intensely geographical side of this data, I want to find a way to utilize the lat/long information here too.


New Approach:

Cull out extraneous noise in the data set by visualizing only the 6 most popular types of citations on a map.


VERSION 5

Final Question:

Does the number of tickets written in San Francisco vary by citation type, Weekday of issue, and location of issue?


Final Image