A2-GabriellaBrignardello

From cs448b-wiki
Jump to: navigation, search

Exploratory Data Analysis of NYC Restaurant Inspection Results

Domain and Data

Before settling on the domain of NYC Restaurant Inspection Results, I downloaded and explored about 10 datasets from the World Bank, specifically relating to the global indicators of education, fertility, GDP, unemployment, population growth, and population composition (i.e. youth, elderly), as well as several about prescription drug use and smoking trends by state in the U.S. After playing around with several of these datasets in Tableau, I was unsatisfied with the initial results and decided that it would be better to focus in on a more specific domain to avoid cluttered and confusing data visualizations that attempt to tell too much (as this was a major part of my feedback from Assignment 1).

As I am a bit of a foodie as well as a "clean freak", I started looking around for datasets on health/food inspections and came across the NYC Restaurant Inspection Results website, which is actively updated as new inspections are completed -- in fact, the last entry in the dataset that I downloaded is April 14, 2016.

Exploratory Process

First Steps: Visualization Creation Round 1

After loading the data into Tableau (without doing any pre-processing), I started playing around with the different dimensions and measures to see what would be interesting to explore. I created 4 basic visualizations: the first two were plotting the number of records of inspections over time -- both by year [Fig. 1] and by month (aggregated over the years) [Fig. 2] -- while the third was a bar chart of the number of records of inspections by each borough in NYC, colored by the violation type [Fig. 3] and the fourth was the number of records of inspections by each cuisine description (also broken down by grade) [Fig. 4]. I found that the time series graphs are not particularly interesting, although it did show me that more inspections have been done more recently and that most inspections are done at the end of second quarter of the year (April-June) and at the end of the year. On the other hand, the violation type bar graph and the distribution of inspections across cuisine types were both very overwhelming and would not easily digestible by a viewer. These visualizations, however, helped shape my initial question and gave me ideas for visualizations that I will discuss later on.

Through these firs steps, I started to think about what questions I could try to answer with this dataset. The two that came to mind are: (1) What types of cuisines get the best/worst grades? and (2) What boroughs get the best/worst grades?

Follow Up to First Steps

At this point, I realized I needed to do some data manipulation to cleanup the data. I used RStudio for all of my data manipulation/pre-processing/filtering. In order to guide this process, I decided to take a closer look into the explanatory documentation for this dataset: (1) "How We Score and Grade" and (2) the "About" document included in the dataset folder. In the "About" document, I discovered that not all of the inspection records in the dataset are gradable. Gradable inspections can be identified when the following statements are true: (1) INSPECTION TYPE in (Cycle Inspection/Initial Inspection, Cycle Inspection/Re-Inspection, Pre-Permit (Operational)/Initial Inspection, Pre-Permit (Operational)/Re-Inspection); (2) ACTION in (Violations were cited in the following area(s), No violations were recorded at the time of this inspection, Establishment Closed by DOHMH); and (3) INSPECTION DATE > July 26, 2010. Thus, to get only the gradable inspections, I used these 3 conditions to filter the data. I also filtered out any negative scores. Lastly, I selected the most recent inspection report for each unique institution to avoid any double counting when dealing with number of records.

In addition to manipulating the NYC Health Inspection Results dataset, I also created my own "boroughs dataset" that included zipcodes (retrieved from the NY Department of Health), population (retrieved from the NYC Planning website, and size (in square miles) (retrieved from Wikipedia. I then joined this dataset to the NYC Health Inspection Results dataset so that I could verify that all of the zipcodes were associated with the right borough (as I found some were not) and also so that I would have this additional information about each borough available when creating my data visualizations in Tableau.

Questions from First Steps

Thus, I had a cleaned dataset of 24,472 entries (which corresponds to the fact that "the NYC Health Department inspects about 24,000 restaurants a year to monitor compliance with City and State food safety") as well as a better idea of how to tackle showing off this data in a visual form. From here, I was ready to tackle (refined versions of) the first real questions that I had generated in my first steps above:

(1) What types of cuisines have the highest proportion of A grades?
(2) Do some boroughs have better health inspection grades than others?

Visualization Creation Round 2

Question 1: Health & Cuisine Type

With the cleaned dataset and my initial questions in mind, I started creating visualizations in Tableau. To ensure that I was testing out all of the possibilities, I made several versions of the same type of graph and even did some "user testing" to see which visualizations appealed to viewers and which were confusing or boring.

Fig. 5

I first set out to tackle refining the bar chart of the number of inspection records across each cuisine description (broken down by grade) [Fig. 4] as that would help me answer the first question about the relationship between cuisine type and grade. My first step was to group the columns as there were many duplicates (i.e. I grouped Soups, Salads, Sandwiches, Sandwiches & Salads, etc. into Sandwiches/Salads/Soups) or similar types of cuisines (i.e. Greek and Mediterranean), and then I excluded several of the categories that I felt were not "Cuisines", such as Fruits/Vegetables, Smoothies, Cafes, Donuts, and Nuts. This left me with 18 groups -- a much more manageable number to compare in the form of a bar chart. Following the same style as Fig. 4, I created a bar chart that is broken down by grade [Fig.5 ]. A grades are obviously dominant as one can clearly see that orange dominates the visualization; however, beyond that insight, there is not much else to be gained from this visualization.

I then decided to see if more could be gained from making this into a 100% stacked bar chart so that the proportion of each grade can be more clearly seen [Fig. 6]. I also sorted it by largest proportion of A grades to smallest, demonstrating which types of cuisines get the better health scores and grades. From the visualization, it is evident that the less "mainstream" cuisines (i.e. Hawaiian/Filipino, African, Korean, etc.) are not as compliant as their counterparts (i.e. Italian, American, Greek/Mediterranean, American, Mexican, etc.). I think that this visualization does a good job of telling a simple story and avoids unnecessary complications and excess of information. However, one point that I think might be confusing is that the health records have a mix of true cuisine types and types of foods (i.e Hamburgers/Hotdogs). For this reason, I put it on the list of possibilities for the final product; nevertheless, there is more exploration to be done.


Fig. 6
Question 2: Health & Borough
Fig. 7

I then briefly explored the answer to the second question about which boroughs get the best scores. After making a simple 100% stacked bar chart across the 5 boroughs [Fig. 7], I saw that the proportion of A's, B's, and C's are pretty much the same across the board, even though the number of inspection records for each borough differ substantially (specifically, Manhattan has substantially more than the other 4).

Given that there was not much more to look into for this question, I decided to pivot and explore solely the quantity of unique restaurants in each borough and each grouped cuisine type, using the data in a different way than I had initially expected. So, the next questions to answer were:

(1) How diverse are the boroughs in terms of cuisine types?
(2) How dense are the boroughs in terms of number of restaurants compared to their respective size and population?





Visualization Creation Round 3

Question 1: Cuisine Diversity in the Boroughs
Fig. 8

I was interested in exploring this question because I have heard from my friends who are from NYC that they will go to different places in the city to get specific types of food. I also know that different ethnic and racial groups tend to live together in the different boroughs of NYC so I wanted to see if the types of cuisines present in each borough would reflect this trend. Lastly, I hypothesized that the boroughs with more restaurants will probably be more diverse.

To start out with, I looked at the general cuisine type diversity of restaurants in NYC as a whole to see which dominated. I was not surprised by the fact that American cuisine definitely has the largest presence, followed by Chinese, Italian, Pizza, and Bagels. I chose to represent this as a bubble chart because you can immediately see the more common cuisine types, as those are the largest bubbles that are also labeled [Fig. 8]. While in concept this visualization seems good at telling the story that I want it to, even with the list of grouped cuisines that I had created for earlier visualizations, there were still too many to each have their own unique color which is problematic.

Fig. 9
As I was more interested in diversity by borough, I tried to split this bubble chart into a 5x1 grid of smaller bubble charts for each of the boroughs but this was far from effective for similar reasons to those that I mentioned above and also the data was not easily compared between boroughs which is necessary in order to answer my question. Scrapping that, I moved to a 100% stacked bar chart (as we already know the distribution of restaurants across boroughs from previous visualizations) in order to show how much each of the cuisine type groups contributes to the whole for each borough [Fig. 9]. From this visualization, it is apparent that all of the boroughs are quite diverse. Specifically, you can see that Manhattan has a large American cuisine presence but also has a nice spread of other cuisines. Chinese cuisine makes up a large proportion of the restaurants in the Bronx, Queens, and Brooklyn -- Caribbean/Latin cuisine also has a larger presence in these 3 boroughs than in the other 2. Staten Island and the Bronx have a lot of pizza establishments and Italian food can be easily found in Staten Island as well. While I can gather all of these insights from looking closely at the visualization and referencing the legend, this visualization has a lot going on -- in fact, too much going on -- taking away from its ability to clearly answer the question in a way that is both visually aesthetic and easily understood.
Fig. 10

I made one last attempt at simply displaying borough diversity by overlaying gantt bars showing the unique count of cuisine types in ever borough over a bar chart of the distribution of restaurants across boroughs [Fig. 10]. This simple visualization actually told me that all of the boroughs are pretty diverse in that they have upwards of 30 cuisine types; Manhattan, Brooklyn, and Queens are the most but that also makes sense given that those 3 boroughs have the most restaurants. Thus, I can conclude that the boroughs are all pretty diverse and that number of restaurants doesn't effect that diversity factor like I expected.









Question 2: Density of Restaurants Across the Boroughs
Fig. 11

In many of the visualizations that I have already created, the distribution of food establishments across the boroughs is apparent -- Manhattan obviously has many more than the other 4 boroughs and Staten Island has the least. For reference, I made a simple distribution bar chart so that this insight can be easily seen without any other distracting factors (i.e. grade, cuisine type, etc.) [Fig. 11].

Fig. 12

My first step to answering this question of density of food establishments across boroughs was to look at quantity of establishments (through the number of unique institutions that had been inspected) and compare that to the population density and size of each of the boroughs. To do this, I used the data that I had collected to make the borough dataset that I joined to the inspection dataset. My initial attempt to visualize all of these factors (i.e. population, size, and quantity of establishments) involved using a filled map. Initially prepared to map out the boroughs using zipcodes, Tableau actually recognized each of the boroughs as counties and divided the NYC area up automatically. The geographic space of each borough is colored based on its population (I chose a grayscale as I think its best for backgrounding if more information is going to be overlayed) and then the colored circle placed in the center of each borough denotes the quantity of establishments [Fig. 12]. The map is able to effectively to show the physical size of each borough and their respective population densities; and while the circles do show number of restaurants, the don't contribute that well to the visualization as a whole. Overall, I was not pleased with how this came out and tried experimenting with bar charts to display the same information.

I then created a series of 3 bar charts [Fig. 13, 14, 15] that show similar information to the map. The first [Fig. 13] compares the distinct count of of restaurants to the size of each borough in square miles. Even though Manhattan is physically the smallest borough, it has the most restaurants, while Queens is physically the largest -- 5x the size of Manhattan -- and has the third least number of restaurants -- approximately half of the number in Manhattan. I then looked at comparing the population of each borough to the distinct count of restaurants [Fig. 14]. Again, Manhattan stands out as the outlier in that although it is the borough with the third largest population, it is the borough with the highest number of restaurants. The other boroughs, on the other hand, seem to be fairly comparable to one another in terms of how the population matches up to the quantity of restaurants.


The last visualization combines elements to compare the number of restaurants per 100,000 people to the physical size of each borough [Fig. 15]. This emphasizes the results shown in Fig. 13 and 14 -- that Manhattan is overly populated in restaurants while Queens is underpopulated -- because both Manhattan and Queens are the obvious outliers given the gap between the gantt line showing size and the bars showing number of restaurants per 100,000 people. This gave me the idea to look at population density (population/square miles) to see if Manhattan is truly overpopulated in restaurants and Queens is underpopulated. By creating a visualization that compares number of restaurants to population density (number of people per square mile) [Fig. 16], it becomes evident that Manhattan's population density actually matches its quantity of restaurants -- because so many people are packed into a small space, it makes sense to have so many restaurants to meet that demand. In Fig. 16, it also becomes evident that perhaps Queens is overpopulated in restaurants, while Staten Island and Brooklyn are pretty much at par and the Bronx is lacking.

Final Visualization

Given the variety of subsections of the data that I explored, it was difficult to choose which area to focus in on for my final visualization. Ultimately, I decided to refine my last visualization that compares the population density (number of people per square mile) to the number of distinct restaurants in each borough.

Thus, my final question is: What is the relationship between each borough's population, physical size, and quantity of restaurants?

While you can see the trend with the bar plot overlayed with the gantt lines in Fig. 16, the graph is not as effective as it could be. I decided to change it into a scatter plot so that a trend line could be identified to show how quantity of restaurants and population density are somewhat positively correlated. As you can see below, Manhattan, Brooklyn, and Staten Island fit well with this trend line, showing that their population densities match their number of distinct restaurants. Thus, my initial insights about Manhattan being overpopulated with restaurants is wrong as this visualization shows that it is reasonable to have more restaurants in a borough that is densely populated -- which also logically makes sense given that there is more demand for food when there are more people in a given space. The outliers in this visualization are the Bronx and Queens; the positioning of the Bronx above the trend line suggests that this borough is underpopulated with restaurants as the population density is outweighing the number of restaurants, while the positioning of Queens below the trend line suggests that it is overpopulated with restaurants as the number of restaurants outweighs the population density. I had reached these conclusions previously when analyzing the bar/gantt line plot; however, I think they are more clearly demonstrated in this scatter plot form, as a viewer with limited knowledge about this dataset or NYC can easily pick up the trend and message that I am putting forth with this visualization. I tested this visualization with several people to ensure that this assumption was correct and got positive feedback, confirming my choice to use this as my final visualization for this project. Ultimately, I feel quite content with this visualization as it is concise, clear, and straightforward, and does not try to show too much information like my visualization in Assignment 1. I think it is not only aesthetically pleasing in its simplicity and visual elements (i.e. each point being shaped like its respective borough, green color matching one's mental perception of land coloring on maps), but also does a good job of answering my final question and telling a story about the relationship between population density and quantity of restaurants in each borough that I discovered through this exploratory analysis process.


FinalViz-GabriellaBrignardello-A2.png