A2-PascalOdek

From cs448b-wiki
Jump to: navigation, search

Assignment 2

Exploratory Data Analysis

The task in this assignment is to use existing software tools to formulate and answer a series of specific questions about a data set of your choice. After answering the questions you should create a final visualization that is designed to present the answer to your question to others. You should maintain a web notebook that documents all the questions you asked and the steps you performed from start to finish. The goal of this assignment is not to develop a new visualization tool, but to understand better the process of exploring data using off-the-shelf visualization tools.

Domain of Interest : Businesses

  • Reviews
  • Operational hours
  • Ratings
  • Service offered
  • Parking availability

Initial Question

Are there any patterns shown when giving reviews/tips/ratings for businesses by users of different gender?

Data Set Selection

The Academic Challenge dataset was fit for my study domain, Businesses. However, the Users dataset had many details but lacked gender/sex, therefore the initial question could not be explored. After inspecting the dataset further , some questions within the same domain were inspired:

  • What is the distribution of ratings of all businesses in the dataset ?
  • What is the distribution of the number of reviews of all businesses in the dataset?
  • What ratings did Businesses that ended up getting closed have?

Shaping Data

The Yelp Challenge Dataset is given in smaller datasets containing json files of businesses, reviews, tips , users and check-ins. Each file is composed of a single object type, one json-object per-line. This format is not easy to work with, so I modified the files for ease of queries. Each file was modified into an array of objects of the category, e.g the businesses file was converted into an array of businesses with each item of the array being an object with the details of a business. This was repeated for each of the following:

  • Businesses
  • Tips
  • Reviews
  • Users
  • Check-ins

The resulting files were fed into a NoSQL Mongo DataBase on my computer. Each category listed above was thus a collection into which I could easily query to get the initial figures to be used in generating visualizations. After analysing the data in a Mongo database, I entered the values to google sheets for further analysis and fine tuning.

Visualization Iterations

To answer the questions about correlation of other factors against ratings of businesses, I first had a graph showing the distribution of ratings acroos the businesses represented in the dataset.

No. of Businesses Vs Ratings

To get started I visualized the distribution of ratings of businesses. The findings were that the rating with the highest number of businesses is 3.5 stars, while the rating with lowest number of businesses is 1.0. Allrating.png

No. of Businesses vs No. of Reviews

This was very promising but the dataset was not easy to completely visualize as the number was very large for smaller number of reviews, and got less and less for larger number of reviews. The first iteration of the graph doesn't show all the data well except for the very large difference in the number of businesses with few reviews. The second iteration showed a little more than the first graph, but still scaling was an issue.

Iteration 1

Reviews100.png

Iteration 2

Reviews10.png

Rating Distribution of Closed Businesses

My research then pivoted into finding out more about businesses that are closed. My goal was to find if there was a correlation between the businesses that were closed and their ratings. From the graph, we can tell that a high percentage of the mid-rated (2.0-4.0) businesses were closed compared to the very low rated and high-rated businesses. My expectation was that the businesses wth very low ratings would have a higher percentage of closings but that wasn't the case. Closed.png

Sources and Software