A2-SanbyLee

From cs448b-wiki
Jump to: navigation, search
The final visualization is shown below with a short caption and analysis. The full wiki notebook follows.


click the image above to see a larger version


The chart above shows words used in song lyrics in the 1960s versus the 2000s. The words are plotted by their relative popularity in each decade, measured as a ranking from #1 to #1500 out of all words used in the dataset. Words that stay the same in popularity fall along a diagonal trend line, while words that shot up in popularity fall above the line, and words that declined in relative popularity fall below the line. In addition, the 2x2 matrix neatly buckets words into 4 categories: Those that are popular in both decades ("stalwarts"), those that rose in popularity ("rising stars"), those that declined relative to the population ("out of fashion"), and those that are not popular in either decade ("unimportant"). Analyzing this chart leads to a few interesting observations:

  • There was a huge rise in swear words, which is not surprising, especially given censorship rules in the past. However, it is surprising just how far the F-word shot up in popularity: out of 1500 words used in the 1960s, it ranked #1448, but by the 2000s it became #214, surpassing words like "beautiful" (#333), "money" (#342), and "together" (#338).
  • Words that gained popularity over the decades - "bleed", "scared", "scream" - versus those that declined in popularity - "sunday", "guitar", "sunshine", "joy" - does this reflect changes in society or our collective psyche? Or changes in the function of music?
  • Some things are less popular than you think - yes, "sex" and "drugs" went up in popularity - but they're not even in the top half of most used words in either decade. Similarly, "romance" and "woo" dropped in popularity - but they weren't that popular to begin with.
  • However, one does see different concepts of love - in the 1960s, love seems to be more associated both with stability - but juxtaposed against the need for excitement. You see words like "married" and "tender" - but also "hurry" and "thrill." It also hints at methods of courtship that have fallen out of fashion - "letter" was the #615 most popular word in the 60s, falling to #1110 by the 2000s, and "flowers" went from #542 to a depressing #1057.
  • There are many female terms of endearment - ranging from the old-fashioned "sugar" (falling from #702 to #1452), to those falling slightly out of fashion - "darling", "babe", "mama", "honey" (falling from #249 to #769) - but "baby" has always held strong (#79 despite falling from #25).
  • And some things are always popular. Despite "church" having fallen out of fashion, "lord" still ranks in the top half in both 1960s and 2000s, and "god" has actually risen from #341 to #198. Words like "kiss", "forever", "hope" are always popular, as are the emotions of despair "lonely", "cry", "pain." And of course, money is a constant - actually rising from #392 to #342. However, to answer the eternal question (money vs. love) - it's still way behind "love" (at #31 in 2000s despite falling from #16 previously).


Domain of Interest

I was inspired by the article posted on Piazza about analysis of dialogue by gender in Hollywood screenplays. I thought it was fascinating how the authors used data analysis and visualization to bring insight to a controversial question that is often debated from a political perspective rather than a data-oriented one. After reading that article, I followed a link to a related article about the "largest vocabulary in hip-hop," which ranks modern hip-hop artists by the number of unique words used in their lyrics, and compares their vocabulary to that of Shakespeare and Moby Dick. Again, I loved how the authors used data to answer an unusual question.

Therefore, I decided on song lyrics as my domain of interest. Not only does a song reflect the background and life experiences of the artist, but a popular song is also crafted by the artist to appeal to emotions in the listeners, thus revealing what we as an audience are interested in hearing about, and what is collectively on our minds. My goal was to analyze the word choice used by artists in their song lyrics, in order to find insights about our cultural environment.

Initial Question

What are the most common words used in song lyrics? Are there differences by region, genre, or gender of artist?

Initial Concept

My initial idea for a visualization was to use word clouds to present a breakdown of the most common words. Words that are more common would be shown as larger; words that are less common would be shown as smaller. I was initially drawn to the idea of word clouds because they are visually appealing and attention-grabbing. In order to show the differences by region, genre, etc, I could show an individual word cloud for each, and let the viewer compare them visually.

Iteration 1: Changing the concept and question

After doing some research on word clouds, I realized that they are not the best visualization for what I want to represent, as explained in this article. In particular, visually comparing the word clouds for different demographics is imprecise and difficult for the viewer to parse. Therefore, I decided to brainstorm other methods of visualization. I decided to first look for an appropriate dataset, and let the dataset and results of data analysis guide me toward the best visualization.

I found the Million Song Dataset, which has feature analysis and metadata for one million songs, compiled by the Laboratory for the Recognition and Organization of Speech at Audio at Columbia University. In exploring this dataset, I realized that they had tagged songs by the year in which they were produced, and most amazingly, they had data going back to 1922!

Finding this dataset inspired me to update my question. I decided to focus specifically on change over time. How has word choice in song lyrics changed over the years? We all know that slang words and certain expressions become more or less common throughout the years; I wanted to study this phenomenon by looking at song lyrics.

Data Source & Transformations

The data that I used is the Million Song Dataset, available here. The data and accompanying information were presented in the following publication: Thierry Bertin-Mahieux, Daniel P.W. Ellis, Brian Whitman, and Paul Lamere. The Million Song Dataset. In Proceedings of the 12th International Society for Music Information Retrieval Conference (ISMIR 2011), 2011. Million Song dataset. and further contact information and citations can be found here.

The data that I downloaded consisted of several separate components: (1) Word count dictionary for ~200K songs. Each song was listed as a separate record and tagged with a track ID. The words were identified by word ID, corresponding to 5000 stemmed words, which covered ~90% of the dataset. (2) Map from track ID to the year it was published. (3) Map from stemmed version of words to one possible unstemmed version of the word. The data was in CSV format.

In order to do my analysis, I needed to perform several transforms on the data, which I did using a Python script. The script performed several tasks: (a) Link track IDs in dataset #1 to track years in dataset #2. (b) Transform dataset #1 from a {song -> wordcount} map to a {word -> list of wordcounts by year} map. This was done by adding up all occurrences of a word in a year across songs. (c) Map from stemmed to unstemmed version of words for easier analysis.

The final output was a CSV file where each line followed the format: word,x,x,x,x,x... where each x is the number of occurrences of the word in 1922, 1923, 1924, etc.

Some caveats for the data: The FAQ for the Million Song Dataset lists how the songs were chosen. In looking at the change over time, it is possible that their algorithm selects songs in such a way that it systematically excludes certain songs, meaning the songs selected in a certain year are not representative of that year. Directionally, the results of our analysis suggest that this is not the case, but it is an important caveat to keep in mind and explore further. In addition, the mapping from stemmed to unstemmed words is only one possible mapping, so we lose the resolution between the change in different versions of a word - for example, we only know how "love" changed over time, and we don't know the difference between "loving", "loved", "lovable." For the purpose of my current analysis, I focus on studying the broad cultural differences over time, rather than specific word usage.

Iteration 2: Zooming in on a subset of the data

Since my updated question is "How has word choice changed over the years?" I decided that a time series plot would be the best visualization to show this data. My first attempt at visualization is below.


n=100 for visual clarity


We can see some immediate problems with this visualization. To begin with, there is way too much data on here. The visualization above shows only 100 words as an example, but already the graph is too dense to make out any interesting patterns.

The solution was to simplify my dataset: (1) I decided to look only at the years from about 1960 to 2010. While we have data going back to the 1920s, the sample size is so small that it is hard to draw meaningful insight from it. (2) I decided to bucket the data by decades instead of looking at individual years. This lower resolution is more appropriate for our task, which is looking at broader trends over a long period of time. (3) there are 5000 words, but there is a long tail and the top 1500 words account for ~92% of all word usage. Therefore, I decided to look only at the top 1500 words. All of these changes helped to reduce data overload and made it easier to do analysis.

Iteration 3: Changing to scatterplot instead of time series

After simplifying my dataset, I made an updated time series plot. My next attempt at visualization is below.


n=50 for visual clarity


This updated visualization still did not provide much insight. Although I was able to see more clearly how word count had risen over time, it was still difficult to visually compare the trends of different words. Looking at the top two lines, it seems the top word grew a bit more from the 1990s to the 2000s than the bottom word - but what does that mean? There wasn't really a larger insight I could draw from this. In addition, all words followed the same overall trend - the number of occurrences rose over time. This trend is probably a function of the simple fact that the number of songs produced (and included in the data set) rose over time.

However, I did notice something interesting. While all words followed the same overall trend over 4 decades, when I looked at the endpoints for 1960s and 2000s specifically, some words did rise drastically more than others. I realized that zooming in on these two endpoints would further clarify the picture.

More importantly, looking at the data this way led me to update my question to its final version. Given that all word counts are rising, but some words rise more drastically than others, let us ask: Which words shot up in popularity? And which words experienced slow growth (and therefore, a relative decline)? Instead of simply looking at the absolute word counts over time, we look at how words are changing relative to one another.

Reformulating my question inspired me to use a completely different visualization: a scatterplot in the form of a 2x2 matrix. This visualization is a powerful and clean way to show my data: I can plot a word's popularity in the 1960s on one axis, and a word's popularity in the 2000s on the other axis. Words that stay the same in relative popularity will fall along a diagonal trend line - words that shot up in popularity will be above the line, while words that declined in relative popularity will fall below the line. In addition, the 2x2 matrix neatly buckets words into 4 categories: Those that are popular in both decades ("stalwarts"), those that rose in popularity ("rising stars"), those that declined relative to the population ("out of fashion"), and those that are not popular in either decade ("unimportant"). I was inspired by the 2x2 matrices that are often used in business contexts to plot the growth potential of companies for making investment decisions.

Reformulating my question also solved another issue I had with my dataset: previously, I was using the absolute counts to plot a word's growth over time. However, the total number of songs (and word count) increased over time, so I needed a way to account for this, otherwise it would just look like every word became more popular. There are several ways of doing this - measure a word's contribution to each year as a % of total, index a word's YOY growth to population growth, or convert absolute word counts into rankings - and I chose to use rankings. In each decade, I sorted words by their absolute word counts, then ranked them #1 to #1500. This measure reflected a word's popularity relative to other words, irrespective of the absolute counts.

To summarize, the changes that I made in this iteration: (1) Focus on the two endpoints of 1960s and 2000s (2) Change my question to look at which words became more or less popular relative to other words (3) Plot data on a scatterplot 2x2 matrix instead of as time series (4) Measure word popularity by rankings relative to other words, instead of absolute word count.

Iteration 4: Using bubble size to give a sense of scale & adding visual cues

With these changes, I created the next version of my visualization. At this point, I switched to using Tableau for my final data analysis. Previously, I had been using Google Sheets to create simple charts for preliminary exploration, as it allowed me to quickly edit and transform data directly in the sheets. However, Tableau is much more powerful for creating visualizations. After using Tableau, I came to really appreciate its features - in particular, it allows very precise manipulation of graphical elements such as color, which becomes important later on. Below is the first version of my 2x2 scatterplot.


n=1500


This scatterplot is much more helpful for analyzing the data! It shows a clear trend - the data points roughly follow a diagonal line down and to the right - suggesting that words that are more popular in the 1960s (lower # on the y-axis) are also more popular in the 2000s (lower # on the x-axis which is reversed), and vice versa. However, we can also see that there are some clear outliers. There is a cluster of points in the upper right - these are the words that became much more popular by the 2000s. There is also a cluster of points in the lower left - these are the words that were popular in the 1960s, but fell out of fashion by the 2000s.

However, there are still a few improvements to make. First, I wanted to account for the fact that some words may rise heavily in popularity, but still not be used very often in absolute terms. For example, a word that rises in ranking from #1430 to #957 will appear to have a meteoric rise, but the actual word count may be very low, as it is still in the bottom half of all words used. Second, I wanted to give a sense of scale. For example, even if a word rises from #500 to #200 - how does that compare to a word that rises from #5 to #4? Again, the first word appears to have a meteoric rise, so we may conclude that it is now widely used - however, if the absolute word count of a #4 ranked word is still many times that of a #200 ranked word, the first word may be less "important" than we think, even though the trajectory of its rise is very steep.

Therefore, I decided to use bubble size to give a sense of scale. I modified the visualization above to change the size of the bubbles, based on their "all time" word counts measured by summing absolute word counts from 1960s to 2000s. This gives us a sense of how "important" each word is, in the overall picture of things. The revised scatterplot is shown below.


n=1500


Again, this modification clarifies the picture further! We see now that while some words in the upper right corner have risen greatly in popularity - in absolute terms, the frequency of their usage is still nowhere near the most popular words in the bottom right corner. This makes sense. Once words are used extremely frequently - on the count of ~300-500K per decade - it becomes much harder to climb up the ranks of popularity because that word needs ~100K additional "usages" to "overtake" the next closest competitor. On the other hand, a word ranked around #300 may only be used ~20K times, and therefore need much fewer usages to rise up in the rankings. Therefore, we would expect to see much more movement in rankings among the lower-ranked players, versus the top-ranked ones.

NOTE: For my final visualization, I transformed the "all time" counts to be scaled logarithmically and then indexed them against the minimum size, and used that to draw bubble size. For some reason, my original attempt at using the raw all time counts yielded bubble sizes that did not provide useful information, so I had to do this transformation. However, when I recreated the chart above just now, the bubble sizes looked much closer to how I had originally intended, and provides useful information. In a future iteration, I would consider redoing the final visualization with the bubble sizes shown above.

After all these changes, this scatterplot shows a much clearer picture of the data than we had initially, and it is much easier to do analysis. However, as a visualization, the scatterplot still looks dry. Therefore, I decided to use color as an additional visual cue, to reinforce the information shown by bubble size. Even though I use two separate visual cues to represent the same information (all time count), doubling up on visual cues is helpful because it reinforces the information visually, and makes the overall visualization more powerful and easy to grasp at a glance.

The final visualization in Tableau is shown below.


n=1500


Iteration 6: Labeling unique words

The final step is to look at the actual words! The scatterplot shows that some words have risen greatly in popularity, while others have declined, and others have always been and always will be popular - so what are those words? Tableau has a useful function that allows you to mouse over a data point and see all the associated data - so I started sifting through the data and matching individual words to data points. An important distinction is that this part of my analysis shows the subjective/qualitative side of data analysis. Previous to this, I was using quantitative data to plot points to reveal trends. However, for this part of my analysis, I started using qualitative analysis to find patterns and tell a story. This type of analysis is inherently subjective, although it is also valuable. One potential modification could be to show this visualization as interactive - and let the audience themselves mouse over different points to reveal which word it is - letting them draw conclusions for themselves.

As I examined each word, I found that some of the words that rose in popularity were Spanish or German words. One explanation for this is that songs in Spanish or German have become much more popular over the last few decades. For the purposes of this analysis, I visually excluded those data points from consideration. In order to provide a completely rigorous analysis, another iteration of this visualization would involve going back to the original data set and excluding those words while calculating rankings.

The last step was to add the word labels, and format the visualization in Google Drawings. The final visualization is shown below.


A2-SanbyLee-assign2.jpg


When we look at the visualization above, we can start to draw inferences about how word choice, and by extension our cultural psyche, have changed over the last few decades. My analysis is presented at the top of this page. As I mentioned, a great next step would be to make this visualization interactive, so that each viewer can look at the data him/herself and draw their own conclusions!