A2-ZachMaurer

From cs448b-wiki
Jump to: navigation, search

Original Intentions

I'm interested in humanitarian issues and I wanted to work with a dataset that was directly involved with this topic. I also knew that I wanted to learn about different tools for joining/manipulating datasets, so I started to think about different interesting combinations of ideas out there. After browsing data.hdx.org (repository for lots of UN, etc. datasets) for a while I happened on an ACLED for political violence in African countries. I decided to combine this with Mobile Phone Subscription data and Internet Use data from the World Bank for my assignment.

Based on these 3 datasets:

  • ACLED Conflict Data (1997-2015)
  • Mobile Phone Subscribers Per 100 People (1960-2014)
  • Internet Users Per 100 People (1960-2013)

I set out to investigate the following questions:

  • Have protests increased with internet and mobile access?
  • Who are the primary perpetrators of violent acts?
  • Has religion played an increasing role at different times?
  • Has religion changed in its prevalence across Africa?
  • Have incidences become more of less fatal?

Entry #1 - Examining Connectivity Growth

First, I decided to plot mobile phone and internet connectivity for each year. (Color represents different countries, however, I didn't provide a legend because it's 54 entries long.)

At this step, I wanted to test my assumptions that phone and internet prevalence was increasing over time. Furthermore, I wanted to get a sense for how uniform and how rate of change was across different countries.

Some specific steps that I went through:

  • Considered a number of different ways to manipulate the data, including some of the methods mentioned in class.
  • Ended up choosing to learn a new set of tools: iPython Notebooks, pandas and Python3
  • Really impressed by how quickly I was able to start using these different tools.
  • Had to do some initial country name manipulation and matching to filter phone and internet datasets by African countries.
  • Checked for any mismatches between labels and manually added to the list (e.g. “CONGO” vs. “CONGO, REPUBLIC OF”)
  • Limited the date ranges with pandas.
  • Very hard time figuring out how to display on tableau, eventually placing pills labelled "Measure Names/Values" on the left axes caused the lines to be overlaid on each other.
  • Noticed that mobile phone subscriptions per person exceeded 2 in some cases. This is not necessarily an error in the data, since in many developing countries individuals own multiple different phones for different uses.
Mobile phone subscribers per 100 people over time in African countries.
Internet users per 100 people over time in African countries.

Entry #2 - Where is Connectivity Lagging?

After producing the previous two visualizations, I noticed that there was a sort of doubly-clustered growth pattern. The countries were very roughly split in half, one group was growing much faster than the other in terms of cell phone and internet connectivity.

I thought that this might be valuable information for later investigations, so I mapped the bottom 50% of the internet and phone connectivity datasets.

This required me to export two lists of countries from the phone and internet datasets from Tableau to csv. From there, I loaded up the csv into an iPython Notebook and wrote a python script to do a short set difference comparison on the lists.

Some observations:

  • Not surprising that central african countries have the slowest growth in connectivity, given development conditions relative to other african countries.
  • Interesting to note how internet growth was slow in the north but not the south and vice versa for phone growth.
African countries with the lowest phone and mobile phone access.

Entry #3 - How frequently do these events occur? (Or are recorded?)

To answer this question, I had to group the data by year and by type of political event.

  • In the graph below, it was interesting to note how "protests and riots" generally followed a similar trend as "violence against civilians" and "battles", but specifically how the number of protests recently has far eclipsed the number of events in the past couple years.
Total occurrences of certain types of political events by year.

Entry #4 - What are the most fatal types of events?

At this point, I wanted to get an overall sense of how violence was distributed across different countries and what were the causes for political violence in the aggregate. In pandas, I grouped each years ACLED data by event type and created annual totals for number of events that occurred and how many fatalities were caused by each type of event.

This process gave me the following five graphs.

  • The first two are graphs of the data transformation described above. In the second graph, I've removed the fatalities data-points that likely corresponded to the Angola Civil War and Eritrea-Ethiopia War in '98-99 to get a better sense of the trends in the bulk of the data. It was astonishing to see the huge number of fatalities as a result of these events, relative to other historical records.
  • The third graph was experimenting with a different representation of similar data.
  • The fourth graph is a small multiple variation of the first two, where I re-organized the data in terms of Country. Although this graph was too large to view here (I've only included a crop), it was interesting to scan through and see how many countries experienced spikes at different times or had gradually increasing trends.


Fatalities due to certain political events by year.
Same as previous, but outliers excluded to see trends in bulk of data.
Total aggregate fatalities due to political events in African countries.
Fatalities due to certain types of events for each year, broken down by country.

Entry #5 - How are phone and internet access related to protests?

I iterated on a number of different "big" questions for my final visualization for this assignment. However, partially due to the amount of time I spent learning how to use pandas, I ended up having to rule a number of them out because there wasn't enough time for me to do certain types of analysis. For example, I had thought I might be able to manually classify different actors involved in the ACLED conflict data based on religious categories to see how religious-poltical violence has changed over time. Unfortunately, there were close to 3000 unique groups, which made that task infeasible.

So, since my data was mostly in the proper format for answering questions related to phone/internet connectivity and protests, I started to think about the best way of displaying that information.

  • Based on the previous graphs, I knew that displaying 50+ countries on a single graph was not particularly effective. However, focusing in on a single country seemed to ignore the scale of the dataset that I was working with. So, I decided to treat the entire dataset as a set of tuples mapping number of protests in a given year to internet/phone access. This choice started to give me some sort of relationship which I believed I could work on clarifying and displaying effectively.
  • To produce this graph, I had to (1) pivot the original tables into a "stacked" form in pandas, (2) double-check that all country names matched, (3) linearly interpolate for any missing phone/internet access values and then (4) manually copy and past the values together in Excel so Tableau would digest my data properly.
  • The next two graphs were experiments looking for any clustering around countries or year. I realized that a multi-hue color scheme would not work well for encoding aspects of this data because its too jumbled.
  • From the second graph, I realized that the Arab Spring protests in Egypt were a massive outlier in terms of frequency. Certainly, an interesting and relevant point for this overall topic, but since I was more concerned about overall trends related to phone and internet access, this seemed like an outlier data point.
Number of protests in a given year for African countries compared to internet and phone access.
Number of protests in a given year for African countries compared to internet and phone access. Country is color-coded and date is displayed as a floating label.
Total numbers of protests 1997-2015 in different African countries.

Entry #6 - Refining the Final Visualization

The two smaller thumbnails below are iterations on my final visualization. The main things that I learned at this point was:

  • A color gradient could be used effectively to connote the passing of time (i.e. the "year" value in my data). This helps communicate the increasing trend of protests and connectivity over time.
  • I tried sampling just the top 10 protesting countries to see if that reduced the noise near the origin. However, I came to the conclusion that the datapoints filter out too many values. Instead, I decided to just use an exponential trend line to reflect the relationship between the two. Hopefully, the less steep than expected slope of the trend line would indicate that there are a number of values densely clustered near the origin or x-axis.
Draft #1. Number of protests compared to internet/phone access. Big change here was using a blue gradient to denote the year of the data.
Draft #2. Drawing on the previous entry's conclusions, tried to see if there was a stronger trend by just sampling the top-10 protesting countries.

Entry #7 - Final Visualization: Have protests increased with internet and mobile access?

Number of annual protests in African countries compared to normalized annual phone and internet access statistics. Phone and internet access statistics compiled from World Bank data. Protests totals by country and year sourced from ACLED African Political Violence Center. 2013-14 Egypt and South Africa data excluded as extreme outliers and to provide a more detailed view of the majority of the data.

It is clear that the annual frequency of protests in African countries is increasing with internet and mobile phone access over the past 18 years. Although the trend line plots a more conservative than expected rate of change due to the concentration of data points near the origin and x-axis, it is clear that there are a number of recent instances where greater phone and internet access have accompanied greater annual protest frequencies. This conservative rate of change may suggest that internet and mobile phone access are not strong causal factors for increasing numbers of protests. Instead, they may only increase the scale and momentum that protest movements develop over time. This hypothesis could be supported by the presence of extreme outlier data points representing the 2013-2014 Arab Spring protests in Egypt and the housing protests in South Africa (excluded from view to show detail in bulk of data) some of which had almost five times the number of protests in one year (~1800) than the top data-point displayed on this graph. To further understand the role of phone and internet connectivity in protest activity in African countries over time, it would be valuable to supplement this analysis with datasets estimating the participation rate for protests over time. I assumed that an exponential trend line would be appropriate due to the exponential connections that a phone/internet medium allows between individuals in a network. However, it would be important to investigate how the size of protests has changed over time in comparison to phone and internet access to understand this relationship in more detail.