This is the last question of Problem set 5. In this problem you will use real data from the Titanic to calculate conditional probabilities and expectations.

tldr: the ship sinks

On April 15, 1912, the largest passenger liner ever made collided with an iceberg during her maiden voyage. When the Titanic sank it killed 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships. One of the reasons that the shipwreck resulted in such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others.

The titanic.csv file contains data for 887 of the real Titanic passengers. Each row represents one person. The columns describe different attributes about the person including whether they survived ($S$), their age ($A$), their passenger-class ($C$), their sex ($G$) and the fare they paid ($X$).

[Quetion12] Write a program in C, C++, Java or Python that reads the data file and finds the answers to the following questions:

  1. Calculate the conditional probability that a person survives given their sex and passenger-class:
    $P(S=\text{ true | } G = \text{female}, C = 1)$
    $P(S=\text{ true | } G = \text{female}, C = 2)$
    $P(S=\text{ true | } G = \text{female}, C = 3)$
    $P(S=\text{ true | } G = \text{male}, C = 1)$
    $P(S=\text{ true | } G = \text{male}, C = 2)$
    $P(S=\text{ true | } G = \text{male}, C = 3)$
  2. What is the probability that a child who is in third class and is 10 years old or younger survives? Since the number of data points that satisfy the condition is small use the "bayesian" approach and represent your probability as a beta distribution. Calculate a belief distribution for:
    $S=\text{ true | } A ≤ 10, C = 3$
    You can express your answer as a parameterized distribution.
  3. How much did people pay to be on the ship? Calculate the expectation of fare conditioned on passenger-class:
    $E[X \text{ | } C = 1]$
    $E[X \text{ | } C = 2]$
    $E[X \text{ | } C = 3]$

You only have to submit your answers, not your program. As such you could get away with calculating these statistics by hand. Use a program. This is a warm up to problem set 6 where you will write machine learning algorithms (in C, C++, Java or Python) that read data and perform more advanced calculations.

Don't know how to create a new project from scratch? Ask a TA or download a blank CS106A Java project or a blank CS106B C++ project.

Aside: In making this problem I learned that there were somewhere between 80 and 153 passengers from present day Lebanon (then Ottoman Empire) on the Titanic. That would be 7% of the people aboard.

Update (May/12): We removed commas from the name field in the dataset to make parsing easier.
Titanic Dataset


Dataset columns:
  • 0: Survived Indicator
  • 1: Passenger Class
  • 2: Name
  • 3: Sex
  • 4: Age
  • 5: Siblings Aboard
  • 6: Parents Aboard
  • 7: Fare paid in £s

Extensions?

See if you can find something suprising in the dataset. Can you predict p? Can you find interesting correlations?