A2-BrandonLiu

From cs448b-wiki
Jump to: navigation, search

Pick a domain I chose to look into professional sports salaries, specifically Major League Baseball.

Initial question

  • What is the relationship between money and team wins?
  • Does more money correlate with more wins?


Assess the fitness of your data There were a number of data sets offering data that traced many years back. For the most part, the data sets were very complete or could be pieced together (baseball-reference.com contained a large number of statistics). Initially, I figured that the increases in salaries and payrolls through the years would impact the results, but this proved to not be an issue, as I simply plotted multiple years at a time using small multiples. I could not find any data sets containing both salaries and team win percentages per year, so I hate to process the data myself. I merged several data frames through R and transposed some of the columns to align them properly in order to get a full data set. Some players did not have salaries associated with their names, so I attempted to find the salaries online. Sometimes I could not (and those salaries tended to be quite small and for temporary players), so I ended up discarding most null values.


Exploratory Analysis

First Attempt


I began simply by graphing the payrolls of different teams (Fig 1.2) and their overall win percentages (Fig 1.1) before I put the two together in a makeshift trellis plot. While payrolls showed more variation, winning percentages were essentially contained within the 0.400 and 0.600 range.

Fig 1.1: Team win percentages by year
Fig 1.2 Team payrolls by year


I used small multiples to model the relationship between a team's payroll size and the number of wins they achieved. Interestingly, plotting over time showed a positive relationship between the amount of money spent and the number of wins a team had that year. This correlates well with commonly held beliefs: naturally, with more money, teams can afford to pay more expensive (and better) players. Note: I spent a substantial amount of time fiddling with Tableau to attempt to generate a trellis plot on my own but was unable to do so. I settled for filtering different shots and screenshotting those.

Fig 1.3: Trellis plot of team win percentages and team payrolls. Note the positive trend as more money is spent, teams win more games.

Second Attempt



From my first visualizations, I developed a number of additional questions, specifically:

  • How much is a win worth?
  • How does this value differ between teams?
  • How much is spending increasing and have particular teams changed their spending over time?

I reworked the data to generate a value per win amount for each team. Then, I generated new visualizations of the same data: one to better display changes (increases) in spending trends and another to identify outliers for salary and performance. I also displayed a bar chart for 2015 wins by team in order to better illustrate some of the points I am documenting.

Fig 2.1: Cost per win for MLB teams by year. The box-and-whisker plot allows better viewing of individual points and identification of outliers.
Fig 2.2: Trends in cost per win for MLB teams by year. The line graphs enable the user to better see changes in total salaries and within individual teams.


The average (median) MLB team spent $1,456,787.77 on a win in 2015. As an example to transition to my question of value differing between teams, the LA Dodgers spent a staggering $3,240,993.59 per win, over twice the league average. Though the Dodgers did make the playoffs with 92 wins, five other major league teams won more games than them. The St. Louis Cardinals spent $ $1,282,415.00 per win, just around the league average, and led all teams with 100 wins. Meanwhile, the Pittsburgh Pirates won 98 games at a cost of $ $1,065,892.85 per win. The Chicago Cubs had 97 wins and were also below the average spending with $1,247,216.20 per win. Now of course, more wins means that the displayed cost of each win is diluted more, but the contrast is quite striking. And in fact, teams like the Dodgers (and even more so the New York Yankees), appear to have consistently paid more for each of their wins.


Fig 2.3: Team wins in 2015


Final Attempt and Visualization

The previous analysis focuses on baseball teams as high level organizations. However, the analysis ignores the most important part of the game: the players. The next steps in my research led me to investigate the breakdown of individual player value relating to money. To do this, I utilized a statistical measure called Wins Above Replacement (WAR). WAR is essentially a one-statistic summary of a player's contributions to their team (since it condenses everything into one stat, it is imperfect, but it is a valuable comparator between players). Another way to explain WAR is should your player need to be replaced (i.e. traded or injured), their war is the number of wins they contribute above that average player who would replace them. For these purposes, I chose to focus on offensive WAR = (Batting Runs + Base Running Runs +Fielding Runs + Positional Adjustment + League Adjustment +Replacement Runs) / (Runs Per Win).

I sourced the data from baseball-reference.com. In my final visualization, I was curious about the specific value teams were obtaining from players and how teams like the Dodgers had made mistakes as compared to other teams who seemingly could generate more value out of the teams they assembled at far lower salaries. Thus, the main questions that arose were the following:

  • Who are some of the most "valuable" players? Are non-free agents more valuable?
  • Who are some of the best value players at their position?
  • What is the cost of WAR in the league?

For the data selection, I chose only to use players who had positive WAR, and I almost made note to filter out players who had not played the majority of the season. The result shrunk the sample size considerably but arguably made for a more accurate data set. That being said, I opted not to calculate WAR/costs of WAR totals for whole teams, noting that there would be significant discrepancies when considering the lack of data from players who had not played the whole season and the large volume of trades that happen later in the baseball season.

Fig 3.1 graphs WAR against player salary and highlights some of the top players for bargain value. The tendency is that the highest value players are non-free agents.

Fig 3.1: Player value and WAR. Of interesting note is that the best player value tends to come from players who cost in the small millions (~$100,000 - $6,000,000). This is no coincidence: players in their early years in the league are ineligible for free agency--free agents are eligible to negotiate for higher value and sign with other teams. Thus, players like Bryce Harper and Mike Trout receive much smaller salaries despite posting higher WAR values that are quite similar to veteran players making several times the amount they make.

Fig 3.2 shows the relative cost of WAR for individual players.

Fig 3.2: Area demonstrates the relative cost of WAR for individual players. The larger the box for a player, the more expensive their WAR is. Players like Troy Tulowitzki are quiet expensive relative to their contributions to the team. Of note is the difference between Fig 3.2 and Fig 3.1. Long-time household names like Joe Mauer and Yadier Molina provide extremely expensive WAR. These veteran players negotiated high paying contracts. Clubs are forced to sign these, all the while knowing that players' value tends to decrease with age.
Fig 3.2.1: See individual players with the highest WAR.

Lastly, I chose to look at players in with the highest positional value. Certain positions in baseball tend to produce different types of numbers. For instance, catchers tend to hit below average because they expend more energy during games behind the plate. Fig 3.3. looks at trends in WAR at positions:


Fig 3.3: Salary and WAR are plotted at each position in a trellis plot to allow for distinction by position.

The final figure 3.3. highlights the highest "value" players at each position by WAR and that particular positions tend to have lower WARs than others (e.g. catcher is much lower than first base). Also note the trendlines associated with each position. For third base, it appears that increasing third baseman salaries does not yield an obviously higher WAR (of course, this is a simplification of the trend, but it does indicate returns that are not fantastic).

In summary Figures 3 highlight some of the most valuable players in the MLB and at each position. Notably, Mike Trout and Bryce Harper have the highest WAR and among the lowest salaries. The best bargain deals on players tend to come early in their careers when they produce high numbers for low salary requirements (prior to free agency). Many older free agents provide substantially less value for their price--this may also be due to factors like age. Interestingly, Bryce Harper and Trout, in addition to having extremely high WAR-salary ratios posted the highest raw WAR values. Thus with the WAR "ceiling" attainable by some of the least expensive, young players, it may indicate that teams should avoid high-cost, aging free agents.


Data Sources

Data Tools For the most part, I used Tableau to generate the visualizations. I relied on Excel and R to handle data transformations as necessary.