About me

I am a behavioral scientist and research director at Stanford University. I study how to measure and affect academic motivation on a large scale. You can read about my work and its motivation in my dissertation.

PERTS

I co-founded PERTS, the Project for Education Research That Scales, and I spend most of my time directing it. PERTS is an applied research center at Stanford University that partners educational organizations, like schools and online learning platforms, with behavioral research teams to measure and change the way students think about learning. We are delighted to be working on a large array of collaborations with Professors Greg Walton, Carol Dweck, Jo Boaler, James Gross, and David Yeager.

Interests

I love to do all of the following, in no particular order:

That is to say, I'm a huge nerd. But, most of all, I like my work to have a real-world impact. That means I often work with foundations, schools, research groups, and businesses to help them measure, understand, and affect individuals' behavior.

 

Fun with R

I love R and the R community. It seems like every time I have an obscure question about data analysis or graphing, several good Samaritans have already posted their answers on Stack Overflow, Cross Validated, or other wonderful R pages and blogs. When I can, I try to give back. In that spirit, I'm including some functions below that may prove useful to others (and to my future self).

 

Logistic Regression Power Analysis

I recently needed to determine the necessary sample size for an analysis with a dichotomous outcome (earning credit in a math course). I soon discovered that power calculations for logistic regression models are complicated by their non-linear nature (Hsieh, Bloch, & Larsen, 1998). Although a number of analytical solutions have been proposed for computing the power of logistic regressions, these formulae are generally bounded by certain assumptions (Demidenko, 2007). For simplicity and to avoid the bias associated with analytical approaches, I made a few R functions to conduct Monte Carlo power simulations of logistic regression.

To use the functions below, you need to specify the expected ratio of treatment to control subjects (50% is default), the expected event prevalence in the control group (CP) and treatment group (TP), and the number of observations (n).

			library(data.table)	# for faster processing

			#	specify simulation parameters
			# 	R 	= ratio treatment:control
			#	CP 	= event prevalence in control group 
			#	TP 	= event prevalence in treatment group 
			#	n 	= observations per sample

			power_specifications <- list(
				#	what's the power with n=200 or 500
				#	if the control prevalence (CP) is 20%
				#	and the treatment increases prevalence by 10% or 15%
				#	i.e., TP = .3 or .35
				data.frame( R=.5, CP=.2, TP=.3, n=200 ) ,
				data.frame( R=.5, CP=.2, TP=.35, n=200 ) ,
				data.frame( R=.5, CP=.2, TP=.3, n=500 ) ,
				data.frame( R=.5, CP=.2, TP=.35, n=500 )
			)

			#	convert the specifications to a single data.table
			ldf 	<- do.call( rbind , power_specifications )
			ldf$id 	<- 1:nrow( ldf )
			ldf 	<- as.data.table( ldf )

			#	make data.frame with requested binomial parameters
			mdf <- function( R, CP, TP, n ){
				treated_n 	<- floor(R*n)
				control_n	<- n-treated_n
				condition	<- c( rep("treated",treated_n), rep("control",control_n ) )
				treated_obs 	<- rbinom( treated_n, 1, TP )
				control_obs 	<- rbinom( control_n, 1, CP )
				return( data.frame( cond=condition, value=c(treated_obs, control_obs) ) )
			}

			#	return p-value of treatment term
			logit_p 	<- function(df){ 
				summary(glm( value ~ cond, data=df ))$coefficients[2,4] 
			}

			#	return proportion of simulations (default simulations=1000) with p < 0.05
			sig_prop 	<- function( R, CP, TP, n, sims=1000 ){
				ps <- replicate( sims, logit_p( mdf( R, CP, TP, n ) ) ) 
				return( mean(ps < .05) )
			}

			#	run the simulations
			#	put the power in the "power" column
			ldf[ j=power:=sig_prop( R, CP, TP, n), by=id ]
			#	calculate the effect size in terms of effect on prevlance 
			ldf$effect <- as.factor(ldf$TP - ldf$CP)

			#	see what the power is...
			ldf

			#	output
			# 	    R  CP   TP   n id power effect
			#	1: 0.5 0.2 0.30 200  1 0.367    0.1
			#	2: 0.5 0.2 0.35 200  2 0.664   0.15
			#	3: 0.5 0.2 0.30 500  3 0.738    0.1
			#	4: 0.5 0.2 0.35 500  4 0.959   0.15
		

For example, the output above suggests that with a sample of 500, if the control group prevalence were 20% and the treatment increased prevalence by 10% or 15%, the power would be 74% or 96%, respectively. The power would increase or decrease as the baseline prevalence got farther or closer to 50%. See for yourself by changing the CP and TP variables and running the simulation.

Obviously, these functions would be useful for a broader range of situations if they could control for covariates of various strengths. They would also be better if they could handle nested data, i.e., mixed-effect models. Maybe I'll add those features in the future if they prove necessary. I'm also open to other suggestions.

 

Basic Graphing with ggplot2

My favorite package for graphing in R is ggplot2. Hats off to Hadley Wickham, its creator. ggplot2 provides lots of useful options that make it easy (once you know what you're doing) to make highly informative graphs. However, it's not always entirely intuitive. I often find myself looking at my notes or the documentation to remind myself how to change plot parameters. In this section, I provide a few basic use cases in the hopes that it will provide useful to others (and to my future self).

Version information

ggplot2 underwent some major revisions in version 0.9.2.1. Those broke a lot of my old ggplot2 code, but I can't complain because it made it easier to customize plots. The examples below assume you're using version 0.9.2.1 or greater.

Simulate some data

To do some graphing, we'll need data to graph. Let's simulate some data from a hypothetical study. In this study, 2/3 of students are assigned to a treatment designed to boost math scores, and 1/3 are assigned to a control condition. Their math scores are collected before and after treatment. We want to visualize the effects of the treatment on follow-up scores.

			#	generate a data frame "d" with the data to graph
			set.seed( 1 ) 		#	set the random seed
			n					<- 200	#	number of observations
			es					<- .7 	#	effect size
			# oversample the treatment condition
			conditions			<- c( rep("treatment",2) , "control" )
			assigned_condition	<- sample( conditions , n , replace=TRUE )
			treated				<- assigned_condition == "treatment"
			prescore			<- rnorm( n )
			postscore			<- prescore + rnorm( n ) 
			#	create an interaction between treatment and prescore
			#	so that treatment is more effective for initial low-performers
			interaction			<- es - prescore[treated]
			postscore[treated] 	<- interaction + postscore[treated]
			d					<- data.frame(	prescore 	= prescore, 
												postscore	= postscore, 
												condition	= assigned_condition )
		

Create a nice theme

I love ggplot2, but I'm not crazy about some of the aesthetic defaults. Here we create a theme called horzi_theme that makes the background white and sets the background to have thin gray horizontal but not vertical lines. We'll use horzi_theme in each of the subsequent graphs.

			#	horzi theme makes the background white with thin, horizontal lines
			library( grid ) 	# for units
			horzi_theme <- theme(	#	remove the gray background
				panel.background 	= element_blank() ,
				#	make the major gridlines light gray and thin
				panel.grid.major.y	= element_line( size=.1, colour="#666666" ) ,
				#	suppress the vertical grid lines
				panel.grid.major.x 	= element_blank() ,
				#	suppress the minor grid lines
				panel.grid.minor 	= element_blank() ,
				#	add axes
				axis.line			= element_line( size=.2 , colour="#666666" ),
				#	adjust the axis ticks
				axis.ticks 			= element_line( size=.2 , colour="#666666" ),
				#	move the y-axis over to the left 
				axis.title.y		= element_text( angle=90, vjust=-.1, hjust=.5 ),
				#	increase margin moved-over y-axis label fits
				plot.margin = unit( c(.5,.25,.25,.5) , "in") 
			)


		
 

Distribution plots

It's always smart to look at distributions first. Let's compare the two conditions' distributions with: 1) histograms in separate facets and 2) overlapping density curves.

Histogram

Give each condition its own facet and histogram. Note that it's hard to compare the distributions by eye because there are more students in the treatment condition. That's lame, and it's where density plots come in.

			ggplot( d , aes( postscore, fill=condition ) ) +
				#	binwidth=1 groups responses along bins of 1 x-axis unit
				geom_histogram( binwidth=1 ) +
				#	override the default colors for each condition
				scale_fill_manual( values=c("control"="red","treatment"="blue" ) ) +
				#	include horzi_theme to change the default background etc.
				horzi_theme +
				#	create separate facets for condition
				facet_grid( . ~ condition )
		
Density plot

Density curves look cooler. They're also more useful when distributions are unbalanced in count: They show the percent of the given distribution instead of the absolute count.

			#	the scales library enables % formatting (used for y-axis below)
			library( scales )
			ggplot( d , aes( postscore, fill=condition ) ) +
				#	we need to make it transparent so we can see the overlap
				geom_density( alpha = .3 , size=0 ) +
				scale_y_continuous( labels=percent ) +
				#	set the curve colors
				scale_fill_manual( values=c( "red","blue" ) ) +
				horzi_theme +
				#	make the legend a little prettier
				guides( fill = guide_legend( title = "Condition" ,
									#	change font-face to remove bold
									title.theme 	= element_text( face="plain", angle=0 ) ,
									#	remove the black border around keys
									override.aes 	= list( colour="white" )
									)
						)
		
 

Bar graphs

What about that venerable classic (sarcasm), the bar graph?

			#	make a simple graph showing postscore by condition
			ggplot( d , aes( condition, postscore, fill=condition ) ) +
				geom_bar( stat="summary", fun.y="mean", position="dodge" ) +
				#	manually set colors and corresponding labels; no need for a legend
				scale_fill_manual( 		guide  = "none",
										values = c( "control"="red", "treatment"="blue" ) ,
										breaks = c( "control", "treatment" ) ,
										labels = c( "control group", "treatment group" ) 
									) +
				#	label the graph and axes
				ggtitle( "A totally lame bar graph" ) +
				#	add a y-axis title
				scale_y_continuous( "Post-Study Math Score" ) +
				horzi_theme 
		
Bar graph with error bars

Now let's make a bar graph with error bars. To make error bars we'll pre-compute a new data frame "g" with the condition means and error bar limits.

			#	compute the means and save to g
			g	<- aggregate( postscore ~ condition, data = d , mean )
			#	append the standard deviation to g
			gsd	<- aggregate( postscore ~ condition, data = d , sd )
			g	<- merge( g, gsd , by="condition", suffixes=c("",".sd") )
			#	append the n to g
			glen<- aggregate( postscore ~ condition, data = d , length )
			g	<- merge( g, glen , by="condition", suffixes=c("",".len") )
			#	calculate the standard error of the mean
			g$se	<- g$postscore.sd/(g$postscore.len)^.5
			#	compute the error bar limits
			g$upper	<- 	g$postscore + g$se
			g$lower	<- 	g$postscore - g$se
		

And now for the actual graph code...

			#	now make the bar graph with error bars
			ggplot( g , aes( condition, postscore, fill=condition) ) +
				#	stat="identity" because the value is already in g (no averaging needed)
				geom_bar( stat="identity", position="dodge" ) +
				#	manually set colors and corresponding labels; no need for a legend
				scale_fill_manual( 		guide  = "none",
										values = c( "control"="red", "treatment"="blue" ) ,
										breaks = c( "control", "treatment" ) ,
										labels = c( "control group", "treatment group" ) 
									) +
				#	add the error bars
				geom_errorbar( 	aes( ymax=upper, ymin=lower ) , 
								width	=.25, 
								position="dodge",
								color	="#666666" ) +
				#	label the graph
				ggtitle( "A slightly more informative graph\n(because of the error bars)" ) +
				#	adjust the title font and size, just for fun
				theme( title = element_text( family="Verdana", size=10 ) ) +
				#	add a y-axis title
				scale_y_continuous( "Post Scores" ) +
				horzi_theme 
		
 

Line graph

Neither of the bar graphs gave us any indication that the treatment was more effective for initial low-performers. So let's make a line graph to visualize this moderation effect.

			#	make a line + scatter graph showing prescore and postscore by condition
			ggplot( d , aes( prescore, postscore, color=condition, linetype=condition) ) +
				#	add points (scatterplot)
				geom_point( 	position="jitter" , # 	jitter the points to prevent overlap
								alpha=.3 			#	reduce opacity
								) +
				#	create a trendline for each condition 
				geom_smooth( 	method="lm", 	#	method="lm", use a linear model stat::lm()
								se=TRUE , 		#	se=TRUE, show 95% confidence bands
								fullrange=TRUE 	#	fullrange=TRUE, extend trendlines to margin
							) +
				#	manually set colors and corresponding labels
				scale_colour_manual( 	values = c( "control"="red", "treatment"="blue" ) ,
										#	define the groups
										breaks = c( "control", "treatment" ) ,
										#	rename the groups on the legend
										labels = c( "control group", "treatment group" ) ,
										#	change the legend parameters
										guide  = guide_legend( 	
													title="Treatment Condition",
													title.theme = element_text( 
															size 	= 13, 
															face 	= "plain", 
															color 	= "#333333", 
															angle 	= 0 ) ,
													label.theme = element_text( 
															size 	= 12, 
															face 	= "plain", 
															color 	= "#333333", 
															angle 	= 0 ) ,
													)
									) +
				#	manually set linetypes and corresponding labels
				scale_linetype_manual( 	values = c( "control"="solid", "treatment"="dashed" ) ,
										guide="none"	#	suppress linetype legend
										) +
				#	adjust the zoom of the graph (not useful in this case)
				coord_cartesian( x=c(-2,2) , y=c(-2,2) ) +
				#	label the graph and axes
				ggtitle( "A much more informative graph" ) +
				xlab( "Pre-Score" ) +
				#	just for fun, we'll change the y-axis breaks
				scale_y_continuous( "Post-Score" , breaks=c(-1:1) ) +
				#	change the look of the graph
				theme( 	#	make the title 16 point Verdana and axis titles 12 point Verdana
						title = element_text( family="Verdana", size=12 ) , 
						axis.title.x = element_text( size=12 ) ,
						axis.title.y = element_text( size=12 ) 
						) +
				horzi_theme
		
 

Stratified Randomization

Stratified randomization keeps the size of randomly-assigned experimental conditions balanced across specified sample characteristics, e.g., race, gender, previous performance. The two functions below make it easy to conduct stratified randomization in R.

Why do stratified randomization?

When conducting experiments, it is important to ensure that the sample of individuals in each experimental condition is similar. For example, let's say that proportionally more high achievers than low achievers were randomly assigned to the treatment condition. If the treatment condition experienced greater improvement along the outcome of interest, it would be difficult to ascertain whether the improvement was due to the treatment itself or due to high achievers improving more quickly than low achievers in general. Although there are post-hoc ways to statistically control for such imbalances, failures of randomization can complicate analyses and damage researchers' ability to rigorously draw causal inferences about the effects of a given treatment.

Stratified randomization functions

This code has two (interdependent) functions and some example code for testing them.

			#	balanced_randomization
			#	Inputs:
			#		ids		vector of unquie ids to randomize to one of the groups
			#		groups 	groups to randomize ids to
			#	Returns:
			#		data frame with columns id and randomly assigned group
			#	Description:
			#		ids are randomly assigned to groups. Groups are balanced, i.e., 
			#		each group has the same # of ids. If the # of ids is not divisible
			#		by the # of groups, the remainder is assigned randomly.

			balanced_randomization <- function( ids, groups = c() ) {
				#	make sure ids are not duplicated
				if( TRUE %in% duplicated(ids) ){ stop("Duplicate IDs not allowed!") }

				#	ids should be characters because factors are will match falsely
				ids	<- as.character( ids )

				#	generate a data frame that will contain the group assignments
				sampling_df <- data.frame( 	id=ids, 
											group=rep( NA, length(ids) ) ,
											stringsAsFactors=FALSE
											)
				#	keep track of ids that have already been randomized		 	
				already_sampled <- c()
				#	how many ids should be assigned to each group?
				size_per_group <- floor( length(ids) / length(groups) )

				#	randomize ids to each group
				for(i in 1:length(groups) ) {
					#	sample the ids for this group, excluding those already assigned
					ids_in_group <- sample( 
									ids[ ! ids %in% already_sampled ] , 
									size_per_group , 
									replace=FALSE 
									)
					#	update the list of already assigned ids
					already_sampled <- c( already_sampled, ids_in_group )
					#	set the group assignment
					sampling_df$group[ sampling_df$id %in% ids_in_group  ] <- groups[i]
				}	

				#	if the number of ids is not evenly divisble by the number of groups,
				#	randomly fill out the remaining, unfilled entries
				remainder <- length(ids) %% length(groups)
				if(remainder > 0){
					sampling_df$group[ is.na(sampling_df$group) ] <- 
						sample( groups , remainder, replace=FALSE)
				}
				return( sampling_df )
			}

			#	stratified_randomization
			#	(requires the package "plyr" and the "balanced_randomization" function)
			#	Inputs:
			#		DF		data frame containing ids to randomly assign and the
			#				corresponding stratification characteristics for each id
			#		id		a string specifying which column of DF is the id to randomize
			#		strata	a vector of strings specifying how to stratify the data
			#		groups 	groups to randomize ids to
			#	Returns:
			#		data frame with a "group" column appended with the randomly assigned group
			#	Description:
			#		For each stratum (unique combination of strata), ids are randomly assigned
			#		to one of the passed in groups in a balanced way (using balanced_randomization).
			stratified_randomization <- function( DF, id, strata, groups ){
				#	we need plyr for its handy dlply function. Thanks, Hadley!
				if( ! require( plyr ) ){ stop( "You need to install plyr!" ) }

				#	set the strata (features you want to stratify on)
				strata <- dlply( DF , strata )

				#	perform balanced randomization within each stratum
				for( stratum in strata ){

					#	the first time, create the data frame with assignment
					if( ! exists("strat_df") ){
						strat_df <- balanced_randomization( stratum[ , id], groups)
					}
					#	subsequent times append to data frame with assignments
					else{
						strat_df <- rbind(	balanced_randomization( stratum[ , id], groups) ,
											strat_df )
					}
				}

				return( merge( DF, strat_df, by=id ) )
			}

			#	BALANCED RANDOMIZATION EXAMPLES
			#	Sample from 3 groups equally
			ids_to_sample 	<- c(1:100)
			groups 			<- c("Group 1","Group 2", "Group 3")
			sampled_df 		<- balanced_randomization( ids = ids_to_sample, groups = groups)
			table(sampled_df$group)

			#	Sample from 3 groups where you weight Group 1 2x
			ids_to_sample 	<- c(1:100)
			groups 			<- c( rep( "Group 1", 2 ),"Group 2", "Group 3")
			sampled_df 		<- balanced_randomization( ids = ids_to_sample, groups = groups)
			table(sampled_df $group)

			#	STRATIFIED RANDOMIZATION EXAMPLE
			#	generate a fake dataset "d"
			ids_to_sample 	<- c(1:100)
			gender			<- sample( c("M","F") , 100, replace=TRUE )
			race			<- sample( c("W","B","L") , 100, replace=TRUE )
			d				<- data.frame( id=ids_to_sample , gender=gender, race=race )

			#	define the groups to randomize to
			groups 			<- c( "Group 1","Group 2", "Group 3")

			#	specify the data frame, the id, the stratification features, and the groups to randomize to
			stratified		<- stratified_randomization( d , id="id", strata=c("race","gender"), groups=groups )

			#	confirm the conditions are balanced
			aggregate( id ~ group + gender + race, stratified, length )

		

 

DataPlay.js

I made DataPlay.js to manipulate data with JavaScript. It was motivated by my desire to distribute the processing load associated with generating reports for educators and researchers who work with PERTS. We considered crunching the data in PHP or MySQL or by calling R from Apache, but there was an obvious appeal to doing the processing client-side. It turned out to be pretty easy to make a little JS library that makes data lookup and aggregation a snap. Take a look at the Github repository to download it or for examples.

 

Comments

Found something you like or something you don't like? I'd love to get your feedback.