I'm a data scientist focusing on big data and computation in scientific inference, in three principal areas:

Inference: Big data methods

Instead of collecting data to test a particular hypothesis, researchers are now generating hypotheses by direct inspection of the data, then using the data to test those hypotheses. What counts as a significant finding in this case? Can we estimate the likelihood that finding will be replicated in a new sample? How effectively does statistical methodology translate to big data settings? [Related Publications]

Open Data: Data Science dissemination

When computation is used in research, it becomes part of the methods used to derive a result. What information is needed to verify and replicate data science and computational findings? How should these steps be made available to the community? How can datasets and software be repurposed to catalyze new discoveries? [Related Publications]

Infrastructure: What tools and computational environments are needed to enable data science?

We have an opportunity to think about data science as a life cycle -- from experimental design and databases through algorithms and methodology to the identification and dissemination of scientific findings -- and design tools and environments that enable reliable scientific investigation and inference at scale. A key goal is enabling the science aspect of data science in silico. [Related Publications]