Hi I'm Victoria. I'm a data scientist working on open data and its implications.

My group focuses on understanding the effect of big data and computation on scientific inference, in three principal areas:

Inference: How effectively does statistical methodology translate to big data settings?

Instead of collecting data to test a particular hypothesis, researchers are now generating hypotheses by direct inspection of the data, then using the data to test those hypotheses. What counts as a significant finding in this case? Can we estimate how likely that finding is to be replicated in a new sample?

Open Data: What information is needed to verify and replicate data science findings?

When computation is used in research, it becomes part of the methods used to derive a result. How should these steps be made openly available to the community for inspection, verification, replication, and re-use? How can datasets and software be repurposed to catalyze new discoveries?

Infrastructure: What tools and computational environments are needed to enable data science discoveries?

We have an opportunity to think about data science as a life cycle -- from experimental design and databases through algorithms and methodology to the scientific findings -- and design tools and environments that enable reliable scientific investigation and inference at scale. A key question is how to enable the science aspect of data science in silico.