<Big Data Cover>
 
I work for Stanford just down the road at Hopkins Marine Station on the
Tagging of Pacific Predators Project. Basically we stick electronic sensor
tags on everything from squid to blue whale to take a peek at what they
do, and also at environmental parameters while they're doing it. And, in 
the end a fair bit of what we find ends up being used in establishing 
policy for various environmental and conservation efforts. 
 
We're arguably a "Big Data" driven project, integrating large geographic 
information and satellite remote sensing data sets with our own tag data; 
but, I think we all have at least some personal notion of what's meant by 
"Big Data."  So for just a few of minutes I'm going to swap my computerist 
hat for my environmentalist hat and express some concerns centered around 
what seems like a fundamental question: are analysis methods that work 
great, for example, in linguistics, really applicable when it comes to, 
say, setting environmental policy?  
 
One thing you'll need to keep in mind here is the contentious nature of 
conservation work. We actually have received death threats at our 
leatherback sea turtle research station in Costa Rica. Not, as you
might expect, from poachers, who, with a little prodding, are starting
to embrace eco-tourism as a more sustainable means of income, but, from
big monied beach front development interests, with strong ties to major US
real estate companies. We even have our own personal little Rovian on-line
swift-boat  campaign to deal with. What that all boils down to is we can't
afford even the slightest question about data quality, filtering, or analysis
to maintain our credibility for making recommendations  to policy makers.  
 
<Great Turtle Race> 
 
That said, some of you may be familiar with our on-line awareness raising event,
"The Great Turtle Race." The race is really an after thought that came about
while examining our incoming satellite data streams. Those arrive from sensors
harnessed on sea turtles to collect data for use in confirming a hypothesized
turtle migration corridor from Costa Rica past the Galapagos. 
 
Long story short, last June we published a report in the on-line journal Public
Library of Science.  
 
<PLoS paper image> 
 
And just three weeks ago, largely due to that report, the IUCN (International
Union for Conservation of Nature) put forth a resolution to protect the
critical areas as we defined them.  
 
<Blog page> 
 
So, we get to put up cool articles about it on our website, and it's all good.
Well, all good in blog world, anyway. In the real world resolutions are a
little more stark.  
 
<Resolution Text> 
 
And, the problems of handling data become very much like the problems of
maintaining the chain of evidence in a criminal proceeding. 
 
For example, here is an issue that could have been used to call our quality
control into question and kill our credibility.  
 
<vertical view> 
 
This is an overlay of temperature versus depth records obtained from diving
sea turtles. Doesn't look too impressive, (it's not), but, it exposes an error
in an on-board depth data compression algorithm that was missed by the sensor
manufacturer until I spotted it and we pointed it out to them. The error was
actually exposed because of accepted models of ocean thermal structure.
Individual data records look reasonable in and of themselves, but, in aggregate
just  "not quite right" according to the models. Which leads to two "Big Data"
questions: When can you drop the notion of needing models, hence the scientific
method from your analysis? And, also, just how far will you go trusting others
to filter your data for you?  
 
<cavity search> 
 
That's amusing, yes, but it's also an example of what happens when you turn
the scientific method on its head and start looking for reasons for your data,
instead of reason from your data, in essence ignoring an issue extremely
important to policy makers, validation, which is another potential "Big Data"
issue. 
 
<deer map> 
 
Here's something a bit less Orwellian. This page is from a project by a
gentleman who studies deer migration, and even though his data set is
absolutely tiny, he can still be affected by issues at the cloud computing
end of  "Big Data". His  processing system is literally an iPhone, which he
uses to collect his satellite tag data, drops that into Google Spreadsheets
for processing, and then onto Google Earth for display and assessment. Not
much to go wrong. But, part of what is being looked at here is how the physical
environment affects behavior. That means he has to trust Google Earth images 
are always representative of the current situation in his study region. Not a
bad bet in this case, but for more dynamic studies, maybe not good. And, just
how trusting can you be? As more and more reliance is put into the correctness
of more and more remote data sets, are we setting ourselves up for perhaps
even maliciously introduced data problems? Maybe something akin to "Big Data"
man-in-the-middle attacks? Certainly our detractors wouldn't hesitate if they
had the skills. 
 
And, of course, last but not least, what does this gentleman or any of us do
when we finally get that call, as so eloquently expressed by Mick and Keith
some 40 odd years ago? 
 
<hey you>