Research Discussions:

The following log contains entries starting several months prior to the first day of class, involving colleagues at Brown, Google and Stanford, invited speakers, collaborators, and technical consultants. Each entry contains a mix of technical notes, references and short tutorials on background topics that students may find useful during the course. Entries after the start of class include notes on class discussions, technical supplements and additional references. The entries are listed in reverse chronological order with a bibliography and footnotes at the end.

April 25, 2015

To: Mindreaderz
Subject: Project Suggestion from Adam Marblestone

Contributor: Adam Marblestone
Title: Simulate the Rosetta Brain

Adam has an interesting potential project suggestion. The designing and evaluating ROSETTA [111] configurations. He’ll be talking abou ROSETTA in class Monday, May 4 and you can find the paper on his calendar entry. Here’s a quick sketch of the project he has in mind:

  1. Take an EM connectomic volume, already analyzed.

  2. Annotate it (essentially ‘‘draw’’ on the images) with a bunch of FISSEQ [96] barcodes, at various densities and sub-cellular localizations.

  3. Simulate what that would look like in ExM.

In this way, come up with a set of design constraints in terms of: how many barcodes we need, what density and distribution in the cell we need, what FISSEQ rolony [1] size we need, what spatial resolution we need, etc. You could start with something similar to Figure 3 of Conneconomics (PDF) (based on Yuriy Mishchenko’s work), but then go much further in constraining the design space.

[1] I think this is a typo and Adam meant to type ‘‘polony’’ as n polony sequencing.

Adam would be happy to advise on such a project if you are interested in taking a shot at it.

April 24, 2015

Ed Boyden from MIT will be speaking on Monday, 27 April. Ed will focus on his expansion microscopy (ExM) technology and the prospects for light microscopy relying on modern super-resolution image-processing techniques to rival the resolving power of an electron microscope. The primary reading [26] describes ExM and the supplementary reading [33] is concerned with quantifying the limitations of different neural recording technologies in terms of their ability to separate the activity of one neuron from that of its neighbors.

Sebastian Seung from Princeton will join us on Wednesday, 29 April. Sebastian will be talking about neural modeling and structural connectomics. I’ve selected four papers: two that emphasize his theoretical work on (i) the neural basis for reinforcement learning [145] and (ii) how the cortex selects active sets of neurons [62]. The other two papers focus on connectomics: one emphasizing the advantages of densely sampling neurons [146] and the second describing how machine learning can accelerate segmentation [79].

Adam Marblestone, a research scientist in Ed’s lab, will talk on the following Monday, May 4. Before joining Ed’s lab, Adam was a graduate student at Harvard working with George Church where he helped to develop a number of technologies for fluorescent staining using combinatorial codes that leverage high-throughput RNA sequencing to identify a large number of different molecules in a single assay [96111110112]. He is currently working on the team developing ExM.

Eric Jonas will join us on Wednesday, May 6. Eric completed his PhD at MIT and is currently a postdoc at the University of California, Berkeley. He co-founded and was the CEO of Prior Knowledge from its founding through its acquisition by Salesforce at the end of 2012. His collaboration with Konrad Kording has yielded some interesting insights into what can be inferred from even noisy connectomes [81].

The readings for each presentation are on the course calendar pages. Jonas will join us in person the others virtually. Aditya and I have finally worked out a satisfactory solution for audio. By having the presenter call us on a land line, we achieve good voice quality, and we’ve purchased a high-quality powered speaker to amplify the audio output of the presentation laptop and adjust the volume so we don’t have to strain to hear the speaker. If we lose the IP connection, at least we’ll have a copy of the speaker’s slides and will be able to hear the presentation clearly.

April 23, 2015

David Cox’s presentation stimulated a lot of discussion yesterday. Unfortunately most of it was out in the quad after David’s talk. If you have lingering questions for David, I expect he’d be glad to hear from you. His running baseball analogy was inspired. After a review of earlier recording technologies, David described their calcium-indicator and two-photon technologies for recording from rodents and mentioned their earlier work using micro-electrode arrays [30]. Their automated experimental setup allows them run dozens of rodent experiments every day [190].

In terms of new or improved recording technologies, David mentioned work by Alipasha Vaziri and his colleagues using their wide-field, temporal-focusing imaging technology which combined with a nuclear-localized calcium indicator, NLS-GCaMP5K, supports unambiguous discrimination of individual neurons in densely packed tissue [131142]. He showed how CCaMP technology has rapidly maturing and enthused over recent improvements in genetically encoded voltage indicators like ASAP1 [157].

Relevant to our interest in inferring the function of neural circuits in the ventral visual stream, he told us how he is building on work out of Jim DiCarlo’s lab comparing human and machine vision [183]. In particular, David described his recent work incorporating constraints from psychophysics into the loss function and regularizer used in training image classifiers to model human visual ability [139]. The results look are promising and the research methodology is definitely an intriguing new approach to modeling neural function.

I mentioned in my introduction that David had done some early work in comparing human and machine vision. He applied methods derived from computational genomics for high-throughput screening and employed a clever strategy for training and evaluation to search for biologically-motivated models that rival state-of-the-art in image classifiers [130]. He was also one of first to use GPUs at scale to accelerate the evaluation of thousands of different network topologies in parallel in searching for the best performing models [13087].

April 21, 2015

Giret et al [52] describes some of the birdsong work that Joergen presented in yesterday’s meeting with Max Planck. Surya Ganguli, one of the authors told me about this work in yesterday’s class at Stanford. (PDF) Jerome Lecoq a postdoc in Mark Schnitzer’s lab also attended class and we talked about using one of their miniature fluorescence microscopes [51] so that the experimental birds could move about naturally.

Spectral methods and graph algorithms figure prominently in Surya’s analyses. Cliques are fully connected subgraphs that often correspond to computational artifacts like winner-take-all networks. Turns out that it is hard to find small cliques in sparsely connected networks. Spectral methods don’t work well. Deshpande and Montanari [36] describe the problem and provide an efficient algorithm. (PDF)

Surya also told me about an interesting technique borrowed [155] from non-equilibrium statistical physics for learning deep, generative models consisting of many (thousands) layers or time steps in the case of recurrent networks. The idea is to ‘‘systematically and slowly destroy structure in a data distribution through an iterative forward diffusion process [and] then learn a reverse diffusion process that restores structure in data, yielding a highly flexible and tractable generative model of the data’’. (PDF)

April 20, 2015

I’m searching for papers that make a credible attempt to model simple and complex cells in terms of cortical columns. My interest was sparked by the Larkum paper [94] that Costas Anastassiou told us about in his class presentation and by subsequently revisiting the Felleman and Van Essen classic paper [42] that appeared in the first issue of Cerebral Cortex including one of the most often reproduced diagrams (Figure 4) in computational neuroscience.

Dileep George developed an interesting biologically-motivated laminar instantiation of the belief-update equations he used in implementing a version of Jeff Hawkins’ Hierarchical Temporal Memory [50]. Dileep’s model was novel for its tackling a rather complex computation, but he was primarily interested in showing how neural circuits could, in principle, be arranged in columns to perform such computations, and, not necessarily, how or whether there actually exist neural circuits in the mammalian neocortex that perform said computations.

Whatever else you might argue, at least Dileep’s model is a computational model that any computer scientist would recognize as such, albeit one that starts with a specific algorithm in mind, rather than starting from what is known about the anatomy, physiology and behavior and attempting to derive an algorithm that obeys the biological constraints. Whether or not it is biologically plausible model of neural computation is another question altogether. There are many ways of accomplishing the same computational task and, given that inference is intractable, if the brain does solve the belief update equations, then it does so using an efficient approximation that would be quite interesting to discover.

In any case, I was looking for something simpler computationally speaking and more challenging biologically: a detailed hypothesis mapping the computations that Hubel and Wiesel [7473] assigned to what they called ‘‘simple’’ and ‘‘complex’’ cells in primary visual cortex to anatomical structures along with some form of corroborating evidence from the literature. I did find a few papers of peripheral relevance [1861654], but less than what I was hoping for.

April 19, 2015

The primary reading [47] for Surya Ganguli’s class discussion on Monday cites a number of interesting related papers, including Ganguli and Sompolinsky [46] which I’ve provided as supplementary reading. This paper invokes the theory of compressed sensing to better understand what can be learned from analyzing neural recordings. The basic strategy of finding a low-rank basis for encoding a high-dimensional signal has instantiations in many sub areas of statistics and machine learning. The following note will provide some context.

We have already encountered one such instantiation when Viren Jain and then Peter Li brought up the idea of employing unsupervised learning to reduce the reliance on labeled data which is more often than not in short supply. Peter mentioned using an autoencoder to learn a model that feeds a high-dimensional feature vector through a lower-dimensional hidden layer to an output layer of the same dimensionality as the input. During training you set the input and output to be the same — your objective is to recover the original signal and in doing so learn a compressed representation.

The low-dimensional hidden layer in the autoencoder method is often referred to as a bottleneck and the associated approach to machine learning to as the bottle-neck method [166]. As an aside, one of the three authors of [166] is Bill Bialek a theoretical biophysicist at Princeton who has trained a number of outstanding computational neuroscientists and produced a body of work well worth your time scanning when you have a chance. Here are a few of my favorites: [114113].

Another idea related to compressed sensing is the notion of sparse coding which first appeared in reference to neural coding in the work of Horace Barlow in which he characterized the statistics of visual stimuli as a means to better understand the nature of perception and the neural codes that support memory [87]. Sparse coding became popular in computational neuroscience largely due to the work of Bruno Olshausen and David Field [121122].

While there are many learning algorithms, there are only two fundamental methods of learning that we have discovered so far: maximizing margins and minimizing models. Maximizing margins is the method used in support vector machines (SVM) and is theoretically motivated by applying the Vapnik-Chervonenkis (VC) Dimension [173] in the context of Leslie Valiant’s PAC model of learning [17217].

The method of minimizing models is related to various model-selection strategies that typically make use of some form of Occam’s Razor, e.g., by applying the minimum description length (MDL) principle, or, in statistical machine learning, by using either the Akaike (AIC) or Bayesian (BIC) information criteria. We are assuming in this discussion that learning implies the ability to generalize from seen to unseen examples. In point of fact, this is impossible without additional information about the problem [180179], and this often takes the form of a restriction or prior on the family of models considered.

Terms like stochastic gradient descent, expectation maximization, genetic algorithms, nearest neighbor algorithm and maximum-likelihood least squares refer to search methods, loss functions, families of models or all three. In learning artificial neural networks, the search method may be gradient descent, the loss function squared error and the family of models multi-layer perceptrons, but the method of learning is model minimization and takes the form of restrictions on the family of models, e.g., models with only one hidden layer.

April 17, 2015

The readings for all the classes through May 11 are available from their respective entries in the calendar. There you’ll also find the presentations through last Wednesday, April 15. Some of the entries have three or four papers. Unless otherwise indicated, you can assume the first listed paper is the primary paper to read for class. Surya Ganguli from Stanford is on for this coming Monday and David Cox from Harvard is participating on Wednesday. Get the most out of these opportunities; take some time to read the papers this weekend and come to class prepared to ask questions.

Costas’ calendar entry now includes a sample dataset from one of his cortical models in the format of a tar ball consisting of simulated intra- and extra-cellular traces, spike-timing rasters, and movies illustrating the structure and dynamics of the simulated tissue sample. As Costas pointed out in class, he is interested in partnering with students to produce alternative datasets that might be better suited to functional analysis or exhibit different dynamics. The tar ball also includes Python scripts for reformatting the data and extracting content from the binary files.

April 16, 2015

Peter Li’s slides and references from yesterday’s class are available here. Up-to-date course notes are available on Mindreaderz and here. The HHMI Janelia FIB-SEM drosophila data that Peter mentioned is available here and includes ground truth and the original grey scale EM data. If you’re interested, tell Aditya you’d like to play with it, and he’ll download it to the Stanford servers. We may be able to get the full seven-column dataset for class projects, but experiment with the smaller dataset first since even the single-column dataset is rather large.

A couple of Google jokers made the following addition to one of the bathrooms on the Google Quad Campus: entering the men’s room you see this fellow lurking at the back near the toilet stalls, curious you move nearer for a better look, and then, peering close at the helmet visor half expecting to find it occupied, sure enough you see this note.

April 15, 2015

Costas Anastassiou’s slides and copies of papers relevant to our discussion in class are linked directly off his calendar page here. I’ve included the explanatory notes that were included in his last posting here:

Here are the two citations I mentioned right in the end: The first paper (Shai et al [147]) presents the putative role of active dendrites and the Ca-hotzone for encoding. It has both experiments and modeling. Especially in the end we suggest that the absence of the active, Ca-hotzone results in fundamentally different types of computation occurring at the single-neuron level. The second paper (Larkum [94]) is a beautiful theory of how active properties of dendrites can have functional and behavioral roles. Hopefully, the network simulations we discussed today will soon (or not so) be able to test some of these theories. Finally, here is a rather extensive review regarding the function and role of active dendrites: Major et al [107].

April 13, 2015

I want you to appreciate the opportunities that models of the sort that Costas and his team have developed open up to you for interesting projects. We have good EM datasets with sufficient ground truth so that you can learn connectomes. We’ll be hearing from Eric Jonas about approaches applying spectral analysis to affinity matrices obtained from connectomes in order to infer interesting properties of the underlying circuits. We also have sources for calcium imaging (CI) data from mouse, fly and fish.

We don’t have ground truth for the CI data — knowledge of the circuits and the action potentials that produced the data — nor do we have a great deal of correlated EM and CI data. That will change soon as AIBS and Janelia continue to make progress. In some sense, however, we will never have enough data — even though we will be drowning it as we test the limits of modern data storage systems, and we will never have the ‘‘right’’ data — even though we can always gather more.

‘‘Never’’ is a long time, and ‘‘right’’ is a relative term in this context, but the point is that until we have the ability to perturb the system when, where and how we want, we will be at the mercy of those conducting the experiments and gathering the data. Technologies like optogenetics [28188], robotic patch clamping [90] and fluorescence endoscopes [82] offer the promise of precise excitation or inhibition, but it will take time to incorporate such technology into efficient work-flows that don’t require prohibitive preparation.

In the meantime, simulations like those Costas, Sean Hill and Henry Markram are developing hold out the promise of allowing us to test the hypothesis — some would substitute ‘‘monstrous conceit’’ for ‘‘hypothesis’’ given the apparent complexity of the task — that, once we have lots of correlated EM and CI data and the machinery to perturb the neural circuits of awake, behaving animals at will, we will be able to unravel the mysteries of their function. While the models Costas talked about are not complete and not likely to be able to reproduce the full spectrum of behavior of the circuits they seek to model, they are arguably good enough to test the above hypothesis, at least to the extent that it bodes well if we can infer non-trivial function from the simulated brain tissue.

Following up on my question at the end of Costas talk and his answer, I envision projects that start with a specific model — constituting your target organism, tissue and circuit — from which you can obtain as much simulated CI data as you need for training and testing as well as all the ground truth you could possibly need, including all local field potentials, cell types, circuit reconstructions, neurotransmitters and synapse valence information (excitatory or inhibitory). Armed with this detailed understanding of your target, you can now apply machine learning tools to try to learn the input-output behavior or statistical characterizations of the dynamics and evaluate the resulting models against the ground truth. If you deem it necessary, you can adjust the model parameters, design new experiments, inject perturbations or otherwise control the environment to gain greater insight into your models ability to reproduce the behavior of the target organism.

If you could do this, and I think Costas, his colleagues at Allen and my team at Google would agree, the implications would be significant as it would demonstrate that at least in principle — modulo the accuracy and complexity of the models Costas’s team have developed — the research programs Allen, Harvard, Janelia, Max Planck, Stanford, MIT, etc., have the potential to realize their goal of recording from and inferring the function of neural circuits of some size and complexity. The resulting work flow could serve as a complement to experiments on real tissue and ensure to the extent possible that we don’t start from a state of complete ignorance in taking on the considerable challenges of dealing living tissue.

Tomorrow, Peter Li will be talking about the deep neural networks that we have been developing at Google for classifying voxels, identifying membranes, segmenting cells and tracing their processes. Peter is expert in developing and testing neural network architectures and his algorithms have surpassed the current state of the art by a substantial amount. He’ll provide a short primer on ANNs and then discuss the various network topologies he’s tried along with what worked and what didn’t. His experience working with E.J. Chichilnisky doing recordings on primate retina, mapping receptive fields of retinal ganglion cells and understanding the structure and function of retinal circuits make him a great resource.

April 11, 2015

Davi sent a note with some additional papers you might want to at least skip before his lecture:

In addition to the 2011 paper [18], the class should read [89] --- a review of imaging methods. And optionally, read [67] and subsequent follow-on work from that lab. Is this a more efficient approach than EM to answer questions about how function relates to network connectivity? How are the two approaches redundant, and how are they complementary?

April 9, 2015

Adam Marblestone sent me a 2014 Neuron paper [109] that he and Ed Boyden wrote in which they consider the prospects for ‘‘assumption-free brain mapping’’ and draw on examples from Mario’s work to illustrate their premise. They point out the advantages of pairing solution-driven engineers with problem-driven scientists and note that incentives are all working in the right direction with engineers seeking applications for their technologies and scientists seeking technologies to solve their problems.

Adam and Ed suggest that we will need to build bias free brain mapping technologies that work ‘‘backward from the fundamental properties of the brain and are equal to the challenge of mapping their mechanisms’’, instead of starting from a set of known building blocks and working forward, tacitly assuming your collection of blocks is up to the task of solving currently insoluble problems — which is like the drunk who, despite the fact he knows he lost his keys elsewhere, looks for his keys under the lamppost because that is the only place where there is enough light to see.

I’ll try to reduce Adam and Ed’s argument to its essence: There are multiple levels at which we can explain the how the brain works. In the following we consider four such levels: molecular, cellular, functional and behavioral. Our present state of knowledge at each level varies considerably.

Physics provides powerful tools to explain how molecules interact. Biology has revealed a great deal about cellular processes, but we believe there is more to know before we can give a complete account at either the functional or behavioral level. We can observe molecules and cells with varying degrees of scale and precision.

By functional, we mean algorithmic and computational. Given a collection of molecules, a set of initial conditions and the laws of quantum electrodynamics, we could in principle predict the behavior of those molecules. Similarly, given an algorithmic account of the brain, we could in principle explain how a brain would respond to any stimulus.

If we could describe the brain at the molecular level we could in principle simulate the brain and predict what we would observe if we were able to measure its electrical and chemical properties. Of course, a physical realization of a system is the most efficient way to predict its behavior at the molecular level. We want a more succinct and transparent description.

At the cellular level we would like an explanation in terms of cellular processes like communication, gene expression and respiration. An explanation at this level would afford a bridge between the molecular and functional levels. We are not confident we know all the cellular processes necessary to provide a complete account at the cellular level.

We cannot directly observe function; we have to infer function from our observations of what’s going on at the physical level — the cellular substrate in which the computations are realized. We can observe behavior but the behavioral level does not suffice as an adequate explanation for mind nor as a basis for dealing with its pathologies.

The functional level provides a bridge between the cellular and the behavioral. However, our cellular level understanding is incomplete. Adam and Ed advocate we learn more about the aggregate behavior of molecules by developing technology to observe them directly. Such technology will help us to identify inaccurate or inadequate descriptions of cellular processes.

As our cellular level understanding improves, we will be better prepared to improve our functional account. By being able to directly observe cellular processes at work in awake behaving organisms, we will be less likely to jump to conclusions based on an existing flawed or incomplete theory at the cellular level. Summarizing the above:

I asked Davi Bock if he would release the calcium imaging data they collected for their 2011 paper [18] and he said that if I wanted it I should talk to Clay, but that they only got a decent signal for 14 or so neurons. Davi said that:

You might do much better talking to Wei-Chung Allen Lee (, now an Instructor at Harvard, who stayed after Clay went to the Allen and inherited a fast calcium imaging rig and the first-generation TEM camera array. This let him do some really nice follow-up work to our 2011 paper, and he has a manuscript currently in revision describing it. Instead of using OGB he used GCaMP3 and functionally characterized many more neurons (somata of neurons in layer 2/3 and apical dendrites from layer 5). He finds evidence for like-to-like anatomical connectivity across a number of stimulus parameters, and I could imagine analyzing his calcium imaging + EM data would be much more satisfying (higher N, broader stimulus space) than our 2011 data.

A researcher by the name of Dimitri Perrin from Queensland University of Technology asked me to take a look at a Google Research Award proposal that he was preparing to submit. It was an interesting proposal and when I asked him if there were any papers on the technology he was proposing, he sent me two of his recent papers [161164] published in Cell. The titles are tantalizing; those of you working in Karl Deisseroth’s lab might take a quick look and report back to the rest of us.

April 9, 2015

Mario Galarreta gave a great presentation in Monday’s class — his slides are now available on the calendar page — challenging us to be aware of our biases, the dogma that predominates in the field and the conceit that we know more than we do. Then he launched into a deep dive into the details of his research chronicling his fifteen years of doing electrophysiology at Stanford, what he learned, what he didn’t and what we still don’t know and don’t even know that we don’t know. The quotes from Ramòn y Cajal might have seemed only relevant to the early 20th, but they are true today more than ever despite the putative fact that we know a great deal more now than we did in Cajal’s day. This log entry is a bit long and the next three paragraphs a little philosophical, and so if you got this far and are inclined to skip the rest, please fast forward to the last paragraph and pay special attention to the two footnotes that introduce the work of couple of our speakers.

I’m continually amazed at what people think we know about the brain. Oddly enough university faculty and graduate students are particularly prone to this sort of exaggeration, perhaps because the textbooks they write or study from emphasize the known and give sort shrift to the unknown. Industrial research scientists are somewhat less prone, probably because in developing products such as pharmaceuticals and medical equipment they are constantly constrained by the limitations of what we know and, unlike the academic researcher, cannot opportunistically switch to work on some other problem when faced by a lack of knowledge. It may be a prerequisite for pursuing basic science that one is fundamentally optimistic, but really good scientists can compartmentalize their optimism and their skepticism so as to remain enthusiastic and committed while exercising their critical faculties.

Relevant to the areas of neuroscience we are exploring in this class, Mario’s talk laid bare the ignorance and bias behind statements by noted neuroscientists that we won’t learn anything of value from connectomics and that, in particular, once we have one connectome there will be diminishing returns from obtaining additional connectomes. The technologies for collecting data that we will leverage in class projects are like any scientific instruments: they exploit what we know from current physics, chemistry and biology to yield tantalizing glimpses into natural phenomena, but they require interpretation and, generally, a great deal of cleverness to apply to specific questions. In hindsight, it may seem that a scientist builds an apparatus specifically to answer a particular question, employs the resulting instrument to collect data from experiments that are obvious, and thereby resolves his question. This is hardly ever the case.

In our case, the data will be noisy, incomplete and generally only a rough proxy for what we would really like to know. As computational neuroscientists, our tool box includes all of mathematics, statistics, algorithms, numerical analysis and machine learning. The first rule of exploratory statistical analysis is to know your data — analyze the sources noise and error, identify outliers, quantify dependencies among variables, etc. The second — and I’m making this one up — rule is to take advantage of what you’ve learned from this analysis to exploit the properties of the data — normally distributed — and transform the data into a more tractable form — apply principal components analysis. Academics have written books about exploratory data analysis, and I don’t expect you to be proficient in this area; my primary advice is to constantly question your assumptions, don’t be afraid to re-frame the problem, and come to terms with the data you have and don’t get distracted in a quixotic quest to find the perfect dataset or solve an intractable problem.

I’ll provide a couple of examples to illustrate the above points. The first example involves work by Peiran Gao and Surya Ganguli in which they describe a relatively standard statistical-analysis workflow practiced by computational neuroscientists. They then ask three questions of the sort Mario described in his presentation, i.e., questions that most scientists would not have not bothered to ask or wouldn’t see the need to ask given they failed to recognize the irregularities that Peiran and Surya observed. Finally, they develop an intriguing theory that provides answers to the three questions and has far reaching consequences for how we think about computation in the brain1. As another example here is a recording of a technical talk given at Google by Eric Jonas on his joint work with Konrad Kording looking at extracting cell types and microcircuitry from neural connectomics2. Surya will be joining us on Monday, April 20, and Eric on Monday, May 6. These presentations are relatively late in the quarter, and so if you find their work interesting enough that you want to look into possible related projects, I suggest you read their papers and contact them directly with your questions. The calendar now has the readings for all of the talks through the first week in May.

April 6, 2015

For those of you taking CS379C. please accept the invitation I sent out a couple of days ago inviting you to join the Mindreaderz email group. After this message, all subsequent class notes, relevant papers, project suggestions, etc will go to Mindreaderz. Administrative announcements, individual and course-related correspondence will use the email addresses provided on the Axes course roster. I don’t expect this will happen, but if the email traffic generated by Mindreaderz gets annoying or distracting, you can always change your delivery option to ‘‘digest’’ and get just one summary message per day. The remainder of this message includes some notes that have been accumulating in my research log and that I’ve been intending to add to the CS379C class discussion page.

If you are curious about the scientific vision for the BRAIN Initiative — BTW the BRAIN acronym stands for Brain Research through Advancing Innovative Neurotechnologies — you might want to check out this report to Francis Collins, the Director of the NIH that was prepared by the Advisory Committee of the BRAIN Initiative led by Cornelia Bargmann and Bill Newsome: BRAIN 2025: A Scientific Vision.

In the summary report produced by the CS379C class of 2013, we were cautiously optimistic that technologies combining the use of light and sound, e.g., photoacoustic spectroscopy and photoacoustic tomography, would have a significant impact on the field of neuroscience in the next couple of years. These technologies work by selectively illuminating the target tissue with electromagnetic energy — typically in the infrared region of the spectrum in order to penetrate deep into the tissue — and then record the resulting changes in pressure by sensing radiated acoustic energy — typically in the ultrasound range.

We were particularly excited by whole-brain recording technologies such as the neural dust proposal out of Berkeley by Seo et al [143] which was also featured in Marblestone et al [112]. A new paper [185] just out in Nature Methods describes a promising new technology for photoacoustic microscopy (PAM) that seems particularly promising. In this case, the PAM technology is used for high-speed imaging of the oxygen saturation of hemoglobin, and hence is a possible alternative to fMRI but with higher spatial and temporal resolution:

We present fast functional photoacoustic microscopy for three-dimensional high-resolution, high-speed imaging of the mouse brain, complementary to other imaging modalities. We implemented a single-wavelength pulse-width-based method with a one-dimensional imaging rate of 100 khz to image blood oxygenation with capillary-level resolution. We applied PAM to image the vascular morphology, blood oxygenation, blood flow and oxygen metabolism in both resting and stimulated states in the mouse brain.

In this class, we will repeatedly return to the question of how best to articulate hypotheses about the function of neural circuits. It is our contention, that current approaches fall far short in terms of explanatory value when it comes to describing meso-scale function [113] and models from machine learning and computer vision originally motivated by results from neurobiology might serve as a source of such models [32]. Along similar lines, Lim et al [101100] ‘‘explore the idea that there are common and general principles that link network structures to biological functions, principles that constrain the design solutions that evolution can converge upon for accomplishing a given cellular task. We describe approaches for classifying networks based on abstract architectures and functions, rather than on the specific molecular components of the networks.’’

Increasingly, computational neuroscientists are applying ideas from statistical mechanics and dynamical systems theory to understanding the statistical properties of ensembles of neurons, with the motivation that the aggregate behavior of the ensemble is best characterized not in terms of the interactions between individual neurons but rather in terms of the interactions between self-organizing, constantly forming and reforming, highly-connected components — cliques in the parlance of graph theory — of the constituent neurons. Here’s a paper by Liam Paninski, Sarah Woolley Ramirez and their colleagues et al [132] that provides an interesting example of this approach and we will hear more from Surya Ganguli later in the quarter.

There are now a few academic labs that routinely release their data to the community. It isn’t exactly a common occurrence as yet, but the trend looks promising. The EM data from the Bock et al paper [18] is available at the National Center for Microscopy and Imaging Research in their Cell Centered Data (CCB) repository. You can find it on the CCB website by typing "8448" — the dataset ID — into the search window at the upper left hand corner of the splash page. The site also provides viewers for exploring the data. If you think you want to experiment with this data, tell Aditya and me and we’ll consider copying the data to Stanford servers. Some of the datasets are very large and network capacity and disk space while relatively inexpensive — at least the latter — are not free. You might want to hold off for a while before downloading more than a terabyte. At any rate, we want to be good stewards of Stanford computing resources, so please ask Aditya or me before moving a lot of data.

April 1, 2015

The course website and calendar are up to date through April 8. This includes the first two lectures and links to relevant papers, plus entries for Mario’s and Viren’s presentation / discussion sessions, including the readings for those classes. Check it out and if you encounter any dead links or incomprehensible content please tell me — the splash page and its calendar entry are the only ones I can vouch for at this time, but I’ll have the discussion list ready by the end of the weekend. Here are some of current news stories and answers to questions from the first lectures:

CMU researchers have used data mining to a publicly available website that acts like Wikipedia, indexing the decades worth of physiological data collected about the billions of neurons in the brain. Researchers at NYU have captured images of dendrite nerve branches that show how mice brains sort, store, and make sense out of information during learning. In a study published online in the journal Nature March 30, the NYU Langone neuroscientists tracked neuronal activity in dendritic nerve branches as the mice learned motor tasks such as how to run forward and backward on a small treadmill. They found that the generation of calcium ion spikes — which appeared in screen images as tiny ‘‘lightning bolts’’ in these dendrites — was tied to strengthening or weakening connections between neurons, hallmarks of learning new information.

The EM neural-tissue-sample-preparation protocols are linked to Wednesday’s calendar entry. For reference here’s an expansion of what I said in class about fixation, contrast agents and heavy metals:

Uranyl [238] — Uranyl acetate is an acetate salt of uranium. The advantage of UA is that it produces the highest electron density and image contrast as well as imparting a fine grain to the image due to the atomic weight of 238 of uranium. The uranyl ions bind to proteins and lipids with sialic acid carboxyl groups such as glycoproteins and ganglioside and to nucleic acid phosphate groups of DNA and RNA.

Lead [207] citrate — Lead citrate enhances the contrasting effect for a wide range of cellular structures such as ribosomes, lipid membranes, cytoskleleton and other compartments of the cytoplasm. The enhancement of the contrasting effect depends on the interaction with reduced osmium, since it allows the attachment of lead ions to the polar groups of molecules. Osmium is used routinely as a fixative. Lead citrate also interacts, to a weaker extent, with UA and therefore lead citrate staining is employed after UA staining.

Osmium [190] tetroxide — Osmium tetroxide fixative enhances the contrast. It acts as fixative as well as enhancer of contrast during post-staining by interacting with uranyl acetate and lead citrate.It has been indicated that fixation time has an effect on contrast obtained by uranyl acetate contrasting. A long fixation with Osmium tetroxide decreases e.g. the contrast of chromatin.

Osmium tetroxide fixative enhances the contrast. It acts as fixative as well as enhancer of contrast during post-staining by interacting with uranyl acetate and lead citrate.It has been indicated that fixation time has an effect on contrast obtained by uranyl acetate contrasting. A long fixation with Osmium tetroxide decreases e.g. the contrast of chromatin. As I alluded to in class, beyond the basic physics and chemistry this step is more art than science, but can make a big difference in image quality.

This method was mentioned in Kevin Briggman’s slides in the context of how you introduce DNA into live cells, generally referred to as transfection. I’ve just listed the definition below, but the Wikipedia page for this term does a pretty good job. You might want to follow some of the related links and learn about viral transfection methods that use adenovirus vectors and the family of retroviruses that are popular for neural circuit tracing.

Electroporation is the use of high-voltage electric shocks to introduce DNA into cells–can be used with most cell types, yields a high frequency of both stable transformation and transient gene expression and, because it requires fewer steps, can be easier than alternate techniques.

My cheesy-serial-section demo included mention of a paper by researchers at HHMI Janelia, Hayworth et al [65] that just came out in Nature Methods describing their new ‘‘hot knife’’ method and why and how it is important in improving their segmentation work using FIB-SSEM (Focused Ion Beam Serial Section Electron Microscopy). This technology was developed by a team of scientists at Janelia working in Harald Hess’s lab and led by Ken Hayworth3. While it is ideally suited to scaling FIB-SEM, the technology has wider application. I’ll describe the technology in some detail in class today. The abstract of the Nature Methods paper follows:

Focused-ion-beam scanning electron microscopy (FIB-SEM) has become an essential tool for studying neural tissue at resolutions below 10 nm x 10 nm x 10 nm, producing data sets optimized for automatic connectome tracing. We present a technical advance, ultrathick sectioning, which reliably subdivides embedded tissue samples into chunks (20 mm thick) optimally sized and mounted for efficient, parallel FIB-SEM imaging. these chunks are imaged separately and then ‘volume stitched’ back together, producing a final three-dimensional data set suitable for connectome tracing.

I talked about how neurobiology labs today are populated by scientists and engineers from many disciplines whose expertise is often relatively narrow, and that I certainly didn’t you expect to be master all subjects that we’ll touch upon in class. This diversity of people and technical expertise is one of the appealing features of working in the field of neurobiology — you learn something new every day and making connections among ideas from different fields is the source of many great ideas. Here’s an expansion of what I briefly mentioned in the introductory lecture regarding the sort of background I expect for this course:

I am expecting that you will be coming from varied backgrounds. In particular, while coding skills are important, you may not have a lot of experience with machine learning, and, while some familiarity with neurobiology is important, you may not have the same depth of knowledge as a graduate student in neuroscience. I am, however, expecting that you have some depth in either neuroscience — in particular the primary visual cortex — or computer science — in particular machine learning and signal processing. If you’re weak on the former, you should probably review the material in Psychophysics of Vision: Primary Visual Cortex. If you’re weak on the latter, I suggest that you look at the survey paper on deep networks by Jorgen Schmidhuber [140], the ACL tutorial by Socher, Manning and Bengio [153], and the documentation available in the Theano and Torch Python libraries. If you’re interested in getting started working on structural connectomics immediately, you might take a look at the ISB Dataset that Aditya has uploaded to the Stanford servers.

March 23, 2015

Sunday night Jo and I watched a YouTube science documentary called ‘‘Bionics, Transhumanism, and the end of Evolution’’. I have no idea who directed, produced or wrote the screenplay. The person who recommended the video to me thought it was a BBC production, but it definitely isn’t. The video includes commentary by a number of reputable scientists, philosophers and science fiction writers. Ray was featured in several segments. I was impressed with the technology selected for discussion and the relative even handedness of the presentation. Depending on your biases, you may see the future predicted by the commentators as apocalyptic, depressing or wonderfully exciting, but it’s hard for a scientist or engineer not to see it as inevitable.

The documentary starts with a clip from Burning Man on the last night of the event when the giant wooden statue of a standing man is set ablaze to the cheering of an enthusiastic crowd. There is subtly ominous music in the background. It ends with a continuation of the clip showing the statue engulfed in flames and beginning to disintegrate and the crowd cheering even more enthusiastically than in the earlier scene. Against the backdrop of the surging crowd and artful conflagration, Bruce Sterling delivers the following soliloquy:

It’s important to realize that the posthuman epoch is coming. We really do want to violate human limits and we’re getting closer to having the technology to do so, but it’s also important to realize that this is not the end of history. It does not solve any of our other problems, it just creates new problems that are going to intensify, and there’s going to be more than one kind of humanity. The mere fact that you’re no longer human doesn’t mean that you don’t have the same personality problems that you had before. It doesn’t liberate you from yourself, it probably makes you more you not less. You’re not going to clank and beep like robocop, you’re just going to have more abilities and new powers. Dealing with power is troublesome; if you have more power, you have more responsibility not less. — (Bionics, Transhumanism, and the end of Evolution)

As I watched the video, I thought of the Enlightenment and the work of the Scottish, English, German and French philosophers of the time: Hobbes, Locke, Hume, Montesquieu, Rousseau, Condorcet, Diderot, d’Alembert and Voltaire. Their writings and the scholarly biographies chronicling their lives reflect their wit, intelligence, enthusiasm and impatience. They knew they were in the midst of a period of profound change. They realized that the existing social contract was dissolving and that it was their responsibility to forge a new one. They were impatient to enact the changes they judged most appropriate and they were appallingly ignorant of the world they lived in.

Given what there was to know at the time, they were well educated, veritable polymaths compared to most of their contemporaries. They were interested in and held strong opinions about capital markets, competition, conflict, public education, property rights, altruism and morality, the control of technology, the role of government, slavery and women’s rights, just to name a few topics. They had no idea of the impending social cataclysm soon to be wrought by the industrial revolution. I thought about Anthony Pagden’s description [128] of the debate between Denis Diderot, David Hume, Emanuel Kant, Laplace and Montesquieu concerning standing armies and the prospects for armed conflict in the different futures they were imagining.

I saw these 18th century philosophers’ mirrored in the views expressed by the 21th century scientists and technology enthusiasts interviewed in the video. We have already started to build machines that kill and destroy villages, we have gradually been granting more and more autonomy to these machines and we will cede more as the machines become smarter and better able to carry out our wishes. More than one commentator suggested we shouldn’t worry about the machines turning against us because we will program safeguards into these machines so we can render them inert if they run amok. Most computer scientist would laugh at their ignorance and perhaps shudder at the risks many of today’s leading scientists and engineers are willing to take with their children’s future. I’m caught up in the same frenzy of enthusiasm and excitement and have no credibility to criticize.

P.S. There’s a sequel entitled ‘‘Better, Stronger, Faster: The Future of the Bionic Body’’ which I haven’t seen yet but I am tempted to given the admittedly-low-bar better-than-the-discovery-channel quality of the above-mentioned documentary — you can find it here. On a related note, we met with two of the founders of Nervana Systems, Amir Khosrowshahi and Arjun Bansal — neuroscietists who came from Bruno Olshausen’s and John Donohue’s labs respectively, Arjun worked with me on probabilistic graphical models of cortex when I was at Brown University — right after their meeting with several partners at Google Ventures. Arjun mentioned the nano-scale implantable neural devices being developed at Berkeley in Jan Rabaey’s lab; here are two representative papers:

March 19, 2015

There are three new (2015) papers just out that extend the use of LSTM models beyond linear chains focusing on the representation of tree structures in NLP applications: Zhu et al [189], Tai et al [83] and Vinyals et al [175]. The Tai et al work is evaluated on semantic relatedness4 and sentiment analysis, where the former task uses the Sentences Involving Compositional Knowledge (SICK) dataset that we encountered in Bill MacCartney’s NLI work.

The Zhu et al et al paper compares the Recursive Tensor Neural Network of Socher et al [154] with the same model in which the authors have replaced the tensor layer with an LSTM resulting in a recurrent neural network5. Unfortunately, all three papers depend on labeled data — though Vinyals et al augment the relatively scarce human-annotated data using automated parsing technology — in the form of parse trees for training. In the sequel, I’ll focus first on the Tai et al model comparing it with the Zhu et al model, followed by a brief discussion of the Vinyals et al work.

The standard linear-chain LSTM ‘‘composes its hidden state from the input at the current time step and the hidden state of the LSTM unit in the previous time step, the tree-structured LSTM, or Tree-LSTM, composes its state from an input vector and the hidden states of arbitrarily many child units.’’

Figure: 1 from [83]: Top: A chain-structured LSTM network. Bottom: A tree-structured LSTM network with arbitrary branching factor. Compare with Figure 1 from Zhu et al [189].

In Tree-LSTM units, gating vectors and memory cell updates are potentially dependent on the state of multiple child units. Instead of a single forget gate, the Tree-LSTM unit has one forget gate fjk for each child k, thereby allowing the Tree-LSTM unit to selectively incorporate information from its children. For example, ‘‘a Tree-LSTM model can learn to emphasize semantic heads in a semantic relatedness task, or it can learn to preserve the representation of sentiment-rich children for sentiment classification.’’

Figure: 2 from [83]: Composing the memory cell c1 and hidden state h1 of a Tree-LSTM unit with two children (subscripts 2 and 3). Labeled edges correspond to gating by the indicated gating vector, with dependencies omitted for compactness. Compare with Figure 2 from Zhu et al [189].

The Vinyals et al [175] work is perhaps not as novel or obviously tree-like, but their approach to parsing is arguably more interesting and their results more compelling than either of Tai et al or Zhu et al. They use a completely different LSTM architecture, namely the sequence-to-sequence (S2S) LSTM model of Sutskever et al [162]. Vinyals et al augment the S2S architecture in [162] so that it produces a linear encoding — an S-expression — of the parse tree and then use a stack to keep track of the level of nesting.

In the tradition of the best Google research, Vinyals et al gain advantage by automatically collecting a large supplementary dataset that enables a substantial difference in performance. The authors ‘‘train a deep LSTM model with 34M parameters on a dataset consisting of 90K sentences (2M tokens) obtained from various treebanks and 7M sentences from the web that are automatically parsed with the Berkeley Parser [...]. The additional automatically-parsed data can be seen as an indirect way of injecting domain knowledge into the model.’’ Simple, elegant and eminently practical.

Hopefully this is enough detail to pique your interest. I found the Tai et al paper the clearest with a nice mix of intuition and technical detail. While different in detail, Tai et al and Zhu et al are the closest in spirit of the three. The Vinyals et al paper is simpler and more elegant than either of the other two; the only reason I didn’t rank it higher is that I was prepared for it as a consequence of our thinking along similar lines — I readily admit, however, that their solution is more elegant than any I came up with.

In looking around for papers on hierarchical models that might apply to either Descartes or Neuromancer use cases, I ran across a 2008 paper by Richard Socher that appeared in a symposium on medical imaging. Socher et al [150] present a hierarchical model for segmenting tubular structures observed in low-radiation X-ray images (3-D CT). They apply the model to segmenting blood vessels in angiographic videos. This paper does not use artificial neural networks, deep or shallow, and relies on three machine learning techniques that might provide some interesting ideas for tracing neural circuitry — see the footnote at the end of this sentence for references6.

For your convenience, you can find the the BibTeX entries for the papers mentioned above including abstracts by following the footnote at the end of this sentence7. All of the papers but the 2008 Socher et al paper are available from arXiv as 2015 submissions — search for the first author having limited the search to the current year. Socher’s paper is on his DBLP page or among Comaniciu’s publications (PDF). It might be instructive to have Oriol, Ilya or Lukasz present their paper at a Descartes / Neo meeting.

March 15, 2015

Here’s a note that I sent to Demis Hassabis during the weekend asking questions about his work in cognitive neuroscience on the relationship between constructive remembering and imaginative forecasting. I am particularly interested in the connection of his work to what I perceive as the related problems of planning to achieve goals and generating responses to questions:

This weekend I got interested in learning more about the secondary / multi-modal association areas. I pulled the usual volumes I consult when I’m clueless and unsure how to proceed. Bear, Connor and Paradiso [10] and Kandel, Schwartz and Jessell [84] didn’t have much to offer, but then I ran across an article in Gazzaniga [48] by some of your colleagues, Addis, Buckner and Schacter [138] that was interesting and led me to more of your work. I was particularly intrigued with the experiments and results reported in [13764].

I’m curious if you know of any work that attempts to provide a computational (algorithmic) explanation of what the circuits described in [137] are actually doing when imagining possible futures. I’m thinking along the lines of the O’Reilly and Frank [124123125] model of working memory and executive control or Eliasmith and Stewart’s SPAUN architecture [16160159], both of which support different forms of gating that enable variable binding — and the authors claim are biologically plausible.

I’m generally of the opinion that these high-level descriptions of cortical computation are unlikely to shed much light on cortical circuitry at the microscale, but at least in the case of O’Reilly’s and Eliasmith’s work their models are described in enough detail that a computer scientist has some chance of understanding what the authors mean when they suggest that some area of the brain is, say, capable of variable binding. With those caveats, I think the effort to be clear is well worth the risk of being wrong, since as least then rational people can agree on what are they arguing about.

The tasks you had subjects perform in your experiments were particularly useful for my introspective gedanken experiments in trying to solve variants of the binding problem using cascaded embedding spaces. For example in imagining a future free of debt, the subject, call her Jan, might be reminded of a friend, Alice, who in similar circumstances got a second job cleaning offices at night. To imagine a debt-free future, Jan might recall the story about Alice, remove the dependence on Alice as the active agent, add herself as the agent, and imagine working extra hours in an otherwise empty office space late at night8

Assuming an embedding-vector representation, why can’t we retrieve the vector representing the story, subtract the vector for Alice and add the vector for Jan to produce a representation that captures the meaning in the hypothesized future? The answer stems from the fact that we don’t know if the Alice story is relevant to reducing Jan’s debt, we don’t know that Alice is the active agent — or what an ‘‘active agent’’ is for that matter, and we don’t know whether the other dependencies in the Alice story are compatible with Jan’s circumstances. Short of finding a much better semantic-frame parser than currently exists, we can’t easily disassemble and reassemble distributed representations to support the sort of imaginative reconstructive remembering that you describe in [138].

Perhaps vector addition and subtraction aren’t up to the task; how about (potentially) more expressive tensor operations on connectionist slot-filler representations, e.g., Smolensky [148]. Unfortunately, while Smolensky gives lip service to encoding graphs and slot-and-filler structures, his distributed representations assume that you map more conventional GOFAI representations onto slices of tensors and by so doing punts on the parts of the problem I’m most interested in, namely learning how to perform these mappings and manipulations in a fully distributed fashion. Javier Snaider on the Descartes team has developed a technology called Modular Composite Representation (MCR) that addresses some of the shortcomings of Smolensky’s approach but still requires parsing [149]. MCR may be our best bet in the short term but I’m still looking for a better compromise.

March 13, 2015

This week has been a rather fallow period. I put aside the time to think about the connectionist binding problem, since most of research including Neuromancer was on a ski trip. After a lot of isolated thought, I took down the three major neuroscience tomes on my book shelf: Bear, Connor and Paradiso [10], Kandel, Schwartz and Jessell [84], Gazzaniga [48], and read everything they had to say about about the primary and secondary association areas in the cortex. Most of it was review, but, for example, I never knew how much more developed (proportionally larger) are the secondary association areas in human cortex compared to any other mammal.

The main thread that I’m following involves the idea that sensory input moves from the periphery to the primary sensory areas, into the primary (unimodal) association areas, and, finally, the secondary association areas, producing increasingly rich composite representations and, ultimately, complete and integrated episodic memories, and that those memories serve as the basis for conditioning procedural strategies and, at least in the case of primates and some birds and mammals, planning to solve problems in novel situations.

It may be that some types of simple planning / action selection do not require complex machinery for variable binding, alignment and substitution. It’s worth thinking about how a kitten might learn from observing its mother and siblings. Does this involve the sort of creative / reconstructive recall that apparently characterizes much of human planning? Imagine a familiar scene that includes a friend or work colleague. Next imagine subtracting out your friend / colleague and adding in yourself in his / her place. Now ask yourself how this altered memory ‘‘feels’’? Do you ‘‘fit’’ in this familiar but nonetheless fictional (counterfactual) account of the past? Is it ‘‘natural’’? What would it take to make the reconstructed memory feel ‘‘natural’’?

On the whole, however, I was disappointed in what I learned though I might have expected as much given how little is known about the primary visual association area, inferotemporal cortex (IT) [18361741813137182169]. Today I’m taking a break in the hope that my thoughts will coalesce into something useful. I took the time to write a note to my Stanford colleagues asking them to recommend their best students take part in CS379C this Spring9 and caught up with the AIBS team working on the iARPA MICrONS proposal.

March 7, 2015

I wrote up some notes including a few papers and technical reports that should help new recruits to better understand the challenges facing Neuromancer and our strategies for addressing them. The challenges are divided into three categories: (1) connectomics (circuits), (2) recordings (activity), and (3) analyses (function), where the last is the least well defined in terms of agreed-upon outcomes and priorities for pursuing them:

  1. CIRCUITS: Here’s a pretty reasonable extrapolation of existing and emerging technologies leading to economical whole-brain connectomics which I’ve excerpted from [110]. Check out the full document. I think the authors have done a good job including the front-runners as well as some of the most promising alternatives. The time frame for whole-brain connectomes run from the two to ten years, depending on the organism and technology. It’s obviously much easier predicting how the technologies of incumbents like Zeiss will fare than the more exotic ideas coming out of the academic labs:

    Due to advances in parallel-beam instrumentation, whole mouse brain electron microscopic image acquisition could cost less than $100 million, with total costs presently limited by image analysis to trace axons through large image stacks. Optical microscopy at 50 to 100 nm isotropic resolution could potentially read combinatorially multiplexed molecular information from individual synapses, which could indicate the identities of the pre-synaptic and post-synaptic cells without relying on axon tracing. An optical approach to whole mouse brain connectomics may be achievable for less than $10 million and could be enabled by emerging technologies to sequence nucleic acids in-situ in fixed tissue via fluorescent microscopy. Novel strategies relying on bulk DNA sequencing, which would extract the connectome without direct imaging of the tissue, could produce a whole mouse brain connectome for $100k to $1 million or a mouse cortical connectome for $10k to $100k. Anticipated further reductions in the cost of DNA sequencing could lead to a $1000 mouse cortical connectome.

    We’re putting most of our money on reconstruction from EM using current and soon-to-be-current technologies like the new Zeiss line of multi-beam microscopes that Winfried Denk is now working with while at the same time developing his extra-wide, perfect-crystal, whole-brain, serial-sectioning diamond-knife [40], but we are also placing side bets on Boyden’s expansion-microscopy technology [26] which we believe is very promising and keeping close tabs on some of the work coming out of the Church [111] and Zador [96] labs.

  2. ACTIVITY: This is an area full of opportunity with lots of new ideas and talent from complementary disciplines. We wrote a technical report on neural recording technologies that is still pretty current [34]. One of my colleagues Adam Marblestone — Ph.D. with George Church and currently a postdoc with Ed Boyden — corralled a group biologists, chemists, physiologists, physicists, electrical engineers, etc, to put together a somewhat more speculative — understandably so given the additional complexity in working with an awake, behaving organism — extrapolation that is definitely worth your time reading [112].

    For the time being, we are banking on calcium imaging as being the recording technology that is likely to scale to satisfy our requirements. The current GECIs have much improved response kinetics and signal amplitudes compared with earlier generations [78], the necessary GECI-expressing transgenic mouse lines already exist and the Allen Institute has world-class neuroscientists with expertise in working with them. There has been some work on miniature fluorescence microscopes suitable for mounting on the head of a mouse, thereby allowing the animal limited mobility [51], but so far the incumbent GECI and fixed-camera technologies seems way out in front in terms of scale and reliability.

  3. FUNCTION: This is by far the least well explored of the three technical categories. The reason is pretty obvious: we have never had data on the scale that we anticipate from the Allen Institute MindScope Project. There has been speculation about the structure and function of cortical columns, but no compelling evidence to support any of the current hypotheses. We’ve been working with Costas Anastassiou and his team at AIBS in developing simulations of small portions of cortex consisting of 5,000-50,000 neurons, but this doesn’t even account for a single cortical column. We could simulate much larger models at Google, but at this point in time it doesn’t much matter, since, if we wanted to create a model of a small patch of cortex spanning multiple cortical columns using state-of-the-art neural modeling tools, we would be hard pressed to do so given our limited knowledge of cortical cell types, connectivity and dynamics.

    We’re designing a series of progressively more difficult modeling challenges. The first couple of challenges involve learning the input-output functions of a set of artificial neural network (ANN) models. We don’t pretend these models are necessarily good models of biological networks; however, if we can’t learn a reasonably well-behaved network we’ve engineered, there isn’t much sense in trying to learn a real neural network given all the unknowns associated with biological systems. The next set of experiments will make use of the models that Costas’ team is developing. These models consist of networks of reconstructed, multi-compartmental, virtually-instrumented and spiking pyramidal neurons and basket cells, plus ion- and voltage-dependent currents and local field potentials so we can generate the same sort of rasters we expect to collect during calcium imaging. Once again we have a highly-controlled sandbox in which to evaluate machine-learning technologies.

    We hope to start getting recorded activity data from AIBS by end of summer if not sooner. We may get access to data from experiments carried out by Clay Reid while still at Harvard that we can play with while waiting for MindScope data. Given real data, the first order of business is to see if we can replicate the training data and generalize to the test data. Interpreting success will be challenging; the best anyone can do may be to capture summary statistics of the output or identify emergent, dynamical-system behaviour. In order to have any chance of reproducing spiking behavior, we may have to restrict our attention to smaller circuits assuming we can identify their boundaries. We’ll also want to exploit any connectomic, proteomic or transcriptomic information we glean from the fixed and registered tissue after the activity-recording stage. Once we have mouse recordings, we are in terra incognita with much to learn.

March 5, 2015

It is easy to fall into the trap of thinking that just because it is possible to trace individual processes over substantial distances in dense neural tissue, tracing all of the neurons in an entire mammalian brain is just a matter of scale. The mouse brain has ~108 neurons and ~1011 synapses in a volume of ~5003 mm. Kilometers of neuronal wiring passes through any cubic millimeter of tissue and the relevant anatomical features are on the scale of 100 nm [110].

Accurate tracing of individual axonal processes is certainly possible using a number of laboratory techniques. Both anterograde (soma to synapse) and retrograde (synapse to soma) tracing that work by exploiting different methods of labeling and axonal transport are reasonably well developed but still require special care to administer. Some progress has been made by using retroviruses to propagate labels from one neuron to another [11522] and now there transsynaptic anterograde tracers, which can cross the synaptic cleft, labeling multiple neurons along an extended path [11619].

These methods suffer from the problem that, while an individual process is relatively easy to trace, once you label all or most of the processes in a given volume, you end up with many of the same problems that surface in tracing processes stained with conventional preparations using the sort of protocols and algorithms we’ve discussed elsewhere. Alternative methods that rely on propagating either unique or one of many distinguishable labels may provide a solution11.

The Zador et al [187] method for attaching unique molecular molecular barcodes to each neuronal connection works by converting the problem of tracing connectivity into a form that can be solved by high-throughput DNA sequencing. A related method leveraging fluorescent in situ nucleic acid sequencing [96] offers similar functionality with more detailed annotation — additional markers for diverse molecules — and a theoretically simpler method for reading off the information encoded in the neural tissue [111].

For more on the technical details as well as other alternative technologies, take a look at the Dean et al report [34] produced by the Spring 2013 class of CS379C or the Marblestone et al report [112] describing the physical principles relevant to scaling neural recording.

March 3, 2015

Here’s an email exchange between Costas Anastassiou and me about how we might use his cortical-model simulations. The messages are in reverse chronological order as are all the entries in this log. From: Costas Anastassiou:

Many thanks for the ideas and insights. First, let me agree with you that there are so many unanswered questions about cortical computation even decades after the beautiful work on cortical microcircuits by Martin and Douglas. Yet, in the last decade a spectrum of large-scale experimental efforts have been undertaken with the aim to provide data (especially connectivity) to fill in the blanks. Here are a few examples: Hofer et al, Nature Neuroscience (2011) [72]; Ko et al, Nature (2011) [89]; Ko et al, Nature (2013) [88]; Lien & Scanziani, Nature Neuroscience (2013) [99]; Packer & Yuste, Journal Neuroscience (2011) [127]; Perin et al, PNAS (2011) [129];

Whereas these efforts are heroic and the associated data priceless, it is unclear what the computational scheme they boil down to is. Here is where I think computational modeling can play key role: to integrate such experimental observations and use multiple approaches to come up with reduction strategies that will help us understand the fundamental types of computations occurring in various stages of visual processing. Regarding the specific questions you asked, here are my responses:

The question then becomes, if one can implement orientation or direction selectivity (by brute, feedforward force), could we then generalize into, say, complex images such as natural scenes. The issue then is that (assuming purely feedforward processing) we would need to co-activate various LGN-neurons in a ways that correlate with the the image’s characteristics which is something we do not have. So, for anything substantially more complicated that simple sine waves we do not have the computational framework in place to account for — though we are working on it.

Regarding visualizations, I guess the questions is: what do you mean when you say ‘‘make sense’’? For example, such visualizations can help us detect epileptic activity, etc. Moreover, we are in the process of creating simulated Ca-imaging movies (dF/F) which is essentially the output gathered in 2-photon experiments in order to study what these experiments really tell us in terms of cortical processing. But you are right, visualizations are also often used in ways that are not informative — this is the reason that before making any of them it would be good to know what we are trying to get out of them.

Here is the email that prompted the above reply:

It will help if we have some baseline expectations to prime the pump with as it were. Here are a few questions to help with basic impedance matching, then we can go from there: To what extent can we model and simulate a collection of cortical columns and their connections? I know that’s a lot of neurons but I’d like to be clear on what we can’t do. I’m guessing this is out of the question for several reasons.

To what extent can we retinotopically map simulated optic tract / LGN input onto contiguous chunks of simulated cortex? Is there any way that we could simulate visual input corresponding to a sine-wave grating? I expect that answer to these questions is, respectively, ‘‘hardly any extent worth talking about’’ and ‘‘you’ve got to be kidding!’’

Since synaptic strength is not something we can plausibly assign a priori, the tabula rasa version of the network can’ t do much at all. If we could apply some model of synaptic plasticity and train one of your models to realize something like Gabor filters that would be a huge result, right?

So what sort of behaviors might we expect to see given the limits of your ability to initialize and thereby preprogram these models? Of course we can and may have to work with random input and just hope that this will stimulate interesting convergent global behaviors that we can then try to infer and replicate in our trained models. How do you imagine generating simulated input? I really have no idea at this point!


P.S. I answered most of my questions by doing a quick literature search and asking local experts (Viren Jain and Peter Li) on my team. As far as I can tell, the state of the art is pretty dismal. Amazing that it has been over three-decades since the first hypotheses concerning cortical columns were made and we still know very little about the circuitry. You’d think someone would have traced a column or its analog in mouse, shrew or even turtle. Helmstaedter et al [66] was most relevant paper I came across and Boudewijns et al [19] was one of the more innovative. If you have additional suggestions, I’d love to hear them.

February 25, 2015

As time permits over the next month, I’ll be adding notes for the class at Stanford in the Spring quarter. After my presentation to the Stanford Computer Science faculty last week, I thought about how to present this material to a non-neuroscience audience more comfortable with machine-learning and computer-vision technology than diffuse neuromodulation and electron microscopy. Here’s rough outline for one approach to motivating the problem and getting computer scientists excited about the extraordinary challenges and opportunities awaiting us:

  1. Neuroscience until very recently: patch clamping (gold standard), multi-electrode arrays, Nematodes, Cephalopods, Murinae, Primates;

  2. You are first generation to have the opportunity to do neuroscience research without ever touching any living or dead model organism;

  3. What if understanding the brain was no more — and no less — difficult than reverse engineering an integrated circuit;

  4. Modern Rutherfords and Faradays — experimentalists with physics, materials science, biochemistry and nanotechnology training;

  5. Why the problem is hard from a theoretical perspective, why it is hard from a practical standpoint, and how it can be made easier;

  6. Persistent hypotheses perceived as facts with no conclusive evidence, e.g., single-cortical-algorithm, symbolic-processing reducibility;

I also thought about how to describe my current strategy for tackling functional connectomics problems in a staged manner so as to maximize the probability that students will be able to successfully complete a project involving synthetic and real data from our collaborators within the time and resource constraints of a busy Spring quarter. Here are my preliminary notes:

Industrial Espionage

Before we get into thinking about whether and how we might infer the function of real neural circuits, let’s briefly consider how we might infer the function computed by an engineered computing device. Suppose you are given recordings of the inputs and outputs of some sort of computing device: an Intel CPU chip, Arduino or Raspberry Pi circuit board, an ASIC or FPGA. Could you infer the function that the device is computing? What if you knew the wiring diagram in addition to the input-output recordings? What if in addition you were able supply your own inputs and observe the outputs? Would this make the problem any easier?

Note this is basically what a semiconductor manufacturer might do having obtained — legally or otherwise — the latest chip from a rival company. It’s called reverse engineering and is generally considered a form of industrial espionage though the practice is believed to be widespread and considered in some circles to be just a normal part of doing business.

Reverse Engineering ANN

Now consider the related problem of inferring the function computed by an artificial neural network (ANN). It’s worth pointing out that we don’t really know what the different layers of a deep neural network are computing, though it could be enormously useful in debugging and optimizing such networks. Would it help to have recordings of their inputs and outputs without knowing anything about the network structure?

Suppose we are presented with a black box containing a deep neural network of unknown architecture. If we were to record from the unknown network’s inputs and outputs could we learn to replicate its behavior? Recent work on distilling more compact networks from previously trained networks suggests that we could [13570]. Could we infer the structure / architecture given just the inputs and outputs? Given the complexity of learning Boolean circuits, I would guess not [85].

What if we were to record from all the units in a network, and we are given their approximate coordinates in the frame of reference of a 2-D projection that preserves the structure of the network shown in the diagrams used to represent the network in print? Would the locality information help to infer the input-output behavior? Could we infer the wiring diagram given the input-output behavior?

Computational Complexity

What if we had the complete circuit — the ‘‘wires’’ and ‘‘components’’, but not what the components compute — their activation functions? Could we infer them? What if we could supply our own input and then observe the output as well as the units in the hidden layers? This capability turns out to be very useful and can transform an NP-complete problem [5453] into a polynomial-time problem [335].

Though I didn’t say so explicitly, I was assuming there is no information encoded in the order in which inputs are presented to our target computing device, and I was assuming that the inputs are discrete — integer or real valued. In the case of real neural networks however, the inputs and outputs are continuous and there definitely is information encoded in the changing amplitude of the continuously varying signals. In recording from the brain, we sample these signals, hopefully at higher than the Nyquist rate.

In addition, the observations will be noisy and may only serve as a proxy for the signal we really want to record. Specifically, we employ two-photon microscopy and fluorescent proteins called genetically engineered calcium indicators (GECI) to carry out calcium imaging. Variation in the excitation of the GECI fluorophores due to light-scattering, photo-bleaching and quantum effects introduces noise, and calcium is a lagging proxy for the local field potentials that we would prefer to record if it were feasible [78].

Learning from Simulation

There are a lot of things that can go wrong in recording neural signals from biological computing devices. The recording instruments are extremely sensitive, the target tissue can be damaged in the process of recording, and, as already pointed out, there are sources of noise and device limitations that can interfere with the fidelity of the recorded signal, and, finally and somewhat ominously, we are not entirely sure we are recording the ‘‘right’’ signals required to infer function.

Instead of starting with recordings of biological systems, we begin by using simulations to generate data that, to the best of our knowledge, accurately represents the behavior of our target systems. If we can’t learn the function of systems that we understand well enough simulate, then we have little or no chance of learning the function of systems about which we know relatively little — or to be a little more optimistic about which we think we know quite a lot, but aren’t entirely sure and could be deluding ourselves.

We know — or think we know — a lot about how individual neurons behave in terms of their input-output behavior. The foundational work of Hodgkin and Huxley [71] provided us with a mathematical model of how neurons process and propagate information. The basic model consists of an electrical-circuit analog and a system of ordinary differential equations that can be used to predict local field potentials. Multi-compartmental variants allow more precision in modeling neurons with extensive axonal and dendritic arborization.

There are limitations to such models [1] as well complicating intra- and extra-cellular factors that have to be accounted for in order to explain certain aspects of neural computations. These include genetic regulation, , :// coupling, and a host of other factors that are required to fully account for even the simplest neural computations.

We are fortunate to work with the accomplished neural modelers [21333886] at AIBS to help us navigate in this complex space, and provide us with state-of-the-art simulations of neural circuits. We will ‘‘instrument’’ these circuits in order to generate simulated data to test our learning algorithms prior to trying them out on the real thing. We can also use these models along with connectomic ground-truth data to produce connectomes in the form of adjacency matrices / affinity graphs that we can use to test our assumptions about the errors produced by existing automated-connectomics algorithms.

Learning Neural Function

Neuroscience covers a great many disciplines that have something to say about the brain. At the micro-scale, we have theories that operate at the cellular and molecular levels such as the Hodgkin-Huxley model. At the macro-scale, we have cognitive neuroscience has given us a wealth of theories and experimental findings from psychology, psychophysics and cognitive science.

There is a substantial gulf between the micro and macro scale. It’s as though we can talk at the low-level about digital circuits, machine registers and assembly code and at the high-level about the behavior of applications like Photoshop and Microsoft Office, but there isn’t even a suitable ontology in which to frame theories about the critically-important middle-ware that adds functionality on top of the operating system in order to enable the development of complex applications software.

An increasing number of computational neuroscientists are coming to the conclusion that the gap between the micro- and macro-scale is too wide to bridge and that we need some sort of meso-scale modeling language to connect the two [1132423]. What would such a language look like? In particular what would be the canonical circuits and computational primitives at such a level? I don’t know the answer for general cognition, but in the case of vision we may have a good start.

Researchers in computer vision have a long and close relationship with scientists studying biological vision [32]. Many of the basic operators and algorithms that comprise computer vision libraries such as OpenCV have taken their inspiration from neuroscience. See elsewhere in my notes and in the recent paper I coauthored with David Cox [32] for a list of computational components inspired by neuroscience and speculation about related technologies that might serve to explain.

Learning Neural Structure

Peter Li who is attending a workshop and hackathon at the Janelia Farm Campus of HHMI mentioned that the drosophila connectome team is nearing completion on the seven-column12 dataset and that the full connectome will soon be available along with the EM data from which it was generated. Peter and Jon Shlens who did their Ph.D. work on primate retinas with E.J. Chichilnisky at the Salk Institute before joining Google have graciously offered to serve as consultants on CS379C projects.

Researchers have successfully constructed the full connectome of one animal: the roundworm C. elegans [177]. Partial connectomes of a mouse retina [21] and mouse primary visual cortex [18] have also been successfully constructed. Bock et al’s [18] complete 12TB data set is publicly available at Open Connectome Project.

I’ve continued to investigate if, how, and where ambiguity and multiple consistent hypotheses are handled in cortex. My conversations with Steven Zucker have focused on the task of finding contours from a combination of low-level, bottom up and high-level, top-down information sources. As Steven points out, this task is closely related to the Gestalt notion of closure and is located (conceptually) in the rather large space of human competencies that current academic information-processing challenges overlook and squarely in the critical path of our requirements for tracing neural processes as part of connectomic analysis and addressing the so-called ‘‘binding’’ problem in our paraphrase, translation and document summarization work. It’s worth emphasizing this last statement: finding a general solution to this problem would go some way toward solving several key problems of interest to the intelligence community — the ‘‘i’’ in ‘‘iARPA’’, namely, tracing unpaved roads in satellite data and improving our ability to automatically recognize, translate and summarize all manner of natural language input — including, alas, our cell phone conversations.

Steven has a relatively recent paper [191] on the topic that does a good job of describing the problem. He describes the underlying computation in terms of the field equations for a reaction-diffusion process13 and suggests that the corresponding information processing may involve the glia surrounding an ensemble of spiking neurons. That’s one possible hypothesis, but, given the slim evidence we have to go on at this juncture, one might also conjecture networks of (inhibitory) GABAergic interneurons primed by (excitory) pyramidal neurons as the substrate for the underlying computations and some sort of mode-seeking, distributed-mean-shift as a more appropriate algorithmic basis [6145114176]. In any case, if striate cortex performs this sort of computation, it would be an important discovery from both a scientific and practical standpoint — the latter since we don’t have good solutions for solving these problems in our artificial neural networks. Here’s the abstract from Steven’s paper:

Border ownership is an intermediate-level visual task: it must integrate (upward flowing) image information about edges with (downward flowing) shape information. This highlights the familiar local-to-global aspect of border formation (linking of edge elements to form contours) with the much less studied global-to-local aspect (which edge elements form part of the same shape). To address this task we show how to incorporate certain high-level notions of distance and geometric arrangement into a form that can influence image-based edge information. The center of the argument is a reaction-diffusion equation that reveals how (global) aspects of the distance map (that is, shape) can be ‘‘read out’’ locally, suggesting a solution to the border ownership problem. Since the reaction-diffusion equation defines a field, a possible information processing role for the local field potential can be defined. We argue that such fields also underlie the Gestalt notion of closure, especially when it is refined using modern experimental techniques.

Steven also pointed me to a 2014 paper of his on the closely-related problem of resolving uncertainty in low-level edge features using curvature constraints as a proxy for high-order statistics. The paper makes interesting (evocative) connections to multiple computational frameworks from Boltzmann machines and statistical mechanics to Markov and Conditional random fields all of which have been applied to this problem by computer-vision researchers. It’s an interesting ‘‘thought’’ piece, but less problem-focused and hypothesis-driven than the earlier work. Here’s a redacted version of the abstract from [192]:

Vision problems are inherently ambiguous: Do abrupt brightness changes correspond to object boundaries? Are smooth intensity changes due to shading or material properties? For stereo: Which point in the left image corresponds to which point in the right one? What is the role of color in visual information processing? To answer these (seemingly different) questions we develop an analogy between the role of orientation in organizing visual cortex and tangents in differential geometry. Machine learning experiments suggest using geometry as a surrogate for high-order statistical interactions. The cortical columnar architecture becomes a bundle structure in geometry. Connection forms within these bundles suggest answers to the above questions, and curvatures emerge in key roles. More generally, our path through these questions suggests an overall strategy for solving the inverse problems of vision: decompose the global problems into networks of smaller ones and then seek constraints from these coupled problems to reduce ambiguity. Neural computations thus amount to satisfying constraints rather than seeking uniform approximations. Even when no global formulation exists one may be able to find localized structures on which ambiguity is minimal; these can then anchor an overall approximation.

February 12, 2015

In an earlier post, I mentioned that according to the MICrONS BAA we are supposed to come up with a novel, biologically plausible machine learning algorithm that (presumably) addresses a machine perception task — auditory, olfactory, visual. etc. I realized last night, that several of us at Google are working on a problem that has the following characteristics: (i) it addresses a felt need in computer vision, (ii) it was identified early on by computational neuroscientists, including Shimon Ulman, Eli Bienstock, David Mumford and Stuart Geman, (iii) we have some ideas for solving the problem using deep neural networks, and (iv) it may have a biological solution that appears early in the visual pathway14. My notes from last night:

Here is a paper by Geoff Hinton and two of his students [69] that has gained some interest in the neural network community as it identifies a shortcoming of convolutional architectures and indeed most — all that I know of — neural network architectures that attempt object recognition. The same shortcoming also applies to many of the most popular non-NN approaches including spatial pyramid matching [55184]. I include in this sweeping generalization alternating-simple-complex-cell hierarchies such as HMAX.

The problem, as Geoff points out, is that the otherwise desirable property of learning invariant features, achieved primarily through the use of multiple stages of pooling, results in increasingly coarse filters that, while appropriately succeeding on some ambiguous cases thereby increasing recall, fail on others that would not be ambiguous were it not for the pooling, thereby decreasing precision. Stu Geman [49] refers to this as the selectivity-invariance tradeoff in analogy to the in statistics.

Simplifying somewhat, Geoff’s strategy in resolving the dilemma is to use pooling to increase recall but include geometry — for example, the coordinates of the maximum value as identified in the frame of reference of the associated receptive field — in a separate channel so as to filter out false positives further along in the processing. This might be important, say, in the case of resolving detail in a low resolution image where context is critical in recognizing objects.

For example, a pink blob in the corner of an image is initially unrecognizable, but, as it becomes increasingly clear that we’re looking at a scene of a barnyard, it is much more likely that the pink blob is a pig and the additional information in the side channel can now be used to check on other constraints that might confirm or disconfirm this hypothesis, e.g., the blob is close enough to the ground for the ersatz pig to be standing on solid earth and an elliptical bounding box is about the right size for a pig as seen from the inferred distance to the barn.

The first mention of the problem in the neuroscience and neural-network literature that I am aware of appeared — along with a proposed solution — in the work of Shimon Ullman [170171]. It has also surfaced in discussions of compositional features and the importance of compositionality in explaining visual processing in the ventral stream — see Bienenstock, Geman and Potter [1415] for some early, influential work and Hanuschkin et al [63] for a more recent example looking at sequential compositionality in feed-forward networks as a model of complex song generation in the Bengalese Finch.

January 19, 2015

In chasing citations forward and backward starting from the Fernández et al paper [43] I turned up a bunch of older work on hierarchical RNN models including the oft-cited but mainly-of-historical-interest NIPS paper by Hihi and Bengio [41]. I didn’t find any useful techniques that haven’t been incorporated into newer models or discarded for good reason. Regarding work that followed Fernández et al, I recommend Koutník et al [91] which appeared in last year’s ICML. This work that came out of Schmidhuber’s lab is memorable for its coinage of the name ‘‘Clockwork RNN’’ and bears all the marks of Jürgen’s encyclopedic knowledge concerning the history of neural networks. Koutník et al compare their CW-RNN models with the Fernández et al’s CTC-trained DBLSTM models.

If nothing else, I recommend you skim through the related-work section for an excellent if brief review of past work on hierarchical neural network models. The ideas mentioned include using delayed forward connections, hidden units operating at different time scales, recurrent connections with time lags, leaky-integrate-and-fire neurons that introduce variable-duration hysteresis, equivalently weighted self connections that decay exponentially in time, and Schmidhuber’s own sequence-chunking and neural-history-compression work. The authors also allude to the repertoire of models from control theory including variable-time-scale and hierarchical variants of HMMs, Kalman filters, and continuous-time stochastic processes, all of which is interesting but largely irrelevant to our use cases.

Lest I mislead you, Clockwork RNN (CW-RNN) models are not hierarchical except in the sense that they facilitate reasoning about processes at multiple time scales. If that’s your working definition of abstraction, then you might characterize them as hierarchical. In a CW-RNN, the hidden layer is ‘‘partitioned into separate modules, each processing inputs at its own temporal granularity, making computations only at its prescribed clock rate.’’ The authors claim that ‘‘[r]ather than making the standard RNN [SRN] models more complex, CW-RNN reduces the number of SRN parameters, improves the performance significantly in the tasks tested [TIMIT spoken-word classification and handwriting recognition], and speeds up the network evaluation.’’ The modules can be ordered — and even stacked — to emphasize the implicit temporal hierarchy:

Clockwork Recurrent Neural Networks [CW-RNN] like SRNs, consist of input, hidden and output layers. There are forward connections from the input to hidden layer, and from the hidden to output layer, but, unlike the SRN, the neurons in the hidden layer are partitioned into g modules of size k. Each of the modules is assigned a clock period Tn ∈ {T1, ..., Tg}. Each module is internally fully interconnected, but the recurrent connections from module j to module i exists only if the period Ti is smaller than period Tj. Sorting the modules by increasing period, the connections between modules propagate the hidden state right-to-left, from slower modules to faster modules.

CW-RNN models are primarily useful for handling long-term dependencies, and it is in this regard that they extend LSTM constant-error carousels to efficiently handle even longer-term dependencies. The basic idea can be applied to complement the Fernández et al CTC-trained DBLSTM models. Since the sub-sequence abstractions that characterize Shalini’s hierarchical document modeling task are variable in span, the latter is most likely to prove useful for our purposes, but given the potential value of keeping track of relevant context over the span of several paragraphs or chapters, it may prove useful to substitute CW-RNN hidden layers for the vanilla LSTM hidden layers in Fernández et al.

In general, it seems unlikely we will have supervision in the form of target sequences at intermediate levels in the hierarchy. However, given sufficient data, Fernández et al suggest it may be possible to discover enough structure to automatically identify structural boundaries:

Finally, the total error signal Δi received by network gi is the sum of the contributions in equations (12) and (13). In general, the two contributions can be weighted depending on the problem at hand. This is important if, for example, the target sequences at some intermediate levels are uncertain or not known at all.
Δi = λi Δitarget + Δibackprop
with 0 ≤ λi ≤ 1. In the extreme case where λi = 0, no target sequence is provided for the network gi, which is free to make any decision that minimizes the error Δibackprop received from levels higher up in the hierarchy. Because training is done globally, the network can, potentially, discover structure in the data at intermediate levels that results in accurate predictions at higher levels of the hierarchy.

In the case of the hierarchical document model, we have supervision at the word, sentence, and paragraph level, though not nearly as precise or informative as the phoneme-level supervision provided in the case of TIDIGITS — the speaker-independent connected-digit speech recognition problem and dataset. The hierarchy starting with MFCCs on the lowest rung, followed by phonemes and culminating in words, is of a very different sort than one starting with words and culminating in paragraph or chapter boundary markers15.

If we have supervision for intermediate layers, then we can use Graves DBLSTM plus CTC model [5843] along with the feedforward topology described in [57] and a modification to the output layers for the LSTM hidden layers. In the next installment, we consider the case in which we don’t have sources of intermediate supervision.

January 17, 2015

Here’s a paper by Fernández, Graves and Schmidhuber [43] that describes a hierarchical model similar to the sort of thing we’re after. Specifically, their model supports multiple levels of abstraction and provides an elegant solution to the problem handling abstractions that span arbitrary-length sub-sequences of the input. This is accomplished by employing an extended label set L′ = L ∪ {BLANK}, and introducing the notion of a path corresponding to a sequence of labels from L′. A few weeks ago Javier and I traded email about Graves’ use of the BLANK label in his thesis, but this IJCAI paper focused my attention and gave me a greater appreciation for how Graves uses it.

The same label can appear multiple times consecutively in the output sequence; however, the interpretation of repeated-label sequences is necessarily a bit more complicated when it comes to defining a probability measure on output sequences. Each layer of network is implemented as a CTC (Connectionist Temporal Classification) layer — introduced by Graves at ICML the year before [58] and refined in his thesis that came out the following year [56], and each CTC layer has its own softmax layer which ‘‘forces the hierarchical system to make decisions at every level and use them in the upper levels to achieve a higher degree of abstraction. The analysis of the structure of the data is facilitated by having output probabilities at every level.’’

They compare their approach with hierarchical HMMs — using the HMM Tool Kit from Steve Young’s group at Cambridge — on a spoken-digits task. These experiments involve relatively small — ~1/4 million parameters — models applied to relatively simple problems, but presumably this is the same architecture that Graves used to obtain the state-of-the-art speech-recognition results reported in his 2013 paper with Navdeep Jaitly [59].

Shalini plans on taking advantage of the implicit abstraction-boundary markers available in documents, e.g., end-of-sentence, end-of-paragraph and end-of-chapter markers, in developing and testing her ideas for hierarchical LSTM models, and this is definitely the right way to move forward as it allows for greater supervision in training intermediate layers of the network and provides ground truth to test unsupvised approaches.

Javier has proposed that we might be able to use MCR to facilitate unsupervised learning to segment: ‘‘We do have labels for the word level but the problem is that we do not have labels for the higher layers sentences, paragraphs, etc. One option is to create these labels (actually vectors) using MCR, so we can train the upper layers using sequences of MCR as target, and let the network to learn the segmentation.’’ The three of us — and anyone else who wants to participate — should set aside some time to talk about these ideas in detail next week.

Fernández et al mention but don’t compare head-to-head with Sanjay Kumar and Martial Hebert’s work on hierarchical CRF models [93]. It would be interesting to get Sanjay’s take on state-of-the-art CRF technology for this application, though it’s my impression from talking with Kevin Murphy that work on CRF models has taken a backseat to deep neural networks, largely due to complexity of training such models. It is also probably worth checking with Andrew MacCullum and Geoff Hinton both of whom have experimented extensively with hierarchical CRF models for, respectively, language processing [163] and computer vision [126]. I’ll have to go back and read the Graves and Jaitly paper [59] describing their state-of-the-art, end-to-end ASR system implemented as deep bidirectional long-short-term-memory recurrent neural network trained with the connectionist temporal classification objective function [59].

January 15, 2015

The success of the Socher et al [152] matrix-vector models on parsing and sentiment analysis helped lead to a resurgence of interest in tensor-product models though at first they didn’t characterize their work as employing tensors and actually contrasted their work with prior work involving tensors. They did realize that they were using compositional vector spaces to represent the variability in word meanings.

Socher et al suggest that every word might be represented by an n-dimensional vector plus an n × n matrix, but noted that the dimensionality of the model becomes too large with the sizes of the vectors, — n = 100 or larger — commonly used in practice, and so ‘‘[i]n order to reduce the number of parameters, [they] represent word matrices by the following low-rank plus diagonal approximation: A = UV + diag(a), where U ∈ ℝn × r, V ∈ ℝr × n, and a ∈ ℝn.

In Socher et al [154] they back off allocating a vector to every word and start using tensor models to represent the complex relationships between words that shade their composite meanings. Geoff Hinton has a long history of incorporating non-linear, multiplicative components into networks to provide greater flexibility and representational power. Geoff and his students were using bilinear16 and tensor17 long before the rest of the NIPS crowd caught on. NTN (Neural Tensor Network) models [151] are particularly good at reasoning about the multiplicative interactions between relationships in natural-language processing [117154] and natural-logic inference [20]. This log entry focuses on work by Srikumar and Manning [156] using an NTN model to learn distributed representations for predicting structured output in the form of sequences, segmentations, labeled trees and arbitrary graphs.

Let xX be an input corresponding to a sentence or document yY an output structure representing x. Φ(x,y) → ℝ is a feature function that captures the relationship between x and y as an n-dimensional vector and w ∈ ℝn is a linear scoring model so that arg maxywT Φ(x,y) defines the combinatorial optimization problem of finding the best structure y representing x. Note that this problem is at the heart of several parsing, alignment and segmentation problems.

Here is Figure 1 from Srikumar and Manning [156] illustrating the three running examples used in the paper — which, by the way, are enormously helpful in understanding the paper:

The output is shown as one or more parts each of which consists of a sequence of labels, e.g., yp = (y0, y1) is a part with two labels. Let L be the set of all M labels {l1, ..., lM}, e.g., a set of part-of-speech tags. We denote the set of parts in the structure for input x by Γx and each part p ∈ Γx is associated with a list of discrete labels, denoted by yp = (yp0, yp1, ...). To represent the various relationships among parts — e.g., emissions in the case of sequences or compositions in the case of nodes in a parse tree — the authors employ a set of of d dimensional unit vectors al one for each label which together constitute the columns of a d × M matrix A.

Since we are assuming that the labels are related to one another through multiplicative interactions, we model those interactions with a tensor whose elements essentially enumerate every possible combination of elements from input φ and label {ai} vectors — see here to understand how this multiplicative mixing is accomplished as a tensor product. Ψ(x, yp, A) is the recursively-defined feature tensor function that produces the feature representation vector ΦA(x, y), not to confused with the input feature vector φ(x)

The definition of the feature tensor function is straightforward but notationally cumbersome and so check out the paper for the details. In lieu of the full details, the authors account of one of their running examples should give you a pretty good idea of how it works. The following graphic expands the tensor for the case of a compositional part with two labels — the middle example in the above figure. The vec(.) operator vectorizes a matrix or tensor by concatenating, respectively, the column vectors of a matrix or (recursively) the two-dimensional slices of a tensor:

Note that tensor products are associative, distributive but non commutative. We can expand A ⊗ φp(x) as a sequence vector-matrix tensor products as in A ⊗ φp(x) = al1 = ⊗ al2al3 φp(x). The size of the feature representation vector is exponential in the number of labels M. In the example involving part-of-speech tagging there we on the order of 50 labels — 45 English and 64 Basque. The paper doesn’t directly indicate that this vector is sparse but practically speaking it would have to be. I’ve asked the first author to resolve the ambiguity in defining Φ the feature tensor function18.

Training is supervised with labeled data for parsing and PoS tagging from Penn Tree Bank or using the Stanford parser. The trained model can be used to score word sequences / sentence fragments with respect to the target class of structures or to provide a representation for classification or related interpretation tasks. The PoS experiments were interesting if not conclusive. I’ve asked Vivek, the first author, if there has been any follow-on experiments with other problems such as paraphrasing.

Training is complicated due to the non-convex nature of the loss function. An alternating optimization is presented in which the alternation is between minimizing f(w, A) with respect to w while holding A fixed and minimizing f(w, A) with respect to A while holding x fixed where the two restricted minimizations are convex. This appears to converge and provide good results and the method sounds entirely reasonable given my prior experience.

Residual thoughts: Remember to write a brief summary of the paraphrase [11] and document-summarization [12] work being done by Jonathan Berant, Percy Liang and Vivek Srikumar in Chris Manning’s lab.

January 11, 2015

In MacCartney [103] and Bowman et al [20], the objective is to classify the semantic relationships between pairs sentences or sentence fragments corresponding to the textual analogs of well-formed natural-logic formula19. This objective is realized in Bowman’s RN[T]N models as an output layer implementing a softmax over the set of semantic relations {→, ←, ↔, ..., ≠}20. For example, given the two natural-logic sentences, ‘‘all reptiles walk’’ and ‘‘some turtles move’’, the softmax layer might yield P(→) = 0.9. Note that natural-logic textual analogs of logical formula pack a lot into few words. For example, ‘‘all reptiles can walk’’ might be represented as ∀ x, isa(x,reptile) ⊃ walk(x) in first-order logic.

Of course you can unpack a natural-logic sentence as in ‘‘men are mortal’’ ∧ ‘‘Socrates is a man’’ ⊃ ‘‘Socrates is mortal’’, but it’s not common in everyday speech to be so pedantic. However, in the case of applying neural networks to infer semantic relations, the more concise — packed — version of the problem comparing ‘‘all men are mortal’’ and ‘‘Socrates is mortal’’ is somewhat easier to handle as it simplifies the monolingual-alignment problem — the neural network has to align ‘‘all men’’ with ‘‘Socrates’’ and realize that the latter is a subset of the former and thus downward-monotone [103]. We can’t expect inference problems to always be so conveniently structured, and more complicated nested formula such as we find in the SICK (Sentences Involving Compositional Knowledge) corpus (HTML) are more likely to represent the norm.

Not only will we want to determine if a statement follows from statments expressed earlier in a conversation — a relatively simple directed task, but it will also be important to draw conclusions — particularly commonsense ones — that follow from what was said — a potentially open-ended task that we humans do all the time without breaking a sweat. Traditionally, declarative knowledge in the form of simple rules, e.g., Horn clauses, is used to derive new knowledge from old by either (a) working forward from antecedents — that we know to be true — to consequents using modus ponens or (b) working backward from consequents — that we hypothesize to be true — to antecedents which if true support the consequent using modus tolens.

Neither of these derivation / proof strategies is guaranteed to derive everything that follows from a set of facts and a set of rules in polynomial time. Hence, whatever sort of theorem prover, production system or logic-programming language we employ, we will have to be satisfied with a heuristic solution, albeit one guided by experience. In keeping with our interest in seeing how far one can go without resorting to GOFAI, consider how we might employ textual surrogates for quantified formulae — rules — and implement their instantiation and application in terms of vectors:

If you compress the meaning of words, phrases and sentence fragments into vectors in an embedding space — their meaning implicit in their proximity and direction to other vectors, how do you get meaning, words and nuance back out when you need them to express yourself or understand someone else? There is an important difference between a language model (LM) that we would learn using, say, SKIP-GRAM or CBOW, and the output layer of the encoder LSTM in the Sutskever et al [162] or Cho  [27] approaches: the LM allows us to index the vectors for millions of words and short phrases via a one-hot vector and the LSTM encoder allows us to generate a single vector representing a sequence of words of variable length where that representation is optimized by the learning algorithm to facilitate computing the probability of the next word via the softmax layer.

To what extent can these encoder vectors serve as an inferential proxy for the input text beyond simply helping us to predict the next word — consider what the encoder has to incorporate into its recurrent output in order to correctly predict the appropriate type of reform in ‘‘As part of his Square Deal with the American people President Roosevelt introduced sweeping economic reform.’’ and ‘‘As part of his New Deal with the American people President Roosevelt introduced sweeping social reform.’’ There is clearly enough information to distinguish ‘‘Theodore’’ from ‘‘Franklin’’ in this context, but the words ‘‘New’’ and ‘‘Square’’ are more important for predicting the penultimate word.

A bag-of-words representation that includes ‘‘Deal’’ plus one of ‘‘New’’ or ‘‘Square’’ is likely to fall short for prediction — assuming the vocabulary doesn’t include the bigrams ‘‘New_Deal’’ and ‘‘Square_Deal’’, but at least the LSTM encoder has the potential to differentiate between ‘‘New Deal’’ and ‘‘Square Deal’’. It would potentially useful however in preparing to generate a response to have distinguished ‘‘Franklin Roosevelt’’ from ‘‘Theodore Roosevelt.’’

LSTM layers can be trained to compute differences between vectors thereby ‘‘backing’’ out terms ‘‘bound’’ in a vector composition via, say, superposition. In principle, it should be possible for a layer to take the embedding for sequence of words, say, ‘‘Franklin Roosevelt was an effective social reformer’’, back out some bound term, say, ‘‘Franklin’’, substitute an alternative term, say, ‘‘Theodore’’ and then check the result against a collection of natural-logic assertion including, say, ‘‘Theodore Roosevelt was an effective economic reformer’’ and find no or scant supporting evidence.

This simple differencing approach to binding variables may not be as effective as those used in MCR [149] (Modular Compositional Representation) or PSI [178] (Predication-based Semantic Indexing), but it may be worth running some experiments given that the differencing method could turn out to be be easier to implement within the family of NN architectures we are considering, and, if not, perhaps we will figure out a hybrid approach that offers an attractive compromise. In that spirit, here’s a proposal for a quick-and-dirty experiment.

Given the arguably-false, natural-logic sentence ‘‘Franklin Roosevelt was an effective economic reformer’’ issued in the context of a conversation, we might ask if the utterance is entailed by ‘‘Theodore Roosevelt was an effective economic reformer’’ or ‘‘Franklin Roosevelt was an effective social reformer’’. The answer should be effectively ‘‘no’’ as indicated by P(→) < 0.5, using an inference ‘‘engine’’ like NLI. However, the sentence ‘‘Franklin Roosevelt was an effective social activist’’ would seem more probable given ‘‘Franklin Roosevelt was an effective social reformer’’.

Consider the following four sentences: (1) ‘‘Roosevelt was an effective social activist’’, (2) ‘‘social activists are generally social reformers’’, (3) ‘‘social reformers are generally advocates for improving the lives of the disadvantaged and impoverished’’, and (4) ‘‘Franklin Roosevelt significantly improved the lives of the poor and unemployed during the Great Depression’’. Is it reasonable to expect that a neural network model for natural-logic inference could conclude (4) from (1-3)?

There are a couple of key inference problems that we’re interested in addressing: Suppose the model has been trained on Wikipedia including the following sentence: ‘‘Roosevelt signed into law several bills that provided relief to those who lost their jobs in the great deprssion.’’ How might we answer the user’s question: ‘‘Did Roosevelt’s New Deal help the poor?’’ For one thing, the system would have to infer the user is talking about Franklin and not Theodore. And then there is the problem of bringing inferrence to bear on the selection of words and the form of the reply in the process of generating responses.

Given Descartes’ interest in simulating historical figures, we might hope to have a subtantial corpus of dialogue like: Q: ‘‘Did McKinley have any impact on global trade ...’’, A: ‘‘William McKinley was more of a political follower than leader ...’’ and Q: ‘‘What economic policy was Grover Cleveland known for ...’’ A: ‘‘President Cleveland signed legislation ...’’. The specific answers found in the training corpus may not be be approriate to the conversation at hand, but generative machinery — of the LSTM decoder trained on this corpus — may provide a script we can use to respond with an appropriate dialog / speech act.

We could supplement the dialog corpus as a source of such syntactic structure and intentional cues, with knowledge specific to the entities named in the user input. For example, given a question or comment about Teddy Roosevelt’s economic impact, we could make use of ‘‘Roosevelt passed legislation ... antitrust laws ... regulating monopolistic trust corporations ...’’, or asked about McKinley’s impact on global finance following the American Civil War we could ‘‘fold in’’ the Wikipedia statement ‘‘McKinley ... raised protective tariffs to promote American industry ... maintained the gold standard in opposition to ...’’.

In earlier posts, I gave short shrift to the power of natural logic, thinking we would require some analog of conventional rules. Now I’m not so sure. Emulating forward or backward chaining in vector space may be more trouble than it’s worth. It depends on how closely we want to cleave to the syntactic and semantic precision of predicate calculus which is generally seen as being at odds with the flexibility and useful ambiguity of natural language. Full-blown logical inference would require vector / tensor machinery for applying rules of inference including universal instantiation, existential generalization, modus ponens, and applying DeMorgan’s laws as well as other transformations. Doable but worth putting off until we have a compelling use case that simpler natural-logic can’t handle21.

In natural logic the statement ‘‘Pop stars take drugs’’ is shorthand for ∀ x, isa(x,pop_star) ⊃ ∃ y, drug(y) ∧ ingest(x,y), but translating such statements into well-formed formulae is tedious and error prone. Natural-logic inference allows us to determine whether or not ‘‘Pop stars take drugs’’ entails ‘‘Michael_Jackson takes drugs’’. In fact, while natural logic does not support this conclusion, it would be interesting if we could derive — or would have trouble denying — a weak form of entailment for ‘‘Michael_Jackson takes drugs’’ from ‘‘Steven_Tyler mainlines heroin’’, ‘‘Alvin_Lee shoots crystal meth’’ and ‘‘Keith_Richards smokes crack’’ in the absense of any statements of the sort ‘‘Donny_Osmond doesn’t take drugs’’.

Residual thoughts: Read the Ba et al [5] paper on training an object recognition system with an attentional component that ostensibly avoids the overhead of convolutional layers — lots of wasted dot products — applied to large images. Also read their earlier work Mnih et al [119] and scanned the related paper by Maes et al [106]. I was primarily interested in whether their model provides any insights into how we might build an attentional component to control inference. Ba et al start with a global context / salience map, and following each subsequent saccade they obtain a new foveated view of the image 22. In the case of dialog, the analog of a gist-like context might be a bag of words representation and salient regions might correspond to high-entropy sub sequences of the input history — think of the textual analog of Itti and Koch [7677] salience heuristically related to recency and novelty.

Using Ba et al as motivation, Ilya Sutsekever gave a technical talk on basic reinforcement learning in the Brain reading seminar, and Viren Jain mentioned Sebastian Seung’s paper [145] on how:

[T]he randomness of synaptic transmission is harnessed by the brain for learning, in analogy to the way that genetic mutation is utilized by Darwinian evolution. This is possible if synapses are ‘‘hedonistic,’’ responding to a global reward signal by increasing their probabilities of vesicle release or failure, depending on which action immediately preceded reward. Hedonistic synapses learn by computing a stochastic approximation to the gradient of the average reward. They are compatible with synaptic dynamics such as short-term facilitation and depression and with the intricacies of dendritic integration and action potential generation.

Sebastian uses the REINFORCE algorithm of Baxter and Bartlett [9] to demonstrate how a network of hedonistic synapses can be trained to perform a desired computation by administering reward appropriately, as illustrated here through numerical simulations of integrate-and-fire model neurons. Coincidentally a paper in the latest issue of Neuron by Tremblay et al [168] shows how the simultaneous activity of ensembles of neurons in the primate lateral prefrontal cortex can be decoded to reliably predict the ‘‘allocation of attention on a single-trial basis’’. Oh, and I’ll probably never get back to this, but as I was looking for related work by Mnih I stumbled on a paper by Mnih and Hinton [118] on ‘‘hierarchical language models’’ which claims to constrain chunking to binary relations in consecutive terms or conjunctions of terms and — perhaps — identify the tags-terms corresponding to parts of speech, e.g., prepositions, conjunctions, etc, and group accordingly.

November 29, 2014

I’m preparing for the Spring edition of my computational neuroscience class (CS379C) at Stanford, and this year we’ll be looking at machine-learning methods for functional connectomics. While I hope to get interesting recordings of neural activity from Ed’s lab and Clay’s team at the Allen Institute, I’d also like to be able to generate synthetic data by simulating molecular models using MCell or NEURON in order to conduct controlled computational experiments to get a better handle on what’s possible. Here is a video showing off an EM reconstruction of cells in rat hippocampus from the Salk Institute:

Reconstruction of a block of hippocampus from a rat approximately 5 micrometers on a side from serial section transmission electron microscopy in the lab of Kristen Harris at the University of Texas at Austin in collaboration with Terry Sejnowski at the Salk Institute and Mary Kennedy at Caltech. Josef Spacek, Daniel Keller, Varun Chaturvedi, Chandrajit Bajaj, Justin Kinney and Tom Bartol made major contributions to the reconstruction and the video. (YOUTUBE)

This reconstruction was used to conduct molecular-scale simulations of hippocampal neurons for the purpose of investigating hypotheses concerning the role of extra-synaptic — sometimes referred to as ‘‘ectopic’’ transmission — neurotransmitter diffusion [29102]. Jed Wing ( was a key contributor to MCell and probably knows more about the code base than anyone else. Justin Kinney did a lot of the modeling work which is described in Justin’s Ph.D. Thesis at UC San Diego: (PDF).

I would like to find a similarly-detailed molecular model and instrument it so that we can simulate a cortical circuit consisting of something on the same order of complexity as the Hill and Tononi work [68] — on the order of ~1000 cells and millions of connections. I’m in touch with Justin Kinney who is now in Ed Boyden’s lab and I will reach out to Terry Sejnowski and others on the MCell team at the Salk Institute to identify suitable models23.

The motivation is not to supplant the use of recordings from neural tissue but rather to anticipate the advent of such data, and examine the hypothesis that we will actually be able to make sense of such data when it is available at scale. There is a database of models indexed by simulator available on the NEURON website. There is only one MCell model listed on the Yale website. As one might expect, there are quite a few models that can be run on NEURON. There is probably a more extensive list of models for MCell available elsewhere.

Another problem for which synthetic data might prove useful is in testing algorithms that make use an adjacency matrix obtained from the connectomic analysis of EM data to infer function, classify cell types or estimate the existence or strength of synapses. Suppose that we have an adjacency matrix obtained from a connectome for which we have additional information about the location, size and perhaps layer of cortex for the cell bodies of the neurons represented in the matrix.

Suppose that we also have an error model that provides information about the probability of an error in tracing an axonal or dendritic process as a function of the process length or other characteristics that can reliably inferred from current state-of-the-art connectomic analysis assuming tractable human correction / annotation. For small circuits, we might be able to combine this synthetic connectomic data with simulated calcium imaging of the sort described above.

November 5, 2014

Summary of near-term goal of connectomics: high-throughput measurements of wiring diagrams in arbitrary nervous systems at the scale of microcircuits (~105 to 106 neurons and their synaptic interconnections). The measurement will be destructive (invasive) for now, pending fundamental advances in non-invasive nanometer-resolution imaging.

Technological impacts:

  1. A library of computing microcircuits.

    * Routine measurement of wiring diagrams will lead to a library of microcircuits that are found in various nervous systems; this library will be a resource that provides inspiration and models for artificial computing systems. As an analogy, high-throughput genomics has led to a library of protein-encoding sequences that have been the basis of the burgeoning synthetic biology industry. Organizing and analyzing a library of microcircuits will be a major intellectual challenge, but an inevitable and fruitful one.

    * What is the justification for the claim that knowledge of biological microcircuits can usefully inform artificial computing systems? A famous example is that convolutional networks (specifically, the notion of serial layers of filtering and pooling) can be traced back (via LeCun and Fukushima) to measurements in visual cortex by Hubel and Wiesel. At the time, they were restricted to low-throughput physiology techniques, rather than high-throughput measurements of network connectivity. Who knows what we might find in entire libraries of such microcircuits.

  2. Computational Infrastructure for Automated Analysis of Large-scale Biological Imaging Data.

    * Biology and clinical diagnosis is undergoing a major transformation to emphasizing the use of imaging technologies as a platform for data acquisition. This is due to fundamental advances in the resolution and cost-effectiveness of imaging technologies and labeling methods, and the potential for modern computing technology to work with the resulting datasets. What is missing is the software infrastructure and algorithms for automatically analyzing such data.

    * For example, for the problem of ‘‘segmenting’’ (reconstructing) structures in biological imaging datasets, previous work in computer vision has not emphasized the scenario in which there is a rigorous notion of ground truth, as segmentation for natural images is often ill-defined. Hence, research on connectomic reconstruction is leading to new ‘‘end-to-end’’ learning algorithms for the segmentation problem that will be broadly useful for segmentation of datasets in which there is a clear notion of ground truth.

    * For example, Neuromancer has built Google3 infrastructure for storing, accessing, and processing 3d images that are petabyte-scale. This infrastructure will be broadly useful as other imaging platforms (particularly those currently evolving for clinical usage) themselves also grow in scale. In the clinical sphere, at some point (much like with the Cloud Genomics effort), it will be logical to shift management of these datasets to centralized sources of storage and computation, though the timeline for that type of shift is highly uncertain. We anticipate the infrastructure will be useful for other volumetric datasets, including atmospheric data, fluid dynamics simulations, geophysical models.

  3. Precise models for treating brain diseases.

    * Improved understanding of how microcircuits in the brain are organized (by comparing the circuitry of healthy and diseased brains) will inform any treatment of circuit-related brain disorders. For example, it is known that techniques such as deep brain stimulation can in some populations act as significant therapeutic for depression and Parkinson’s. However it is not really known why this works, or why in many cases it doesn’t work. A more precise understanding of the relevant circuits will lead to interventions that are more precisely formulated and therefore more consistently effective.

  4. Precise models for neural interfaces.

    * It is inevitable that humans will want and achieve technology that directly interfaces with neural circuits in various parts of the brain. This will enable entirely new forms of human experience and communication, and will offer the ultimate solution to neural prosthetics of various kinds. The only hope for any efficient path to achieving this type of technology is to first have some clear hypothesis of where and how to interface with different kinds of neural circuits. Connectomics offers one plausible route to generating such hypotheses.

Here are two of the problems that Ed Boyden is obsessed with. I’m similarly obsessed:

  1. If we could reconstruct the prefrontal cortex, we could try to understand how brains derive rational decisions. The visual cortex, we could understand how we process images and recognize objects. Many problems in AI have of course a natural intelligence correlate.

  2. Mapping brain disorders. We don’t understand the origins or pathology at a circuit level, of any brain disorder. If we can find neural targets in the brain that are changed in brain disorders, we could design drugs or stimulators that modulate those neurons. So far, all treatments for brain disorders were found by chance, are not understood, and work poorly.

Here is the title and abstract of the talk that Ed will be giving at Google next Tuesday:

Mapping the brain at scale: collecting the data necessary to infer the computations carried out by neural circuits

If we are to understand the computational basis for intelligent behavior, we need new technologies for mapping the molecular and anatomical circuitry of the brain and recording its dynamic activity with sufficient detail to infer the computations performed by neural circuits. Our group is working on three new approaches to address this need.

First, we have developed a fundamentally new super-resolution light microscopy technology that is faster than any other super-resolution technology, on a per-voxel basis. We anticipate that our new microscopy method, and improved versions we are working on currently, will enable imaging of molecular and anatomical information throughout entire brain circuits, and perhaps even entire brains.

Second, we have adopted to neuroscience the technology of plenoptic or lightfield microscopy, a technology that enables single-shot 3-D images to be acquired without moving parts, and thus can be used to record high-speed movies of neural activity (Nature Methods 11:727-730). We are continuing to improve such microscopes, to the point where they may be useful for imaging the entire mammalian cortex.

Finally, we are working to get the world’s smallest mammal, the Etruscan shrew, going as a model system in visual neuroscience. The Etruscan shrew has a small brain, with a six-layer cortex just a few hundred microns thick, and a visual cortex with perhaps just 75,000 neurons—less than a fly. It is small enough that entire molecular and anatomical maps, as well as dynamic activity maps, of the visual cortex might be feasible using the above tools in the near future, thereby enabling fundamental new models of how the cortex operates.

October 3, 2014

Problem: Observe the spiking behavior of every neuron (302 total) in an awake behaving nematode (C. Elegans) at between 10 and 50Hz while simultaneously recording the gross structure and activity of the worm and controlling all environmental stimuli, with the grand goal of building an artificial worm simulated down to the level of individual neurons. This would be the first such simulation ever and herald a completely new paradigm for computational neuroscience.

Assets: Imaging and data acquisition: [131], C. Elegans connectome: [177]. Early analysis work [158].

August 7, 2014

Here are some notes and references for the prose that the reviewer (Eliasmith) asked for citations. I didn’t talk about sparse coding—for which I’d cite Barlow [7] and [121]—or examples from optimization and signal processing that make use of locally adaptive methods—regarding which I’d mention the utility of adaptive subgradient methods [39] for stochastic gradient descent training and regression kernels for object detection [144] but I don’t know much about precedents for work like Chris Rozell’s  [136]:

Before max pooling there was winner-take-all (WTA) in Fukushima’s Neocognitron [44]. The max pooling layers in the Riesenhuber and Poggio HMAX model [134] have all but replaced WTA in most DNNs though the jury is out on whether this is the best model in practical or biological terms [140].

The computer vision community has experimented with a number of nonlinearities derived from biological models. The logistic function is of far too general utility to credit neuroscience with inspiring its use in computer vision, but its use as a differentiable alternative to the classical threshold unit is certainly the most common to introduction to sigmoid for most computer scientists. Simple truncation methods such as using half wave rectifier instead of the traditional sigmoid were biologically motivated [108]. These rectified linear units have outperformed sigmoidal activation functions to obtain the best results in several benchmark problems [92]

Neuroscientists prefer not to invoke global operations in their models, but local operators derived from biological models have turned out to be both effective and efficient. Surround suppression in classical receptive fields is applied in the form of local non-max suppression for edge and contour detectors and localization in object recognition  [75]. Some form of local (divisive) normalization appears to operate in a number of neural systems [2425] and local contrast normalization is one of the most important operators in state-of-the-art object recognition systems [80].


[1]   B. Agüera y Arcas, A.L. Fairhall, and W. Bialek. Computation in a single neuron: Hodgkin and huxley revisited. Neural Computation, 15(8):1715--49, 2003.

[2]   Costas A. Anastassiou, Rodrigo Perin, Henry Markram, and Christof Koch. Ephaptic coupling of cortical neurons. Nature Neuroscience, 14(2):217--223, 2011.

[3]   Dana Angluin. Learning regular sets from queries and counterexamples. Information and Computation, 75:87--106, 1987.

[4]   C. A. Atencio and C. E. Schreiner. Columnar connectivity and laminar processing in cat primary auditory cortex. PLoS ONE, 5(3):e9521, 2010.

[5]   Jimmy Lei Ba, Volodymyr Mnih, and Koray Kavukcuoglu. Multiple object recognition with visual attention. In Submitted to International Conference on Learning Representations, page [arXiv:1412.7755], 2015.

[6]   Carlo Baldassi, Alireza Alemi-Neissi, Marino Pagan, James J DiCarlo, Riccardo Zecchina, and Davide Zoccolan. Shape similarity, better than semantic membership, accounts for the structure of visual object representations in a population of monkey inferotemporal neurons. PLoS computational biology, 9:e1003167, 2013.

[7]   Horace B. Barlow. Possible principles underlying the transformations of sensory messages. In W. A. Rosenblith, editor, Sensory Communication, pages 217--234. MIT Press, Cambridge, MA, 1961.

[8]   Horace B. Barlow. Unsupervised learning. Neural Computation, 1:295--311, 1989.

[9]   J. Baxter and P. L. Bartlett. Infinite-horizon gradient-based policy search. Journal of Artificial Intelligence Research, 15:319--350, 2001.

[10]   Mark F. Bear, Barry Connors, and Michael Paradiso. Neuroscience: Exploring the Brain (Third Edition). Lippincott Williams & Wilkins, Baltimore, Maryland, 2006.

[11]   Jonathan Berant and Percy Liang. Semantic parsing via paraphrasing. Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2014.

[12]   Jonathan Berant, Vivek Srikumar, Pei-Chun Chen, Brad Huang, Christopher D. Manning, Abby Vander Linden, Brittany Harding, and Peter Clark. Modeling biological processes for reading comprehension. Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2014.

[13]   William Bialek, Fred Rieke, R.R. de Ruyter van Steveninck, and D. Warland. Reading a neural code. Science, 252:1854--1857, 1991.

[14]   Elie Bienenstock and Stuart Geman. Compositionality in neural systems. In M. Arbib, editor, The Handbook of Brain Theory and Neural Networks, pages 223--226. Bradford Books/MIT Press, 1995.

[15]   Elie Bienenstock, Stuart Geman, and Daniel Potter. Compositionality, MDL priors and object recognition. In M.C. Mozer, M.I. Jordan, and T. Petsche, editors, Advances in Neural Information Processing Systems 9, pages 838--844. MIT Press, Cambridge, MA, 1998.

[16]   Peter Blouw and Chris Eliasmith. A neurally plausible encoding of word order information into a semantic vector space. In 35th Annual Conference of the Cognitive Science Society, pages 1905--1910, 2013.

[17]   Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K. Warmuth. Learnability and the vapnik-chervonenkis dimension. Journal of the ACM, 1989.

[18]   Davi D. Bock, Wei-Chung Allen Lee, Aaron M. Kerlin, Mark L. Andermann, Greg Hood, Arthur W. Wetzel, Sergey Yurgenson, Edward R. Soucy, Hyon Suk Kim, and R. Clay Reid. Network anatomy and in vivo physiology of visual cortical neurons. Nature, 471(7337):177--182, 2011.

[19]   Zimbo SRM Boudewijns, Tatjana Kleele, Huibert D. Mansvelder, Bert Sakmann, Christiaan PJ de Kock, and Marcel Oberlaender. Semi-automated three-dimensional reconstructions of individual neurons reveal cell type-specific circuits in cortex. Communications Integrative Biology, 4:486--488, 2011.

[20]   Samuel R. Bowman, Christopher Potts, and Christopher D. Manning. Recursive neural networks for learning logical semantics. CoRR, abs/1406.1827, 2014.

[21]   K.L. Briggman, M. Helmstaedter, and W. Denk. Wiring specificity in the direction-selectivity circuit of the retina. Nature, 471:183--188, 2011.

[22]   E.M. Callaway. Transneuronal circuit tracing with neurotropic viruses. Current Opinion Neurobiology, 18(6):617--23, 2008.

[23]   M. Carandini. From circuits to behavior: a bridge too far? Nature Neuroscience, 15(4):507--509, 2012.

[24]   Matteo Carandini and David J. Heeger. Normalization as a canonical neural computation. Nature Reviews Neuroscience, 13:51--62, 2012.

[25]   Matteo Carandini, David J. Heeger, and J. Anthony Movshon. Linearity and normalization in simple cells of the macaque primary visual cortex. Journal of Neuroscience, 17:8621--8644, 1997.

[26]   Fei Chen, Paul W. Tillberg, and Edward S. Boyden. Expansion microscopy. Science, 347(6221):543--548, 2015.

[27]   K. Cho, B. Merriënboer, C. Gulcehre, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. CoRR, arXiv:406.1078, 2014.

[28]   Brian Y. Chow and Edward S. Boyden. Optogenetics and translational medicine. Science Translational Medicine, 5(177):177ps5, 2013.

[29]   Jay S. Coggan, Thomas M. Bartol, Eduardo Esquenazi, Joel R. Stiles, Stephan Lamont, Maryann E. Martone, Darwin K. Berg, Mark H. Ellisman, and Terrence J. Sejnowski. Evidence for ectopic neurotransmission at a neuronal synapse. Science, 309(5733):446--451, 2005.

[30]   D. D. Cox, A. M. Papanastassiou, D. Oreper, B. B. Andken, and J. J. DiCarlo. High-resolution three-dimensional microelectrode brain mapping using stereo microfocal x-ray imaging. Journal of Neurophysiology, 100(5):2966--2976, 2008.

[31]   David D. Cox and James J. DiCarlo. Does learned shape selectivity in inferior temporal cortex automatically generalize across retinal position? Journal of Neuroscience, 28(40):10045--10055, November 2008.

[32]   David Daniel Cox and Thomas Dean. Neural networks and neuroscience-inspired computer vision. Current Biology, 24:921--929, 2014.

[33]   Thaddeus R. Cybulski, Joshua I. Glaser, Adam H. Marblestone, Bradley M. Zamft, Edward S. Boyden, George M. Church, and Konrad P. Kording. Spatial information in large-scale neural recordings. Frontiers in Computational Neuroscience, 8:1--16, 2015.

[34]   Thomas Dean, Biafra Ahanonu, Mainak Chowdhury, Anjali Datta, Andre Esteva, Daniel Eth, Nobie Redmon, Oleg Rumyantsev, and Ysis Tarter. On the technology prospects and investment opportunities for scalable neuroscience. ArXiv preprint cs.CV/1307.7302, 2013.

[35]   Thomas Dean, Dana Angluin, Kenneth Basye, Sean Engelson, Leslie Kaelbling, Evangelos Kokkevis, and Oded Maron. Inferring finite automata with stochastic output functions and an application to map learning. Machine Learning, 18(1):81--108, 1995.

[36]   Yash Deshpande and Andrea Montanari. Finding hidden cliques of size \ sqrt {N/e} in nearly linear time. Foundations of Computational Mathematics, pages 1--60, 2013.

[37]   James J. DiCarlo and David D. Cox. Untangling invariant object recognition. Trends in Cognitive Sciences, 11(8):333--341, 2007.

[38]   Shaul Druckmann, Thomas K. Berger, Felix Schürmann, Sean Hill, Henry Markram, and Idan Segev. Effective stimuli for constructing reliable neuron models. PLoS Computational Biology, 7(8):e1002133, 2011.

[39]   John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121--2159, 2011.

[40]   A.L. Eberle, S. Mikula, R. Schalek, J.W. Lichtman, M.L. Tate, and D. Zeidler. High-resolution, high-throughput imaging with a multibeam scanning electron microscope. Journal of Microscopy, 2015.

[41]   Salah El Hihi and Yoshua Bengio. Hierarchical recurrent neural networks for long-term dependencies. In D. S. Touretzky, M. Mozer, and M.E. Hasselmo, editors, Advances in Neural Information Processing Systems 8. MIT Press, 1996.

[42]   D. J. Felleman and D. C. Van Essen. Distributed hierarchical processing in primate cerebral cortex. Cerebral Cortex, 1:1--47, 1991.

[43]   Santiago Fernández, Alex Graves, and Jürgen Schmidhuber. Sequence labelling in structured domains with hierarchical recurrent neural networks. In Proceedings of the 20th International Joint Conference on Artificial Intelligence, 2007.

[44]   K. Fukushima. Neocognitron: A self organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernnetics, 36(4):93--202, 1980.

[45]   Mario Galarreta and Shaul Hestrin. Electrical and chemical synapses among parvalbumin fast-spiking gabaergic interneurons in adult mouse neocortex. PNAS, 99:12438--12443, 2002.

[46]   S. Ganguli and H. Sompolinsky. Compressed sensing, sparsity, and dimensionality in neuronal information processing and data analysis. Annual Review of Neuroscience, 35:485--508, 2012.

[47]   Peiran Gao and Surya Ganguli. On simplicity and complexity in the brave new world of large-scale neuroscience. CoRR, arXiv:1503.08779, 2015.

[48]   Michael S. Gazzaniga. The Cognitive Neurosciences (Third Edition). Bradford Books. MIT Press, Cambridge, MA, 2009.

[49]   Stuart Geman. Invariance and selectivity in the ventral visual pathway. Journal of Physiology --- Paris, 100(4):212--224, 2006.

[50]   Dileep George and Jeff Hawkins. Towards a mathematical theory of cortical micro-circuits. PLoS Computational Biology, 5(10), 2009.

[51]   K.K. Ghosh, L.D. Burns, E.D. Cocker, A. Nimmerjahn, Y. Ziv, A.E. Gamal, and M.J. Schnitzer. Miniaturized integration of a fluorescence microscope. Nature Methods, 8(10):871--8, 2011.

[52]   Nicolas Giret, Joergen Kornfeld, Surya Ganguli, and Richard H. R. Hahnloser. Evidence for a causal inverse model in an avian cortico-basal ganglia circuit. Proceedings of the National Academy of Sciences USA, 111:6063--6068, 2014.

[53]   E. Mark Gold. System identification via state characterization. Automatica, 8:621--636, 1972.

[54]   E. Mark Gold. Complexity of automaton identification from given sets. Information and Control, 37:302--320, 1978.

[55]   Kristen Grauman and Trevor Darrell. The pyramid match kernel: Efficient learning with sets of features. Journal of Machine Learning Research, 8:725--760, 2007.

[56]   Alex Graves. Supervised sequence labelling with recurrent neural networks. Diploma thesis. Technische Universität München, 2009.

[57]   Alex Graves. Generating sequences with recurrent neural networks. CoRR, arXiv:1308.0850, 2012.

[58]   Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd International Conference on Machine Learning, pages 369--376, New York, NY, USA, 2006. ACM.

[59]   Alex Graves and Navdeep Jaitly. Towards end-to-end speech recognition with recurrent neural networks. In Proceedings of the 31th International Conference on Machine Learning, volume 32, pages 1764--1772, 2014.

[60]   Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. CoRR, arXiv:1410.5401, 2014.

[61]   Petilla Interneuron Nomenclature Group. Petilla terminology: nomenclature of features of GABAergic interneurons of the cerebral cortex. Nature Reviews Neuroscience, 9:557--568, 2008.

[62]   Richard H. R. Hahnloser, Rahul Sarpeshkar, Misha A. Mahowald, Rodney J. Douglas, and H. Sebastian Seung. Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit. Nature, 405:947--951, 2000.

[63]   Alexander Hanuschkin, Markus Diesmann, and Abigail Morrison. A reafferent and feed-forward model of song syntax generation in the bengalese finch. Journal Compututational Neuroscience, 31(3):509--532, 2011.

[64]   Demis Hassabis and Eleanor A. Maguire. The construction system of the brain. Philosphical Transactions of the Royal Society London B Biological Science, 364:1263--1271, 2009.

[65]   Kenneth J. Hayworth, C. Shan Xu, Zhiyuan Lu, Graham W. Knott, Richard D. Fetter, Juan Carlos Tapia, Jeff W. Lichtman, and Harald F. Hess. Ultrastructurally smooth thick partitioning and volume stitching for large-scale connectomics. Nature Methods, 12:319--322, 2015.

[66]   M. Helmstaedter, C.P.J. de Kock, D. Feldmeyer, R.M. Bruno, and B. Sakmann. Reconstruction of an average cortical column in silico. Brain Research Reviews, 55(2):193--203, 2007.

[67]   Moritz Helmstaedter, Kevin L. Briggman, Srinivas C. Turaga, Viren Jain, H. Sebastian Seung, and Winfried Denk. Connectomic reconstruction of the inner plexiform layer in the mouse retina. Nature, 500:168--174, 2013.

[68]   S. Hill and G. Tononi. Modeling sleep and wakefulness in the thalamocortical system. Journal of Neurophysiology, 93(3):1671--98, 2005.

[69]   Geoffrey E. Hinton, Alex Krizhevsky, and Sida D. Wang. Transforming auto-encoders. In Proceedings of the International Conference on Artificial Neural Networks, pages 44--51, 2011.

[70]   Oriol Hinton, Geoff Vinyals and Jeff Dean. Distilling knowledge in a neural network. In Deep Learning Workshop at the 2014 Conference on Neural Information Processing Systems, 2014.

[71]   Alan L. Hodgkin and Andrew F. Huxley. A quantitative description of membrane current and its application to conduction and excitation in nerve. The Journal of Physiology, 117:500--544, 1952.

[72]   Sonja B. Hofer, Ho Ko, Bruno Pichler, Joshua Vogelstein, Hana Ros, Hongkui Zeng, Ed Lein, Nicholas A. Lesica, and Thomas D. Mrsic-Flogel. Differential connectivity and response dynamics of excitatory and inhibitory neurons in visual cortex. Nature Neuroscience, 14:1045--1052, 2011.

[73]   D. H. Hubel and T. N Wiesel. Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. Journal of Physiology, 160:106--154, 1962.

[74]   D. H. Hubel and T. N Wiesel. Receptive fields and functional architecture of monkey striate cortex. Journal of Physiology, 195:215--243, 1968.

[75]   Aapo Hyvärinen. Statistical models of natural images and cortical visual representation. Topics in Cognitive Science, 2(2):251--264, 2010.

[76]   L. Itti and P. Baldi. A principled approach to detecting surprising events in video. In Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 631--637, San Siego, CA, 2005.

[77]   L. Itti, C. Koch, and E. Niebur. A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(11):1254--1259, Nov 1998.

[78]   Akerboom J, Chen TW, Wardill TJ, Tian L, Marvin JS, Mutlu S, Calderón NC, Esposti F, Borghuis BG, Sun XR, Gordus A, Orger MB, Portugues R, Engert F, Macklin JJ, Filosa A, Aggarwal A, Kerr RA, Takagi R, Kracun S, Shigetomi E, Khakh BS, Baier H, Lagnado L, Wang SS, Bargmann CI, Kimmel BE, Jayaraman V, Svoboda K, Kim DS, Schreiter ER, and Looger LL. Optimization of a gcamp calcium indicator for neural activity imaging. The Journal of Neuroscience, 32:13819--13840, 2012.

[79]   Viren Jain, H. Sebastian Seung, and Srinivas C. Turag. Machines that learn to segment images: a crucial technology for connectomics. Current Opinion in Neurobiology, 20(5):1--14, 2010.

[80]   Kevin Jarrett, Koray Kavukcuoglu, Marc’Aurelio Ranzato, and Yann LeCun. What is the best multi-stage architecture for object recognition? In Proceedings of the International Conference on Computer Vision. IEEE Computer Society, 2009.

[81]   Eric Jonas and Konrad Kording. Automatic discovery of cell types and microcircuitry from neural connectomics. CoRR, abs/1407.4137, 2014.

[82]   J.C. Jung and M.J. Schnitzer. Multiphoton endoscopy. Optics Letters, 28(11):902--904, 2003.

[83]   Christopher D. Manning Kai Sheng Tai, Richard Socher. Improved semantic representations from tree-structured long short-term memory networks. CoRR, abs/1503.00075, 2015.

[84]   E.R. Kandel, J.H. Schwartz, and T.M. Jessell. Principles of neural science (Fourth Edition). McGraw-Hill, Health Professions Division, 2000.

[85]   M. Kearns and L. G. Valiant. Cryptographic limitations on learning boolean functions and finite automata. In Proceedings of the Twenty First Annual ACM Symposium on Theoretical Computing, pages 433--444, 1989.

[86]   Georges Khazen, Sean L. Hill, Felix Schürmann, and Henry Markram. Combinatorial expression rules of ion channel genes in juvenile rat (rattus norvegicus) neocortical neurons. PLoS ONE, 7(4):e34786, 2012.

[87]   A. Klöeckner, N. Pinto, Y. Lee, B. Catanzaro, P. Ivanov, and A. Fasih. PyCUDA: GPU run-time code generation for high-performance computing. Technical Report 2009-40, Scientific Computing Group, Brown University, Providence, RI, USA, November 2009.

[88]   Ho Ko, Lee Cossell, Chiara Baragli, Jan Antolik, Claudia Clopath, Sonja B Hofer, and Thomas D Mrsic-Flogel. The emergence of functional microcircuits in visual cortex. Nature, 496(7443):96--100, 2013.

[89]   Ho Ko, Sonja B. Hofer, Bruno Pichler, Katherine A. Buchanan, P. Jesper Sjostrom, and Thomas D. Mrsic-Flogel. Functional specificity of local synaptic connections in neocortical networks. Nature, 473:87--91, 2011.

[90]   Suhasa B. Kodandaramaiah, Giovanni T. Franzesi, Brian Y. Chow, Edward S. Boyden, and Craig R. Forest. Automated whole-cell patch-clamp electrophysiology of neurons in vivo. Nature Methods, 9(6):585--587, 2012.

[91]   Jan Koutník, Klaus Greff, Faustino Gomez, and Jürgen Schmidhuber. Clockwork RNN. In Proceedings of the 31th International Conference on Machine Learning, volume 32, pages 1863--1871, 2014.

[92]   Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton. Imagenet classification with deep convolutional neural networks. In P. Bartlett, F.C.N. Pereira, C.J.C. Burges, L. Bottou, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1106--1114, 2012.

[93]   Sanjiv Kumar and Martial Hebert. Man-made structure detection in natural images using a causal multiscale random field. In Proceedings of IEEE Computer Vision and Pattern Recognition, volume 1, pages 119--126, 2003.

[94]   M. Larkum. A cellular mechanism for cortical associations: an organizing principle for the cerebral cortex. Trends Neuroscience, 36(3):141--151, 2013.

[95]   Matthew Lawlor and Steven W. Zucker. Third-order edge statistics: Contour continuation, curvature, and cortical connections. In Christopher J. C. Burges, Leon Bottou, Zoubin Ghahramani, and Kilian Q. Weinberger, editors, NIPS, pages 1763--1771, 2013.

[96]   Je H. Lee, Evan R. Daugharthy, Jonathan Scheiman, Reza Kalhor, Joyce L. Yang, Thomas C. Ferrante, Richard Terry, Sauveur S. F. Jeanty, Chao Li, Ryoji Amamoto, Derek T. Peters, Brian M. Turczyk, Adam H. Marblestone, Samuel A. Inverso, Amy Bernard, Prashant Mali, Xavier Rios, John Aach, and George M. Church. Highly Multiplexed Subcellular RNA Sequencing in Situ. Science, 343(6177):1360--1363, 2014.

[97]   Tai Sing Lee, David Mumford, Song Chun Zhu, and Victor Lamme. The role of V1 in shape representation. In Bower, editor, Computational Neuroscience, pages 697--703. Plenum Press, New York, 1997.

[98]   T.S. Lee, D. Mumford, R.Romero, and V.Lamme. The role of primary visual cortex in higher level vision. Vision Research, 38:2429--2454, 1998.

[99]   Anthony D. Lien and Massimo Scanziani. Tuned thalamic excitation is amplified by visual cortical circuits. Nature Neuroscience, 16:1315--1323, 2013.

[100]   W. A. Lim, R. Alvania, and W. F. Marshall. Cell biology 2.0. Trends Cell Biololgy, 22(12):611--612, 2012.

[101]   W. A. Lim, C. M. Lee, and C. Tang. Design principles of regulatory networks: searching for the molecular algorithms of the cell. Molecular Cell, 49(2):202--212, 2013.

[102]   Vladan Lucic and Wolfgang Baumeister. Monte carlo places strong odds on ectopic release. Science, 309:387--388, 2005.

[103]   Bill MacCartney and Christopher Manning. Natural logic for textual inference. In Proceedings of ACL Workshop on Textual Entailment and Paraphrasing, 2007.

[104]   Bill MacCartney and Christopher D. Manning. Modeling semantic containment and exclusion in natural language inference. In Proceedings of the 22Nd International Conference on Computational Linguistics - Volume 1, pages 521--528, Stroudsburg, PA, USA, 2008. Association for Computational Linguistics.

[105]   Bill MacCartney and Christopher D. Manning. An extended model of natural logic. In Proceedings of the Eighth International Conference on Computational Semantics, pages 140--156, Stroudsburg, PA, USA, 2009. Association for Computational Linguistics.

[106]   Francis Maes, Ludovic Denoyer, and Patrick Gallinari. Structured prediction with reinforcement learning. Machine Learning, 77(2-3):271--301, 2009.

[107]   G. Major, M. E. Larkum, and J. Schiller. Active properties of neocortical pyramidal neuron dendrites. Annual Review Neuroscience, 36:1--24, 2013.

[108]   Jitendra Malik and Pietro Perona. Preattentive texture discrimination with early vision mechanisms. Journal of the Optical Society of America A, 7:923--932, 1990.

[109]   A. H. Marblestone and E. S. Boyden. Designing tools for assumption-proof brain mapping. Neuron, 83(6):1239--1241, 2014.

[110]   Adam H Marblestone, Evan R Daugharthy, Reza Kalhor, Ian D Peikon, Justus M Kebschull, Seth L Shipman, Yuriy Mishchenko, Je Hyuk Lee, David A Dalrymple, Bradley M Zamft, Konrad P Kording, Edward S Boyden, Anthony M Zador, and George M Church. Conneconomics: The economics of dense, large-scale, high-resolution neural connectomics. bioRxiv, 2014.

[111]   Adam H. Marblestone, Evan R. Daugharthy, Reza Kalhor, Ian D. Peikon, Justus M. Kebschull, Seth L. Shipman, Yuriy Mishchenko, Je Hyuk Lee, Konrad P. Kording, Edward S. Boyden, Anthony M. Zador, and George M. Church. Rosetta brains: A strategy for molecularly-annotated connectomics. CoRR, arXiv:1404.5103, 2014.

[112]   Adam H. Marblestone, Bradley M. Zamft, Yael G. Maguire, Mikhail G. Shapiro, Thaddeus R. Cybulski, Joshua I. Glaser, Ben Stranges, Reza Kalhor, Elad Alon David A. Dalrymple, Dongjin Seo, Michel M. Maharbiz, Jose Carmena, Jan Rabaey, Edward S. Boyden, George M. Church, and Konrad P. Kording. Physical principles for scalable neural recording. ArXiv preprint cs.CV/1306.5709, 2013.

[113]   Gary Marcus, Adam Marblestone, and Thomas Dean. The atoms of neural computation. Science, 346:551--552, 2014.

[114]   Henry Markram, Maria Toledo-Rodriguez, Yun Wang, Anirudh Gupta, Gilad Silberberg, and Caizhi Wu. Interneurons of the neocortical inhibitory system. Nature Reviews Neuroscience, 5:793--807, 2004.

[115]   J.H. Marshel, T. Mori, K.J. Nielsen, and E.M. Callaway. Targeting single neuronal networks for gene expression and cell labeling in vivo. Neuron, 67(4):562--574, 2010.

[116]   Manuel Marx, Robert H. Gunter, Werner Hucko, Gabriele Radnikow, and Dirk Feldmeyer. Improved biocytin labeling and neuronal 3d reconstruction. Nature Protocols, 7:394--407, 2012.

[117]   Jeff Mitchell and Mirella Lapata. Composition in distributional models of semantics. Cognitive Science, 34(8):1388--1429, 2010.

[118]   Andriy Mnih and Geoffrey E. Hinton. A scalable hierarchical distributed language model. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information Processing Systems 21, pages 1081--1088, 2008.

[119]   Volodymyr Mnih, Nicolas Heess, Alex Graves, and Koray Kavukcuoglu. Recurrent models of visual attention. CoRR, abs/1406.6247, 2014.

[120]   Javier Morante and Claude Desplan. Dissecting and staining drosophila optic lobes. In Bing Zhang, Marc R. Freeman, and Scott Waddell, editors, Drosophila Neurobiology: A Laboratory Manual, volume 2011, pages 652--656. CSHL Press, Cold Spring Harbor, New York, 2011.

[121]   B. A. Olshausen and D. J. Field. Natural image statistics and efficient coding. Computation in Neural Systems, 7(2):333--339, 1996.

[122]   B. A. Olshausen and D. J. Field. Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision Research, 37(23):3311--3325, 1997.

[123]   Randall C. O’Reilly. Biologically based computational models of high-level cognition. Science, 314(5796):91--94, 2006.

[124]   Randall C. O’Reilly and Michael J. Frank. Making working memory work: A computational model of learning in the prefrontal cortex and basal ganglia. Neural Computation, 18(2):283--328, 2006.

[125]   Randall C. O’Reilly, Seth A. Herd, and Wolfgang M. Pauli. Computational models of cognitive control. Current Opinion in Neurobiology, 20(2):257--261, 2010.

[126]   Simon Osindero and Geoffrey Hinton. Modeling image patches with a directed hierarchy of markov random fields. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 1121--1128. MIT Press, Cambridge, MA, 2008.

[127]   Adam M Packer and Rafael Yuste. Dense, unspecific connectivity of neocortical parvalbumin-positive interneurons: a canonical microcircuit for inhibition? The Journal of Neuroscience, 31(37):13260--13271, 2011.

[128]   Anthony Pagden. The Enlightenment and Why It Still Matters. Random House, New York, NY, 2013.

[129]   Rodrigo Perin, Thomas K. Berger, and Henry Markram. A synaptic organizing principle for cortical neuronal groups. Proceedings of the National Academy of Sciences, 108(13):5419--5424, 2011.

[130]   Nicolas Pinto, Zac Stone, Todd Zickler, and David D. Cox. Scaling-up Biologically-Inspired Computer Vision: A Case-Study on Facebook. In IEEE Computer Vision and Pattern Recognition, Workshop on Biologically Consistent Vision, pages 35--42, 2011.

[131]   R. Prevedel, Y.G. Yoon, M. Hoffmann, N. Pak, G. Wetzstein, S. Kato, T. Schrödel, R. Raskar, M. Zimmer, E.S. Boyden, and A. Vaziri. Simultaneous whole-animal 3D-imaging of neuronal activity using light field microscopy. CoRR, arXiv:1401.5333, 2013.

[132]   Alexandro D. Ramirez, Yashar Ahmadian, Joseph Schumacher, David Schneider, Sarah M. N. Woolley, and Liam Paninski. Incorporating naturalistic correlation structure improves spectrogram reconstruction from neuronal activity in the songbird auditory midbrain. Journal Neuroscience, 31:3828--3842, 2011.

[133]   Michael W. Reimann, Costas A. Anastassiou, Rodrigo Perin, Sean L. Hill, Henry Markram, and Christof Koch. A biophysically detailed model of neocortical local field potentials predicts the critical role of active membrane currents. Neuron, 79(2):375--390, 2013.

[134]   M. Riesenhuber and T. Poggio. Hierarchical models of object recognition in cortex. Nature Neuroscience, 2(11):1019--1025, November 1999.

[135]   Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. FitNets: Hints for thin deep nets. CoRR, abs/1412.6550, 2014.

[136]   C.J. Rozell, D.H Johnson, R.G. Baraniuk, and B.A. Olshausen. Sparse coding via thresholding and local competition in neural circuits. Neural Computation, 20(10):2526--2563, 2008.

[137]   Daniel L. Schacter, Donna Rose Addis, Demis Hassabis, Victoria C. Martin, R. Nathan Spreng, and Karl K. Szpunar. The future of memory: Remembering, imagining, and the brain. Neuron, 76:677--694, 2012.

[138]   D.L. Schacter, D.R Addis, and R.L. Buckner. Constructive memory and the simulation of future events. In M.S. Gazzaniga, editor, The Cognitive Neurosciences IV, pages 751--762. MIT Press, Cambridge, MA, 2009.

[139]   W. Scheirer, S. Anthony, K. Nakayama, and D. Cox. Perceptual annotation: Measuring human performance to improve machine vision. Transactions in Pattern Analysis and Machine Learning, 36:1679--1686, 2014.

[140]   Jürgen Schmidhuber. Deep learning in neural networks: An overview. Technical report, Technical Report IDSIA-03-14, 2014.

[141]   Elad Schneidman, William Bialek, and Michael J. Berry II. An information theoretic approach to the functional classification of neurons. In Suzanna Becker, Sebastian Thrun, and Klaus Obermayer, editors, Advances in Neural Information Processing Systems 15, pages 197--204, 2002.

[142]   Tina Schrodel, Robert Prevedel, Karin Aumayr, Manuel Zimmer, and Alipasha Vaziri. Brain-wide 3d imaging of neuronal activity in caenorhabditis elegans with sculpted light. Nature Methods, 10:1013--1020, 2013.

[143]   Dongjin Seo, Jose M. Carmena, Jan M. Rabaey, Elad Alon, , and Michel M. Maharbiz. Neural dust: An ultrasonic, low power solution for chronic brain-machine interfaces. ArXiv preprint cs.CV/1307.2196, 2013.

[144]   Hae Jong Seo and Peyman Milanfar. Training-free, generic object detection using locally adaptive regression kernels. IEEE Transactions Pattern Analysis and Machine Intelligence, 32(9):1688--1704, 2010.

[145]   H. Sebastian Seung. Learning in spiking neural networks by reinforcement of stochastic synaptic transmission. Neuron, 40(6):1063--1073, 2003.

[146]   H. Sebastian Seung. Reading the book of memory: Sparse sampling versus dense mapping of connectomes. Neuron, 62(1):17--29, 2009.

[147]   A. S. Shai, C. A. Anastassiou, M. E. Larkum, and C. Koch. Physiology of layer 5 pyramidal neurons in mouse primary visual cortex: Coincidence detection through bursting. PLoS Computational Biology, 11(3), 2015.

[148]   P. Smolensky. Tensor product variable binding and the representation of symbolic structures in connectionist systems. Artificial Intelligence, 46(1-2):159--216, 1990.

[149]   Javier Snaider and Stan Franklin. Modular composite representation. Cognitive Computation, pages 1--18, 2014.

[150]   Richard Socher, Adrian Barbu, and Dorin Comaniciu. A learning based hierarchical model for vessel segmentation. In IEEE International Symposium on Biomedical Imaging: From Nano to Macro, Paris, France, 2008. IEEE.

[151]   Richard Socher, Danqi Chen, Christopher D. Manning, and Andrew Y. Ng. Reasoning with neural tensor networks for knowledge base completion. In Advances in Neural Information Processing Systems 26. 2013.

[152]   Richard Socher, Brody Huval, Christopher D. Manning, and Andrew Y. Ng. Semantic compositionality through recursive matrix-vector spaces. In Proceedings of the 2012 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2012.

[153]   Richard Socher and Christopher D. Manning. Deep learning for NLP (without magic) tutorial. In Human Language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics, pages 1--3, 2013.

[154]   Richard Socher, Alex Perelygin, Jean Y. Wu, Jason Chuang, Christopher D. Manning, Andrew Y. Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631--1642. Association for Computational Linguistics, Stroudsburg, PA, USA, 2013.

[155]   Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. CoRR, arXiv:1503.03585, 2015.

[156]   Vivek Srikumar and Christopher Manning. Learning distributed representations for structured output prediction. In Advances in Neural Information Processing Systems 27, 2014.

[157]   Francois St-Pierre, Jesse D. Marshall, Ying Yang, Yiyang Gong, Mark J. Schnitzer, and Michael Z. Lin. High-fidelity optical reporting of neuronal electrical activity with an ultrafast fluorescent voltage sensor. Nature Neuroscience, 17:884--889, 2014.

[158]   Greg J. Stephens, Leslie C. Osborne, and William Bialek. Searching for simplicity in the analysis of neurons and behavior. Proceedings of the National Academy of Sciences, 108(3):15565--15571, 2011.

[159]   Terrence C. Stewart, Trevor Bekolay, and Chris Eliasmith. Learning to select actions with spiking neurons in the basal ganglia. Frontiers in Neuroscience, 6(2), 2012.

[160]   Terrence C. Stewart, Xuan Choo, and Chris Eliasmith. Symbolic reasoning in spiking neurons: A model of the cortex/basal ganglia/thalamus loop. In Prodeedings of the 32nd Annual Meeting of the Cognitive Science Society, 2010.

[161]   E. A. Susaki, K. Tainaka, D. Perrin, F. Kishino, T. Tawara, T. M. Watanabe, C. Yokoyama, H. Onoe, M. Eguchi, S. Yamaguchi, T. Abe, H. Kiyonari, Y. Shimizu, A. Miyawaki, H. Yokota, and H. R. Ueda. Whole-brain imaging with single-cell resolution using chemical cocktails and computational analysis. Cell, 157(3):726--739, 2014.

[162]   Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to sequence learning with neural networks. CoRR, arXiv:1409.3215, 2014.

[163]   Charles Sutton and Andrew McCallum. An introduction to conditional random fields for relational learning. In Lise Getoor and Ben Taskar, editors, Introduction to Statistical Relational Learning. MIT Press, 2006.

[164]   K. Tainaka, S. I. Kubota, T. Q. Suyama, E. A. Susaki, D. Perrin, M. Ukai-Tadenuma, H. Ukai, and H. R. Ueda. Whole-body imaging with single-cell resolution by tissue decolorization. Cell, 159(4):911--924, 2014.

[165]   D. Takeuchi, T. Hirabayashi, K. Tamura, and Y. Miyashita. Reversal of interlaminar signal between sensory and memory processing in monkey temporal cortex. Science, 331(6023):1443--1447, 2011.

[166]   Naftali Tishby, Fernando Pereira, and William Bialek. The information bottleneck method. In Proceedings of the 37th Annual Allerton Conference on Communication, Control and Computing, pages 368--377, 1999.

[167]   A. Torralba, A. Oliva, M.S. Castelhano, and J.M. Henderson. Contextual guidance of attention in natural scenes. Psychological Review, 113:766--786, 2006.

[168]   Sébastien Tremblay, Adam Pieper, Florian Sachs, and Julio Martinez-Trujillo. Attentional filtering of visual information by neuronal ensembles in the primate lateral prefrontal cortex. Neuron, 85:202--215, 2015.

[169]   K. Tsunoda, Y. Yamane, M. Nishizaki, and M. Tanifuji. Complex objects are represented in macaque inferotemporal cortex by the combination of feature columns. Nature Neuroscience, 4:832--838, 2001.

[170]   S. Ullman and S. Soloviev. Computation of pattern invariance in brain-like structures. Neural Networks, 12:1021--1036, 1999.

[171]   Shimon Ullman, Michel Vidal-Naquet, and Erez Sali. Visual features of intermediate complexity and their use in classification. Nature Neuroscience, 5(7):682--687, 2002.

[172]   L. G. Valiant. A theory of the learnable. Communications of the ACM, 27:1134--1142, 1984.

[173]   V. N. Vapnik and A. Ya. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16(2):264--280, 1971.

[174]   Bram-Ernst Verhf, Rufin Vogels, and Peter Janssen. Inferotemporal cortex subserves three-dimensional structure categorization. Neuron, 73:171--182, 2012.

[175]   Oriol Vinyals, Lukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, and Geoffrey E. Hinton. Grammar as a foreign language. CoRR, abs/1412.7449, 2014.

[176]   Thomas Wennekers, Friedrich T. Sommer, and Gunther Palm. Iterative retrieval in associative memories by threshold control of different neural models. Neural Computation, 11:21--66, 1999.

[177]   J. G. White, E. Southgate, J. N. Thomson, and S. Brenner. The structure of the nervous system of the nematode caenorhabditis elegans. Philosophical Transactions of the Royal Society B: Biological Sciences, 314:1--340, 1986.

[178]   Dominic Widdows and Trevor Cohen. Reasoning with vectors: A continuous model for fast robust inference. Logic Journal of the IGPL, 10.1093/jigpal/jzu028:000--000, 2014.

[179]   David H. Wolpert, editor. The Mathematics of Generalization. Addison-Wesley, Reading, Massachusetts, 1995.

[180]   David H. Wolpert and William G. Macready. No free lunch theorems for optimization. IEEE Transactions on Evolutionary Computation, 1(1):67--82, 1997.

[181]   Yukako Yamane, Eric T. Carlson, Katherine C. Bowman, Zhihong Wang, and Charles E. Connor. A neural code for three-dimensional object shape in macaque inferotemporal cortex. Nature Neuroscience, 11:1352--1360, 2008.

[182]   Yukako Yamane, Kazushige Tsunoda, Madoka Matsumoto, Adam N. Phillips, and Manabu Tanifuji. Representation of the spatial relationship among object parts by neurons in macaque inferotemporal cortex. Journal of Neurophysiology, 96(6):3147--3156, 2006.

[183]   D.L. Yamins, H. Hong, C. Cadieu, and J.J. DiCarlo. Hierarchical modular optimization of convolutional networks achieves representations similar to macaque it and human ventral stream. In Advances in Neural Information Processing Systems 26, pages 3093--3101, Tahoe, CA, 2013.

[184]   Jianchao Yang, Kai Yu, Yihong Gong, and Thomas Huang. Linear spatial pyramid matching using sparse coding for image classification. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 2009.

[185]   Junjie Yao, Lidai Wang, Joon-Mo Yang, Konstantin I. Maslov, Terence T. W. Wong, Lei Li, Chih-Hsien Huang, Jun Zou, and Lihong V. Wang. High-speed label-free functional photoacoustic microscopy of mouse brain in action. Nature Methods, advance online publication, 2015.

[186]   J. Yu and D. Ferster. Functional coupling from simple to complex cells in the visually driven cortical circuit. Journal of Neuroscience, 33(48):18855--18866, 2013.

[187]   Anthony M. Zador, Joshua Dubnau, Hassana K. Oyibo, Huiqing Zhan, Gang Cao, and Ian D. Peikon. Sequencing the connectome. PLoS Biology, 10(10):e1001411, 2012.

[188]   F. Zhang, V. Gradinaru, A.R. Adamantidis, R. Durand, R.D. Airan, L. de Lecea, and K. Deisseroth. Optogenetic interrogation of neural circuits: technology for probing mammalian brain structures. Nature Protocols, 5(3):439--56, 2010.

[189]   Xiaodan Zhu, Parinaz Sobhani, and Hongyu Guo. Long short-term memory over tree structures. CoRR, abs/1503.04881, 2015.

[190]   D. Zoccolan, N. Oertelt, J. J. DiCarlo, and D. D. Cox. A rodent model for the study of invariant visual object recognition. Proceedings of the National Academy of Sciences, 106(21):8748--8753, 2009.

[191]   Steven W. Zucker. Local field potentials and border ownership: A conjecture about computation in visual cortex. Journal of Physiology - Paris, 106:297--315, 2012.

[192]   Steven W. Zucker. Stereo, shading, and surfaces: Curvature constraints couple neural computations. Proceedings of the IEEE, 102:812--829, 2014.

1 Following up on a conversation with Terry Sejnowki, I contacted Sury Ganguli about an analysis that Terry summarized in our discussion. Surya provided a nice summary of the analysis he did with Peiran Gao and a link to a paper under review for publication in Current Opinion in Neurobiology a draft of which you find here. Here’s the relevant excerpt from Surya reply:

We often record from 100’s of neurons — we are using Krishna Shenoy’s data from motor cortex, compute trial averages, do dimensionality reduction, and get low dimensional state space dynamics that make sense, and can decode single trial information well from this small subset of neurons. This raises conceptual questions:
  1. Why is the dimensionality so small, relative to the number of neurons?

  2. Why do the state space dynamics make sense, and why can we decode well despite recording so few neurons?

  3. Can we trust these results — would either the dimensionality or dynamics change if we recorded all 1 million neurons say?

We develop a theoretical framework that (a) derives an upper bound on the dimensionality of neural data given the complexity of the task and smoothness of neural responses and (b) connects the action of doing electrophysiology to doing a random projection of neural activity patterns onto the subspace of recorded neurons — this can be used to tell us how many neurons we need to record to accurately recover collective neural dynamics.

Using this theory, the answers to the above questions are:

  1. Because the task is too simple, and the dimensionality is actually as high as possible given the complexity of the task.

  2. Because you don’t need that many random projections (recorded neurons) to recover the geometry/manifold of neural activity patterns.

  3. Neither the dimensionality nor dynamics will change if we record more neurons, while doing the same task.

Hope that helps! Am happy to chat about this further if you wish.

best wishes,

2 Here is the notice for a technical talk given at Google by Eric Jonas on his joint work with Konrad Kording:

Title: Automatic discovery of cell types and microcircuitry from neural connectomics [81]

Speaker: Eric Jonas is a postdoc working on measurement and computation with Ben Recht in EECS at UC Berkeley. He completed his PhD on stochastic circuitry for Bayesian inference at MIT in September of 2013, where he also received his M. Eng. and SB in EECS and and SB in Neurobiology. His research interests lie at the intersection of measurement, inference, and biology.

Abstract: Neural connectomics has begun producing massive amounts of data, necessitating new analysis methods to discover the biological and computational structure. It has long been assumed that discovering neuron types and their relation to microcircuitry is crucial to understanding neural function. Here we developed a nonparametric Bayesian technique that identifies neuron types and microcircuitry patterns in connectomics data. It combines the information traditionally used by biologists, including connectivity, cell body location and the spatial distribution of synapses, in a principled and probabilistically-coherent manner. We show that the approach recovers known neuron types in the retina and enables predictions of connectivity, better than simpler algorithms. It also can reveal interesting structure in the nervous system of C. elegans, and automatically discovers the structure of a microprocessor. Our approach extracts structural meaning from connectomics, enabling new approaches of automatically deriving anatomical insights from these emerging datasets.

Paper: Automatic discovery of cell types and microcircuitry from neural connectomics, Eric Jonas and Konrad Kording, CoRR, 2014.

3 Kenneth Hayworth has worked with some of the best researchers in field of neuroscience and electron microscopy. He is a Senior Scientist at the HHMI Janelia Farm Research Campus and he works in the Harald Hess Lab. In addition to his work at Janelia, Ken is the President and Co-Founder of the Brain Preservation Foundation which is, as its name suggests, dedicated to preserving the brains of humans, including their individual memories and identities, after they die.

4 For a given pair of sentences, the semantic relatedness task is to predict a human-generated rating of the semantic similarity between the two sentences. To evaluate their model on the semantic relatedness task, Tai et al use the the Sentences Involving Compositional Knowledge (SICK) dataset (Marelli et al 2014), which consists of ~10,000 sentence pairs divided 4500/500/5000 training/validation/testing.

5 The authors distinguish between recursive neural network models in which the output feeds back to the input and recurrent models in which the hidden state values persists over time, i.e., over recursive applications of the model.

6 Here are the three machine-learning technologies employed in the Socher et al [150] paper on segmenting tubular structures: marginal space learning — A. Barbu, V. Athitsos, B. Georgescu, S. Boehm, P. Durlak, and D. Comaniciu, ‘‘Hierarchical learning of curves: Application to guidewire localization in fluoroscopy,’’ CVPR, 2007, probabilistic boosting trees — Zhuowen Tu, ‘‘Probabilistic boosting-tree: Learning discriminative models for classification, recognition, and clustering.,’’ ICCV, 2005, and steerable features — Y. Zheng, A. Barbu, B. Georgescu, M. Scheuering, and D. Comaniciu, ‘‘Fast automatic heart chamber segmentation from 3-D CT data using marginal space learning and steerable features,’’ ICCV, 2007.

7 Here are the BibTeX entries including abstracts for the papers mentioned in the March 19 log entry:

        title = {A Learning Based Hierarchical Model for Vessel Segmentation},
       author = {Richard Socher and Adrian Barbu and Dorin Comaniciu},
    booktitle = {IEEE International Symposium on Biomedical Imaging: From Nano to Macro},
    publisher = {IEEE},
      address = {Paris, France},
         year = {2008},
     abstract = {In this paper we present a learning based method for vessel segmentation in angiographic videos. Vessel Segmentation is an important task in medical imaging and has been investigated extensively in the past. Traditional approaches often require pre-processing steps, standard conditions or manually set seed points. Our method is automatic, fast and robust towards noise often seen in low radiation X-ray images. Furthermore, it can be easily trained and used for any kind of tubular structure. We formulate the segmentation task as a hierarchical learning problem over 3 levels: border points, cross-segments and vessel pieces, corresponding to the vessel's position, width and length. Following the Marginal Space Learning paradigm the detection on each level is performed by a learned classifier. We use Probabilistic Boosting Trees with Haar and steerable features. First results of segmenting the vessel which surrounds a guide wire in 200 frames are presented and future additions are discussed.}
        title = {Long Short-Term Memory Over Tree Structures},
       author = {Xiaodan Zhu and Parinaz Sobhani and Hongyu Guo},
      journal = {CoRR},
       volume = {abs/1503.04881},
         year = {2015},
     abstract = {The chain-structured long short-term memory (LSTM) has showed to be effective in a wide range of problems such as speech recognition and machine translation. In this paper, we propose to extend it to tree structures, in which a memory cell can reflect the history memories of multiple child cells or multiple descendant cells in a recursive process. We call the model S-LSTM, which provides a principled way of considering long-distance interaction over hierarchies, e.g., language or image parse structures. We leverage the models for semantic composition to understand the meaning of text, a fundamental problem in natural language understanding, and show that it outperforms a state-of-the-art recursive model by replacing its composition layers with the S-LSTM memory blocks. We also show that utilizing the given structures is helpful in achieving a performance better than that without considering the structures.}
        title = {Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks},
       author = {Kai Sheng Tai, Richard Socher, Christopher D. Manning},
      journal = {CoRR},
       volume = {abs/1503.00075},
         year = {2015},
     abstract = {A Long Short-Term Memory (LSTM) network is a type of recurrent neural network architecture which has recently obtained strong results on a variety of sequence modeling tasks. The only underlying LSTM structure that has been explored so far is a linear chain. However, natural language exhibits syntactic properties that would naturally combine words to phrases. We introduce the Tree-LSTM, a generalization of LSTMs to tree-structured network topologies. Tree-LSTMs outperform all existing systems and strong LSTM baselines on two tasks: predicting the semantic relatedness of two sentences (SemEval 2014, Task 1) and sentiment classification (Stanford Sentiment Treebank).}
       author = {Oriol Vinyals and Lukasz Kaiser and Terry Koo and Slav Petrov and Ilya Sutskever and Geoffrey E. Hinton},
        title = {Grammar as a Foreign Language},
      journal = {CoRR},
       volume = {abs/1412.7449},
          url = {},
         year = 2014,
     abstract = {Syntactic parsing is a fundamental problem in computational linguistics and natural language processing. Traditional approaches to parsing are highly complex and problem specific. Recently, Sutskever et al. (2014) presented a task-agnostic method for learning to map input sequences to output sequences that achieved strong results on a large scale machine translation problem. In this work, we show that precisely the same sequence-to-sequence method achieves results that are close to state-of-the-art on syntactic constituency parsing, whilst making almost no assumptions about the structure of the problem. To achieve these results we need to mitigate the lack of domain knowledge in the model by providing it with a large amount of automatically parsed data.}

8 In the process of imagining herself in Alice’s situation working two jobs, Jan might realize that this option for getting out of debt isn’t likely to work for her since Jan knows that she needs at least eight hours of sleep every night to function, and working two jobs would quickly exhaust her. How exactly she realizes her limitations in this regard is an open question and one that we’ve been thinking a lot about in developing applications for Descartes.

Specifically, how might Jan draw this particular conclusion but avoid drawing all of the myriad other true but irrelevant conclusions that have no bearing on the feasibility of her proposed ‘‘working two jobs’’ solution to getting out of debt? This is an example of the sort ‘‘relevant deduction’’ that we would like from a natural logic inference NLI system of the sort we considered last quarter in discussing Bill MacCartney’s work [105104].

We would also like our NLI system to draw the same conclusion if Jan’s memory was less directly applicable as in the case that Jan is reminded of Fred telling her about hiring a graduate student to work nights in his convenience store to avoid taking out a student loan to cover her tuition and housing. We’re not so naïve as to believe there is a general or optimal solution to this problem, but we’re hoping that a flexible episodic memory might help to cope with the combinatorics.

10 These models developed by Costas Anastassiou and his team at AIBS and Sean Hill at EPFL consist of networks of reconstructed, multi-compartmental, virtually-instrumented and spiking pyramidal neurons and basket cells, plus ion- and voltage-dependent currents and local field potentials that allow us to generate the same sort of rasters we expect to collect during calcium imaging.

9 I’ll be teaching cs379c Computational Models of the Neocortex again this spring, but with a different focus than in previous years. This generation of computer scientists will be the first in history to have access to brain data in sufficient quantity and quality for large-scale structural and functional connectomics, and this year I’m trying to attract computer science students and computer-savvy engineers and neuroscientists interested in tackling some of the machine-learning and signal-processing challenges in analyzing such data.

In collaboration with the Allen Institute for Brain Science (AIBS), HHMI Janelia Farm, Max Planck Institute for Medical Research and MIT, we are compiling EM (Electron Microscopy) datasets that will enable computer scientists to reconstruct the neural circuits for several model organisms, and co-registered activity recordings using calcium imaging (CI) from which we hope to glean algorithmic insights by fitting various artificial neural network models to account for observed input / output behavior.

We’ll be working with two teams of scientists and engineers who are building the tools to acquire this data. We have several relatively-small (10TB) EM datasets (including ground truth) that students interested in circuit tracing (structural connectomics) can use in projects. Scientists at AIBS have volunteered to help students in understanding the data and technologies used to collect it. In addition, engineers from my team at Google will supply examples of algorithms that have worked well for us.

Inferring function from CI data is more challenging since until recently there haven’t been good datasets to work with. We now have several such datasets provided by our collaborators that can be used in student projects. In addition, we’ll be generating synthetic datasets for cortical circuits of 5-50K neurons using Hodgkin-Huxley models10 developed at AIBS and EPFL. These models and their associated simulators provide a controlled environment in which to experiment with and evaluate machine-learning technologies for functional connectomics.

The prerequisites are basic high-school biology, good math skills, and familiarity with machine learning. Some background in computer vision and signal processing will be important for projects in structural connectomics. Familiarity with modern artificial neural network technologies is a plus for projects in functional connectomics. Please encourage your qualified students to consider taking the course. As an added incentive, I have a group of extraordinary scientists and engineers lined up to help make it a great course.

11 I asked two of my colleagues on the Neuromancer team to comment on the scalability of techniques like those championed in [11619] and here is what they offered. Peter Li, who worked on retina in E.J. Chichilnisky’s lab at the Salk Institute and has a good deal of hands-on practical experience in tracing circuits, had this to say:

[Peter]: I have experience filling neurons with neurobiotin, which is a very similar biotin derivative (slightly smaller than biocytin). You can get beautiful fills, and the binding to avidin is extremely strong, so you can easily augment the labeling in a variety of ways.

Interestingly, neurobiotin is small enough that it passes through many gap junctions (positive charge may also help in some cases), so it is often used in tracer coupling studies. We used it to investigate coupling between primate photoreceptors.

Scale is a bit of an issue. Normally you inject single cells with tracer using micropipettes. Biolistics should be an option for scaling up, but that’s a (literally) scattershot approach. In general with [non-genetically encoded dyes], the problem is how to fill larger numbers of cells without filling so many that you can’t sort anything out anymore. Similar issue with lipophilic dyes like DiI.

For tracer coupling, people do crude things like cut a slash through the tissue with a razor blade and then soak in biotin. You can then see how far into the tissue the dye spreads. For example, in some cases the spread was greater at night than during the day, indicating circadian modulation of gap junctions.

Viren Jain, who worked at HHMI Janelia Farm has experience tracing individual neurons but on a much larger scale that anything attempted prior, had this to say:
[Viren]: If you want sparse reconstruction of large numbers of individual neurons, you might as well go with GFP or variants thereof these days. Janelia is doing that approach on a massive scale, to image nearly every neuron in drosophila using optical microscopy (the main technological innovation being genetic driver lines to control expression only within very specific neurons). This still won’t tell you anything about connectivity, but is useful for cell type analysis and confirming the correctness of EM reconstructions.

12 The Drosophila visual system is composed of the retina and the optic lobes, which are the ganglia where photoreceptors project and initial processing of visual inputs occurs. The optic lobes are formed by several structures that mediate different behaviors and represent different levels of processing: the lamina, the medulla, and the lobula complex, which is formed by the lobula and the lobula plate. The medula is composed of columns which consist of about sixty cells and serve as the basic functional unit of the medulla. The HHMI Janelia Farm seven-column dataset mentioned in the text consists of a single sample containing seven such columns and a small border of additional tissue. Source: Morante and Desplan [120].

13 Reaction-diffusion systems are mathematical models which explain how the concentration of one or more substances distributed in space changes under the influence of two processes: local chemical reactions in which the substances are transformed into each other, and diffusion which causes the substances to spread out over a surface in space. Reaction-diffusion systems are naturally applied in chemistry. However, the system can also describe dynamical processes of non-chemical nature. Examples are found in biology, geology and physics and ecology. Mathematically, reaction-diffusion systems take the form of semi-linear parabolic partial differential equations. (SOURCE)

14 It appears that there is much more going on in primary visual cortex than edge detectors. The evidence of recurrent computations, geometry, contour completion, etc. has mounted over the decades since Hubel and Wiesel. I’m primarily aware of this through conversations withe Bruno Olshausen and Steven Zucker, reading the work of my colleagues at Brown David Mumford and Tai Sing Lee — David’s graduate student and now a professor at Carnegie Mellon. Here are some representative papers from Lee and Mumford [9798] and here is the abstract from a NIPS paper by Lawlor and Zucker [95] that hints at how resolving geometric ambiguity might be explained in primary visual cortex by invoking the application of higher-order statistics:

Association field models have attempted to explain human contour grouping performance, and to explain the mean frequency of long-range horizontal connections across cortical columns in V1. However, association fields only depend on the pairwise statistics of edges in natural scenes. We develop a spectral test of the sufficiency of pairwise statistics and show there is significant higher order structure. An analysis using a probabilistic spectral embedding reveals curvature-dependent components.

15 Words are discrete easily reproducible quanta of information that we have collectively agreed upon and learned to process. They can be conveyed with little or no loss over a noisy channel. They are more efficient than zeros and ones as a basis for spoken language. They are seemingly indivisible and immutable and yet capable of subtle shades of meaning and readily adapted to describing new phenomena. They provide a solid basis for human communication. Alas, everything above (thoughts and mental states) or below (sensations and external stimuli) present infinitely greater difficulty interpreting.

16 Let V, W and X be three vector spaces over the same base field F. A bilinear map (SOURCE) is a function

B: V × WX

such that for any w in W the map

vB(v, w)

is a linear map from V to X, and for any v in V the map

wB(v, w)

is a linear map from W to X.

In other words, if we hold the first entry of the bilinear map fixed, while letting the second entry vary, the result is a linear operator, and similarly if we hold the second entry fixed. Note that if we regard the product V × W as a vector space, then B is not a linear transformation of vector spaces (unless V = 0 or W = 0) because, for example B(2(v, w)) = B(2v, 2w) = 2B(v, 2w) = 4B(v, w).

17 The tensor product of two vector spaces V and W, denoted VW and also called the tensor direct product, is a way of creating a new vector space analogous to multiplication of integers. The outer product (SOURCE) usually refers to the tensor product of vectors. If you want something like the outer product between an m × n matrix A and a p × q matrix B, you can use the generalization of outer product, called the Kronecker product (SOURCE) and notated AB.

Given two matrices, we can think of them as representing linear maps between vector spaces equipped with a chosen basis. The Kronecker product of the two matrices then represents the tensor product of the two linear maps. For example, if A is an m × n matrix and B is a p × q matrix, then the Kronecker product AB is the mp × nq block matrix:

more explicitly:

It is worth noting that recursively applied tensor products grow exponentially in dimension so that for { ui ∈ ℝd: 0 < in } by the associativity of tensor products:

un ⊗ ... u3u2u1 = [ un ⊗ [ ... [ u3 ⊗ [ u2u1 ] ] ... ] = M

where M is a dn × dn matrix.

18 Message to Vivek Srikumar:

I’ve been reading your 2014 NIPS paper and I’m a bit puzzled with some of your notation, specifically the use of the integer-valued variables n and N that you use to specify the dimensionality of the weight vector w, feature vector Φ(x, y) and input feature vector φ(x).

If I assume that (uppercase) N is used exclusively for the user-defined features that determine the size of φ(x), then (lowercase) n defines the dimensionality of the weight vector and Φ(x, y), but the latter seems to be a function of Ψ(x, yp, A) and |yp| the latter of which could be of length anywhere from 1 to M the total number of labels.

So I conclude it must me that yp = (y0, y1, ..., ym) represents a sparse vector with m << M non-zeros and so therefore:

|Ψ(x, yp, A)| = dM |φ(x)| for all yp in Γx

and furthermore:

|Ψ(x, yp, A)| = is also dM |φ(x)|

though presumably the latter is quite sparse or d is quite small.

Could you confirm or disconfirm this, and, if I’m wrong, provide an alternative interpretation? Thanks.

From: Vivek Srikumar

Thanks for your email! I didn’t quite understand how you got this conclusion:

So I conclude it must me that yp = (y0, y1, ..., ym) represents a sparse vector with m < M non-zeros and so therefore:

But let me try to explain the notation a bit better. The dimensionalities involved are:

  1. n: The weight vector w is n dimensional. Φ(x, y, A) also n dimensional

  2. M: Number of labels in the problem M, corresponding to the set {l1, l2, ..., lM}. We will call this set L.

  3. d: Each label i is associated with a d-dimensional vector ali. These vectors are the columns of the d × M matrix A.

  4. m: Let us take a specific part/factor in Γx p. Suppose this p is associated with m of the outputs in the factor graph. That is, part p has a m-tuple label (y0, y1, ..., ym−1). Each yi is an element of L.

By unrolling the recursion in (3), we have

Ψ(x, yp, A) = ay0ay1 ⊗ ... ⊗ aym−1 ⊗ φ(x)

This is an m + 1 order tensor. The first m elements of this tensor product are all d dimensional vectors and the last one is a |φ(x)| dimensional one.

So the dimensionality of vec(Ψ(x,yp, A)) is dm|φ(x)| for all p ∈ Γx. Note that here m is just the number of variables associated with this part p and is not related to M, the number of labels in the problem.

In the paper, I use the (in hindsight unnecessary) additional indirection of using aly0 to refer to the vector corresponding to the label ly0. That is, the yi’s index into the label set L.

To: Vivek Srikumar

We’re on the same page up until the penultimate paragraph. You seem to be saying that all the parts in Γx have the same number of labels. Indeed the exact same labels!

Then since we simply sum the vec(Ψ(x,yp, A)), that means |Φ(x, y, A)| = dm|φ(x)| = n = |w| and so I agree that in this case everything works. I had just assumed that a given x might have parts with different label sets.

From: Vivek Srikumar

Ah, not really. The input feature vector is defined to be part specific (φp instead of φ, as in Equation 3). In practice, we could pad the input features with enough zeros. For example, say there are two parts:
  1. part p1 has one label y1 and is associated with φ1 gives d1(x)| features, and

  2. part p2 has two labels (y1, y2) and is associated with φ2 gives d22(x)| features.

We can pad the vectors vec1) and vec2) with zeros appropriately so that the two sets of features can be added. If the two feature spaces are completely orthogonal, then the actual feature vector will be of size d1(x)| + d22(x)|.

We can make this statement more formally by thinking about the basis vectors of the different vector spaces. Say the basis vectors for the label vectors are {e1, e2, ... ed}, the basis vectors for the range of φ1 are {f1, f2, ...}, with |φ1(x)| elements and those for φ2 are {g1, g2,....} with |φ2(x)| elements.

Then we can say the following about the two feature tensors and their vectorizations:

  1. the tensor product ay1 ⊗ φ1(x) will have the basis tensors {eifj} with d1(x)| elements. vec(ay1 ⊗ φ1(x)) will have the basis vectors {vec(eifj)}, and

  2. the tensor product ay1ay2 ⊗ φ2(x) will have the bases {eiejgk} with d22(x)| elements. And its vectorization will have the basis vectors {vec(eiejgk)}.

The full feature vector is the sum of these two vectors. But for the sum to be meaningful, the two vectors should be in the same space. That is, they should have the same basis vectors. To make that happen, we need to assume that the REAL feature space is defined by the basis vectors {vec(eifj)} ∪ {vec(eiejgk)}. Let’s call this union FULL. The size of FULL is d1(x)| + d22(x)| if the two sets do not share any common elements.

There are different ways of making the feature vectors from (1) and (2) exist in the FULL space. One way to easily achieve this is to define the output of the vectorization operator to produce vectors in the FULL space, with zeros for all bases that do not correspond to the corresponding part.

Note that this basically comes for free if we define use sparse vector implementations that are internally defined as maps from strings to doubles. Strings will then define the bases, ⊗ for the bases is just string concatenation. I didn’t do this for efficiency, though.

To: Vivek Srikumar

Thanks. In my original message asking for clarification, I was thinking of something along the lines of your sparse solution for working within the FULL vector space. I suppose you could also include an L1 term in the loss function to keep the basis vectors sparse. In any case, if your approach works well enough it would be worth spending time on this to make it scale.

I’m working on hierarchical document models implemented as RNNs with LSTM hidden layers. In the case of words, sentences, paragraphs, etc., there are clear syntactic markers that suggest semantic boundaries, but obviously there are also meaningful fragments at the phrase and topic level likely to prove useful.

Completely automated detection of boundaries for such fragments is beyond the state of the art, but research has focused primarily on compositional grouping for parsing, alignment, sentiment analysis and traditional question answering. It may be that using a method such as described in your NIPS paper with enough data we can reveal more of fine-grained structure for analysis.

P.S. I found your 2014 EMNLP paper with Jonathan Berant quite interesting and I look forward to following your work more closely in the future.

19 Semantic relations are model theoretic entities, e.g., entailment, contradiction and mutual consistency, which shouldn’t be confused with the binary operators that are part of the syntax of the logic, e.g., ⊃ for implication, ∧ for conjunction and ¬ for negation.

20 HTML 4.0 doesn’t have character codes for the natural-logic-relation symbols used by MacCartney [103] and Bowman et al [20], and so I’m using → for entailment, ← for reverse entailment, ↔ for equivalence, ¦ for alternation, and ≠ for independence.

21 Think about forward chaining producing a sequence of inferences in the form of bound atomic vectors that are fed into an LSTM layer in which each LSTM block is capable of remembering an entire embedding vector — the vector equivalent of a ground atomic formula, but with the mutability and flexibility of an embedding vector. We could usea variant of the NTM model [60] to set aside a round-robin buffer and feed the inferred consequents into the buffer which is subsequently scanned for relevant information to construct a context to assist response generation. The problem is that as soon as I starting thinking along these lines I immediately recall the unpleasant consequences of unfettered forward or backward chaining in applying theorem provers or classical planning systems to anything other than toy domains. Perhaps the same sort of wishful thinking will surface as we try to apply NLI at scale, but I’m hoping the enormous capacity of high-dimensional embedding spaces coupled with the graceful degradation in precision we see in the case of very-large language models will carry the day.

22 The prediction process is seeded with down-sampled low-resolution, gist-like [167] version of the whole input image annotated by the ‘‘context-network’’ to provide ‘‘sensible hints on where the potentially interesting regions of the image are located’’. At each subsequent stage in the process, the system has a ‘‘foveal’’ view centered on the target location of the last saccade — a high-resolution focal region surrounded by a low-resolution peripheral region.

23 There is always the option of using higher-level modeling tools like Chris Eliasmith’s Spaun [159] (TED) but I’m not comfortable with the accuracy or level of detail of such models. Preferably I would like to simulate at the molecular level and use molecular-scale models to develop simulated instruments to mimic genetically encoded calcium indicators.