Research Discussions:

The following log contains entries starting several months prior to the first day of class, involving colleagues at Brown, Google and Stanford, invited speakers, collaborators, and technical consultants. Each entry contains a mix of technical notes, references and short tutorials on background topics that students may find useful during the course. Entries after the start of class include notes on class discussions, technical supplements and additional references. The entries are listed in reverse chronological order with a bibliography and footnotes at the end.

May 21, 2015

Cho et al [32] observe that while existing RNN-based encoder-decoder (RNN-ENCDEC) models for machine translation perform relatively well on short sentences without unknown words, performance degrades rapidly as the length of the sentence and the number of unknown words increase. They also claim and provide evidence that their recently-introduced gated recursive convolutional neural network (GATED-RCNN) models ‘‘learn a grammatical structure of a sentence automatically.’’ Both RNN-ENCDEC and GATED-RCNN models exhibit problems with longer sentences containing unknown words.

In their experiments, all models were trained end-to-end including the 620-dimensional word embeddings1. The authors employ a variable-width beam search that accounts for the [predicted] sentence length and normalize the log probabilities of candidate translations with respect to the length of the translation2. The vocabularies used in their comparisons consist of 30,000 most commonly occurring words in both languages — English and French, and run separate experiments using training data with and without unknown words. In both cases — data with / without unknowns, RNNs with gated hidden units performed better than gated recursive convolutional neural networks (BLEU scores3 of 14/23 and 10/18 respectively), and conventional statistical machine learning (SMT) systems (MOSES) performed better than either of the neural models (33/66).

The authors take-home message is that we need to scale up memory and computation to handle larger models and, in particular, larger vocabularies, and that more research is needed to prevent the neural models from under-performing with long sentences. The latter being a clear invitation to look at their recent work [7] in which they describe a neural network model that performs comparably to the existing state-of-the-art phrase-based system on English-to-French translation, with the added bonus of qualitatively pleasing soft word-and-phrase alignments. I gave a short synopsis of Bahdanau et al [7] in the last post, and here I provide just the relevant mathematics behind their technical contribution.

In RNN-ENCDEC models like Cho et al [31] and Sutskever et al [203], the decoder is trained to predict the next word yt given the context vector c and all the previously predicted words { y1, ..., yt−1 }. The probability over the translation y is defined by decomposing the joint probability into a sequence of conditionals:

The contribution of Bahdanau et al is to have the encoder produce a set of embeddings that the decoder can ‘‘search’’ through in order to provide the most relevant information for predicting each word in the output sentence / translation. The following graphic from their paper illustrates the basic architecture and — simultaneously and potentially confusingly — the process by which the decoder generates a translation from an input sentence. The remainder of this log entry attempts to explain what’s going on, mostly by retelling the authors’ explanation in terms I can understand a little easier:

The authors uses the term annotation to refer to the output embeddings produced by the encoder in processing the input sentence. There are Tx such annotations — one for each word in the input:

where si = f ( si−1, yi−1, ci ) is the hidden state of the decoder at time i and the context ci for predicting yi is computed as a weighted sum of the annotations { hj } as follows:

The weight αij for each annotation defines its contribution to the i-th context and it is by adjusting the set of these weights { αij } that the model searches for the most relevant information to aid in predicting the next word. The αij are computed as follows:

where
is an alignment model realized as a feedforward network that takes as input the hidden state si of the decoder just before emitting yi and the j-th encoder output embedding (annotation) hj of the input sentence4. The alignment score αij reflects how well the inputs around position j and the output at position i match. The feedforward network is trained jointly with the rest of model.

There are many more details that are provided in the appendix of the arXiv version of the paper which has been accepted at ICLR 2015 for oral presentation, but the above summary pretty much explains the basic architecture and the steps required in producing a translation of an input sentence. If the supplement doesn’t suffice — and there do appear to be ambiguities in some of the variable subscripts, implementations of both models built on top of Theano are available online included with a GitHub repository called GroundHog.

May 19, 2015

Two interesting ideas are coming together to address some of the problems that I described in my note earlier this week. The first has to do with the emergence of convolutional architectures for neural networks operating on continuous models of language. This is a really a powerful idea for all of the reasons that I gave back in November, but didn’t fully embrace at that time5.

The second idea involves the use of sparse coding and deconvolution for identifying relationships between variables in graphs and continuous language models and deals with one of the main shortcomings of vanilla convolutional nets6. Identifying such relationships is important in modeling genetic pathways and predicting protein structure from sequence alignments [50] and multi-level hierarchical and compositional models for object recognition and scene analysis that learn cues at different levels of representation necessary to explain and exploit relationships between objects and their contexts [144].

Here are a few of the key papers concerned with convolutional networks for language modeling. If you’re going to read just one paper — or you are planning to stop reading the present document at this point, I recommend you read or at least skim Kalchbrenner, Grefenstette and Blunsom [103]:

The second idea isn’t really associated with a distinct network architecture. In the cases that immediately interest to me, the basic idea is primarily concerned with the limitations of baking too much information into a single, fixed-length embedding vector. Encoder–decoder models used for machine translation encode a source sentence as a fixed-length vector from which a decoder generates a translation. Bahdanau et al [7] suggest that the fixed-length vector is the bottleneck in improving the performance of such models. To avoid the bottleneck, their models produce a set of vectors — think of it as a cache — that are kept separate from one another during encoding, and then in predicting a target word in midst of decoding, search through this cache looking for information to guide prediction. In the words of the authors:

The most important distinguishing feature of this approach from the basic encoder–decoder is that it does not attempt to encode a whole input sentence into a single fixed-length vector. Instead, it encodes the input sentence into a sequence of vectors and chooses a subset of these vectors adaptively while decoding the translation. This frees a neural translation model from having to squash all the information of a source sentence, regardless of its length, into a fixed-length vector. We show this allows a model to cope better with long sentences.
An alternative approach akin to what I proposed in an earlier post is to recode the embedding vector in a sparse, over-complete intermediate representation that separates meaning in the text into separate, independent components that can easily be recovered when required during decoding. The Adaptive Deconvolutional Neural Networks of Zeiler et al [232] employs a similar approach for learning the structure of natural images. When I originally formulated my approach, I had a different use case in mind than either of Bahdanau et al or Zeiler et al. None of the approaches I’ve read about so far or come up on my own seem exactly right, but I have to re-read a couple of the papers that I didn’t understand completely.

May 17, 2015

Here are some more ideas on how we might explore the sparse sub fields with stable loci hypothesis that we discussed earlier. In this log entry I’m interested in finding out if current models already provide us with an instantiation of the hypothesized latent capability. The suggestions presented in this entry are significantly easier to implement and diagnostically more salient for an initial exploration than required for a full-scale analysis.

Start with a fully-trained, encoder-decoder, paraphrase-or-translation model [7217]. We need just the encoder for the preliminary experiments. The basic idea is to use the model to generate embedding vectors from pairs of sentences, and then analyze how the embeddings differ. In the simplest case, the two sentences would differ primarily with respect to individual word assignments to sentence parts such as the subject or primary verb. Initially we will be looking for evidence of vector components that are routinely used to encode such sentence parts, but eventually we want to identify the stable loci of semantic features (slots) assuming such loci exist.

The simplest pairs would include single point substitutions in syntactically identical sentences, e.g., ‘‘Alice drove Frank to the store’’ and ‘‘Alice drove Sam to the store’’. Then pairs involving phrasal substitutions, ‘‘Alice drove Suzy and Bill to the store’’, multiple substitutions and agreement, e.g., ‘‘Alice drove Suzy to the shopping mall’’ and ‘‘Alice drove her brother crazy to distraction’’, pairs with recursive structure, e.g., ‘‘Alice drove the hitchhiker she picked up to the mall’’. And, finally, pairs involving substitutions with similar surface structure but different semantics, e.g., , ‘‘Alice drove herself to distraction’’.

Tale a pair of sentences, e.g., P = {‘‘Gayle kicked the ball’’, ‘‘Victor kicked the ball’’}, and run each sentence through the encoder producing vectors v1 and v2 respectively. Now compare the two vectors and plot the squared difference. Pick a threshold or take the K largest differences to specify the non-zero entries in a mask M and then use M to create a new vector v1′ by replacing the components in v2 with the corresponding components in v1.

As a sanity check, cosine_distance(v1,v1′) should be smaller than cosine_distance(v1,v2). As a further check, we can use the selected model to compare v1 and v1′. For instance, use the model decoder and a beam search with a beam width of 1 to generate text sequences, e.g., a paraphrase or translation, from each vector. If the two text sequences differ, then try a larger beam width and generate larger samples for comparison.

Next we might try to see if the indices of the non-zero entries in M are peculiar to our choice of ‘‘Gayle’’ and ‘‘Victor’’ or determined by some other criterion. Try the pair Q = {‘‘Veronica kicked the ball’’, ‘‘Alice kicked the ball’’} as well as the four pairs in the cross product of P and Q. If all of the masks created in carrying out these experiments identify roughly the same indices, it will be worth trying to determine if the indices — the locus — of the vector components that encode the pivot words — ‘‘Gayle’’, ‘‘Victor’’, ‘‘Veronica’’ or ‘‘Alice’’ — are a function of the location of the pivot in the input sequence, e.g., try {‘‘Mark yelled at Gayle’’, ‘‘Mark yelled at Victor’’}, or determined by the role of the word in the sentence, e.g., it could play a syntactic role like ‘‘subject’’ or a semantic role like ‘‘actor’’, e.g., {‘‘A confident Gayle gave the valedictory’’,‘‘A confident Victor gave the valedictory’’}.

If the locus of the pivot words don’t appear to be determined by their role in the sentence, or, worse, appear to be completely independent of role, then we need an alternative explanation for how these representations can be encoded so as to produce so much structural variation in surface forms — I’m not implying by introducing the term surface form in this context that our models have to encode the deep structure of language [33]. If the locus does appear to be determined by role, this should provide some incentive to run additional experiments of a similar sort to those above targeting other roles and experimenting with more complicated sentences.

Long sentences may prove a challenge depending on whether the embeddings are biased to more strongly encode words appearing early (or late for that matter) in the input sequences. The same tools, i.e., linear logistic regression and linear discriminative analysis, mentioned earlier should suffice for these additional exploratory exercises. If we need a lot of such pairs it should be relatively straightforward to parse sample sentences from the training corpus to generate a large set of pairs.

May 13, 2015

After a disastrous first couple of lectures in which the invited speakers participating via Skype or Google Hangouts were basically inaudible and we all had to huddle around a laptop to have any chance of understanding what they said, we made a concerted effort to fix the problem for once and all — this wasn’t the first time we encountered problems relying on Hangouts and Skype for remote participants, especially with invited speakers calling in from the Harvard, MIT and Princeton.

We asked our remote invited speakers to send PDF of their slides ahead of time, find a quiet place to present, and, if at all possible, have a land-line on hand in case the VOIP gods were not smiling on us the day of their presentation. I bought a better-quality powered speaker to provide enough volume for everyone in the room to hear the presentation clearly — the cheap PC stereo speakers I got from Google inventory in 2013 were woefully inadequate. I also bought some cables from Fry’s and Radio Shack that were needed to connect my laptop to the ceiling-mounted overhead projector.

My cell phone and laptop were not quite up to the task of delivering high-quality audio and had to be adapted. A $2.99 smart-phone app was necessary to subvert the Android OS so as to amplify the output of my cell phone — there was no land-line phone in the classroom, and a $9.99 Chrome extension was required to get around my Mac Air trying to limit volume on the audio out — presumably to avoid damaging the hearing of laptop owners.

I used a retractable extension cord from my home to simplify dealing with power cables. The equipment including my laptop was too heavy to comfortably carry the half mile from the nearest visitor parking to my classroom on the Stanford quad, and so I bought a water-proof plastic file container from Walmart for $6.99 and an inexpensive mini hand truck from Fry’s for $29.99. The assembled equipment resulted in a significantly improved experience for the students and the invited speakers. Here’s a photo of the complete rig working on the day Davi Bock from HHMI Janelia Farm participated in class:

May 11, 2015

Prologue

I have been thinking a lot lately about distributed representations for encoding structured knowledge, e.g., parse trees, hierarchies, relational networks, slot-filler schemas — frames, protobuffers, etc. In the process, I’ve come to a deeper appreciation of several of the current technologies used to construct and deploy such representations [1871069816810516783209].

However, using these models for inference often requires symbolic manipulation of one sort or another9, and I am interested in a model in which all operations can be carried out in a single neural-network architecture that learns to perform these operations by training a model with back-propagation. There is some evidence to suggest that such a model is possible [2171002351479189193], but it is by no means conclusive.

I’ve focused primarily on two capabilities: (i) pointer following in service to semantic attachment, and (ii) substitution in support of variable binding, assuming that together these basic operations subsume most of the other capabilities that I imagine needing to build conversationally fluent applications. Currently I am exploring an hypothesis concerning how exchangeable pieces of embedding-space vectors might be implemented as a collection of sparse codes10. This is part of a new research strategy that I’ve been experimenting with, and I expect this strategy needs an introduction to motivate the methodology and provide some historical perspective.

Strategy

I’ll start with a bit of shaky dogma: (i) experience — mine and that of others whose opinions I generally trust — suggests that trying to define your own features is not just a waste of time but more often than not leads to poorer performance than if you just let the model sort things out for itself — you’re only job is to present the ‘‘right’’ data with as little ‘‘meddling’’ as possible; (ii) pretty much the same advice applies to fiddling around with the details of network models trying to engineer into their architecture specific capabilities that you think they need — in this case, your only job is to provide a suitable repertoire of architectural motifs and let the data and back-propagation do the rest.

While I’ve pretty much bought into (i) and (ii), I have to admit that historically there has been some value in providing precise hypotheses concerning what features (representations) and mechanisms (computations) might be required or, in the case of modeling neural circuits, what features and mechanisms are actually implemented in a given circuit. Classic examples from neuroscience include the notion that direction-selective circuits in striate cortex implement Gabor filters, that the rest of the ventral visual stream is just a stack of alternating simple and complex cells a la Hubel and Wiesel [90], and that the whole of visual processing is carried out by a homogeneous sheet of computational units called cortical columns [4388155154153]. So my new research strategy is to ignore the dogma and generate hypotheses concerning functions — computational capabilities — that I believe are latent in some of the translation and paraphrase models we’ve been working on at Google. What follows is an example of my following this new strategy.

Hypothesis

There appears to be some evidence that recurrent neural networks can generate language so as to enforce constraints between words and word fragments in text, e.g., subject-verb agreement11. We hypothesize that during training these networks learn rules governing the application of such constraints and construct distributed representations encoding the information necessary to apply those rules. The encoded information enables the resulting models to modify an existing distributed representation, in order, for example, to substitute a different subject, predicate or determiner obeying the encoded constraints. If true, it would useful to know how and where (within the vectors that define an embedding space) this information is encoded.

There are several possibilities to consider. The information could be stored directly in the embedding vector either (sparsely) in a subset of the vector components or (densely) sprinkled throughout the vector (as in the case of holographic reduced representations [16798]). Alternatively12, it could be stored in a location external to the vector in which case the vector must contain a reference (pointer) to that location (as in the case of applications of spatter codes and modular composite representations [187105]).

In this note, we explore the sparse variant of the former possibility. Specifically, we hypothesize that each linguistic constraint that the network and its corresponding embedding space are able to handle has an associated subset of embedding-space vector components that is used exclusively for encoding the information required to enforce the constraint. Intuitively, imagine we encode a user utterance as an embedding vector and subsequently retrieve that vector and wish to adapt the previously encoded utterance to the the present circumstances by substituting a new subject appropriate to the current context. We hypothesize that the information required to enforce agreement between the new subject and the old verb can be reliably recovered from a subset of the vector components allocated during training for exactly that purpose. We refer to such a subset as a sparse sub field of an embedding-space vector.

Figure 1: A graphical model (top) of a collection of neurons (vertexes) and connections (edges) indicating active synapses, and the neural-network equivalent (bottom) showing the neurons (highlighted in red and green) that comprise particular sparse sub fields of the network.

There are a couple of concrete steps we might take in exploring these hypotheses: We could engineer neural networks to facilitate one of these strategies, e.g., integrate machinery for retrieving the necessary information from an external memory [71219] or modify the objective function to encourage sparsity and locality of reference so as to represent the required information internally within the embedding vectors [13373]. In addition, we could attempt to collect evidence supporting one of the alternative hypotheses by measuring how well a given class of network models enforces different agreement constraints, e.g., subject-verb or determiner-noun agreement, and identifying the locus of the embedded information required to apply these constraints.

Evidence

It may prove difficult to predict a particular linguistic constraint that we believe likely to be represented in an embedding vector and then precisely identify its corresponding sparse sub field. The difficulty arises for the simple reason that the neural network is not likely learn to the exact same rules of grammar defined by linguists and grammarians13. It still may be possible to find rough correlates of some familiar rules of agreement. The first step is to identify a class of models that produces the desired behavior. State-of-the-art paraphrase [14], paragraph [119] and translation [203] models offer a promising source of candidates.

The next step is to determine if instances of the selected class enforce the target linguistic constraint. One approach is to use a lexicalized dependency parser to generate parse trees for a sample of sentences from the training data that illustrate application of the constraint. In the case of subject-verb agreement, find examples of verbs that have different singular and plural forms, and search for sentences containing one of those verbs plus a noun for a subject that also has different singular and plural forms14. Assuming we observe good agreement, search for vector components that are highly correlated with the number — singular or plural — of the verb15.

May 9, 2015

The following is mostly my thinking ‘‘out loud’’ in prose, a habit that generally helps me focus attention and sort out hidden misconceptions. Parts of the following text have beeen lifted and adapted in later entries — those occurring earlier in this chronologically-reverse-ordered document, and the rest is left in place simply to chronicle the trajectory of my thinking and return to if I lose my way.

Latent Sparse Slot Encoding Hypothesis

I have been thinking a lot lately about distributed representations for encoding structured knowledge, e.g., parse trees, hierarchies, relational networks, slot-filler schemas — frames, protobuffers, etc. In the process, I’ve come to appreciate a number of the technologies used to construct and deploy such representations [1871069816810516783209].

For the most part, the use of these models for inference requires a fair bit of symbolic manipulation, and I am interested in a model in which all operations can be carried out in a single neural-network architecture that learns to perform these operations by training a model with back-propagation. There is evidence to suggest that such a model is possible [2171002351479189193].

I’ve focused my attention primarily on two capabilities: (i) semantic attachment via pointer following, and (ii) substitution and variable binding, assuming that these subsume most of the other capabilities that I can imagine needing. Currently I am exploring an hypothesis concerning how exchangeable pieces of embedding-space vectors can be implemented as a collection of sparse codes. I like to think of these mutable pieces as genes since they can be exchanged between genomes, but the analogy doesn’t quite work since genes typically correspond to dense sequences of base pairs. Nevertheless I’ll return to this analogy later on.

Independent Property Sparse Codes

The vectors that are generated using the skip-gram algorithm [146] are distributed representations of words. The skip-gram model induces a measure of similarity in which words that appear commonly in the same contexts are deemed to be similar16. As demonstrated by Tomas Mikolov [147], these representations encode information about diverse properties of the words, and, at least in some cases, those properties are represented within the vectors in terms of specific dimensions [4]. This is borne out by the way in which arithmetical operations on embedding vectors are able to perform analogical reasoning, e.g. KING - MALE + FEMALE = QUEEN.

It seems reasonable to expect that there are multiple — perhaps many — dimensions that are employed within vectors to represent properties of words and that all vectors rely on the same vector components to encode these properties. Furthermore, it might be the case that only a small number of vector components are required for any given property and that, to a large extent, the vector components used to represent a particular property P are allocated exclusively — or nearly so — for the purpose of representing P. If this is the case then one could construct a mask to be used to extract just the components used to represent a particular property.

Substitution and Variable Binding

Moreover one might exploit this modularity to perform a version of variable binding. For example we might mask out just the components that represent gender, extract a vector that represents the gender of one person and substitute the gender of another person in its place. If we had a vector representing a sentence like ‘‘Jack went to the store’’, and the representation of sentences included a dimension that represents the subject of the sentence then we might be able to modify the sentence to have a different subject, for example, ‘‘Jill went to the store’’.

It would be interesting to see if we can also perform more complicated substitutions that, for example, might handle subject-verb agreement, article-noun agreement, etc. Properties of words like gender, and properties of sentences like subject are not likely to be represented in all embedding vectors. Words, sentences, paragraphs, and fragments of these may allocate their corresponding vector components differently. Of course we could carry out substitutions symbolically if we could identify the right masks; the question is whether we can learn to perform such substitutions.

Attachment and Pointer Following

Recently there have been several extensions of neural networks that involve coupling them to external memory resources that they can interact with by attentional processes; these attentional models include the memory networks of Weston et al [219] and the neural Turing machines of Graves et al [71]. Subsequent related work focuses on attentional models for visual tasks [224615172].

Analogies to Synthetic Biology

There are different ways of adding information encoded in one vector to another vector — sum, average, convolve, but linking is a very powerful and general method. One way in which this can be done is to allocate a set of vector components to representing the address of the linked information.

The machinery for binding might be implemented using the attentional mechanisms described in the recent papers by Yoshua Bengio, Alex Graves and Jason Weston among others. The machinery for pointer following might be implemented using some variant of the read/write memory in the neural Turing machine model of bank by or the attentional mechanisms

Think more about the conceptual relationships between terms from genomics, e.g., recombinant DNA, hybridization, recombinase, etc, and terms from linear algebra and spectral graph theory, e.g., affinity, sparsity, vector space, rank, etc.

Instead of thinking about vector arithmetic and embedding spaces, it may be useful to think about metaphors from genomics, specifically recombinant DNA, hybridization, recombinase. A gene is the vector whose components are modulo four integers. The coding region forgiven property of the embedded representation is just a sequence of base pairs marked with appropriate promoter, begin and end segments.

Miscellaneous loose ends: Here are some references to papers by Sebastian Seung and his colleagues that struck me as relevant to understanding the neural correlates of the sort of computations discussed in this entry:

‘‘We propose that the neocortex combines digital selection of an active set of neurons with analogue response by dynamically varying the positive feedback inherent in its recurrent connections.’’ [75]

‘‘We show analytically how using neurons with multiple bistable dendritic compartments can enhance the robustness of eye fixations to mistuning while reproducing the approximately linear and continuous relationship between neuronal firing rates and eye position, and the dependence of neuron pair firing rate relationships on the direction of the previous saccade. The response of the model to continuously varying inputs makes testable predictions for the performance of the vestibuloocular reflex. Our results suggest that dendritic bistability could stabilize the persistent neural activity observed in working memory systems.’’ ([65])

May 7, 2015

You have about four weeks left to complete your projects. Final proposals are due on Monday, May 11. By then you should know what project you’ll be working on and have a pretty good idea of how you will tackle the problems addressed in your proposal. Don’t spend a lot of time on the final proposal; at this stage, I’ll be happy if you just clearly state the problem you’re trying to solve and provide a status report on what you’ve accomplished.

For the rest of the quarter, we will only meet as a class on Mondays. Tuesdays will be devoted to ‘‘office hours’’ by which I mean that if you want to meet with me to discuss your project you have three options: we can agree to meet on campus during the regularly scheduled class time, either in Building 100, Room 101K or in one of two cafes: Coupa Cafe in Y2E2 or Bytes Cafe across from Gates. If you want to meet at other times, we can arrange to meet at Google or by Hangout. And, of course, I will do my best to respond quickly to email, especially if you include CS379C in subject.

I expect you to attend and participate in the remaining Monday classes. That shouldn’t be a burden, however, as we have three exciting speakers still to present:

Feel free to invite your interested colleagues to any one or all of these lecture / discussions. In each case, the speaker will be describing new work that directly relates to several of your projects. Alipasha Vaziri will not be speaking as he is traveling extensively and we couldn’t find a convenient time in his itinerary to fit in a presentation. Ed mentioned a bit about the collaboration between his and Alipasha’s labs. I’m very interested in the technology they’re developing, I hope we can get Alipasha to participate the next time the course is offered.

May 6, 2015

Following up on the discussion we had after class with Eric Jonas, here’s a new paper from Ian Stevenson’s lab on inferring synaptic connectivity from electrophysiology (PLoS), work from Scott Linderman combining stochastic block models with GLMs (NIPS), and a paper on discovering latent network structures from point process data: (CoRR). Here’s a related paper from Sebastian Seung’s lab — Sümbül et al [201] — published in Frontiers in Neuroscience — the journal also lists some interesting papers under the category of Quantitative Analysis of Neuroanatomy worth scanning (FiN). I also seem to recall that Eric mentioned a special issue of Neuron that Sebastian co-edited on a related topic, perhaps pertaining just to identifying cell types.

May 5, 2015

Here are a few references relating to the problem of inferring functional connections between neurons [19814853]. Adam mentioned one of the authors, Yuri Mishchencko, in his talk yesterday, and Eric Jonas, who will be visiting tomorrow, will be presenting work he’s done in collaboration with another, Konrad Kording, and Eric may have some interesting insights into how one might learn an affinity matrix from activity data alone.

@article{StevensonetalCURRENT-08,
        title = {Inferring functional connections between neurons},
       author = {Ian H Stevenson and James M Rebesco and Lee E Miller and Konrad P Kording},
      journal = {Current Opinion in Neurobiology},
       volume = 18,
         year = 2008,
        pages = {1–7},
     abstract = {A central question in neuroscience is how interactions between neurons give rise to behavior. In many electrophysiological experiments, the activity of a set of neurons is recorded while sensory stimuli or movement tasks are varied. Tools that aim to reveal underlying interactions between neurons from such data can be extremely useful. Traditionally, neuroscientists have studied these interactions using purely descriptive statistics (cross-correlograms or joint peri-stimulus time histograms). However, the interpretation of such data is often difficult, particularly as the number of recorded neurons grows. Recent research suggests that model-based, maximum likelihood methods can improve these analyses. In addition to estimating neural interactions, application of these techniques has improved decoding of external variables, created novel interpretations of existing electrophysiological data, and may provide new insight into how the brain represents information.},
}
@article{MishchenckoetalAAS-11,
        title = {A Bayesian approach for inferring neuronal connectivity from calcium fluorescent imaging data},
       author = {Mishchenko, Yuriy and Vogelstein, Joshua T. and Paninski, Liam},
      journal = {The Annals of Applied Statistics},
    publisher = {The Institute of Mathematical Statistics},
       volume = 5,
       number = {2B},
         year = 2011,
        pages = {1229-1261},
     abstract = {Deducing the structure of neural circuits is one of the central problems of modern neuroscience. Recently-introduced calcium fluo- rescent imaging methods permit experimentalists to observe network activity in large populations of neurons, but these techniques provide only indirect observations of neural spike trains, with limited time resolution and signal quality. In this work, we present a Bayesian approach for inferring neural circuitry given this type of imaging data. We model the network activity in terms of a collection of coupled hidden Markov chains, with each chain corresponding to a single neuron in the network and the coupling between the chains reflecting the net- work’s connectivity matrix. We derive a Monte Carlo Expectation-Maximization algorithm for fitting the model parameters; to obtain the sufficient statistics in a computationally-efficient manner, we introduce a specialized blockwise-Gibbs algorithm for sampling from the joint activity of all observed neurons given the observed fluorescence data. We perform large-scale simulations of randomly connected neuronal networks with biophysically realistic parameters and find that the proposed methods can accurately infer the connectivity in these networks given reasonable experimental and computational constraints. In addition, the estimation accuracy may be improved significantly by incorporating prior knowledge about the sparseness of connectivity in the network, via standard $L_1$ penalization methods.},
}
@incollection{FletcherandRanganNIPS-14,
        title = {Scalable Inference for Neuronal Connectivity from Calcium Imaging},
       author = {Fletcher, Alyson K and Rangan, Sundeep},
    booktitle = {Advances in Neural Information Processing Systems 27},
       editor = {Zoubin Ghahramani and Max Welling and C. Cortes and N.D. Lawrence and K.Q. Weinberger},
         year = 2014,
        pages = {2843-2851},
    publisher = {Curran Associates, Inc.},
     abstract = {Fluorescent calcium imaging provides a potentially powerful tool for inferring connectivity in neural circuits with up to thousands of neurons. However, a key challenge in using calcium imaging for connectivity detection is that current systems often have a temporal response and frame rate that can be orders of magnitude slower than the underlying neural spiking process. Bayesian inference methods based on expectation-maximization (EM) have been proposed to overcome these limitations, but are often computationally demanding since the E-step in the EM procedure typically involves state estimation for a high-dimensional nonlinear dynamical system. In this work, we propose a computationally fast method for the state estimation based on a hybrid of loopy belief propagation and approximate message passing (AMP). The key insight is that a neural system as viewed through calcium imaging can be factorized into simple scalar dynamical systems for each neuron with linear interconnections between the neurons. Using the structure, the updates in the proposed hybrid AMP methodology can be computed by a set of one-dimensional state estimation procedures and linear transforms with the connectivity matrix. This yields a computationally scalable method for inferring connectivity of large neural circuits. Simulations of the method on realistic neural networks demonstrate good accuracy with computation times that are potentially significantly faster than current approaches based on Markov Chain Monte Carlo methods.},
}

May 3, 2015

Here is a recent review article [165] (PDF) on calcium imaging using genetically-encoded calcium indicators (GECI) and two-photon laser scanning microscopy (TPLSM) written by Kark Svoboda and his colleagues at HHMI Janelia Farm Campus. It does a good job of setting out the challenges as we move from 10s or 100s of neurons to thousands or millions, pointing to cell types that are particularly problematic to detect, the limitations of current gene delivery methods, and contrasting current GECI+TPLSM technology with electrophysiology and TPLSM using genetically-encoded voltage indicators.

Two recent papers explore regulatory pathways implicated in controlling the process of learning and adaptation were highlighted on the Kurzweil AI news feed. The first paper published in the Journal of Biological Chemistry describes a gene that controls an important part of the process whereby dendritic spines extend filaments called filopodia that wave about in the extracellular fluid searching for axons that secrete chemicals to attract the filopodia. If one of these filaments contacts an axon, it develops into a dendritic spine forming the basis for memory formation.

The second paper appearing in Nature Neuroscience describes how neurons constantly make use of DNA demethylation to alter gene expression levels thereby controlling synaptic activity. The authors have identified one gene, Tet3, that plays a crucial role in the regulatory pathways that control synaptic activity, and shown that DNA demethylation controls the expression of Tet3.

Justin Sanchez, whom I met last time I was in Washington for a BRAIN-related meeting, runs the SUBNETS program at DARPA and is launching a new program aimed at erasing or restoring memories in traumatized soldiers returning from Afghanistan. I won’t comment on the near-term prospects for success, but the design sketch is interesting for involving non-trivial computation, part of it carried out on the implanted chip and part of it on an external device worn behind the ear. Power is provided by inductive coupling through a pair of RF coils. One coil is attached to the interior of the skull and connected to the implanted chip and the other is attached to the scalp and connected to an electronics package worn behind the ear. The surgical intervention is similar to that required for cochlear implants.

Since Tomàs Mikolov produced his first embedding-vector language model starting with recurrent networks [145] and later using his more-efficient and highly-scalable WORD2VEC algorithm [146] researchers have been surprised and then mystified by the ability of these models to perform analogies using vector arithmetic [147]. Several researchers have tried to demystify the models using more rigorous arguments — see Levy and Goldberg [124] for one of the more credible attempts, but have repeatedly fallen short of their goal and, in the end, have only added to the mystery. But now a team of theoretical computer scientists at Princeton led by Sanjeev Arora have come up with what looks like a solid theoretical explanation along with a new algorithm that mirrors insights from their analysis et al [4].

April 29, 2015

I was thinking about Ed Boyden’s presentation on Monday and in particular about the circuit boards he described for pre-processing and compressing the data from 3-D MEAs. In this case, the temporal resolution might be something on the order 1K Hz in which case the system would produce about 1G readings per second assuming a 100 × 100 array of multi-electrode linear probes with at least 100 electrodes per shank for a total of at least (102)3 electrodes for the MEA. For his newest 1K electrode linear probes that would be 10G readings per second or about a terabyte every 10 seconds assuming long integers (8 bytes on 64 bit machines). His solution is to perform the spike sorting on ASICs and buffer/stream the data to an array of disks.

I was thinking about how you might avoid the storage—and some of the pre-processing—overhead by training models concurrently with running experiments and collecting data. We employ many strategies for exploiting parallelism in training our models at Google. Among the most useful strategies, we partition large models into component blocks, employ batch and mini-batch training, perform asynchronous stochastic gradient descent and train multiple instances of the model in parallel, periodically and asynchronously updating a single global set of model parameters using a multi-threaded parameter server [40].

It may be possible to reduce the overall computational cost by eliminating spike sorting using the idea Surya mentioned in his class presentation. I can’t remember the exact details, but he claimed that a linear combination of local observations, e.g., electrode recordings of local field potentials, was sufficient for recovering dynamics and therefore spike sorting was not strictly necessary and could even introduce additional noise and thus reconstruction error17. If one of you has a different recollection please chime in here. It would be relatively easy to solve the equations in parallel on a GPU to achieve high-throughput.

[05-01-15] Surya just got back to me and said he hadn’t published anything on the idea of eliminating spike sorting yet, but that the basic idea comes from the theory of random projections on smooth manifolds [9] and the implications of the theory for machine learning and compressed sensing. As far as the application to spike sorting is concerned, Surya writes that ‘‘the basic idea is if the neural manifold is random, then superimposing neurons on a single electrode is itself a random projection and should preserve information about the neural manifold.’’ He suggests that this observation explains the results in the following paper [24].

To train a model, we would stream batched observations to a large number of model-instance workers that would distribute each batch to a bunch of model-component-block workers that would run a forward pass and then calculate the gradients to feed to the centralized parameter server to update the global model parameters. After each local update, the batch storage is recovered or used to hold the next batch. In principle, there would be no need to use more than the storage required for the current set of batches, assuming that we have enough workers and a fast enough network to handle the deluge of data. We might want to store some data simply to provide the rest of the community with enough, representative data so they could replicate our results.

Note that our goal here is to infer a function that replicates the input output behavior of the neural circuit. One complication is that in order to train an ANN, we would need to know the inputs and outputs. I’m assuming that by observing the activity in the neural circuit we could determine to a first approximation which locations in, say, a calcium-imaging raster correspond to axonal termini (synaptic boutons) and which location are the termini of dendritic spines. That may be a naïve assumption on my part given the complications introduced by dendrodendritic synapses and related departures from the classic axodendritic connections, and depending on where the field potentials are measured, e.g., soma versus synapses.

April 25, 2015

To: Mindreaderz
Subject: Project Suggestion from Adam Marblestone

Contributor: Adam Marblestone
Title: Simulate the Rosetta Brain

Adam has an interesting potential project suggestion. The designing and evaluating ROSETTA [138] configurations. He’ll be talking abou ROSETTA in class Monday, May 4 and you can find the paper on his calendar entry. Here’s a quick sketch of the project he has in mind:

  1. Take an EM connectomic volume, already analyzed.

  2. Annotate it (essentially ‘‘draw’’ on the images) with a bunch of FISSEQ [121] barcodes, at various densities and sub-cellular localizations.

  3. Simulate what that would look like in ExM.

In this way, come up with a set of design constraints in terms of: how many barcodes we need, what density and distribution in the cell we need, what FISSEQ rolony [1] size we need, what spatial resolution we need, etc. You could start with something similar to Figure 3 of Conneconomics (PDF) (based on Yuriy Mishchenko’s work), but then go much further in constraining the design space.

[1] I think this is a typo and Adam meant to type ‘‘polony’’ as n polony sequencing.

Adam would be happy to advise on such a project if you are interested in taking a shot at it.

April 24, 2015

Ed Boyden from MIT will be speaking on Monday, 27 April. Ed will focus on his expansion microscopy (ExM) technology and the prospects for light microscopy relying on modern super-resolution image-processing techniques to rival the resolving power of an electron microscope. The primary reading [30] describes ExM and the supplementary reading [39] is concerned with quantifying the limitations of different neural recording technologies in terms of their ability to separate the activity of one neuron from that of its neighbors.

Sebastian Seung from Princeton will join us on Wednesday, 29 April. Sebastian will be talking about neural modeling and structural connectomics. I’ve selected four papers: two that emphasize his theoretical work on (i) the neural basis for reinforcement learning [183] and (ii) how the cortex selects active sets of neurons [75]. The other two papers focus on connectomics: one emphasizing the advantages of densely sampling neurons [184] and the second describing how machine learning can accelerate segmentation [95].

Adam Marblestone, a research scientist in Ed’s lab, will talk on the following Monday, May 4. Before joining Ed’s lab, Adam was a graduate student at Harvard working with George Church where he helped to develop a number of technologies for fluorescent staining using combinatorial codes that leverage high-throughput RNA sequencing to identify a large number of different molecules in a single assay [121138137139]. He is currently working on the team developing ExM.

Eric Jonas will join us on Wednesday, May 6. Eric completed his PhD at MIT and is currently a postdoc at the University of California, Berkeley. He co-founded and was the CEO of Prior Knowledge from its founding through its acquisition by Salesforce at the end of 2012. His collaboration with Konrad Kording has yielded some interesting insights into what can be inferred from even noisy connectomes [97].

The readings for each presentation are on the course calendar pages. Jonas will join us in person the others virtually. Aditya and I have finally worked out a satisfactory solution for audio. By having the presenter call us on a land line, we achieve good voice quality, and we’ve purchased a high-quality powered speaker to amplify the audio output of the presentation laptop and adjust the volume so we don’t have to strain to hear the speaker. If we lose the IP connection, at least we’ll have a copy of the speaker’s slides and will be able to hear the presentation clearly.

April 23, 2015

David Cox’s presentation stimulated a lot of discussion yesterday. Unfortunately most of it was out in the quad after David’s talk. If you have lingering questions for David, I expect he’d be glad to hear from you. His running baseball analogy was inspired. After a review of earlier recording technologies, David described their calcium-indicator and two-photon technologies for recording from rodents and mentioned their earlier work using micro-electrode arrays [36]. Their automated experimental setup allows them run dozens of rodent experiments every day [236].

In terms of new or improved recording technologies, David mentioned work by Alipasha Vaziri and his colleagues using their wide-field, temporal-focusing imaging technology which combined with a nuclear-localized calcium indicator, NLS-GCaMP5K, supports unambiguous discrimination of individual neurons in densely packed tissue [169180]. He showed how CCaMP technology has rapidly maturing and enthused over recent improvements in genetically encoded voltage indicators like ASAP1 [196].

Relevant to our interest in inferring the function of neural circuits in the ventral visual stream, he told us how he is building on work out of Jim DiCarlo’s lab comparing human and machine vision [227]. In particular, David described his recent work incorporating constraints from psychophysics into the loss function and regularizer used in training image classifiers to model human visual ability [177]. The results look are promising and the research methodology is definitely an intriguing new approach to modeling neural function.

I mentioned in my introduction that David had done some early work in comparing human and machine vision. He applied methods derived from computational genomics for high-throughput screening and employed a clever strategy for training and evaluation to search for biologically-motivated models that rival state-of-the-art in image classifiers [166]. He was also one of first to use GPUs at scale to accelerate the evaluation of thousands of different network topologies in parallel in searching for the best performing models [166110].

April 21, 2015

Giret et al [62] describes some of the birdsong work that Joergen presented in yesterday’s meeting with Max Planck. Surya Ganguli, one of the authors told me about this work in yesterday’s class at Stanford. (PDF) Jerome Lecoq a postdoc in Mark Schnitzer’s lab also attended class and we talked about using one of their miniature fluorescence microscopes [61] so that the experimental birds could move about naturally.

Spectral methods and graph algorithms figure prominently in Surya’s analyses. Cliques are fully connected subgraphs that often correspond to computational artifacts like winner-take-all networks. Turns out that it is hard to find small cliques in sparsely connected networks. Spectral methods don’t work well. Deshpande and Montanari [44] describe the problem and provide an efficient algorithm. (PDF)

Surya also told me about an interesting technique borrowed [194] from non-equilibrium statistical physics for learning deep, generative models consisting of many (thousands) layers or time steps in the case of recurrent networks. The idea is to ‘‘systematically and slowly destroy structure in a data distribution through an iterative forward diffusion process [and] then learn a reverse diffusion process that restores structure in data, yielding a highly flexible and tractable generative model of the data’’. (PDF)

April 20, 2015

I’m searching for papers that make a credible attempt to model simple and complex cells in terms of cortical columns. My interest was sparked by the Larkum paper [117] that Costas Anastassiou told us about in his class presentation and by subsequently revisiting the Felleman and Van Essen classic paper [51] that appeared in the first issue of Cerebral Cortex including one of the most often reproduced diagrams (Figure 4) in computational neuroscience.

Dileep George developed an interesting biologically-motivated laminar instantiation of the belief-update equations he used in implementing a version of Jeff Hawkins’ Hierarchical Temporal Memory [60]. Dileep’s model was novel for its tackling a rather complex computation, but he was primarily interested in showing how neural circuits could, in principle, be arranged in columns to perform such computations, and, not necessarily, how or whether there actually exist neural circuits in the mammalian neocortex that perform said computations.

Whatever else you might argue, at least Dileep’s model is a computational model that any computer scientist would recognize as such, albeit one that starts with a specific algorithm in mind, rather than starting from what is known about the anatomy, physiology and behavior and attempting to derive an algorithm that obeys the biological constraints. Whether or not it is biologically plausible model of neural computation is another question altogether. There are many ways of accomplishing the same computational task and, given that inference is intractable, if the brain does solve the belief update equations, then it does so using an efficient approximation that would be quite interesting to discover.

In any case, I was looking for something simpler computationally speaking and more challenging biologically: a detailed hypothesis mapping the computations that Hubel and Wiesel [9089] assigned to what they called ‘‘simple’’ and ‘‘complex’’ cells in primary visual cortex to anatomical structures along with some form of corroborating evidence from the literature. I did find a few papers of peripheral relevance [2302065], but less than what I was hoping for.

April 19, 2015

The primary reading [57] for Surya Ganguli’s class discussion on Monday cites a number of interesting related papers, including Ganguli and Sompolinsky [56] which I’ve provided as supplementary reading. This paper invokes the theory of compressed sensing to better understand what can be learned from analyzing neural recordings. The basic strategy of finding a low-rank basis for encoding a high-dimensional signal has instantiations in many sub areas of statistics and machine learning. The following note will provide some context.

We have already encountered one such instantiation when Viren Jain and then Peter Li brought up the idea of employing unsupervised learning to reduce the reliance on labeled data which is more often than not in short supply. Peter mentioned using an autoencoder to learn a model that feeds a high-dimensional feature vector through a lower-dimensional hidden layer to an output layer of the same dimensionality as the input. During training you set the input and output to be the same — your objective is to recover the original signal and in doing so learn a compressed representation.

The low-dimensional hidden layer in the autoencoder method is often referred to as a bottleneck and the associated approach to machine learning to as the bottle-neck method [207]. As an aside, one of the three authors of [207] is Bill Bialek a theoretical biophysicist at Princeton who has trained a number of outstanding computational neuroscientists and produced a body of work well worth your time scanning when you have a chance. Here are a few of my favorites: [117916].

Another idea related to compressed sensing is the notion of sparse coding which first appeared in reference to neural coding in the work of Horace Barlow in which he characterized the statistics of visual stimuli as a means to better understand the nature of perception and the neural codes that support memory [1110]. Sparse coding became popular in computational neuroscience largely due to the work of Bruno Olshausen and David Field [156157].

While there are many learning algorithms, there are only two fundamental methods of learning that we have discovered so far: maximizing margins and minimizing models. Maximizing margins is the method used in support vector machines (SVM) and is theoretically motivated by applying the Vapnik-Chervonenkis (VC) Dimension [215] in the context of Leslie Valiant’s PAC model of learning [21420].

The method of minimizing models is related to various model-selection strategies that typically make use of some form of Occam’s Razor, e.g., by applying the minimum description length (MDL) principle, or, in statistical machine learning, by using either the Akaike (AIC) or Bayesian (BIC) information criteria. We are assuming in this discussion that learning implies the ability to generalize from seen to unseen examples. In point of fact, this is impossible without additional information about the problem [223222], and this often takes the form of a restriction or prior on the family of models considered.

Terms like stochastic gradient descent, expectation maximization, genetic algorithms, nearest neighbor algorithm and maximum-likelihood least squares refer to search methods, loss functions, families of models or all three. In learning artificial neural networks, the search method may be gradient descent, the loss function squared error and the family of models multi-layer perceptrons, but the method of learning is model minimization and takes the form of restrictions on the family of models, e.g., models with only one hidden layer.

April 17, 2015

The readings for all the classes through May 11 are available from their respective entries in the calendar. There you’ll also find the presentations through last Wednesday, April 15. Some of the entries have three or four papers. Unless otherwise indicated, you can assume the first listed paper is the primary paper to read for class. Surya Ganguli from Stanford is on for this coming Monday and David Cox from Harvard is participating on Wednesday. Get the most out of these opportunities; take some time to read the papers this weekend and come to class prepared to ask questions.

Costas’ calendar entry now includes a sample dataset from one of his cortical models in the format of a tar ball consisting of simulated intra- and extra-cellular traces, spike-timing rasters, and movies illustrating the structure and dynamics of the simulated tissue sample. As Costas pointed out in class, he is interested in partnering with students to produce alternative datasets that might be better suited to functional analysis or exhibit different dynamics. The tar ball also includes Python scripts for reformatting the data and extracting content from the binary files.

April 16, 2015

Peter Li’s slides and references from yesterday’s class are available here. Up-to-date course notes are available on Mindreaderz and here. The HHMI Janelia FIB-SEM drosophila data that Peter mentioned is available here and includes ground truth and the original grey scale EM data. If you’re interested, tell Aditya you’d like to play with it, and he’ll download it to the Stanford servers. We may be able to get the full seven-column dataset for class projects, but experiment with the smaller dataset first since even the single-column dataset is rather large.

A couple of Google jokers made the following addition to one of the bathrooms on the Google Quad Campus: entering the men’s room you see this fellow lurking at the back near the toilet stalls, curious you move nearer for a better look, and then, peering close at the helmet visor half expecting to find it occupied, sure enough you see this note.

April 15, 2015

Costas Anastassiou’s slides and copies of papers relevant to our discussion in class are linked directly off his calendar page here. I’ve included the explanatory notes that were included in his last posting here:

Here are the two citations I mentioned right in the end: The first paper (Shai et al [185]) presents the putative role of active dendrites and the Ca-hotzone for encoding. It has both experiments and modeling. Especially in the end we suggest that the absence of the active, Ca-hotzone results in fundamentally different types of computation occurring at the single-neuron level. The second paper (Larkum [117]) is a beautiful theory of how active properties of dendrites can have functional and behavioral roles. Hopefully, the network simulations we discussed today will soon (or not so) be able to test some of these theories. Finally, here is a rather extensive review regarding the function and role of active dendrites: Major et al [134].

April 13, 2015

I want you to appreciate the opportunities that models of the sort that Costas and his team have developed open up to you for interesting projects. We have good EM datasets with sufficient ground truth so that you can learn connectomes. We’ll be hearing from Eric Jonas about approaches applying spectral analysis to affinity matrices obtained from connectomes in order to infer interesting properties of the underlying circuits. We also have sources for calcium imaging (CI) data from mouse, fly and fish.

We don’t have ground truth for the CI data — knowledge of the circuits and the action potentials that produced the data — nor do we have a great deal of correlated EM and CI data. That will change soon as AIBS and Janelia continue to make progress. In some sense, however, we will never have enough data — even though we will be drowning it as we test the limits of modern data storage systems, and we will never have the ‘‘right’’ data — even though we can always gather more.

‘‘Never’’ is a long time, and ‘‘right’’ is a relative term in this context, but the point is that until we have the ability to perturb the system when, where and how we want, we will be at the mercy of those conducting the experiments and gathering the data. Technologies like optogenetics [34233], robotic patch clamping [113] and fluorescence endoscopes [99] offer the promise of precise excitation or inhibition, but it will take time to incorporate such technology into efficient work-flows that don’t require prohibitive preparation.

In the meantime, simulations like those Costas, Sean Hill and Henry Markram are developing hold out the promise of allowing us to test the hypothesis — some would substitute ‘‘monstrous conceit’’ for ‘‘hypothesis’’ given the apparent complexity of the task — that, once we have lots of correlated EM and CI data and the machinery to perturb the neural circuits of awake, behaving animals at will, we will be able to unravel the mysteries of their function. While the models Costas talked about are not complete and not likely to be able to reproduce the full spectrum of behavior of the circuits they seek to model, they are arguably good enough to test the above hypothesis, at least to the extent that it bodes well if we can infer non-trivial function from the simulated brain tissue.

Following up on my question at the end of Costas talk and his answer, I envision projects that start with a specific model — constituting your target organism, tissue and circuit — from which you can obtain as much simulated CI data as you need for training and testing as well as all the ground truth you could possibly need, including all local field potentials, cell types, circuit reconstructions, neurotransmitters and synapse valence information (excitatory or inhibitory). Armed with this detailed understanding of your target, you can now apply machine learning tools to try to learn the input-output behavior or statistical characterizations of the dynamics and evaluate the resulting models against the ground truth. If you deem it necessary, you can adjust the model parameters, design new experiments, inject perturbations or otherwise control the environment to gain greater insight into your models ability to reproduce the behavior of the target organism.

If you could do this, and I think Costas, his colleagues at Allen and my team at Google would agree, the implications would be significant as it would demonstrate that at least in principle — modulo the accuracy and complexity of the models Costas’s team have developed — the research programs Allen, Harvard, Janelia, Max Planck, Stanford, MIT, etc., have the potential to realize their goal of recording from and inferring the function of neural circuits of some size and complexity. The resulting work flow could serve as a complement to experiments on real tissue and ensure to the extent possible that we don’t start from a state of complete ignorance in taking on the considerable challenges of dealing living tissue.

Tomorrow, Peter Li will be talking about the deep neural networks that we have been developing at Google for classifying voxels, identifying membranes, segmenting cells and tracing their processes. Peter is expert in developing and testing neural network architectures and his algorithms have surpassed the current state of the art by a substantial amount. He’ll provide a short primer on ANNs and then discuss the various network topologies he’s tried along with what worked and what didn’t. His experience working with E.J. Chichilnisky doing recordings on primate retina, mapping receptive fields of retinal ganglion cells and understanding the structure and function of retinal circuits make him a great resource.

April 11, 2015

Davi sent a note with some additional papers you might want to at least skip before his lecture:

In addition to the 2011 paper [21], the class should read [112] --- a review of imaging methods. And optionally, read [81] and subsequent follow-on work from that lab. Is this a more efficient approach than EM to answer questions about how function relates to network connectivity? How are the two approaches redundant, and how are they complementary?

April 9, 2015

Adam Marblestone sent me a 2014 Neuron paper [136] that he and Ed Boyden wrote in which they consider the prospects for ‘‘assumption-free brain mapping’’ and draw on examples from Mario’s work to illustrate their premise. They point out the advantages of pairing solution-driven engineers with problem-driven scientists and note that incentives are all working in the right direction with engineers seeking applications for their technologies and scientists seeking technologies to solve their problems.

Adam and Ed suggest that we will need to build bias free brain mapping technologies that work ‘‘backward from the fundamental properties of the brain and are equal to the challenge of mapping their mechanisms’’, instead of starting from a set of known building blocks and working forward, tacitly assuming your collection of blocks is up to the task of solving currently insoluble problems — which is like the drunk who, despite the fact he knows he lost his keys elsewhere, looks for his keys under the lamppost because that is the only place where there is enough light to see.

I’ll try to reduce Adam and Ed’s argument to its essence: There are multiple levels at which we can explain the how the brain works. In the following we consider four such levels: molecular, cellular, functional and behavioral. Our present state of knowledge at each level varies considerably.

Physics provides powerful tools to explain how molecules interact. Biology has revealed a great deal about cellular processes, but we believe there is more to know before we can give a complete account at either the functional or behavioral level. We can observe molecules and cells with varying degrees of scale and precision.

By functional, we mean algorithmic and computational. Given a collection of molecules, a set of initial conditions and the laws of quantum electrodynamics, we could in principle predict the behavior of those molecules. Similarly, given an algorithmic account of the brain, we could in principle explain how a brain would respond to any stimulus.

If we could describe the brain at the molecular level we could in principle simulate the brain and predict what we would observe if we were able to measure its electrical and chemical properties. Of course, a physical realization of a system is the most efficient way to predict its behavior at the molecular level. We want a more succinct and transparent description.

At the cellular level we would like an explanation in terms of cellular processes like communication, gene expression and respiration. An explanation at this level would afford a bridge between the molecular and functional levels. We are not confident we know all the cellular processes necessary to provide a complete account at the cellular level.

We cannot directly observe function; we have to infer function from our observations of what’s going on at the physical level — the cellular substrate in which the computations are realized. We can observe behavior but the behavioral level does not suffice as an adequate explanation for mind nor as a basis for dealing with its pathologies.

The functional level provides a bridge between the cellular and the behavioral. However, our cellular level understanding is incomplete. Adam and Ed advocate we learn more about the aggregate behavior of molecules by developing technology to observe them directly. Such technology will help us to identify inaccurate or inadequate descriptions of cellular processes.

As our cellular level understanding improves, we will be better prepared to improve our functional account. By being able to directly observe cellular processes at work in awake behaving organisms, we will be less likely to jump to conclusions based on an existing flawed or incomplete theory at the cellular level. Summarizing the above:

I asked Davi Bock if he would release the calcium imaging data they collected for their 2011 paper [21] and he said that if I wanted it I should talk to Clay, but that they only got a decent signal for 14 or so neurons. Davi said that:

You might do much better talking to Wei-Chung Allen Lee (Wei-Chung_Lee@hms.harvard.edu), now an Instructor at Harvard, who stayed after Clay went to the Allen and inherited a fast calcium imaging rig and the first-generation TEM camera array. This let him do some really nice follow-up work to our 2011 paper, and he has a manuscript currently in revision describing it. Instead of using OGB he used GCaMP3 and functionally characterized many more neurons (somata of neurons in layer 2/3 and apical dendrites from layer 5). He finds evidence for like-to-like anatomical connectivity across a number of stimulus parameters, and I could imagine analyzing his calcium imaging + EM data would be much more satisfying (higher N, broader stimulus space) than our 2011 data.

A researcher by the name of Dimitri Perrin from Queensland University of Technology asked me to take a look at a Google Research Award proposal that he was preparing to submit. It was an interesting proposal and when I asked him if there were any papers on the technology he was proposing, he sent me two of his recent papers [202205] published in Cell. The titles are tantalizing; those of you working in Karl Deisseroth’s lab might take a quick look and report back to the rest of us.

April 9, 2015

Mario Galarreta gave a great presentation in Monday’s class — his slides are now available on the calendar page — challenging us to be aware of our biases, the dogma that predominates in the field and the conceit that we know more than we do. Then he launched into a deep dive into the details of his research chronicling his fifteen years of doing electrophysiology at Stanford, what he learned, what he didn’t and what we still don’t know and don’t even know that we don’t know. The quotes from Ramòn y Cajal might have seemed only relevant to the early 20th, but they are true today more than ever despite the putative fact that we know a great deal more now than we did in Cajal’s day. This log entry is a bit long and the next three paragraphs a little philosophical, and so if you got this far and are inclined to skip the rest, please fast forward to the last paragraph and pay special attention to the two footnotes that introduce the work of couple of our speakers.

I’m continually amazed at what people think we know about the brain. Oddly enough university faculty and graduate students are particularly prone to this sort of exaggeration, perhaps because the textbooks they write or study from emphasize the known and give sort shrift to the unknown. Industrial research scientists are somewhat less prone, probably because in developing products such as pharmaceuticals and medical equipment they are constantly constrained by the limitations of what we know and, unlike the academic researcher, cannot opportunistically switch to work on some other problem when faced by a lack of knowledge. It may be a prerequisite for pursuing basic science that one is fundamentally optimistic, but really good scientists can compartmentalize their optimism and their skepticism so as to remain enthusiastic and committed while exercising their critical faculties.

Relevant to the areas of neuroscience we are exploring in this class, Mario’s talk laid bare the ignorance and bias behind statements by noted neuroscientists that we won’t learn anything of value from connectomics and that, in particular, once we have one connectome there will be diminishing returns from obtaining additional connectomes. The technologies for collecting data that we will leverage in class projects are like any scientific instruments: they exploit what we know from current physics, chemistry and biology to yield tantalizing glimpses into natural phenomena, but they require interpretation and, generally, a great deal of cleverness to apply to specific questions. In hindsight, it may seem that a scientist builds an apparatus specifically to answer a particular question, employs the resulting instrument to collect data from experiments that are obvious, and thereby resolves his question. This is hardly ever the case.

In our case, the data will be noisy, incomplete and generally only a rough proxy for what we would really like to know. As computational neuroscientists, our tool box includes all of mathematics, statistics, algorithms, numerical analysis and machine learning. The first rule of exploratory statistical analysis is to know your data — analyze the sources noise and error, identify outliers, quantify dependencies among variables, etc. The second — and I’m making this one up — rule is to take advantage of what you’ve learned from this analysis to exploit the properties of the data — normally distributed — and transform the data into a more tractable form — apply principal components analysis. Academics have written books about exploratory data analysis, and I don’t expect you to be proficient in this area; my primary advice is to constantly question your assumptions, don’t be afraid to re-frame the problem, and come to terms with the data you have and don’t get distracted in a quixotic quest to find the perfect dataset or solve an intractable problem.

I’ll provide a couple of examples to illustrate the above points. The first example involves work by Peiran Gao and Surya Ganguli in which they describe a relatively standard statistical-analysis workflow practiced by computational neuroscientists. They then ask three questions of the sort Mario described in his presentation, i.e., questions that most scientists would not have not bothered to ask or wouldn’t see the need to ask given they failed to recognize the irregularities that Peiran and Surya observed. Finally, they develop an intriguing theory that provides answers to the three questions and has far reaching consequences for how we think about computation in the brain18. As another example here is a recording of a technical talk given at Google by Eric Jonas on his joint work with Konrad Kording looking at extracting cell types and microcircuitry from neural connectomics19. Surya will be joining us on Monday, April 20, and Eric on Monday, May 6. These presentations are relatively late in the quarter, and so if you find their work interesting enough that you want to look into possible related projects, I suggest you read their papers and contact them directly with your questions. The calendar now has the readings for all of the talks through the first week in May.

April 6, 2015

For those of you taking CS379C. please accept the invitation I sent out a couple of days ago inviting you to join the Mindreaderz email group. After this message, all subsequent class notes, relevant papers, project suggestions, etc will go to Mindreaderz. Administrative announcements, individual and course-related correspondence will use the email addresses provided on the Axes course roster. I don’t expect this will happen, but if the email traffic generated by Mindreaderz gets annoying or distracting, you can always change your delivery option to ‘‘digest’’ and get just one summary message per day. The remainder of this message includes some notes that have been accumulating in my research log and that I’ve been intending to add to the CS379C class discussion page.

If you are curious about the scientific vision for the BRAIN Initiative — BTW the BRAIN acronym stands for Brain Research through Advancing Innovative Neurotechnologies — you might want to check out this report to Francis Collins, the Director of the NIH that was prepared by the Advisory Committee of the BRAIN Initiative led by Cornelia Bargmann and Bill Newsome: BRAIN 2025: A Scientific Vision.

In the summary report produced by the CS379C class of 2013, we were cautiously optimistic that technologies combining the use of light and sound, e.g., photoacoustic spectroscopy and photoacoustic tomography, would have a significant impact on the field of neuroscience in the next couple of years. These technologies work by selectively illuminating the target tissue with electromagnetic energy — typically in the infrared region of the spectrum in order to penetrate deep into the tissue — and then record the resulting changes in pressure by sensing radiated acoustic energy — typically in the ultrasound range.

We were particularly excited by whole-brain recording technologies such as the neural dust proposal out of Berkeley by Seo et al [181] which was also featured in Marblestone et al [139]. A new paper [229] just out in Nature Methods describes a promising new technology for photoacoustic microscopy (PAM) that seems particularly promising. In this case, the PAM technology is used for high-speed imaging of the oxygen saturation of hemoglobin, and hence is a possible alternative to fMRI but with higher spatial and temporal resolution:

We present fast functional photoacoustic microscopy for three-dimensional high-resolution, high-speed imaging of the mouse brain, complementary to other imaging modalities. We implemented a single-wavelength pulse-width-based method with a one-dimensional imaging rate of 100 khz to image blood oxygenation with capillary-level resolution. We applied PAM to image the vascular morphology, blood oxygenation, blood flow and oxygen metabolism in both resting and stimulated states in the mouse brain.

In this class, we will repeatedly return to the question of how best to articulate hypotheses about the function of neural circuits. It is our contention, that current approaches fall far short in terms of explanatory value when it comes to describing meso-scale function [140] and models from machine learning and computer vision originally motivated by results from neurobiology might serve as a source of such models [38]. Along similar lines, Lim et al [127126] ‘‘explore the idea that there are common and general principles that link network structures to biological functions, principles that constrain the design solutions that evolution can converge upon for accomplishing a given cellular task. We describe approaches for classifying networks based on abstract architectures and functions, rather than on the specific molecular components of the networks.’’

Increasingly, computational neuroscientists are applying ideas from statistical mechanics and dynamical systems theory to understanding the statistical properties of ensembles of neurons, with the motivation that the aggregate behavior of the ensemble is best characterized not in terms of the interactions between individual neurons but rather in terms of the interactions between self-organizing, constantly forming and reforming, highly-connected components — cliques in the parlance of graph theory — of the constituent neurons. Here’s a paper by Liam Paninski, Sarah Woolley Ramirez and their colleagues et al [170] that provides an interesting example of this approach and we will hear more from Surya Ganguli later in the quarter.

There are now a few academic labs that routinely release their data to the community. It isn’t exactly a common occurrence as yet, but the trend looks promising. The EM data from the Bock et al paper [21] is available at the National Center for Microscopy and Imaging Research in their Cell Centered Data (CCB) repository. You can find it on the CCB website by typing "8448" — the dataset ID — into the search window at the upper left hand corner of the splash page. The site also provides viewers for exploring the data. If you think you want to experiment with this data, tell Aditya and me and we’ll consider copying the data to Stanford servers. Some of the datasets are very large and network capacity and disk space while relatively inexpensive — at least the latter — are not free. You might want to hold off for a while before downloading more than a terabyte. At any rate, we want to be good stewards of Stanford computing resources, so please ask Aditya or me before moving a lot of data.

April 1, 2015

The course website and calendar are up to date through April 8. This includes the first two lectures and links to relevant papers, plus entries for Mario’s and Viren’s presentation / discussion sessions, including the readings for those classes. Check it out and if you encounter any dead links or incomprehensible content please tell me — the splash page and its calendar entry are the only ones I can vouch for at this time, but I’ll have the discussion list ready by the end of the weekend. Here are some of current news stories and answers to questions from the first lectures:

CMU researchers have used data mining to a publicly available website that acts like Wikipedia, indexing the decades worth of physiological data collected about the billions of neurons in the brain. Researchers at NYU have captured images of dendrite nerve branches that show how mice brains sort, store, and make sense out of information during learning. In a study published online in the journal Nature March 30, the NYU Langone neuroscientists tracked neuronal activity in dendritic nerve branches as the mice learned motor tasks such as how to run forward and backward on a small treadmill. They found that the generation of calcium ion spikes — which appeared in screen images as tiny ‘‘lightning bolts’’ in these dendrites — was tied to strengthening or weakening connections between neurons, hallmarks of learning new information.

The EM neural-tissue-sample-preparation protocols are linked to Wednesday’s calendar entry. For reference here’s an expansion of what I said in class about fixation, contrast agents and heavy metals:

Uranyl [238] — Uranyl acetate is an acetate salt of uranium. The advantage of UA is that it produces the highest electron density and image contrast as well as imparting a fine grain to the image due to the atomic weight of 238 of uranium. The uranyl ions bind to proteins and lipids with sialic acid carboxyl groups such as glycoproteins and ganglioside and to nucleic acid phosphate groups of DNA and RNA.

Lead [207] citrate — Lead citrate enhances the contrasting effect for a wide range of cellular structures such as ribosomes, lipid membranes, cytoskleleton and other compartments of the cytoplasm. The enhancement of the contrasting effect depends on the interaction with reduced osmium, since it allows the attachment of lead ions to the polar groups of molecules. Osmium is used routinely as a fixative. Lead citrate also interacts, to a weaker extent, with UA and therefore lead citrate staining is employed after UA staining.

Osmium [190] tetroxide — Osmium tetroxide fixative enhances the contrast. It acts as fixative as well as enhancer of contrast during post-staining by interacting with uranyl acetate and lead citrate.It has been indicated that fixation time has an effect on contrast obtained by uranyl acetate contrasting. A long fixation with Osmium tetroxide decreases e.g. the contrast of chromatin.

Osmium tetroxide fixative enhances the contrast. It acts as fixative as well as enhancer of contrast during post-staining by interacting with uranyl acetate and lead citrate.It has been indicated that fixation time has an effect on contrast obtained by uranyl acetate contrasting. A long fixation with Osmium tetroxide decreases e.g. the contrast of chromatin. As I alluded to in class, beyond the basic physics and chemistry this step is more art than science, but can make a big difference in image quality.

This method was mentioned in Kevin Briggman’s slides in the context of how you introduce DNA into live cells, generally referred to as transfection. I’ve just listed the definition below, but the Wikipedia page for this term does a pretty good job. You might want to follow some of the related links and learn about viral transfection methods that use adenovirus vectors and the family of retroviruses that are popular for neural circuit tracing.

Electroporation is the use of high-voltage electric shocks to introduce DNA into cells–can be used with most cell types, yields a high frequency of both stable transformation and transient gene expression and, because it requires fewer steps, can be easier than alternate techniques.

My cheesy-serial-section demo included mention of a paper by researchers at HHMI Janelia, Hayworth et al [78] that just came out in Nature Methods describing their new ‘‘hot knife’’ method and why and how it is important in improving their segmentation work using FIB-SSEM (Focused Ion Beam Serial Section Electron Microscopy). This technology was developed by a team of scientists at Janelia working in Harald Hess’s lab and led by Ken Hayworth20. While it is ideally suited to scaling FIB-SEM, the technology has wider application. I’ll describe the technology in some detail in class today. The abstract of the Nature Methods paper follows:

Focused-ion-beam scanning electron microscopy (FIB-SEM) has become an essential tool for studying neural tissue at resolutions below 10 nm x 10 nm x 10 nm, producing data sets optimized for automatic connectome tracing. We present a technical advance, ultrathick sectioning, which reliably subdivides embedded tissue samples into chunks (20 mm thick) optimally sized and mounted for efficient, parallel FIB-SEM imaging. these chunks are imaged separately and then ‘volume stitched’ back together, producing a final three-dimensional data set suitable for connectome tracing.

I talked about how neurobiology labs today are populated by scientists and engineers from many disciplines whose expertise is often relatively narrow, and that I certainly didn’t you expect to be master all subjects that we’ll touch upon in class. This diversity of people and technical expertise is one of the appealing features of working in the field of neurobiology — you learn something new every day and making connections among ideas from different fields is the source of many great ideas. Here’s an expansion of what I briefly mentioned in the introductory lecture regarding the sort of background I expect for this course:

I am expecting that you will be coming from varied backgrounds. In particular, while coding skills are important, you may not have a lot of experience with machine learning, and, while some familiarity with neurobiology is important, you may not have the same depth of knowledge as a graduate student in neuroscience. I am, however, expecting that you have some depth in either neuroscience — in particular the primary visual cortex — or computer science — in particular machine learning and signal processing. If you’re weak on the former, you should probably review the material in Psychophysics of Vision: Primary Visual Cortex. If you’re weak on the latter, I suggest that you look at the survey paper on deep networks by Jorgen Schmidhuber [178], the ACL tutorial by Socher, Manning and Bengio [192], and the documentation available in the Theano and Torch Python libraries. If you’re interested in getting started working on structural connectomics immediately, you might take a look at the ISB Dataset that Aditya has uploaded to the Stanford servers.

March 23, 2015

Sunday night Jo and I watched a YouTube science documentary called ‘‘Bionics, Transhumanism, and the end of Evolution’’. I have no idea who directed, produced or wrote the screenplay. The person who recommended the video to me thought it was a BBC production, but it definitely isn’t. The video includes commentary by a number of reputable scientists, philosophers and science fiction writers. Ray was featured in several segments. I was impressed with the technology selected for discussion and the relative even handedness of the presentation. Depending on your biases, you may see the future predicted by the commentators as apocalyptic, depressing or wonderfully exciting, but it’s hard for a scientist or engineer not to see it as inevitable.

The documentary starts with a clip from Burning Man on the last night of the event when the giant wooden statue of a standing man is set ablaze to the cheering of an enthusiastic crowd. There is subtly ominous music in the background. It ends with a continuation of the clip showing the statue engulfed in flames and beginning to disintegrate and the crowd cheering even more enthusiastically than in the earlier scene. Against the backdrop of the surging crowd and artful conflagration, Bruce Sterling delivers the following soliloquy:

It’s important to realize that the posthuman epoch is coming. We really do want to violate human limits and we’re getting closer to having the technology to do so, but it’s also important to realize that this is not the end of history. It does not solve any of our other problems, it just creates new problems that are going to intensify, and there’s going to be more than one kind of humanity. The mere fact that you’re no longer human doesn’t mean that you don’t have the same personality problems that you had before. It doesn’t liberate you from yourself, it probably makes you more you not less. You’re not going to clank and beep like robocop, you’re just going to have more abilities and new powers. Dealing with power is troublesome; if you have more power, you have more responsibility not less. — (Bionics, Transhumanism, and the end of Evolution)

As I watched the video, I thought of the Enlightenment and the work of the Scottish, English, German and French philosophers of the time: Hobbes, Locke, Hume, Montesquieu, Rousseau, Condorcet, Diderot, d’Alembert and Voltaire. Their writings and the scholarly biographies chronicling their lives reflect their wit, intelligence, enthusiasm and impatience. They knew they were in the midst of a period of profound change. They realized that the existing social contract was dissolving and that it was their responsibility to forge a new one. They were impatient to enact the changes they judged most appropriate and they were appallingly ignorant of the world they lived in.

Given what there was to know at the time, they were well educated, veritable polymaths compared to most of their contemporaries. They were interested in and held strong opinions about capital markets, competition, conflict, public education, property rights, altruism and morality, the control of technology, the role of government, slavery and women’s rights, just to name a few topics. They had no idea of the impending social cataclysm soon to be wrought by the industrial revolution. I thought about Anthony Pagden’s description [163] of the debate between Denis Diderot, David Hume, Emanuel Kant, Laplace and Montesquieu concerning standing armies and the prospects for armed conflict in the different futures they were imagining.

I saw these 18th century philosophers’ mirrored in the views expressed by the 21th century scientists and technology enthusiasts interviewed in the video. We have already started to build machines that kill and destroy villages, we have gradually been granting more and more autonomy to these machines and we will cede more as the machines become smarter and better able to carry out our wishes. More than one commentator suggested we shouldn’t worry about the machines turning against us because we will program safeguards into these machines so we can render them inert if they run amok. Most computer scientist would laugh at their ignorance and perhaps shudder at the risks many of today’s leading scientists and engineers are willing to take with their children’s future. I’m caught up in the same frenzy of enthusiasm and excitement and have no credibility to criticize.

P.S. There’s a sequel entitled ‘‘Better, Stronger, Faster: The Future of the Bionic Body’’ which I haven’t seen yet but I am tempted to given the admittedly-low-bar better-than-the-discovery-channel quality of the above-mentioned documentary — you can find it here. On a related note, we met with two of the founders of Nervana Systems, Amir Khosrowshahi and Arjun Bansal — neuroscietists who came from Bruno Olshausen’s and John Donohue’s labs respectively, Arjun worked with me on probabilistic graphical models of cortex when I was at Brown University — right after their meeting with several partners at Google Ventures. Arjun mentioned the nano-scale implantable neural devices being developed at Berkeley in Jan Rabaey’s lab; here are two representative papers:

March 19, 2015

There are three new (2015) papers just out that extend the use of LSTM models beyond linear chains focusing on the representation of tree structures in NLP applications: Zhu et al [235], Tai et al [100] and Vinyals et al [217]. The Tai et al work is evaluated on semantic relatedness21 and sentiment analysis, where the former task uses the Sentences Involving Compositional Knowledge (SICK) dataset that we encountered in Bill MacCartney’s NLI work.

The Zhu et al et al paper compares the Recursive Tensor Neural Network of Socher et al [193] with the same model in which the authors have replaced the tensor layer with an LSTM resulting in a recurrent neural network22. Unfortunately, all three papers depend on labeled data — though Vinyals et al augment the relatively scarce human-annotated data using automated parsing technology — in the form of parse trees for training. In the sequel, I’ll focus first on the Tai et al model comparing it with the Zhu et al model, followed by a brief discussion of the Vinyals et al work.

The standard linear-chain LSTM ‘‘composes its hidden state from the input at the current time step and the hidden state of the LSTM unit in the previous time step, the tree-structured LSTM, or Tree-LSTM, composes its state from an input vector and the hidden states of arbitrarily many child units.’’

Figure: 1 from [100]: Top: A chain-structured LSTM network. Bottom: A tree-structured LSTM network with arbitrary branching factor. Compare with Figure 1 from Zhu et al [235].

In Tree-LSTM units, gating vectors and memory cell updates are potentially dependent on the state of multiple child units. Instead of a single forget gate, the Tree-LSTM unit has one forget gate fjk for each child k, thereby allowing the Tree-LSTM unit to selectively incorporate information from its children. For example, ‘‘a Tree-LSTM model can learn to emphasize semantic heads in a semantic relatedness task, or it can learn to preserve the representation of sentiment-rich children for sentiment classification.’’

Figure: 2 from [100]: Composing the memory cell c1 and hidden state h1 of a Tree-LSTM unit with two children (subscripts 2 and 3). Labeled edges correspond to gating by the indicated gating vector, with dependencies omitted for compactness. Compare with Figure 2 from Zhu et al [235].

The Vinyals et al [217] work is perhaps not as novel or obviously tree-like, but their approach to parsing is arguably more interesting and their results more compelling than either of Tai et al or Zhu et al. They use a completely different LSTM architecture, namely the sequence-to-sequence (S2S) LSTM model of Sutskever et al [203]. Vinyals et al augment the S2S architecture in [203] so that it produces a linear encoding — an S-expression — of the parse tree and then use a stack to keep track of the level of nesting.

In the tradition of the best Google research, Vinyals et al gain advantage by automatically collecting a large supplementary dataset that enables a substantial difference in performance. The authors ‘‘train a deep LSTM model with 34M parameters on a dataset consisting of 90K sentences (2M tokens) obtained from various treebanks and 7M sentences from the web that are automatically parsed with the Berkeley Parser [...]. The additional automatically-parsed data can be seen as an indirect way of injecting domain knowledge into the model.’’ Simple, elegant and eminently practical.

Hopefully this is enough detail to pique your interest. I found the Tai et al paper the clearest with a nice mix of intuition and technical detail. While different in detail, Tai et al and Zhu et al are the closest in spirit of the three. The Vinyals et al paper is simpler and more elegant than either of the other two; the only reason I didn’t rank it higher is that I was prepared for it as a consequence of our thinking along similar lines — I readily admit, however, that their solution is more elegant than any I came up with.

In looking around for papers on hierarchical models that might apply to either Descartes or Neuromancer use cases, I ran across a 2008 paper by Richard Socher that appeared in a symposium on medical imaging. Socher et al [188] present a hierarchical model for segmenting tubular structures observed in low-radiation X-ray images (3-D CT). They apply the model to segmenting blood vessels in angiographic videos. This paper does not use artificial neural networks, deep or shallow, and relies on three machine learning techniques that might provide some interesting ideas for tracing neural circuitry — see the footnote at the end of this sentence for references23.

For your convenience, you can find the the BibTeX entries for the papers mentioned above including abstracts by following the footnote at the end of this sentence24. All of the papers but the 2008 Socher et al paper are available from arXiv as 2015 submissions — search for the first author having limited the search to the current year. Socher’s paper is on his DBLP page or among Comaniciu’s publications (PDF). It might be instructive to have Oriol, Ilya or Lukasz present their paper at a Descartes / Neo meeting.

March 15, 2015

Here’s a note that I sent to Demis Hassabis during the weekend asking questions about his work in cognitive neuroscience on the relationship between constructive remembering and imaginative forecasting. I am particularly interested in the connection of his work to what I perceive as the related problems of planning to achieve goals and generating responses to questions:

This weekend I got interested in learning more about the secondary / multi-modal association areas. I pulled the usual volumes I consult when I’m clueless and unsure how to proceed. Bear, Connor and Paradiso [13] and Kandel, Schwartz and Jessell [104] didn’t have much to offer, but then I ran across an article in Gazzaniga [58] by some of your colleagues, Addis, Buckner and Schacter [176] that was interesting and led me to more of your work. I was particularly intrigued with the experiments and results reported in [17577].

I’m curious if you know of any work that attempts to provide a computational (algorithmic) explanation of what the circuits described in [175] are actually doing when imagining possible futures. I’m thinking along the lines of the O’Reilly and Frank [159158160] model of working memory and executive control or Eliasmith and Stewart’s SPAUN architecture [19200199], both of which support different forms of gating that enable variable binding — and the authors claim are biologically plausible.

I’m generally of the opinion that these high-level descriptions of cortical computation are unlikely to shed much light on cortical circuitry at the microscale, but at least in the case of O’Reilly’s and Eliasmith’s work their models are described in enough detail that a computer scientist has some chance of understanding what the authors mean when they suggest that some area of the brain is, say, capable of variable binding. With those caveats, I think the effort to be clear is well worth the risk of being wrong, since as least then rational people can agree on what are they arguing about.

The tasks you had subjects perform in your experiments were particularly useful for my introspective gedanken experiments in trying to solve variants of the binding problem using cascaded embedding spaces. For example in imagining a future free of debt, the subject, call her Jan, might be reminded of a friend, Alice, who in similar circumstances got a second job cleaning offices at night. To imagine a debt-free future, Jan might recall the story about Alice, remove the dependence on Alice as the active agent, add herself as the agent, and imagine working extra hours in an otherwise empty office space late at night25

Assuming an embedding-vector representation, why can’t we retrieve the vector representing the story, subtract the vector for Alice and add the vector for Jan to produce a representation that captures the meaning in the hypothesized future? The answer stems from the fact that we don’t know if the Alice story is relevant to reducing Jan’s debt, we don’t know that Alice is the active agent — or what an ‘‘active agent’’ is for that matter, and we don’t know whether the other dependencies in the Alice story are compatible with Jan’s circumstances. Short of finding a much better semantic-frame parser than currently exists, we can’t easily disassemble and reassemble distributed representations to support the sort of imaginative reconstructive remembering that you describe in [176].

Perhaps vector addition and subtraction aren’t up to the task; how about (potentially) more expressive tensor operations on connectionist slot-filler representations, e.g., Smolensky [186]. Unfortunately, while Smolensky gives lip service to encoding graphs and slot-and-filler structures, his distributed representations assume that you map more conventional GOFAI representations onto slices of tensors and by so doing punts on the parts of the problem I’m most interested in, namely learning how to perform these mappings and manipulations in a fully distributed fashion. Javier Snaider on the Descartes team has developed a technology called Modular Composite Representation (MCR) that addresses some of the shortcomings of Smolensky’s approach but still requires parsing [187]. MCR may be our best bet in the short term but I’m still looking for a better compromise.

March 13, 2015

This week has been a rather fallow period. I put aside the time to think about the connectionist binding problem, since most of research including Neuromancer was on a ski trip. After a lot of isolated thought, I took down the three major neuroscience tomes on my book shelf: Bear, Connor and Paradiso [13], Kandel, Schwartz and Jessell [104], Gazzaniga [58], and read everything they had to say about about the primary and secondary association areas in the cortex. Most of it was review, but, for example, I never knew how much more developed (proportionally larger) are the secondary association areas in human cortex compared to any other mammal.

The main thread that I’m following involves the idea that sensory input moves from the periphery to the primary sensory areas, into the primary (unimodal) association areas, and, finally, the secondary association areas, producing increasingly rich composite representations and, ultimately, complete and integrated episodic memories, and that those memories serve as the basis for conditioning procedural strategies and, at least in the case of primates and some birds and mammals, planning to solve problems in novel situations.

It may be that some types of simple planning / action selection do not require complex machinery for variable binding, alignment and substitution. It’s worth thinking about how a kitten might learn from observing its mother and siblings. Does this involve the sort of creative / reconstructive recall that apparently characterizes much of human planning? Imagine a familiar scene that includes a friend or work colleague. Next imagine subtracting out your friend / colleague and adding in yourself in his / her place. Now ask yourself how this altered memory ‘‘feels’’? Do you ‘‘fit’’ in this familiar but nonetheless fictional (counterfactual) account of the past? Is it ‘‘natural’’? What would it take to make the reconstructed memory feel ‘‘natural’’?

On the whole, however, I was disappointed in what I learned though I might have expected as much given how little is known about the primary visual association area, inferotemporal cortex (IT) [22782162253745226211]. Today I’m taking a break in the hope that my thoughts will coalesce into something useful. I took the time to write a note to my Stanford colleagues asking them to recommend their best students take part in CS379C this Spring26 and caught up with the AIBS team working on the iARPA MICrONS proposal.

March 7, 2015

I wrote up some notes including a few papers and technical reports that should help new recruits to better understand the challenges facing Neuromancer and our strategies for addressing them. The challenges are divided into three categories: (1) connectomics (circuits), (2) recordings (activity), and (3) analyses (function), where the last is the least well defined in terms of agreed-upon outcomes and priorities for pursuing them:

  1. CIRCUITS: Here’s a pretty reasonable extrapolation of existing and emerging technologies leading to economical whole-brain connectomics which I’ve excerpted from [137]. Check out the full document. I think the authors have done a good job including the front-runners as well as some of the most promising alternatives. The time frame for whole-brain connectomes run from the two to ten years, depending on the organism and technology. It’s obviously much easier predicting how the technologies of incumbents like Zeiss will fare than the more exotic ideas coming out of the academic labs:

    Due to advances in parallel-beam instrumentation, whole mouse brain electron microscopic image acquisition could cost less than $100 million, with total costs presently limited by image analysis to trace axons through large image stacks. Optical microscopy at 50 to 100 nm isotropic resolution could potentially read combinatorially multiplexed molecular information from individual synapses, which could indicate the identities of the pre-synaptic and post-synaptic cells without relying on axon tracing. An optical approach to whole mouse brain connectomics may be achievable for less than $10 million and could be enabled by emerging technologies to sequence nucleic acids in-situ in fixed tissue via fluorescent microscopy. Novel strategies relying on bulk DNA sequencing, which would extract the connectome without direct imaging of the tissue, could produce a whole mouse brain connectome for $100k to $1 million or a mouse cortical connectome for $10k to $100k. Anticipated further reductions in the cost of DNA sequencing could lead to a $1000 mouse cortical connectome.

    We’re putting most of our money on reconstruction from EM using current and soon-to-be-current technologies like the new Zeiss line of multi-beam microscopes that Winfried Denk is now working with while at the same time developing his extra-wide, perfect-crystal, whole-brain, serial-sectioning diamond-knife [48], but we are also placing side bets on Boyden’s expansion-microscopy technology [30] which we believe is very promising and keeping close tabs on some of the work coming out of the Church [138] and Zador [121] labs.

  2. ACTIVITY: This is an area full of opportunity with lots of new ideas and talent from complementary disciplines. We wrote a technical report on neural recording technologies that is still pretty current [41]. One of my colleagues Adam Marblestone — Ph.D. with George Church and currently a postdoc with Ed Boyden — corralled a group biologists, chemists, physiologists, physicists, electrical engineers, etc, to put together a somewhat more speculative — understandably so given the additional complexity in working with an awake, behaving organism — extrapolation that is definitely worth your time reading [139].

    For the time being, we are banking on calcium imaging as being the recording technology that is likely to scale to satisfy our requirements. The current GECIs have much improved response kinetics and signal amplitudes compared with earlier generations [94], the necessary GECI-expressing transgenic mouse lines already exist and the Allen Institute has world-class neuroscientists with expertise in working with them. There has been some work on miniature fluorescence microscopes suitable for mounting on the head of a mouse, thereby allowing the animal limited mobility [61], but so far the incumbent GECI and fixed-camera technologies seems way out in front in terms of scale and reliability.

  3. FUNCTION: This is by far the least well explored of the three technical categories. The reason is pretty obvious: we have never had data on the scale that we anticipate from the Allen Institute MindScope Project. There has been speculation about the structure and function of cortical columns, but no compelling evidence to support any of the current hypotheses. We’ve been working with Costas Anastassiou and his team at AIBS in developing simulations of small portions of cortex consisting of 5,000-50,000 neurons, but this doesn’t even account for a single cortical column. We could simulate much larger models at Google, but at this point in time it doesn’t much matter, since, if we wanted to create a model of a small patch of cortex spanning multiple cortical columns using state-of-the-art neural modeling tools, we would be hard pressed to do so given our limited knowledge of cortical cell types, connectivity and dynamics.

    We’re designing a series of progressively more difficult modeling challenges. The first couple of challenges involve learning the input-output functions of a set of artificial neural network (ANN) models. We don’t pretend these models are necessarily good models of biological networks; however, if we can’t learn a reasonably well-behaved network we’ve engineered, there isn’t much sense in trying to learn a real neural network given all the unknowns associated with biological systems. The next set of experiments will make use of the models that Costas’ team is developing. These models consist of networks of reconstructed, multi-compartmental, virtually-instrumented and spiking pyramidal neurons and basket cells, plus ion- and voltage-dependent currents and local field potentials so we can generate the same sort of rasters we expect to collect during calcium imaging. Once again we have a highly-controlled sandbox in which to evaluate machine-learning technologies.

    We hope to start getting recorded activity data from AIBS by end of summer if not sooner. We may get access to data from experiments carried out by Clay Reid while still at Harvard that we can play with while waiting for MindScope data. Given real data, the first order of business is to see if we can replicate the training data and generalize to the test data. Interpreting success will be challenging; the best anyone can do may be to capture summary statistics of the output or identify emergent, dynamical-system behaviour. In order to have any chance of reproducing spiking behavior, we may have to restrict our attention to smaller circuits assuming we can identify their boundaries. We’ll also want to exploit any connectomic, proteomic or transcriptomic information we glean from the fixed and registered tissue after the activity-recording stage. Once we have mouse recordings, we are in terra incognita with much to learn.

March 5, 2015

It is easy to fall into the trap of thinking that just because it is possible to trace individual processes over substantial distances in dense neural tissue, tracing all of the neurons in an entire mammalian brain is just a matter of scale. The mouse brain has ~108 neurons and ~1011 synapses in a volume of ~5003 mm. Kilometers of neuronal wiring passes through any cubic millimeter of tissue and the relevant anatomical features are on the scale of 100 nm [137].

Accurate tracing of individual axonal processes is certainly possible using a number of laboratory techniques. Both anterograde (soma to synapse) and retrograde (synapse to soma) tracing that work by exploiting different methods of labeling and axonal transport are reasonably well developed but still require special care to administer. Some progress has been made by using retroviruses to propagate labels from one neuron to another [14226] and now there transsynaptic anterograde tracers, which can cross the synaptic cleft, labeling multiple neurons along an extended path [14322].

These methods suffer from the problem that, while an individual process is relatively easy to trace, once you label all or most of the processes in a given volume, you end up with many of the same problems that surface in tracing processes stained with conventional preparations using the sort of protocols and algorithms we’ve discussed elsewhere. Alternative methods that rely on propagating either unique or one of many distinguishable labels may provide a solution28.

The Zador et al [231] method for attaching unique molecular molecular barcodes to each neuronal connection works by converting the problem of tracing connectivity into a form that can be solved by high-throughput DNA sequencing. A related method leveraging fluorescent in situ nucleic acid sequencing [121] offers similar functionality with more detailed annotation — additional markers for diverse molecules — and a theoretically simpler method for reading off the information encoded in the neural tissue [138].

For more on the technical details as well as other alternative technologies, take a look at the Dean et al report [41] produced by the Spring 2013 class of CS379C or the Marblestone et al report [139] describing the physical principles relevant to scaling neural recording.

March 3, 2015

Here’s an email exchange between Costas Anastassiou and me about how we might use his cortical-model simulations. The messages are in reverse chronological order as are all the entries in this log. From: Costas Anastassiou:

Many thanks for the ideas and insights. First, let me agree with you that there are so many unanswered questions about cortical computation even decades after the beautiful work on cortical microcircuits by Martin and Douglas. Yet, in the last decade a spectrum of large-scale experimental efforts have been undertaken with the aim to provide data (especially connectivity) to fill in the blanks. Here are a few examples: Hofer et al, Nature Neuroscience (2011) [87]; Ko et al, Nature (2011) [112]; Ko et al, Nature (2013) [111]; Lien & Scanziani, Nature Neuroscience (2013) [125]; Packer & Yuste, Journal Neuroscience (2011) [162]; Perin et al, PNAS (2011) [164];

Whereas these efforts are heroic and the associated data priceless, it is unclear what the computational scheme they boil down to is. Here is where I think computational modeling can play key role: to integrate such experimental observations and use multiple approaches to come up with reduction strategies that will help us understand the fundamental types of computations occurring in various stages of visual processing. Regarding the specific questions you asked, here are my responses:

The question then becomes, if one can implement orientation or direction selectivity (by brute, feedforward force), could we then generalize into, say, complex images such as natural scenes. The issue then is that (assuming purely feedforward processing) we would need to co-activate various LGN-neurons in a ways that correlate with the the image’s characteristics which is something we do not have. So, for anything substantially more complicated that simple sine waves we do not have the computational framework in place to account for — though we are working on it.

Regarding visualizations, I guess the questions is: what do you mean when you say ‘‘make sense’’? For example, such visualizations can help us detect epileptic activity, etc. Moreover, we are in the process of creating simulated Ca-imaging movies (dF/F) which is essentially the output gathered in 2-photon experiments in order to study what these experiments really tell us in terms of cortical processing. But you are right, visualizations are also often used in ways that are not informative — this is the reason that before making any of them it would be good to know what we are trying to get out of them.

Here is the email that prompted the above reply:

It will help if we have some baseline expectations to prime the pump with as it were. Here are a few questions to help with basic impedance matching, then we can go from there: To what extent can we model and simulate a collection of cortical columns and their connections? I know that’s a lot of neurons but I’d like to be clear on what we can’t do. I’m guessing this is out of the question for several reasons.

To what extent can we retinotopically map simulated optic tract / LGN input onto contiguous chunks of simulated cortex? Is there any way that we could simulate visual input corresponding to a sine-wave grating? I expect that answer to these questions is, respectively, ‘‘hardly any extent worth talking about’’ and ‘‘you’ve got to be kidding!’’

Since synaptic strength is not something we can plausibly assign a priori, the tabula rasa version of the network can’ t do much at all. If we could apply some model of synaptic plasticity and train one of your models to realize something like Gabor filters that would be a huge result, right?

So what sort of behaviors might we expect to see given the limits of your ability to initialize and thereby preprogram these models? Of course we can and may have to work with random input and just hope that this will stimulate interesting convergent global behaviors that we can then try to infer and replicate in our trained models. How do you imagine generating simulated input? I really have no idea at this point!

Tom

P.S. I answered most of my questions by doing a quick literature search and asking local experts (Viren Jain and Peter Li) on my team. As far as I can tell, the state of the art is pretty dismal. Amazing that it has been over three-decades since the first hypotheses concerning cortical columns were made and we still know very little about the circuitry. You’d think someone would have traced a column or its analog in mouse, shrew or even turtle. Helmstaedter et al [80] was most relevant paper I came across and Boudewijns et al [22] was one of the more innovative. If you have additional suggestions, I’d love to hear them.

February 25, 2015

As time permits over the next month, I’ll be adding notes for the class at Stanford in the Spring quarter. After my presentation to the Stanford Computer Science faculty last week, I thought about how to present this material to a non-neuroscience audience more comfortable with machine-learning and computer-vision technology than diffuse neuromodulation and electron microscopy. Here’s rough outline for one approach to motivating the problem and getting computer scientists excited about the extraordinary challenges and opportunities awaiting us:

  1. Neuroscience until very recently: patch clamping (gold standard), multi-electrode arrays, Nematodes, Cephalopods, Murinae, Primates;

  2. You are first generation to have the opportunity to do neuroscience research without ever touching any living or dead model organism;

  3. What if understanding the brain was no more — and no less — difficult than reverse engineering an integrated circuit;

  4. Modern Rutherfords and Faradays — experimentalists with physics, materials science, biochemistry and nanotechnology training;

  5. Why the problem is hard from a theoretical perspective, why it is hard from a practical standpoint, and how it can be made easier;

  6. Persistent hypotheses perceived as facts with no conclusive evidence, e.g., single-cortical-algorithm, symbolic-processing reducibility;

I also thought about how to describe my current strategy for tackling functional connectomics problems in a staged manner so as to maximize the probability that students will be able to successfully complete a project involving synthetic and real data from our collaborators within the time and resource constraints of a busy Spring quarter. Here are my preliminary notes:

Industrial Espionage

Before we get into thinking about whether and how we might infer the function of real neural circuits, let’s briefly consider how we might infer the function computed by an engineered computing device. Suppose you are given recordings of the inputs and outputs of some sort of computing device: an Intel CPU chip, Arduino or Raspberry Pi circuit board, an ASIC or FPGA. Could you infer the function that the device is computing? What if you knew the wiring diagram in addition to the input-output recordings? What if in addition you were able supply your own inputs and observe the outputs? Would this make the problem any easier?

Note this is basically what a semiconductor manufacturer might do having obtained — legally or otherwise — the latest chip from a rival company. It’s called reverse engineering and is generally considered a form of industrial espionage though the practice is believed to be widespread and considered in some circles to be just a normal part of doing business.

Reverse Engineering ANN

Now consider the related problem of inferring the function computed by an artificial neural network (ANN). It’s worth pointing out that we don’t really know what the different layers of a deep neural network are computing, though it could be enormously useful in debugging and optimizing such networks. Would it help to have recordings of their inputs and outputs without knowing anything about the network structure?

Suppose we are presented with a black box containing a deep neural network of unknown architecture. If we were to record from the unknown network’s inputs and outputs could we learn to replicate its behavior? Recent work on distilling more compact networks from previously trained networks suggests that we could [17385]. Could we infer the structure / architecture given just the inputs and outputs? Given the complexity of learning Boolean circuits, I would guess not [108].

What if we were to record from all the units in a network, and we are given their approximate coordinates in the frame of reference of a 2-D projection that preserves the structure of the network shown in the diagrams used to represent the network in print? Would the locality information help to infer the input-output behavior? Could we infer the wiring diagram given the input-output behavior?

Computational Complexity

What if we had the complete circuit — the ‘‘wires’’ and ‘‘components’’, but not what the components compute — their activation functions? Could we infer them? What if we could supply our own input and then observe the output as well as the units in the hidden layers? This capability turns out to be very useful and can transform an NP-complete problem [6463] into a polynomial-time problem [342].

Though I didn’t say so explicitly, I was assuming there is no information encoded in the order in which inputs are presented to our target computing device, and I was assuming that the inputs are discrete — integer or real valued. In the case of real neural networks however, the inputs and outputs are continuous and there definitely is information encoded in the changing amplitude of the continuously varying signals. In recording from the brain, we sample these signals, hopefully at higher than the Nyquist rate.

In addition, the observations will be noisy and may only serve as a proxy for the signal we really want to record. Specifically, we employ two-photon microscopy and fluorescent proteins called genetically engineered calcium indicators (GECI) to carry out calcium imaging. Variation in the excitation of the GECI fluorophores due to light-scattering, photo-bleaching and quantum effects introduces noise, and calcium is a lagging proxy for the local field potentials that we would prefer to record if it were feasible [94].

Learning from Simulation

There are a lot of things that can go wrong in recording neural signals from biological computing devices. The recording instruments are extremely sensitive, the target tissue can be damaged in the process of recording, and, as already pointed out, there are sources of noise and device limitations that can interfere with the fidelity of the recorded signal, and, finally and somewhat ominously, we are not entirely sure we are recording the ‘‘right’’ signals required to infer function.

Instead of starting with recordings of biological systems, we begin by using simulations to generate data that, to the best of our knowledge, accurately represents the behavior of our target systems. If we can’t learn the function of systems that we understand well enough simulate, then we have little or no chance of learning the function of systems about which we know relatively little — or to be a little more optimistic about which we think we know quite a lot, but aren’t entirely sure and could be deluding ourselves.

We know — or think we know — a lot about how individual neurons behave in terms of their input-output behavior. The foundational work of Hodgkin and Huxley [86] provided us with a mathematical model of how neurons process and propagate information. The basic model consists of an electrical-circuit analog and a system of ordinary differential equations that can be used to predict local field potentials. Multi-compartmental variants allow more precision in modeling neurons with extensive axonal and dendritic arborization.

There are limitations to such models [1] as well complicating intra- and extra-cellular factors that have to be accounted for in order to explain certain aspects of neural computations. These include genetic regulation, , ://en.wikipedia.org/wiki/Ephaptic_couplingephaptic coupling, and a host of other factors that are required to fully account for even the simplest neural computations.

We are fortunate to work with the accomplished neural modelers [217146109] at AIBS to help us navigate in this complex space, and provide us with state-of-the-art simulations of neural circuits. We will ‘‘instrument’’ these circuits in order to generate simulated data to test our learning algorithms prior to trying them out on the real thing. We can also use these models along with connectomic ground-truth data to produce connectomes in the form of adjacency matrices / affinity graphs that we can use to test our assumptions about the errors produced by existing automated-connectomics algorithms.

Learning Neural Function

Neuroscience covers a great many disciplines that have something to say about the brain. At the micro-scale, we have theories that operate at the cellular and molecular levels such as the Hodgkin-Huxley model. At the macro-scale, we have cognitive neuroscience has given us a wealth of theories and experimental findings from psychology, psychophysics and cognitive science.

There is a substantial gulf between the micro and macro scale. It’s as though we can talk at the low-level about digital circuits, machine registers and assembly code and at the high-level about the behavior of applications like Photoshop and Microsoft Office, but there isn’t even a suitable ontology in which to frame theories about the critically-important middle-ware that adds functionality on top of the operating system in order to enable the development of complex applications software.

An increasing number of computational neuroscientists are coming to the conclusion that the gap between the micro- and macro-scale is too wide to bridge and that we need some sort of meso-scale modeling language to connect the two [1402827]. What would such a language look like? In particular what would be the canonical circuits and computational primitives at such a level? I don’t know the answer for general cognition, but in the case of vision we may have a good start.

Researchers in computer vision have a long and close relationship with scientists studying biological vision [38]. Many of the basic operators and algorithms that comprise computer vision libraries such as OpenCV have taken their inspiration from neuroscience. See elsewhere in my notes and in the recent paper I coauthored with David Cox [38] for a list of computational components inspired by neuroscience and speculation about related technologies that might serve to explain.

Learning Neural Structure

Peter Li who is attending a workshop and hackathon at the Janelia Farm Campus of HHMI mentioned that the drosophila connectome team is nearing completion on the seven-column29 dataset and that the full connectome will soon be available along with the EM data from which it was generated. Peter and Jon Shlens who did their Ph.D. work on primate retinas with E.J. Chichilnisky at the Salk Institute before joining Google have graciously offered to serve as consultants on CS379C projects.

Researchers have successfully constructed the full connectome of one animal: the roundworm C. elegans [220]. Partial connectomes of a mouse retina [25] and mouse primary visual cortex [21] have also been successfully constructed. Bock et al’s [21] complete 12TB data set is publicly available at Open Connectome Project.

I’ve continued to investigate if, how, and where ambiguity and multiple consistent hypotheses are handled in cortex. My conversations with Steven Zucker have focused on the task of finding contours from a combination of low-level, bottom up and high-level, top-down information sources. As Steven points out, this task is closely related to the Gestalt notion of closure and is located (conceptually) in the rather large space of human competencies that current academic information-processing challenges overlook and squarely in the critical path of our requirements for tracing neural processes as part of connectomic analysis and addressing the so-called ‘‘binding’’ problem in our paraphrase, translation and document summarization work. It’s worth emphasizing this last statement: finding a general solution to this problem would go some way toward solving several key problems of interest to the intelligence community — the ‘‘i’’ in ‘‘iARPA’’, namely, tracing unpaved roads in satellite data and improving our ability to automatically recognize, translate and summarize all manner of natural language input — including, alas, our cell phone conversations.

Steven has a relatively recent paper [237] on the topic that does a good job of describing the problem. He describes the underlying computation in terms of the field equations for a reaction-diffusion process30 and suggests that the corresponding information processing may involve the glia surrounding an ensemble of spiking neurons. That’s one possible hypothesis, but, given the slim evidence we have to go on at this juncture, one might also conjecture networks of (inhibitory) GABAergic interneurons primed by (excitory) pyramidal neurons as the substrate for the underlying computations and some sort of mode-seeking, distributed-mean-shift as a more appropriate algorithmic basis [7455141218]. In any case, if striate cortex performs this sort of computation, it would be an important discovery from both a scientific and practical standpoint — the latter since we don’t have good solutions for solving these problems in our artificial neural networks. Here’s the abstract from Steven’s paper:

Border ownership is an intermediate-level visual task: it must integrate (upward flowing) image information about edges with (downward flowing) shape information. This highlights the familiar local-to-global aspect of border formation (linking of edge elements to form contours) with the much less studied global-to-local aspect (which edge elements form part of the same shape). To address this task we show how to incorporate certain high-level notions of distance and geometric arrangement into a form that can influence image-based edge information. The center of the argument is a reaction-diffusion equation that reveals how (global) aspects of the distance map (that is, shape) can be ‘‘read out’’ locally, suggesting a solution to the border ownership problem. Since the reaction-diffusion equation defines a field, a possible information processing role for the local field potential can be defined. We argue that such fields also underlie the Gestalt notion of closure, especially when it is refined using modern experimental techniques.

Steven also pointed me to a 2014 paper of his on the closely-related problem of resolving uncertainty in low-level edge features using curvature constraints as a proxy for high-order statistics. The paper makes interesting (evocative) connections to multiple computational frameworks from Boltzmann machines and statistical mechanics to Markov and Conditional random fields all of which have been applied to this problem by computer-vision researchers. It’s an interesting ‘‘thought’’ piece, but less problem-focused and hypothesis-driven than the earlier work. Here’s a redacted version of the abstract from [238]:

Vision problems are inherently ambiguous: Do abrupt brightness changes correspond to object boundaries? Are smooth intensity changes due to shading or material properties? For stereo: Which point in the left image corresponds to which point in the right one? What is the role of color in visual information processing? To answer these (seemingly different) questions we develop an analogy between the role of orientation in organizing visual cortex and tangents in differential geometry. Machine learning experiments suggest using geometry as a surrogate for high-order statistical interactions. The cortical columnar architecture becomes a bundle structure in geometry. Connection forms within these bundles suggest answers to the above questions, and curvatures emerge in key roles. More generally, our path through these questions suggests an overall strategy for solving the inverse problems of vision: decompose the global problems into networks of smaller ones and then seek constraints from these coupled problems to reduce ambiguity. Neural computations thus amount to satisfying constraints rather than seeking uniform approximations. Even when no global formulation exists one may be able to find localized structures on which ambiguity is minimal; these can then anchor an overall approximation.

February 12, 2015

In an earlier post, I mentioned that according to the MICrONS BAA we are supposed to come up with a novel, biologically plausible machine learning algorithm that (presumably) addresses a machine perception task — auditory, olfactory, visual. etc. I realized last night, that several of us at Google are working on a problem that has the following characteristics: (i) it addresses a felt need in computer vision, (ii) it was identified early on by computational neuroscientists, including Shimon Ulman, Eli Bienstock, David Mumford and Stuart Geman, (iii) we have some ideas for solving the problem using deep neural networks, and (iv) it may have a biological solution that appears early in the visual pathway31. My notes from last night:

Here is a paper by Geoff Hinton and two of his students [84] that has gained some interest in the neural network community as it identifies a shortcoming of convolutional architectures and indeed most — all that I know of — neural network architectures that attempt object recognition. The same shortcoming also applies to many of the most popular non-NN approaches including spatial pyramid matching [66228]. I include in this sweeping generalization alternating-simple-complex-cell hierarchies such as HMAX.

The problem, as Geoff points out, is that the otherwise desirable property of learning invariant features, achieved primarily through the use of multiple stages of pooling, results in increasingly coarse filters that, while appropriately succeeding on some ambiguous cases thereby increasing recall, fail on others that would not be ambiguous were it not for the pooling, thereby decreasing precision. Stu Geman [59] refers to this as the selectivity-invariance tradeoff in analogy to the in statistics.

Simplifying somewhat, Geoff’s strategy in resolving the dilemma is to use pooling to increase recall but include geometry — for example, the coordinates of the maximum value as identified in the frame of reference of the associated receptive field — in a separate channel so as to filter out false positives further along in the processing. This might be important, say, in the case of resolving detail in a low resolution image where context is critical in recognizing objects.

For example, a pink blob in the corner of an image is initially unrecognizable, but, as it becomes increasingly clear that we’re looking at a scene of a barnyard, it is much more likely that the pink blob is a pig and the additional information in the side channel can now be used to check on other constraints that might confirm or disconfirm this hypothesis, e.g., the blob is close enough to the ground for the ersatz pig to be standing on solid earth and an elliptical bounding box is about the right size for a pig as seen from the inferred distance to the barn.

The first mention of the problem in the neuroscience and neural-network literature that I am aware of appeared — along with a proposed solution — in the work of Shimon Ullman [212213]. It has also surfaced in discussions of compositional features and the importance of compositionality in explaining visual processing in the ventral stream — see Bienenstock, Geman and Potter [1718] for some early, influential work and Hanuschkin et al [76] for a more recent example looking at sequential compositionality in feed-forward networks as a model of complex song generation in the Bengalese Finch.

January 19, 2015

In chasing citations forward and backward starting from the Fernández et al paper [52] I turned up a bunch of older work on hierarchical RNN models including the oft-cited but mainly-of-historical-interest NIPS paper by Hihi and Bengio [49]. I didn’t find any useful techniques that haven’t been incorporated into newer models or discarded for good reason. Regarding work that followed Fernández et al, I recommend Koutník et al [114] which appeared in last year’s ICML. This work that came out of Schmidhuber’s lab is memorable for its coinage of the name ‘‘Clockwork RNN’’ and bears all the marks of Jürgen’s encyclopedic knowledge concerning the history of neural networks. Koutník et al compare their CW-RNN models with the Fernández et al’s CTC-trained DBLSTM models.

If nothing else, I recommend you skim through the related-work section for an excellent if brief review of past work on hierarchical neural network models. The ideas mentioned include using delayed forward connections, hidden units operating at different time scales, recurrent connections with time lags, leaky-integrate-and-fire neurons that introduce variable-duration hysteresis, equivalently weighted self connections that decay exponentially in time, and Schmidhuber’s own sequence-chunking and neural-history-compression work. The authors also allude to the repertoire of models from control theory including variable-time-scale and hierarchical variants of HMMs, Kalman filters, and continuous-time stochastic processes, all of which is interesting but largely irrelevant to our use cases.

Lest I mislead you, Clockwork RNN (CW-RNN) models are not hierarchical except in the sense that they facilitate reasoning about processes at multiple time scales. If that’s your working definition of abstraction, then you might characterize them as hierarchical. In a CW-RNN, the hidden layer is ‘‘partitioned into separate modules, each processing inputs at its own temporal granularity, making computations only at its prescribed clock rate.’’ The authors claim that ‘‘[r]ather than making the standard RNN [SRN] models more complex, CW-RNN reduces the number of SRN parameters, improves the performance significantly in the tasks tested [TIMIT spoken-word classification and handwriting recognition], and speeds up the network evaluation.’’ The modules can be ordered — and even stacked — to emphasize the implicit temporal hierarchy:

Clockwork Recurrent Neural Networks [CW-RNN] like SRNs, consist of input, hidden and output layers. There are forward connections from the input to hidden layer, and from the hidden to output layer, but, unlike the SRN, the neurons in the hidden layer are partitioned into g modules of size k. Each of the modules is assigned a clock period Tn ∈ {T1, ..., Tg}. Each module is internally fully interconnected, but the recurrent connections from module j to module i exists only if the period Ti is smaller than period Tj. Sorting the modules by increasing period, the connections between modules propagate the hidden state right-to-left, from slower modules to faster modules.

CW-RNN models are primarily useful for handling long-term dependencies, and it is in this regard that they extend LSTM constant-error carousels to efficiently handle even longer-term dependencies. The basic idea can be applied to complement the Fernández et al CTC-trained DBLSTM models. Since the sub-sequence abstractions that characterize Shalini’s hierarchical document modeling task are variable in span, the latter is most likely to prove useful for our purposes, but given the potential value of keeping track of relevant context over the span of several paragraphs or chapters, it may prove useful to substitute CW-RNN hidden layers for the vanilla LSTM hidden layers in Fernández et al.

In general, it seems unlikely we will have supervision in the form of target sequences at intermediate levels in the hierarchy. However, given sufficient data, Fernández et al suggest it may be possible to discover enough structure to automatically identify structural boundaries:

Finally, the total error signal Δi received by network gi is the sum of the contributions in equations (12) and (13). In general, the two contributions can be weighted depending on the problem at hand. This is important if, for example, the target sequences at some intermediate levels are uncertain or not known at all.
Δi = λi Δitarget + Δibackprop
with 0 ≤ λi ≤ 1. In the extreme case where λi = 0, no target sequence is provided for the network gi, which is free to make any decision that minimizes the error Δibackprop received from levels higher up in the hierarchy. Because training is done globally, the network can, potentially, discover structure in the data at intermediate levels that results in accurate predictions at higher levels of the hierarchy.

In the case of the hierarchical document model, we have supervision at the word, sentence, and paragraph level, though not nearly as precise or informative as the phoneme-level supervision provided in the case of TIDIGITS — the speaker-independent connected-digit speech recognition problem and dataset. The hierarchy starting with MFCCs on the lowest rung, followed by phonemes and culminating in words, is of a very different sort than one starting with words and culminating in paragraph or chapter boundary markers32.

If we have supervision for intermediate layers, then we can use Graves DBLSTM plus CTC model [6952] along with the feedforward topology described in [68] and a modification to the output layers for the LSTM hidden layers. In the next installment, we consider the case in which we don’t have sources of intermediate supervision.

January 17, 2015

Here’s a paper by Fernández, Graves and Schmidhuber [52] that describes a hierarchical model similar to the sort of thing we’re after. Specifically, their model supports multiple levels of abstraction and provides an elegant solution to the problem handling abstractions that span arbitrary-length sub-sequences of the input. This is accomplished by employing an extended label set L′ = L ∪ {BLANK}, and introducing the notion of a path corresponding to a sequence of labels from L′. A few weeks ago Javier and I traded email about Graves’ use of the BLANK label in his thesis, but this IJCAI paper focused my attention and gave me a greater appreciation for how Graves uses it.

The same label can appear multiple times consecutively in the output sequence; however, the interpretation of repeated-label sequences is necessarily a bit more complicated when it comes to defining a probability measure on output sequences. Each layer of network is implemented as a CTC (Connectionist Temporal Classification) layer — introduced by Graves at ICML the year before [69] and refined in his thesis that came out the following year [67], and each CTC layer has its own softmax layer which ‘‘forces the hierarchical system to make decisions at every level and use them in the upper levels to achieve a higher degree of abstraction. The analysis of the structure of the data is facilitated by having output probabilities at every level.’’

They compare their approach with hierarchical HMMs — using the HMM Tool Kit from Steve Young’s group at Cambridge — on a spoken-digits task. These experiments involve relatively small — ~1/4 million parameters — models applied to relatively simple problems, but presumably this is the same architecture that Graves used to obtain the state-of-the-art speech-recognition results reported in his 2013 paper with Navdeep Jaitly [70].

Shalini plans on taking advantage of the implicit abstraction-boundary markers available in documents, e.g., end-of-sentence, end-of-paragraph and end-of-chapter markers, in developing and testing her ideas for hierarchical LSTM models, and this is definitely the right way to move forward as it allows for greater supervision in training intermediate layers of the network and provides ground truth to test unsupvised approaches.

Javier has proposed that we might be able to use MCR to facilitate unsupervised learning to segment: ‘‘We do have labels for the word level but the problem is that we do not have labels for the higher layers sentences, paragraphs, etc. One option is to create these labels (actually vectors) using MCR, so we can train the upper layers using sequences of MCR as target, and let the network to learn the segmentation.’’ The three of us — and anyone else who wants to participate — should set aside some time to talk about these ideas in detail next week.

Fernández et al mention but don’t compare head-to-head with Sanjay Kumar and Martial Hebert’s work on hierarchical CRF models [116]. It would be interesting to get Sanjay’s take on state-of-the-art CRF technology for this application, though it’s my impression from talking with Kevin Murphy that work on CRF models has taken a backseat to deep neural networks, largely due to complexity of training such models. It is also probably worth checking with Andrew MacCullum and Geoff Hinton both of whom have experimented extensively with hierarchical CRF models for, respectively, language processing [204] and computer vision [161]. I’ll have to go back and read the Graves and Jaitly paper [70] describing their state-of-the-art, end-to-end ASR system implemented as deep bidirectional long-short-term-memory recurrent neural network trained with the connectionist temporal classification objective function [70].

January 15, 2015

The success of the Socher et al [191] matrix-vector models on parsing and sentiment analysis helped lead to a resurgence of interest in tensor-product models though at first they didn’t characterize their work as employing tensors and actually contrasted their work with prior work involving tensors. They did realize that they were using compositional vector spaces to represent the variability in word meanings.

Socher et al suggest that every word might be represented by an n-dimensional vector plus an n × n matrix, but noted that the dimensionality of the model becomes too large with the sizes of the vectors, — n = 100 or larger — commonly used in practice, and so ‘‘[i]n order to reduce the number of parameters, [they] represent word matrices by the following low-rank plus diagonal approximation: A = UV + diag(a), where U ∈ ℝn × r, V ∈ ℝr × n, and a ∈ ℝn.

In Socher et al [193] they back off allocating a vector to every word and start using tensor models to represent the complex relationships between words that shade their composite meanings. Geoff Hinton has a long history of incorporating non-linear, multiplicative components into networks to provide greater flexibility and representational power. Geoff and his students were using bilinear33 and tensor34 long before the rest of the NIPS crowd caught on. NTN (Neural Tensor Network) models [190] are particularly good at reasoning about the multiplicative interactions between relationships in natural-language processing [149193] and natural-logic inference [23]. This log entry focuses on work by Srikumar and Manning [195] using an NTN model to learn distributed representations for predicting structured output in the form of sequences, segmentations, labeled trees and arbitrary graphs.

Let xX be an input corresponding to a sentence or document yY an output structure representing x. Φ(x,y) → ℝ is a feature function that captures the relationship between x and y as an n-dimensional vector and w ∈ ℝn is a linear scoring model so that arg maxywT Φ(x,y) defines the combinatorial optimization problem of finding the best structure y representing x. Note that this problem is at the heart of several parsing, alignment and segmentation problems.

Here is Figure 1 from Srikumar and Manning [195] illustrating the three running examples used in the paper — which, by the way, are enormously helpful in understanding the paper:

The output is shown as one or more parts each of which consists of a sequence of labels, e.g., yp = (y0, y1) is a part with two labels. Let L be the set of all M labels {l1, ..., lM}, e.g., a set of part-of-speech tags. We denote the set of parts in the structure for input x by Γx and each part p ∈ Γx is associated with a list of discrete labels, denoted by yp = (yp0, yp1, ...). To represent the various relationships among parts — e.g., emissions in the case of sequences or compositions in the case of nodes in a parse tree — the authors employ a set of of d dimensional unit vectors al one for each label which together constitute the columns of a d × M matrix A.

Since we are assuming that the labels are related to one another through multiplicative interactions, we model those interactions with a tensor whose elements essentially enumerate every possible combination of elements from input φ and label {ai} vectors — see here to understand how this multiplicative mixing is accomplished as a tensor product. Ψ(x, yp, A) is the recursively-defined feature tensor function that produces the feature representation vector ΦA(x, y), not to confused with the input feature vector φ(x)

The definition of the feature tensor function is straightforward but notationally cumbersome and so check out the paper for the details. In lieu of the full details, the authors account of one of their running examples should give you a pretty good idea of how it works. The following graphic expands the tensor for the case of a compositional part with two labels — the middle example in the above figure. The vec(.) operator vectorizes a matrix or tensor by concatenating, respectively, the column vectors of a matrix or (recursively) the two-dimensional slices of a tensor:

Note that tensor products are associative, distributive but non commutative. We can expand A ⊗ φp(x) as a sequence vector-matrix tensor products as in A ⊗ φp(x) = al1 = ⊗ al2al3 φp(x). The size of the feature representation vector is exponential in the number of labels M. In the example involving part-of-speech tagging there we on the order of 50 labels — 45 English and 64 Basque. The paper doesn’t directly indicate that this vector is sparse but practically speaking it would have to be. I’ve asked the first author to resolve the ambiguity in defining Φ the feature tensor function35.

Training is supervised with labeled data for parsing and PoS tagging from Penn Tree Bank or using the Stanford parser. The trained model can be used to score word sequences / sentence fragments with respect to the target class of structures or to provide a representation for classification or related interpretation tasks. The PoS experiments were interesting if not conclusive. I’ve asked Vivek, the first author, if there has been any follow-on experiments with other problems such as paraphrasing.

Training is complicated due to the non-convex nature of the loss function. An alternating optimization is presented in which the alternation is between minimizing f(w, A) with respect to w while holding A fixed and minimizing f(w, A) with respect to A while holding x fixed where the two restricted minimizations are convex. This appears to converge and provide good results and the method sounds entirely reasonable given my prior experience.

Residual thoughts: Remember to write a brief summary of the paraphrase [14] and document-summarization [15] work being done by Jonathan Berant, Percy Liang and Vivek Srikumar in Chris Manning’s lab.

January 11, 2015

In MacCartney [129] and Bowman et al [23], the objective is to classify the semantic relationships between pairs sentences or sentence fragments corresponding to the textual analogs of well-formed natural-logic formula36. This objective is realized in Bowman’s RN[T]N models as an output layer implementing a softmax over the set of semantic relations {→, ←, ↔, ..., ≠}37. For example, given the two natural-logic sentences, ‘‘all reptiles walk’’ and ‘‘some turtles move’’, the softmax layer might yield P(→) = 0.9. Note that natural-logic textual analogs of logical formula pack a lot into few words. For example, ‘‘all reptiles can walk’’ might be represented as ∀ x, isa(x,reptile) ⊃ walk(x) in first-order logic.

Of course you can unpack a natural-logic sentence as in ‘‘men are mortal’’ ∧ ‘‘Socrates is a man’’ ⊃ ‘‘Socrates is mortal’’, but it’s not common in everyday speech to be so pedantic. However, in the case of applying neural networks to infer semantic relations, the more concise — packed — version of the problem comparing ‘‘all men are mortal’’ and ‘‘Socrates is mortal’’ is somewhat easier to handle as it simplifies the monolingual-alignment problem — the neural network has to align ‘‘all men’’ with ‘‘Socrates’’ and realize that the latter is a subset of the former and thus downward-monotone [129]. We can’t expect inference problems to always be so conveniently structured, and more complicated nested formula such as we find in the SICK (Sentences Involving Compositional Knowledge) corpus (HTML) are more likely to represent the norm.

Not only will we want to determine if a statement follows from statments expressed earlier in a conversation — a relatively simple directed task, but it will also be important to draw conclusions — particularly commonsense ones — that follow from what was said — a potentially open-ended task that we humans do all the time without breaking a sweat. Traditionally, declarative knowledge in the form of simple rules, e.g., Horn clauses, is used to derive new knowledge from old by either (a) working forward from antecedents — that we know to be true — to consequents using modus ponens or (b) working backward from consequents — that we hypothesize to be true — to antecedents which if true support the consequent using modus tolens.

Neither of these derivation / proof strategies is guaranteed to derive everything that follows from a set of facts and a set of rules in polynomial time. Hence, whatever sort of theorem prover, production system or logic-programming language we employ, we will have to be satisfied with a heuristic solution, albeit one guided by experience. In keeping with our interest in seeing how far one can go without resorting to GOFAI, consider how we might employ textual surrogates for quantified formulae — rules — and implement their instantiation and application in terms of vectors:

If you compress the meaning of words, phrases and sentence fragments into vectors in an embedding space — their meaning implicit in their proximity and direction to other vectors, how do you get meaning, words and nuance back out when you need them to express yourself or understand someone else? There is an important difference between a language model (LM) that we would learn using, say, SKIP-GRAM or CBOW, and the output layer of the encoder LSTM in the Sutskever et al [203] or Cho  [31] approaches: the LM allows us to index the vectors for millions of words and short phrases via a one-hot vector and the LSTM encoder allows us to generate a single vector representing a sequence of words of variable length where that representation is optimized by the learning algorithm to facilitate computing the probability of the next word via the softmax layer.

To what extent can these encoder vectors serve as an inferential proxy for the input text beyond simply helping us to predict the next word — consider what the encoder has to incorporate into its recurrent output in order to correctly predict the appropriate type of reform in ‘‘As part of his Square Deal with the American people President Roosevelt introduced sweeping economic reform.’’ and ‘‘As part of his New Deal with the American people President Roosevelt introduced sweeping social reform.’’ There is clearly enough information to distinguish ‘‘Theodore’’ from ‘‘Franklin’’ in this context, but the words ‘‘New’’ and ‘‘Square’’ are more important for predicting the penultimate word.

A bag-of-words representation that includes ‘‘Deal’’ plus one of ‘‘New’’ or ‘‘Square’’ is likely to fall short for prediction — assuming the vocabulary doesn’t include the bigrams ‘‘New_Deal’’ and ‘‘Square_Deal’’, but at least the LSTM encoder has the potential to differentiate between ‘‘New Deal’’ and ‘‘Square Deal’’. It would potentially useful however in preparing to generate a response to have distinguished ‘‘Franklin Roosevelt’’ from ‘‘Theodore Roosevelt.’’

LSTM layers can be trained to compute differences between vectors thereby ‘‘backing’’ out terms ‘‘bound’’ in a vector composition via, say, superposition. In principle, it should be possible for a layer to take the embedding for sequence of words, say, ‘‘Franklin Roosevelt was an effective social reformer’’, back out some bound term, say, ‘‘Franklin’’, substitute an alternative term, say, ‘‘Theodore’’ and then check the result against a collection of natural-logic assertion including, say, ‘‘Theodore Roosevelt was an effective economic reformer’’ and find no or scant supporting evidence.

This simple differencing approach to binding variables may not be as effective as those used in MCR [187] (Modular Compositional Representation) or PSI [221] (Predication-based Semantic Indexing), but it may be worth running some experiments given that the differencing method could turn out to be be easier to implement within the family of NN architectures we are considering, and, if not, perhaps we will figure out a hybrid approach that offers an attractive compromise. In that spirit, here’s a proposal for a quick-and-dirty experiment.

Given the arguably-false, natural-logic sentence ‘‘Franklin Roosevelt was an effective economic reformer’’ issued in the context of a conversation, we might ask if the utterance is entailed by ‘‘Theodore Roosevelt was an effective economic reformer’’ or ‘‘Franklin Roosevelt was an effective social reformer’’. The answer should be effectively ‘‘no’’ as indicated by P(→) < 0.5, using an inference ‘‘engine’’ like NLI. However, the sentence ‘‘Franklin Roosevelt was an effective social activist’’ would seem more probable given ‘‘Franklin Roosevelt was an effective social reformer’’.

Consider the following four sentences: (1) ‘‘Roosevelt was an effective social activist’’, (2) ‘‘social activists are generally social reformers’’, (3) ‘‘social reformers are generally advocates for improving the lives of the disadvantaged and impoverished’’, and (4) ‘‘Franklin Roosevelt significantly improved the lives of the poor and unemployed during the Great Depression’’. Is it reasonable to expect that a neural network model for natural-logic inference could conclude (4) from (1-3)?

There are a couple of key inference problems that we’re interested in addressing: Suppose the model has been trained on Wikipedia including the following sentence: ‘‘Roosevelt signed into law several bills that provided relief to those who lost their jobs in the great deprssion.’’ How might we answer the user’s question: ‘‘Did Roosevelt’s New Deal help the poor?’’ For one thing, the system would have to infer the user is talking about Franklin and not Theodore. And then there is the problem of bringing inferrence to bear on the selection of words and the form of the reply in the process of generating responses.

Given Descartes’ interest in simulating historical figures, we might hope to have a subtantial corpus of dialogue like: Q: ‘‘Did McKinley have any impact on global trade ...’’, A: ‘‘William McKinley was more of a political follower than leader ...’’ and Q: ‘‘What economic policy was Grover Cleveland known for ...’’ A: ‘‘President Cleveland signed legislation ...’’. The specific answers found in the training corpus may not be be approriate to the conversation at hand, but generative machinery — of the LSTM decoder trained on this corpus — may provide a script we can use to respond with an appropriate dialog / speech act.

We could supplement the dialog corpus as a source of such syntactic structure and intentional cues, with knowledge specific to the entities named in the user input. For example, given a question or comment about Teddy Roosevelt’s economic impact, we could make use of ‘‘Roosevelt passed legislation ... antitrust laws ... regulating monopolistic trust corporations ...’’, or asked about McKinley’s impact on global finance following the American Civil War we could ‘‘fold in’’ the Wikipedia statement ‘‘McKinley ... raised protective tariffs to promote American industry ... maintained the gold standard in opposition to ...’’.

In earlier posts, I gave short shrift to the power of natural logic, thinking we would require some analog of conventional rules. Now I’m not so sure. Emulating forward or backward chaining in vector space may be more trouble than it’s worth. It depends on how closely we want to cleave to the syntactic and semantic precision of predicate calculus which is generally seen as being at odds with the flexibility and useful ambiguity of natural language. Full-blown logical inference would require vector / tensor machinery for applying rules of inference including universal instantiation, existential generalization, modus ponens, and applying DeMorgan’s laws as well as other transformations. Doable but worth putting off until we have a compelling use case that simpler natural-logic can’t handle38.

In natural logic the statement ‘‘Pop stars take drugs’’ is shorthand for ∀ x, isa(x,pop_star) ⊃ ∃ y, drug(y) ∧ ingest(x,y), but translating such statements into well-formed formulae is tedious and error prone. Natural-logic inference allows us to determine whether or not ‘‘Pop stars take drugs’’ entails ‘‘Michael_Jackson takes drugs’’. In fact, while natural logic does not support this conclusion, it would be interesting if we could derive — or would have trouble denying — a weak form of entailment for ‘‘Michael_Jackson takes drugs’’ from ‘‘Steven_Tyler mainlines heroin’’, ‘‘Alvin_Lee shoots crystal meth’’ and ‘‘Keith_Richards smokes crack’’ in the absense of any statements of the sort ‘‘Donny_Osmond doesn’t take drugs’’.

Residual thoughts: Read the Ba et al [6] paper on training an object recognition system with an attentional component that ostensibly avoids the overhead of convolutional layers — lots of wasted dot products — applied to large images. Also read their earlier work Mnih et al [151] and scanned the related paper by Maes et al [132]. I was primarily interested in whether their model provides any insights into how we might build an attentional component to control inference. Ba et al start with a global context / salience map, and following each subsequent saccade they obtain a new foveated view of the image 39. In the case of dialog, the analog of a gist-like context might be a bag of words representation and salient regions might correspond to high-entropy sub sequences of the input history — think of the textual analog of Itti and Koch [9293] salience heuristically related to recency and novelty.

Using Ba et al as motivation, Ilya Sutsekever gave a technical talk on basic reinforcement learning in the Brain reading seminar, and Viren Jain mentioned Sebastian Seung’s paper [183] on how:

[T]he randomness of synaptic transmission is harnessed by the brain for learning, in analogy to the way that genetic mutation is utilized by Darwinian evolution. This is possible if synapses are ‘‘hedonistic,’’ responding to a global reward signal by increasing their probabilities of vesicle release or failure, depending on which action immediately preceded reward. Hedonistic synapses learn by computing a stochastic approximation to the gradient of the average reward. They are compatible with synaptic dynamics such as short-term facilitation and depression and with the intricacies of dendritic integration and action potential generation.

Sebastian uses the REINFORCE algorithm of Baxter and Bartlett [12] to demonstrate how a network of hedonistic synapses can be trained to perform a desired computation by administering reward appropriately, as illustrated here through numerical simulations of integrate-and-fire model neurons. Coincidentally a paper in the latest issue of Neuron by Tremblay et al [210] shows how the simultaneous activity of ensembles of neurons in the primate lateral prefrontal cortex can be decoded to reliably predict the ‘‘allocation of attention on a single-trial basis’’. Oh, and I’ll probably never get back to this, but as I was looking for related work by Mnih I stumbled on a paper by Mnih and Hinton [150] on ‘‘hierarchical language models’’ which claims to constrain chunking to binary relations in consecutive terms or conjunctions of terms and — perhaps — identify the tags-terms corresponding to parts of speech, e.g., prepositions, conjunctions, etc, and group accordingly.

November 29, 2014

I’m preparing for the Spring edition of my computational neuroscience class (CS379C) at Stanford, and this year we’ll be looking at machine-learning methods for functional connectomics. While I hope to get interesting recordings of neural activity from Ed’s lab and Clay’s team at the Allen Institute, I’d also like to be able to generate synthetic data by simulating molecular models using MCell or NEURON in order to conduct controlled computational experiments to get a better handle on what’s possible. Here is a video showing off an EM reconstruction of cells in rat hippocampus from the Salk Institute:

Reconstruction of a block of hippocampus from a rat approximately 5 micrometers on a side from serial section transmission electron microscopy in the lab of Kristen Harris at the University of Texas at Austin in collaboration with Terry Sejnowski at the Salk Institute and Mary Kennedy at Caltech. Josef Spacek, Daniel Keller, Varun Chaturvedi, Chandrajit Bajaj, Justin Kinney and Tom Bartol made major contributions to the reconstruction and the video. (YOUTUBE)

This reconstruction was used to conduct molecular-scale simulations of hippocampal neurons for the purpose of investigating hypotheses concerning the role of extra-synaptic — sometimes referred to as ‘‘ectopic’’ transmission — neurotransmitter diffusion [35128]. Jed Wing (jedwing@google.com) was a key contributor to MCell and probably knows more about the code base than anyone else. Justin Kinney did a lot of the modeling work which is described in Justin’s Ph.D. Thesis at UC San Diego: (PDF).

I would like to find a similarly-detailed molecular model and instrument it so that we can simulate a cortical circuit consisting of something on the same order of complexity as the Hill and Tononi work [82] — on the order of ~1000 cells and millions of connections. I’m in touch with Justin Kinney who is now in Ed Boyden’s lab and I will reach out to Terry Sejnowski and others on the MCell team at the Salk Institute to identify suitable models40.

The motivation is not to supplant the use of recordings from neural tissue but rather to anticipate the advent of such data, and examine the hypothesis that we will actually be able to make sense of such data when it is available at scale. There is a database of models indexed by simulator available on the NEURON website. There is only one MCell model listed on the Yale website. As one might expect, there are quite a few models that can be run on NEURON. There is probably a more extensive list of models for MCell available elsewhere.

Another problem for which synthetic data might prove useful is in testing algorithms that make use an adjacency matrix obtained from the connectomic analysis of EM data to infer function, classify cell types or estimate the existence or strength of synapses. Suppose that we have an adjacency matrix obtained from a connectome for which we have additional information about the location, size and perhaps layer of cortex for the cell bodies of the neurons represented in the matrix.

Suppose that we also have an error model that provides information about the probability of an error in tracing an axonal or dendritic process as a function of the process length or other characteristics that can reliably inferred from current state-of-the-art connectomic analysis assuming tractable human correction / annotation. For small circuits, we might be able to combine this synthetic connectomic data with simulated calcium imaging of the sort described above.

November 5, 2014

Summary of near-term goal of connectomics: high-throughput measurements of wiring diagrams in arbitrary nervous systems at the scale of microcircuits (~105 to 106 neurons and their synaptic interconnections). The measurement will be destructive (invasive) for now, pending fundamental advances in non-invasive nanometer-resolution imaging.

Technological impacts:

  1. A library of computing microcircuits.

    * Routine measurement of wiring diagrams will lead to a library of microcircuits that are found in various nervous systems; this library will be a resource that provides inspiration and models for artificial computing systems. As an analogy, high-throughput genomics has led to a library of protein-encoding sequences that have been the basis of the burgeoning synthetic biology industry. Organizing and analyzing a library of microcircuits will be a major intellectual challenge, but an inevitable and fruitful one.

    * What is the justification for the claim that knowledge of biological microcircuits can usefully inform artificial computing systems? A famous example is that convolutional networks (specifically, the notion of serial layers of filtering and pooling) can be traced back (via LeCun and Fukushima) to measurements in visual cortex by Hubel and Wiesel. At the time, they were restricted to low-throughput physiology techniques, rather than high-throughput measurements of network connectivity. Who knows what we might find in entire libraries of such microcircuits.

  2. Computational Infrastructure for Automated Analysis of Large-scale Biological Imaging Data.

    * Biology and clinical diagnosis is undergoing a major transformation to emphasizing the use of imaging technologies as a platform for data acquisition. This is due to fundamental advances in the resolution and cost-effectiveness of imaging technologies and labeling methods, and the potential for modern computing technology to work with the resulting datasets. What is missing is the software infrastructure and algorithms for automatically analyzing such data.

    * For example, for the problem of ‘‘segmenting’’ (reconstructing) structures in biological imaging datasets, previous work in computer vision has not emphasized the scenario in which there is a rigorous notion of ground truth, as segmentation for natural images is often ill-defined. Hence, research on connectomic reconstruction is leading to new ‘‘end-to-end’’ learning algorithms for the segmentation problem that will be broadly useful for segmentation of datasets in which there is a clear notion of ground truth.

    * For example, Neuromancer has built Google3 infrastructure for storing, accessing, and processing 3d images that are petabyte-scale. This infrastructure will be broadly useful as other imaging platforms (particularly those currently evolving for clinical usage) themselves also grow in scale. In the clinical sphere, at some point (much like with the Cloud Genomics effort), it will be logical to shift management of these datasets to centralized sources of storage and computation, though the timeline for that type of shift is highly uncertain. We anticipate the infrastructure will be useful for other volumetric datasets, including atmospheric data, fluid dynamics simulations, geophysical models.

  3. Precise models for treating brain diseases.

    * Improved understanding of how microcircuits in the brain are organized (by comparing the circuitry of healthy and diseased brains) will inform any treatment of circuit-related brain disorders. For example, it is known that techniques such as deep brain stimulation can in some populations act as significant therapeutic for depression and Parkinson’s. However it is not really known why this works, or why in many cases it doesn’t work. A more precise understanding of the relevant circuits will lead to interventions that are more precisely formulated and therefore more consistently effective.

  4. Precise models for neural interfaces.

    * It is inevitable that humans will want and achieve technology that directly interfaces with neural circuits in various parts of the brain. This will enable entirely new forms of human experience and communication, and will offer the ultimate solution to neural prosthetics of various kinds. The only hope for any efficient path to achieving this type of technology is to first have some clear hypothesis of where and how to interface with different kinds of neural circuits. Connectomics offers one plausible route to generating such hypotheses.

Here are two of the problems that Ed Boyden is obsessed with. I’m similarly obsessed:

  1. If we could reconstruct the prefrontal cortex, we could try to understand how brains derive rational decisions. The visual cortex, we could understand how we process images and recognize objects. Many problems in AI have of course a natural intelligence correlate.

  2. Mapping brain disorders. We don’t understand the origins or pathology at a circuit level, of any brain disorder. If we can find neural targets in the brain that are changed in brain disorders, we could design drugs or stimulators that modulate those neurons. So far, all treatments for brain disorders were found by chance, are not understood, and work poorly.

Here is the title and abstract of the talk that Ed will be giving at Google next Tuesday:

Mapping the brain at scale: collecting the data necessary to infer the computations carried out by neural circuits

If we are to understand the computational basis for intelligent behavior, we need new technologies for mapping the molecular and anatomical circuitry of the brain and recording its dynamic activity with sufficient detail to infer the computations performed by neural circuits. Our group is working on three new approaches to address this need.

First, we have developed a fundamentally new super-resolution light microscopy technology that is faster than any other super-resolution technology, on a per-voxel basis. We anticipate that our new microscopy method, and improved versions we are working on currently, will enable imaging of molecular and anatomical information throughout entire brain circuits, and perhaps even entire brains.

Second, we have adopted to neuroscience the technology of plenoptic or lightfield microscopy, a technology that enables single-shot 3-D images to be acquired without moving parts, and thus can be used to record high-speed movies of neural activity (Nature Methods 11:727-730). We are continuing to improve such microscopes, to the point where they may be useful for imaging the entire mammalian cortex.

Finally, we are working to get the world’s smallest mammal, the Etruscan shrew, going as a model system in visual neuroscience. The Etruscan shrew has a small brain, with a six-layer cortex just a few hundred microns thick, and a visual cortex with perhaps just 75,000 neurons—less than a fly. It is small enough that entire molecular and anatomical maps, as well as dynamic activity maps, of the visual cortex might be feasible using the above tools in the near future, thereby enabling fundamental new models of how the cortex operates.

October 3, 2014

Problem: Observe the spiking behavior of every neuron (302 total) in an awake behaving nematode (C. Elegans) at between 10 and 50Hz while simultaneously recording the gross structure and activity of the worm and controlling all environmental stimuli, with the grand goal of building an artificial worm simulated down to the level of individual neurons. This would be the first such simulation ever and herald a completely new paradigm for computational neuroscience.

Assets: Imaging and data acquisition: [169], C. Elegans connectome: [220]. Early analysis work [197].

August 7, 2014

Here are some notes and references for the prose that the reviewer (Eliasmith) asked for citations. I didn’t talk about sparse coding—for which I’d cite Barlow [10] and [156]—or examples from optimization and signal processing that make use of locally adaptive methods—regarding which I’d mention the utility of adaptive subgradient methods [47] for stochastic gradient descent training and regression kernels for object detection [182] but I don’t know much about precedents for work like Chris Rozell’s  [174]:

Before max pooling there was winner-take-all (WTA) in Fukushima’s Neocognitron [54]. The max pooling layers in the Riesenhuber and Poggio HMAX model [172] have all but replaced WTA in most DNNs though the jury is out on whether this is the best model in practical or biological terms [178].

The computer vision community has experimented with a number of nonlinearities derived from biological models. The logistic function is of far too general utility to credit neuroscience with inspiring its use in computer vision, but its use as a differentiable alternative to the classical threshold unit is certainly the most common to introduction to sigmoid for most computer scientists. Simple truncation methods such as using half wave rectifier instead of the traditional sigmoid were biologically motivated [135]. These rectified linear units have outperformed sigmoidal activation functions to obtain the best results in several benchmark problems [115]

Neuroscientists prefer not to invoke global operations in their models, but local operators derived from biological models have turned out to be both effective and efficient. Surround suppression in classical receptive fields is applied in the form of local non-max suppression for edge and contour detectors and localization in object recognition  [91]. Some form of local (divisive) normalization appears to operate in a number of neural systems [2829] and local contrast normalization is one of the most important operators in state-of-the-art object recognition systems [96].

References

[1]   B. Agüera y Arcas, A.L. Fairhall, and W. Bialek. Computation in a single neuron: Hodgkin and huxley revisited. Neural Computation, 15(8):1715--49, 2003.

[2]   Costas A. Anastassiou, Rodrigo Perin, Henry Markram, and Christof Koch. Ephaptic coupling of cortical neurons. Nature Neuroscience, 14(2):217--223, 2011.

[3]   Dana Angluin. Learning regular sets from queries and counterexamples. Information and Computation, 75:87--106, 1987.

[4]   Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, and Andrej Risteski. Random walks on context spaces: Towards an explanation of the mysteries of semantic word embeddings. CoRR, arXiv:1502.03520, 2015.

[5]   C. A. Atencio and C. E. Schreiner. Columnar connectivity and laminar processing in cat primary auditory cortex. PLoS ONE, 5(3):e9521, 2010.

[6]   Jimmy Lei Ba, Volodymyr Mnih, and Koray Kavukcuoglu. Multiple object recognition with visual attention. In Submitted to International Conference on Learning Representations, page [arXiv:1412.7755], 2015.

[7]   Dzmitry Bahdanau, KyungHyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. CoRR, arXiv:1409.0473, 2015.

[8]   Carlo Baldassi, Alireza Alemi-Neissi, Marino Pagan, James J DiCarlo, Riccardo Zecchina, and Davide Zoccolan. Shape similarity, better than semantic membership, accounts for the structure of visual object representations in a population of monkey inferotemporal neurons. PLoS computational biology, 9:e1003167, 2013.

[9]   Richard G. Baraniuk and Michael B. Wakin. Random projections of smooth manifolds. Foundations of Computational Mathematics, 9(1):51--77, 2009.

[10]   Horace B. Barlow. Possible principles underlying the transformations of sensory messages. In W. A. Rosenblith, editor, Sensory Communication, pages 217--234. MIT Press, Cambridge, MA, 1961.

[11]   Horace B. Barlow. Unsupervised learning. Neural Computation, 1:295--311, 1989.

[12]   J. Baxter and P. L. Bartlett. Infinite-horizon gradient-based policy search. Journal of Artificial Intelligence Research, 15:319--350, 2001.

[13]   Mark F. Bear, Barry Connors, and Michael Paradiso. Neuroscience: Exploring the Brain (Third Edition). Lippincott Williams & Wilkins, Baltimore, Maryland, 2006.

[14]   Jonathan Berant and Percy Liang. Semantic parsing via paraphrasing. Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2014.

[15]   Jonathan Berant, Vivek Srikumar, Pei-Chun Chen, Brad Huang, Christopher D. Manning, Abby Vander Linden, Brittany Harding, and Peter Clark. Modeling biological processes for reading comprehension. Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2014.

[16]   William Bialek, Fred Rieke, R.R. de Ruyter van Steveninck, and D. Warland. Reading a neural code. Science, 252:1854--1857, 1991.

[17]   Elie Bienenstock and Stuart Geman. Compositionality in neural systems. In M. Arbib, editor, The Handbook of Brain Theory and Neural Networks, pages 223--226. Bradford Books/MIT Press, 1995.

[18]   Elie Bienenstock, Stuart Geman, and Daniel Potter. Compositionality, MDL priors and object recognition. In M.C. Mozer, M.I. Jordan, and T. Petsche, editors, Advances in Neural Information Processing Systems 9, pages 838--844. MIT Press, Cambridge, MA, 1998.

[19]   Peter Blouw and Chris Eliasmith. A neurally plausible encoding of word order information into a semantic vector space. In 35th Annual Conference of the Cognitive Science Society, pages 1905--1910, 2013.

[20]   Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K. Warmuth. Learnability and the vapnik-chervonenkis dimension. Journal of the ACM, 1989.

[21]   Davi D. Bock, Wei-Chung Allen Lee, Aaron M. Kerlin, Mark L. Andermann, Greg Hood, Arthur W. Wetzel, Sergey Yurgenson, Edward R. Soucy, Hyon Suk Kim, and R. Clay Reid. Network anatomy and in vivo physiology of visual cortical neurons. Nature, 471(7337):177--182, 2011.

[22]   Zimbo SRM Boudewijns, Tatjana Kleele, Huibert D. Mansvelder, Bert Sakmann, Christiaan PJ de Kock, and Marcel Oberlaender. Semi-automated three-dimensional reconstructions of individual neurons reveal cell type-specific circuits in cortex. Communications Integrative Biology, 4:486--488, 2011.

[23]   Samuel R. Bowman, Christopher Potts, and Christopher D. Manning. Recursive neural networks for learning logical semantics. CoRR, abs/1406.1827, 2014.

[24]   Christie BP, Tat DM, Irwin ZT, Gilja V, Nuyujukian P, Foster JD, Ryu SI, Shenoy KV, Thompson DE, and Chestek CA. Comparison of spike sorting and thresholding of voltage waveforms for intracortical brain-machine interface performance. Journal of Neural Engineering, 12:016009, 2015.

[25]   K.L. Briggman, M. Helmstaedter, and W. Denk. Wiring specificity in the direction-selectivity circuit of the retina. Nature, 471:183--188, 2011.

[26]   E.M. Callaway. Transneuronal circuit tracing with neurotropic viruses. Current Opinion in Neurobiology, 18(6):617--23, 2008.

[27]   M. Carandini. From circuits to behavior: a bridge too far? Nature Neuroscience, 15(4):507--509, 2012.

[28]   Matteo Carandini and David J. Heeger. Normalization as a canonical neural computation. Nature Reviews Neuroscience, 13:51--62, 2012.

[29]   Matteo Carandini, David J. Heeger, and J. Anthony Movshon. Linearity and normalization in simple cells of the macaque primary visual cortex. Journal of Neuroscience, 17:8621--8644, 1997.

[30]   Fei Chen, Paul W. Tillberg, and Edward S. Boyden. Expansion microscopy. Science, 347(6221):543--548, 2015.

[31]   K. Cho, B. Merriënboer, C. Gulcehre, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. CoRR, arXiv:406.1078, 2014.

[32]   Kyunghyun Cho, Bart van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. On the properties of neural machine translation: Encoder-decoder approaches. CoRR, arXiv:1409.1259, 2015.

[33]   Noam Chomsky. Knowledge of Language: Its Nature, Origin and Use. Praeger, New York, NY, 1986.

[34]   Brian Y. Chow and Edward S. Boyden. Optogenetics and translational medicine. Science Translational Medicine, 5(177):177ps5, 2013.

[35]   Jay S. Coggan, Thomas M. Bartol, Eduardo Esquenazi, Joel R. Stiles, Stephan Lamont, Maryann E. Martone, Darwin K. Berg, Mark H. Ellisman, and Terrence J. Sejnowski. Evidence for ectopic neurotransmission at a neuronal synapse. Science, 309(5733):446--451, 2005.

[36]   D. D. Cox, A. M. Papanastassiou, D. Oreper, B. B. Andken, and J. J. DiCarlo. High-resolution three-dimensional microelectrode brain mapping using stereo microfocal x-ray imaging. Journal of Neurophysiology, 100(5):2966--2976, 2008.

[37]   David D. Cox and James J. DiCarlo. Does learned shape selectivity in inferior temporal cortex automatically generalize across retinal position? Journal of Neuroscience, 28(40):10045--10055, November 2008.

[38]   David Daniel Cox and Thomas Dean. Neural networks and neuroscience-inspired computer vision. Current Biology, 24:921--929, 2014.

[39]   Thaddeus R. Cybulski, Joshua I. Glaser, Adam H. Marblestone, Bradley M. Zamft, Edward S. Boyden, George M. Church, and Konrad P. Kording. Spatial information in large-scale neural recordings. Frontiers in Computational Neuroscience, 8:1--16, 2015.

[40]   Jeffrey Dean, Greg S. Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc’Aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, , and Andrew Y. Ng. Large scale distributed deep networks. In F. Pereira, C.J.C. Burges, L. Bottou, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1223--1231. Curran Associates, Inc., 2012.

[41]   Thomas Dean, Biafra Ahanonu, Mainak Chowdhury, Anjali Datta, Andre Esteva, Daniel Eth, Nobie Redmon, Oleg Rumyantsev, and Ysis Tarter. On the technology prospects and investment opportunities for scalable neuroscience. ArXiv preprint cs.CV/1307.7302, 2013.

[42]   Thomas Dean, Dana Angluin, Kenneth Basye, Sean Engelson, Leslie Kaelbling, Evangelos Kokkevis, and Oded Maron. Inferring finite automata with stochastic output functions and an application to map learning. Machine Learning, 18(1):81--108, 1995.

[43]   Thomas Dean, Greg S. Corrado, and Jonathon Shlens. Three controversial hypotheses concerning computation in the primate cortex. In Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence, 2012.

[44]   Yash Deshpande and Andrea Montanari. Finding hidden cliques of size \ sqrt {N/e} in nearly linear time. Foundations of Computational Mathematics, pages 1--60, 2013.

[45]   James J. DiCarlo and David D. Cox. Untangling invariant object recognition. Trends in Cognitive Sciences, 11(8):333--341, 2007.

[46]   Shaul Druckmann, Thomas K. Berger, Felix Schürmann, Sean Hill, Henry Markram, and Idan Segev. Effective stimuli for constructing reliable neuron models. PLoS Computational Biology, 7(8):e1002133, 2011.

[47]   John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121--2159, 2011.

[48]   A.L. Eberle, S. Mikula, R. Schalek, J.W. Lichtman, M.L. Tate, and D. Zeidler. High-resolution, high-throughput imaging with a multibeam scanning electron microscope. Journal of Microscopy, 2015.

[49]   Salah El Hihi and Yoshua Bengio. Hierarchical recurrent neural networks for long-term dependencies. In D. S. Touretzky, M. Mozer, and M.E. Hasselmo, editors, Advances in Neural Information Processing Systems 8. MIT Press, 1996.

[50]   Soheil Feizi, Daniel Marbach, Muriel Medard, and Manolis Kellis. Network deconvolution as a general method to distinguish direct dependencies in networks. Nature Biotechnology, 31:726--733, 2013.

[51]   D. J. Felleman and D. C. Van Essen. Distributed hierarchical processing in primate cerebral cortex. Cerebral Cortex, 1:1--47, 1991.

[52]   Santiago Fernández, Alex Graves, and Jürgen Schmidhuber. Sequence labelling in structured domains with hierarchical recurrent neural networks. In Proceedings of the 20th International Joint Conference on Artificial Intelligence, 2007.

[53]   Alyson K Fletcher and Sundeep Rangan. Scalable inference for neuronal connectivity from calcium imaging. In Zoubin Ghahramani, Max Welling, C. Cortes, N.D. Lawrence, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 2843--2851. Curran Associates, Inc., 2014.

[54]   K. Fukushima. Neocognitron: A self organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernnetics, 36(4):93--202, 1980.

[55]   Mario Galarreta and Shaul Hestrin. Electrical and chemical synapses among parvalbumin fast-spiking gabaergic interneurons in adult mouse neocortex. PNAS, 99:12438--12443, 2002.

[56]   S. Ganguli and H. Sompolinsky. Compressed sensing, sparsity, and dimensionality in neuronal information processing and data analysis. Annual Review of Neuroscience, 35:485--508, 2012.

[57]   Peiran Gao and Surya Ganguli. On simplicity and complexity in the brave new world of large-scale neuroscience. CoRR, arXiv:1503.08779, 2015.

[58]   Michael S. Gazzaniga. The Cognitive Neurosciences (Third Edition). Bradford Books. MIT Press, Cambridge, MA, 2009.

[59]   Stuart Geman. Invariance and selectivity in the ventral visual pathway. Journal of Physiology --- Paris, 100(4):212--224, 2006.

[60]   Dileep George and Jeff Hawkins. Towards a mathematical theory of cortical micro-circuits. PLoS Computational Biology, 5(10), 2009.

[61]   K.K. Ghosh, L.D. Burns, E.D. Cocker, A. Nimmerjahn, Y. Ziv, A.E. Gamal, and M.J. Schnitzer. Miniaturized integration of a fluorescence microscope. Nature Methods, 8(10):871--8, 2011.

[62]   Nicolas Giret, Joergen Kornfeld, Surya Ganguli, and Richard H. R. Hahnloser. Evidence for a causal inverse model in an avian cortico-basal ganglia circuit. Proceedings of the National Academy of Sciences USA, 111:6063--6068, 2014.

[63]   E. Mark Gold. System identification via state characterization. Automatica, 8:621--636, 1972.

[64]   E. Mark Gold. Complexity of automaton identification from given sets. Information and Control, 37:302--320, 1978.

[65]   Mark S. Goldman, Joseph H. Levine, Guy Major, David W. Tank, and H.S. Seung. Robust persistent neural activity in a model integrator with multiple hysteretic dendrites per neuron. Cerebral Cortex, 13(11):1185--1195, 2003.

[66]   Kristen Grauman and Trevor Darrell. The pyramid match kernel: Efficient learning with sets of features. Journal of Machine Learning Research, 8:725--760, 2007.

[67]   Alex Graves. Supervised sequence labelling with recurrent neural networks. Diploma thesis. Technische Universität München, 2009.

[68]   Alex Graves. Generating sequences with recurrent neural networks. CoRR, arXiv:1308.0850, 2012.

[69]   Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd International Conference on Machine Learning, pages 369--376, New York, NY, USA, 2006. ACM.

[70]   Alex Graves and Navdeep Jaitly. Towards end-to-end speech recognition with recurrent neural networks. In Proceedings of the 31th International Conference on Machine Learning, volume 32, pages 1764--1772, 2014.

[71]   Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. CoRR, arXiv:1410.5401, 2014.

[72]   Karol Gregor, Ivo Danihelka, Alex Graves, and Daan Wierstra. DRAW: A recurrent neural network for image generation. CoRR, abs/1502.04623, 2015.

[73]   V. Gripon and C. Berrou. Sparse neural networks with large learning diversity. IEEE Transactions on Neural Networks, 22(7):1087--1096, 2011.

[74]   Petilla Interneuron Nomenclature Group. Petilla terminology: nomenclature of features of GABAergic interneurons of the cerebral cortex. Nature Reviews Neuroscience, 9:557--568, 2008.

[75]   Richard H. R. Hahnloser, Rahul Sarpeshkar, Misha A. Mahowald, Rodney J. Douglas, and H. Sebastian Seung. Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit. Nature, 405:947--951, 2000.

[76]   Alexander Hanuschkin, Markus Diesmann, and Abigail Morrison. A reafferent and feed-forward model of song syntax generation in the bengalese finch. Journal Compututational Neuroscience, 31(3):509--532, 2011.

[77]   Demis Hassabis and Eleanor A. Maguire. The construction system of the brain. Philosphical Transactions of the Royal Society London B Biological Science, 364:1263--1271, 2009.

[78]   Kenneth J. Hayworth, C. Shan Xu, Zhiyuan Lu, Graham W. Knott, Richard D. Fetter, Juan Carlos Tapia, Jeff W. Lichtman, and Harald F. Hess. Ultrastructurally smooth thick partitioning and volume stitching for large-scale connectomics. Nature Methods, 12:319--322, 2015.

[79]   L. Heck and H. Huang. Deep learning of knowledge graph embeddings for semantic parsing of twitter dialogs. In Proceedings of the IEEE Global Conference on Signal and Information Processing, 2014.

[80]   M. Helmstaedter, C.P.J. de Kock, D. Feldmeyer, R.M. Bruno, and B. Sakmann. Reconstruction of an average cortical column in silico. Brain Research Reviews, 55(2):193--203, 2007.

[81]   Moritz Helmstaedter, Kevin L. Briggman, Srinivas C. Turaga, Viren Jain, H. Sebastian Seung, and Winfried Denk. Connectomic reconstruction of the inner plexiform layer in the mouse retina. Nature, 500:168--174, 2013.

[82]   S. Hill and G. Tononi. Modeling sleep and wakefulness in the thalamocortical system. Journal of Neurophysiology, 93(3):1671--98, 2005.

[83]   Geoff E. Hinton. Mapping part-whole hierarchies into connectionist networks. Artificial Intelligence, 46:47--75, 1990.

[84]   Geoffrey E. Hinton, Alex Krizhevsky, and Sida D. Wang. Transforming auto-encoders. In Proceedings of the International Conference on Artificial Neural Networks, pages 44--51, 2011.

[85]   Oriol Hinton, Geoff Vinyals and Jeff Dean. Distilling knowledge in a neural network. In Deep Learning Workshop at the 2014 Conference on Neural Information Processing Systems, 2014.

[86]   Alan L. Hodgkin and Andrew F. Huxley. A quantitative description of membrane current and its application to conduction and excitation in nerve. The Journal of Physiology, 117:500--544, 1952.

[87]   Sonja B. Hofer, Ho Ko, Bruno Pichler, Joshua Vogelstein, Hana Ros, Hongkui Zeng, Ed Lein, Nicholas A. Lesica, and Thomas D. Mrsic-Flogel. Differential connectivity and response dynamics of excitatory and inhibitory neurons in visual cortex. Nature Neuroscience, 14:1045--1052, 2011.

[88]   Jonathan C Horton and Daniel L Adams. The cortical column: a structure without a function. Philosophical Transactions of the Royal Society B: Biological Sciences, 360(1456):837--862, 2005.

[89]   D. H. Hubel and T. N Wiesel. Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. Journal of Physiology, 160:106--154, 1962.

[90]   D. H. Hubel and T. N Wiesel. Receptive fields and functional architecture of monkey striate cortex. Journal of Physiology, 195:215--243, 1968.

[91]   Aapo Hyvärinen. Statistical models of natural images and cortical visual representation. Topics in Cognitive Science, 2(2):251--264, 2010.

[92]   L. Itti and P. Baldi. A principled approach to detecting surprising events in video. In Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 631--637, San Siego, CA, 2005.

[93]   L. Itti, C. Koch, and E. Niebur. A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(11):1254--1259, Nov 1998.

[94]   Akerboom J, Chen TW, Wardill TJ, Tian L, Marvin JS, Mutlu S, Calderón NC, Esposti F, Borghuis BG, Sun XR, Gordus A, Orger MB, Portugues R, Engert F, Macklin JJ, Filosa A, Aggarwal A, Kerr RA, Takagi R, Kracun S, Shigetomi E, Khakh BS, Baier H, Lagnado L, Wang SS, Bargmann CI, Kimmel BE, Jayaraman V, Svoboda K, Kim DS, Schreiter ER, and Looger LL. Optimization of a gcamp calcium indicator for neural activity imaging. The Journal of Neuroscience, 32:13819--13840, 2012.

[95]   Viren Jain, H. Sebastian Seung, and Srinivas C. Turag. Machines that learn to segment images: a crucial technology for connectomics. Current Opinion in Neurobiology, 20(5):1--14, 2010.

[96]   Kevin Jarrett, Koray Kavukcuoglu, Marc’Aurelio Ranzato, and Yann LeCun. What is the best multi-stage architecture for object recognition? In Proceedings of the International Conference on Computer Vision. IEEE Computer Society, 2009.

[97]   Eric Jonas and Konrad Kording. Automatic discovery of cell types and microcircuitry from neural connectomics. CoRR, abs/1407.4137, 2014.

[98]   M. N. Jones and D. J. K. Mewhort. Representing word meaning and order information in a composite holographic lexicon. Psychological Review, 114:1--37, 2007.

[99]   J.C. Jung and M.J. Schnitzer. Multiphoton endoscopy. Optics Letters, 28(11):902--904, 2003.

[100]   Christopher D. Manning Kai Sheng Tai, Richard Socher. Improved semantic representations from tree-structured long short-term memory networks. CoRR, abs/1503.00075, 2015.

[101]   Nal Kalchbrenner and Phil Blunsom. Recurrent continuous translation models. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2013.

[102]   Nal Kalchbrenner and Phil Blunsom. Recurrent convolutional neural networks for discourse compositionality. Proceedings of the 2013 Workshop on Continuous Vector Space Models and their Compositionality, 2013.

[103]   Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. A convolutional neural network for modelling sentences. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, 2014.

[104]   E.R. Kandel, J.H. Schwartz, and T.M. Jessell. Principles of neural science (Fourth Edition). McGraw-Hill, Health Professions Division, 2000.

[105]   P. Kanerva. The binary spatter code for encoding concepts at many levels. In M. Marinaro and P. Morasso, editors, Proceedings of International Conference on Artificial Neural Networks, pages 226--9. Springer-Verlag, 1994.

[106]   Pentti Kanerva. Hyperdimensional computing: An introduction to computing in distributed representation with high-dimensional random vectors. Cognitive Computation, 1(2):139--159, 2009.

[107]   Dimitri Kartsaklis, Nal Kalchbrenner, and Mehrnoosh Sadrzadeh. Resolving lexical ambiguity in tensor regression models of meaning. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, pages 212--217, Baltimore, USA, 2014. Association for Computational Linguistics.

[108]   M. Kearns and L. G. Valiant. Cryptographic limitations on learning boolean functions and finite automata. In Proceedings of the Twenty First Annual ACM Symposium on Theoretical Computing, pages 433--444, 1989.

[109]   Georges Khazen, Sean L. Hill, Felix Schürmann, and Henry Markram. Combinatorial expression rules of ion channel genes in juvenile rat (rattus norvegicus) neocortical neurons. PLoS ONE, 7(4):e34786, 2012.

[110]   A. Klöeckner, N. Pinto, Y. Lee, B. Catanzaro, P. Ivanov, and A. Fasih. PyCUDA: GPU run-time code generation for high-performance computing. Technical Report 2009-40, Scientific Computing Group, Brown University, Providence, RI, USA, November 2009.

[111]   Ho Ko, Lee Cossell, Chiara Baragli, Jan Antolik, Claudia Clopath, Sonja B Hofer, and Thomas D Mrsic-Flogel. The emergence of functional microcircuits in visual cortex. Nature, 496(7443):96--100, 2013.

[112]   Ho Ko, Sonja B. Hofer, Bruno Pichler, Katherine A. Buchanan, P. Jesper Sjostrom, and Thomas D. Mrsic-Flogel. Functional specificity of local synaptic connections in neocortical networks. Nature, 473:87--91, 2011.

[113]   Suhasa B. Kodandaramaiah, Giovanni T. Franzesi, Brian Y. Chow, Edward S. Boyden, and Craig R. Forest. Automated whole-cell patch-clamp electrophysiology of neurons in vivo. Nature Methods, 9(6):585--587, 2012.

[114]   Jan Koutník, Klaus Greff, Faustino Gomez, and Jürgen Schmidhuber. Clockwork RNN. In Proceedings of the 31th International Conference on Machine Learning, volume 32, pages 1863--1871, 2014.

[115]   Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton. Imagenet classification with deep convolutional neural networks. In P. Bartlett, F.C.N. Pereira, C.J.C. Burges, L. Bottou, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1106--1114, 2012.

[116]   Sanjiv Kumar and Martial Hebert. Man-made structure detection in natural images using a causal multiscale random field. In Proceedings of IEEE Computer Vision and Pattern Recognition, volume 1, pages 119--126, 2003.

[117]   M. Larkum. A cellular mechanism for cortical associations: an organizing principle for the cerebral cortex. Trends Neuroscience, 36(3):141--151, 2013.

[118]   Matthew Lawlor and Steven W. Zucker. Third-order edge statistics: Contour continuation, curvature, and cortical connections. In Christopher J. C. Burges, Leon Bottou, Zoubin Ghahramani, and Kilian Q. Weinberger, editors, NIPS, pages 1763--1771, 2013.

[119]   Quoc Le and Tomàs Mikolov. Distributed representations of sentences and documents. CoRR, abs/1405.4053v2, 2014.

[120]   Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278--2324, 1998.

[121]   Je H. Lee, Evan R. Daugharthy, Jonathan Scheiman, Reza Kalhor, Joyce L. Yang, Thomas C. Ferrante, Richard Terry, Sauveur S. F. Jeanty, Chao Li, Ryoji Amamoto, Derek T. Peters, Brian M. Turczyk, Adam H. Marblestone, Samuel A. Inverso, Amy Bernard, Prashant Mali, Xavier Rios, John Aach, and George M. Church. Highly Multiplexed Subcellular RNA Sequencing in Situ. Science, 343(6177):1360--1363, 2014.

[122]   Tai Sing Lee, David Mumford, Song Chun Zhu, and Victor Lamme. The role of V1 in shape representation. In Bower, editor, Computational Neuroscience, pages 697--703. Plenum Press, New York, 1997.

[123]   T.S. Lee, D. Mumford, R.Romero, and V.Lamme. The role of primary visual cortex in higher level vision. Vision Research, 38:2429--2454, 1998.

[124]   Omer Levy and Yoav Goldberg. Linguistic regularities in sparse and explicit word representations. In Proceedings of the Eighteenth Conference on Computational Natural Language Learning, pages 171--180, Ann Arbor, Michigan, 2014. Association for Computational Linguistics.

[125]   Anthony D. Lien and Massimo Scanziani. Tuned thalamic excitation is amplified by visual cortical circuits. Nature Neuroscience, 16:1315--1323, 2013.

[126]   W. A. Lim, R. Alvania, and W. F. Marshall. Cell biology 2.0. Trends Cell Biololgy, 22(12):611--612, 2012.

[127]   W. A. Lim, C. M. Lee, and C. Tang. Design principles of regulatory networks: searching for the molecular algorithms of the cell. Molecular Cell, 49(2):202--212, 2013.

[128]   Vladan Lucic and Wolfgang Baumeister. Monte carlo places strong odds on ectopic release. Science, 309:387--388, 2005.

[129]   Bill MacCartney and Christopher Manning. Natural logic for textual inference. In Proceedings of ACL Workshop on Textual Entailment and Paraphrasing, 2007.

[130]   Bill MacCartney and Christopher D. Manning. Modeling semantic containment and exclusion in natural language inference. In Proceedings of the 22Nd International Conference on Computational Linguistics - Volume 1, pages 521--528, Stroudsburg, PA, USA, 2008. Association for Computational Linguistics.

[131]   Bill MacCartney and Christopher D. Manning. An extended model of natural logic. In Proceedings of the Eighth International Conference on Computational Semantics, pages 140--156, Stroudsburg, PA, USA, 2009. Association for Computational Linguistics.

[132]   Francis Maes, Ludovic Denoyer, and Patrick Gallinari. Structured prediction with reinforcement learning. Machine Learning, 77(2-3):271--301, 2009.

[133]   Julien Mairal, Francis Bach, and Jean Ponce. Task-driven dictionary learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(4):791--804, 2012.

[134]   G. Major, M. E. Larkum, and J. Schiller. Active properties of neocortical pyramidal neuron dendrites. Annual Review Neuroscience, 36:1--24, 2013.

[135]   Jitendra Malik and Pietro Perona. Preattentive texture discrimination with early vision mechanisms. Journal of the Optical Society of America A, 7:923--932, 1990.

[136]   A. H. Marblestone and E. S. Boyden. Designing tools for assumption-proof brain mapping. Neuron, 83(6):1239--1241, 2014.

[137]   Adam H Marblestone, Evan R Daugharthy, Reza Kalhor, Ian D Peikon, Justus M Kebschull, Seth L Shipman, Yuriy Mishchenko, Je Hyuk Lee, David A Dalrymple, Bradley M Zamft, Konrad P Kording, Edward S Boyden, Anthony M Zador, and George M Church. Conneconomics: The economics of dense, large-scale, high-resolution neural connectomics. bioRxiv, 2014.

[138]   Adam H. Marblestone, Evan R. Daugharthy, Reza Kalhor, Ian D. Peikon, Justus M. Kebschull, Seth L. Shipman, Yuriy Mishchenko, Je Hyuk Lee, Konrad P. Kording, Edward S. Boyden, Anthony M. Zador, and George M. Church. Rosetta brains: A strategy for molecularly-annotated connectomics. CoRR, arXiv:1404.5103, 2014.

[139]   Adam H. Marblestone, Bradley M. Zamft, Yael G. Maguire, Mikhail G. Shapiro, Thaddeus R. Cybulski, Joshua I. Glaser, Ben Stranges, Reza Kalhor, Elad Alon David A. Dalrymple, Dongjin Seo, Michel M. Maharbiz, Jose Carmena, Jan Rabaey, Edward S. Boyden, George M. Church, and Konrad P. Kording. Physical principles for scalable neural recording. ArXiv preprint cs.CV/1306.5709, 2013.

[140]   Gary Marcus, Adam Marblestone, and Thomas Dean. The atoms of neural computation. Science, 346:551--552, 2014.

[141]   Henry Markram, Maria Toledo-Rodriguez, Yun Wang, Anirudh Gupta, Gilad Silberberg, and Caizhi Wu. Interneurons of the neocortical inhibitory system. Nature Reviews Neuroscience, 5:793--807, 2004.

[142]   J.H. Marshel, T. Mori, K.J. Nielsen, and E.M. Callaway. Targeting single neuronal networks for gene expression and cell labeling in vivo. Neuron, 67(4):562--574, 2010.

[143]   Manuel Marx, Robert H. Gunter, Werner Hucko, Gabriele Radnikow, and Dirk Feldmeyer. Improved biocytin labeling and neuronal 3d reconstruction. Nature Protocols, 7:394--407, 2012.

[144]   Graham W. Taylor Matthew D. Zeiler, Dilip Kirshnan and Rob Fergus. Deconvolutional networks. In IEEE International Conference on Computer Vision and Pattern Recognition, pages 2528--2535, 2010.

[145]   T. Mikolov and G. Zweig. Context dependent recurrent neural network language model. In Spoken Language Technology Workshop (SLT), 2012 IEEE, pages 234--239, 2012.

[146]   Tomàs Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26, pages 3111--3119, 2013.

[147]   Tomàs Mikolov, Wen tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 2013.

[148]   Yuriy Mishchenko, Joshua T. Vogelstein, and Liam Paninski. A bayesian approach for inferring neuronal connectivity from calcium fluorescent imaging data. The Annals of Applied Statistics, 5(2B):1229--1261, 2011.

[149]   Jeff Mitchell and Mirella Lapata. Composition in distributional models of semantics. Cognitive Science, 34(8):1388--1429, 2010.

[150]   Andriy Mnih and Geoffrey E. Hinton. A scalable hierarchical distributed language model. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information Processing Systems 21, pages 1081--1088, 2008.

[151]   Volodymyr Mnih, Nicolas Heess, Alex Graves, and Koray Kavukcuoglu. Recurrent models of visual attention. CoRR, abs/1406.6247, 2014.

[152]   Javier Morante and Claude Desplan. Dissecting and staining drosophila optic lobes. In Bing Zhang, Marc R. Freeman, and Scott Waddell, editors, Drosophila Neurobiology: A Laboratory Manual, volume 2011, pages 652--656. CSHL Press, Cold Spring Harbor, New York, 2011.

[153]   Vernon Mountcastle. An organizing principle for cerebral function: the unit model and the distributed system. In Gerald Edelman and Vernon Mountcastle, editors, The Mindful Brain, pages 7--50. MIT Press, Cambridge, MA, 1978.

[154]   Vernon B. Mountcastle. The columnar organization of the neocortex. Brain, 120(4):701--722, 1997.

[155]   Vernon B. Mountcastle. Introduction to the special issue on computation in cortical columns. Cerebral Cortex, 13(1):2--4, January 2003.

[156]   B. A. Olshausen and D. J. Field. Natural image statistics and efficient coding. Computation in Neural Systems, 7(2):333--339, 1996.

[157]   B. A. Olshausen and D. J. Field. Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision Research, 37(23):3311--3325, 1997.

[158]   Randall C. O’Reilly. Biologically based computational models of high-level cognition. Science, 314(5796):91--94, 2006.

[159]   Randall C. O’Reilly and Michael J. Frank. Making working memory work: A computational model of learning in the prefrontal cortex and basal ganglia. Neural Computation, 18(2):283--328, 2006.

[160]   Randall C. O’Reilly, Seth A. Herd, and Wolfgang M. Pauli. Computational models of cognitive control. Current Opinion in Neurobiology, 20(2):257--261, 2010.

[161]   Simon Osindero and Geoffrey Hinton. Modeling image patches with a directed hierarchy of markov random fields. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 1121--1128. MIT Press, Cambridge, MA, 2008.

[162]   Adam M Packer and Rafael Yuste. Dense, unspecific connectivity of neocortical parvalbumin-positive interneurons: a canonical microcircuit for inhibition? The Journal of Neuroscience, 31(37):13260--13271, 2011.

[163]   Anthony Pagden. The Enlightenment and Why It Still Matters. Random House, New York, NY, 2013.

[164]   Rodrigo Perin, Thomas K. Berger, and Henry Markram. A synaptic organizing principle for cortical neuronal groups. Proceedings of the National Academy of Sciences, 108(13):5419--5424, 2011.

[165]   Simon Peron, Tsai-Wen Chen, and Karel Svoboda. Comprehensive imaging of cortical networks. Current Opinion in Neurobiology, 32:115--123, 2015.

[166]   Nicolas Pinto, Zac Stone, Todd Zickler, and David D. Cox. Scaling-up Biologically-Inspired Computer Vision: A Case-Study on Facebook. In IEEE Computer Vision and Pattern Recognition, Workshop on Biologically Consistent Vision, pages 35--42, 2011.

[167]   Tony Plate. Holographic reduced representations: Convolution algebra for compositional distributed representations. In International Joint Conference on Artificial Intelligence, pages 30--35. Morgan Kaufmann, 1991.

[168]   Tony A. Plate. Holographic Reduced Representation: Distributed Representation for Cognitive Structures. CSLI Publications, Stanford, CA, USA, 2003.

[169]   R. Prevedel, Y.G. Yoon, M. Hoffmann, N. Pak, G. Wetzstein, S. Kato, T. Schrödel, R. Raskar, M. Zimmer, E.S. Boyden, and A. Vaziri. Simultaneous whole-animal 3D-imaging of neuronal activity using light field microscopy. CoRR, arXiv:1401.5333, 2013.

[170]   Alexandro D. Ramirez, Yashar Ahmadian, Joseph Schumacher, David Schneider, Sarah M. N. Woolley, and Liam Paninski. Incorporating naturalistic correlation structure improves spectrogram reconstruction from neuronal activity in the songbird auditory midbrain. Journal Neuroscience, 31:3828--3842, 2011.

[171]   Michael W. Reimann, Costas A. Anastassiou, Rodrigo Perin, Sean L. Hill, Henry Markram, and Christof Koch. A biophysically detailed model of neocortical local field potentials predicts the critical role of active membrane currents. Neuron, 79(2):375--390, 2013.

[172]   M. Riesenhuber and T. Poggio. Hierarchical models of object recognition in cortex. Nature Neuroscience, 2(11):1019--1025, November 1999.

[173]   Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. FitNets: Hints for thin deep nets. CoRR, abs/1412.6550, 2014.

[174]   C.J. Rozell, D.H Johnson, R.G. Baraniuk, and B.A. Olshausen. Sparse coding via thresholding and local competition in neural circuits. Neural Computation, 20(10):2526--2563, 2008.

[175]   Daniel L. Schacter, Donna Rose Addis, Demis Hassabis, Victoria C. Martin, R. Nathan Spreng, and Karl K. Szpunar. The future of memory: Remembering, imagining, and the brain. Neuron, 76:677--694, 2012.

[176]   D.L. Schacter, D.R Addis, and R.L. Buckner. Constructive memory and the simulation of future events. In M.S. Gazzaniga, editor, The Cognitive Neurosciences IV, pages 751--762. MIT Press, Cambridge, MA, 2009.

[177]   W. Scheirer, S. Anthony, K. Nakayama, and D. Cox. Perceptual annotation: Measuring human performance to improve machine vision. Transactions in Pattern Analysis and Machine Learning, 36:1679--1686, 2014.

[178]   Jürgen Schmidhuber. Deep learning in neural networks: An overview. Technical report, Technical Report IDSIA-03-14, 2014.

[179]   Elad Schneidman, William Bialek, and Michael J. Berry II. An information theoretic approach to the functional classification of neurons. In Suzanna Becker, Sebastian Thrun, and Klaus Obermayer, editors, Advances in Neural Information Processing Systems 15, pages 197--204, 2002.

[180]   Tina Schrodel, Robert Prevedel, Karin Aumayr, Manuel Zimmer, and Alipasha Vaziri. Brain-wide 3d imaging of neuronal activity in caenorhabditis elegans with sculpted light. Nature Methods, 10:1013--1020, 2013.

[181]   Dongjin Seo, Jose M. Carmena, Jan M. Rabaey, Elad Alon, , and Michel M. Maharbiz. Neural dust: An ultrasonic, low power solution for chronic brain-machine interfaces. ArXiv preprint cs.CV/1307.2196, 2013.

[182]   Hae Jong Seo and Peyman Milanfar. Training-free, generic object detection using locally adaptive regression kernels. IEEE Transactions Pattern Analysis and Machine Intelligence, 32(9):1688--1704, 2010.

[183]   H. Sebastian Seung. Learning in spiking neural networks by reinforcement of stochastic synaptic transmission. Neuron, 40(6):1063--1073, 2003.

[184]   H. Sebastian Seung. Reading the book of memory: Sparse sampling versus dense mapping of connectomes. Neuron, 62(1):17--29, 2009.

[185]   A. S. Shai, C. A. Anastassiou, M. E. Larkum, and C. Koch. Physiology of layer 5 pyramidal neurons in mouse primary visual cortex: Coincidence detection through bursting. PLoS Computational Biology, 11(3), 2015.

[186]   P. Smolensky. Tensor product variable binding and the representation of symbolic structures in connectionist systems. Artificial Intelligence, 46(1-2):159--216, 1990.

[187]   Javier Snaider and Stan Franklin. Modular composite representation. Cognitive Computation, pages 1--18, 2014.

[188]   Richard Socher, Adrian Barbu, and Dorin Comaniciu. A learning based hierarchical model for vessel segmentation. In IEEE International Symposium on Biomedical Imaging: From Nano to Macro, Paris, France, 2008. IEEE.

[189]   Richard Socher, John Bauer, Christopher D. Manning, and Andrew Y. Ng. Parsing with compositional vector grammars. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, Stroudsburg, PA, USA, 2013. Association for Computational Linguistics.

[190]   Richard Socher, Danqi Chen, Christopher D. Manning, and Andrew Y. Ng. Reasoning with neural tensor networks for knowledge base completion. In Advances in Neural Information Processing Systems 26. 2013.

[191]   Richard Socher, Brody Huval, Christopher D. Manning, and Andrew Y. Ng. Semantic compositionality through recursive matrix-vector spaces. In Proceedings of the 2012 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2012.

[192]   Richard Socher and Christopher D. Manning. Deep learning for NLP (without magic) tutorial. In Human Language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics, pages 1--3, 2013.

[193]   Richard Socher, Alex Perelygin, Jean Y. Wu, Jason Chuang, Christopher D. Manning, Andrew Y. Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631--1642. Association for Computational Linguistics, Stroudsburg, PA, USA, 2013.

[194]   Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. CoRR, arXiv:1503.03585, 2015.

[195]   Vivek Srikumar and Christopher Manning. Learning distributed representations for structured output prediction. In Advances in Neural Information Processing Systems 27, 2014.

[196]   Francois St-Pierre, Jesse D. Marshall, Ying Yang, Yiyang Gong, Mark J. Schnitzer, and Michael Z. Lin. High-fidelity optical reporting of neuronal electrical activity with an ultrafast fluorescent voltage sensor. Nature Neuroscience, 17:884--889, 2014.

[197]   Greg J. Stephens, Leslie C. Osborne, and William Bialek. Searching for simplicity in the analysis of neurons and behavior. Proceedings of the National Academy of Sciences, 108(3):15565--15571, 2011.

[198]   Ian H Stevenson, James M Rebesco, Lee E Miller, and Konrad P Kording. Inferring functional connections between neurons. Current Opinion in Neurobiology, 18:1–7, 2008.

[199]   Terrence C. Stewart, Trevor Bekolay, and Chris Eliasmith. Learning to select actions with spiking neurons in the basal ganglia. Frontiers in Neuroscience, 6(2), 2012.

[200]   Terrence C. Stewart, Xuan Choo, and Chris Eliasmith. Symbolic reasoning in spiking neurons: A model of the cortex/basal ganglia/thalamus loop. In Prodeedings of the 32nd Annual Meeting of the Cognitive Science Society, 2010.

[201]   Uygar Sümbül, Aleksandar Zlateski, Ashwin Vishwanathan, Richard H. Masland, and H. Sebastian Seung. Automated computation of arbor densities: a step toward identifying neuronal cell types. Frontiers in Neuroanatomy, 8(139), 2014.

[202]   E. A. Susaki, K. Tainaka, D. Perrin, F. Kishino, T. Tawara, T. M. Watanabe, C. Yokoyama, H. Onoe, M. Eguchi, S. Yamaguchi, T. Abe, H. Kiyonari, Y. Shimizu, A. Miyawaki, H. Yokota, and H. R. Ueda. Whole-brain imaging with single-cell resolution using chemical cocktails and computational analysis. Cell, 157(3):726--739, 2014.

[203]   Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to sequence learning with neural networks. CoRR, arXiv:1409.3215, 2014.

[204]   Charles Sutton and Andrew McCallum. An introduction to conditional random fields for relational learning. In Lise Getoor and Ben Taskar, editors, Introduction to Statistical Relational Learning. MIT Press, 2006.

[205]   K. Tainaka, S. I. Kubota, T. Q. Suyama, E. A. Susaki, D. Perrin, M. Ukai-Tadenuma, H. Ukai, and H. R. Ueda. Whole-body imaging with single-cell resolution by tissue decolorization. Cell, 159(4):911--924, 2014.

[206]   D. Takeuchi, T. Hirabayashi, K. Tamura, and Y. Miyashita. Reversal of interlaminar signal between sensory and memory processing in monkey temporal cortex. Science, 331(6023):1443--1447, 2011.

[207]   Naftali Tishby, Fernando Pereira, and William Bialek. The information bottleneck method. In Proceedings of the 37th Annual Allerton Conference on Communication, Control and Computing, pages 368--377, 1999.

[208]   A. Torralba, A. Oliva, M.S. Castelhano, and J.M. Henderson. Contextual guidance of attention in natural scenes. Psychological Review, 113:766--786, 2006.

[209]   David S. Touretzky and Geoffrey E. Hinton. A distributed connectionist production system. Cognitive Science, 12:423--466, 1988.

[210]   Sébastien Tremblay, Adam Pieper, Florian Sachs, and Julio Martinez-Trujillo. Attentional filtering of visual information by neuronal ensembles in the primate lateral prefrontal cortex. Neuron, 85:202--215, 2015.

[211]   K. Tsunoda, Y. Yamane, M. Nishizaki, and M. Tanifuji. Complex objects are represented in macaque inferotemporal cortex by the combination of feature columns. Nature Neuroscience, 4:832--838, 2001.

[212]   S. Ullman and S. Soloviev. Computation of pattern invariance in brain-like structures. Neural Networks, 12:1021--1036, 1999.

[213]   Shimon Ullman, Michel Vidal-Naquet, and Erez Sali. Visual features of intermediate complexity and their use in classification. Nature Neuroscience, 5(7):682--687, 2002.

[214]   L. G. Valiant. A theory of the learnable. Communications of the ACM, 27:1134--1142, 1984.

[215]   V. N. Vapnik and A. Ya. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16(2):264--280, 1971.

[216]   Bram-Ernst Verhf, Rufin Vogels, and Peter Janssen. Inferotemporal cortex subserves three-dimensional structure categorization. Neuron, 73:171--182, 2012.

[217]   Oriol Vinyals, Lukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, and Geoffrey E. Hinton. Grammar as a foreign language. CoRR, abs/1412.7449, 2014.

[218]   Thomas Wennekers, Friedrich T. Sommer, and Gunther Palm. Iterative retrieval in associative memories by threshold control of different neural models. Neural Computation, 11:21--66, 1999.

[219]   Jason Weston, Sumit Chopra, and Antoine Bordes. Memory networks. CoRR, 2014.

[220]   J. G. White, E. Southgate, J. N. Thomson, and S. Brenner. The structure of the nervous system of the nematode caenorhabditis elegans. Philosophical Transactions of the Royal Society B: Biological Sciences, 314:1--340, 1986.

[221]   Dominic Widdows and Trevor Cohen. Reasoning with vectors: A continuous model for fast robust inference. Logic Journal of the IGPL, 10.1093/jigpal/jzu028:000--000, 2014.

[222]   David H. Wolpert, editor. The Mathematics of Generalization. Addison-Wesley, Reading, Massachusetts, 1995.

[223]   David H. Wolpert and William G. Macready. No free lunch theorems for optimization. IEEE Transactions on Evolutionary Computation, 1(1):67--82, 1997.

[224]   Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. CoRR, arXiv:1502.03044, 2015.

[225]   Yukako Yamane, Eric T. Carlson, Katherine C. Bowman, Zhihong Wang, and Charles E. Connor. A neural code for three-dimensional object shape in macaque inferotemporal cortex. Nature Neuroscience, 11:1352--1360, 2008.

[226]   Yukako Yamane, Kazushige Tsunoda, Madoka Matsumoto, Adam N. Phillips, and Manabu Tanifuji. Representation of the spatial relationship among object parts by neurons in macaque inferotemporal cortex. Journal of Neurophysiology, 96(6):3147--3156, 2006.

[227]   D.L. Yamins, H. Hong, C. Cadieu, and J.J. DiCarlo. Hierarchical modular optimization of convolutional networks achieves representations similar to macaque it and human ventral stream. In Advances in Neural Information Processing Systems 26, pages 3093--3101, Tahoe, CA, 2013.

[228]   Jianchao Yang, Kai Yu, Yihong Gong, and Thomas Huang. Linear spatial pyramid matching using sparse coding for image classification. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 2009.

[229]   Junjie Yao, Lidai Wang, Joon-Mo Yang, Konstantin I. Maslov, Terence T. W. Wong, Lei Li, Chih-Hsien Huang, Jun Zou, and Lihong V. Wang. High-speed label-free functional photoacoustic microscopy of mouse brain in action. Nature Methods, advance online publication, 2015.

[230]   J. Yu and D. Ferster. Functional coupling from simple to complex cells in the visually driven cortical circuit. Journal of Neuroscience, 33(48):18855--18866, 2013.

[231]   Anthony M. Zador, Joshua Dubnau, Hassana K. Oyibo, Huiqing Zhan, Gang Cao, and Ian D. Peikon. Sequencing the connectome. PLoS Biology, 10(10):e1001411, 2012.

[232]   M.D. Zeiler, G.W. Taylor, and R. Fergus. Adaptive deconvolutional networks for mid and high level feature learning. In IEEE International Conference on Computer Vision, pages 2018--2025, 2011.

[233]   F. Zhang, V. Gradinaru, A.R. Adamantidis, R. Durand, R.D. Airan, L. de Lecea, and K. Deisseroth. Optogenetic interrogation of neural circuits: technology for probing mammalian brain structures. Nature Protocols, 5(3):439--56, 2010.

[234]   Xiang Zhang and Yann LeCun. Text understanding from scratch. CoRR, abs/1502.01710, 2015.

[235]   Xiaodan Zhu, Parinaz Sobhani, and Hongyu Guo. Long short-term memory over tree structures. CoRR, abs/1503.04881, 2015.

[236]   D. Zoccolan, N. Oertelt, J. J. DiCarlo, and D. D. Cox. A rodent model for the study of invariant visual object recognition. Proceedings of the National Academy of Sciences, 106(21):8748--8753, 2009.

[237]   Steven W. Zucker. Local field potentials and border ownership: A conjecture about computation in visual cortex. Journal of Physiology - Paris, 106:297--315, 2012.

[238]   Steven W. Zucker. Stereo, shading, and surfaces: Curvature constraints couple neural computations. Proceedings of the IEEE, 102:812--829, 2014.


1 In their experiments, Cho et al [32] train the whole network including the word embedding matrix. The embedding dimensionality was chosen to be quite large, as [their] preliminary experiments with 155-dimensional embeddings showed rather poor performance. Source:  [32]

2 At each time step of the decoder, we keep the s translation candidates with the highest log-probability, where s = 10 is the beam-width. During the beam-search, we exclude any hypothesis that includes an unknown word. For each end-of-sequence symbol that is selected among the highest scoring candidates the beam-width is reduced by one, until the beam-width reaches zero.

The beam-search to (approximately) find a sequence of maximum log-probability under RNN was proposed and used successfully in (Graves, 2012) and (Boulanger-Lewandowski et al., 2013). Recently, the authors of (Sutskever et al., 2014) found this approach to be effective in purely neural machine translation based on LSTM units.

When we use the beam-search to find the k best translations, we do not use a usual log-probability but one normalized with respect to the length of the translation. This prevents the RNN decoder from favoring shorter translations, behavior which was observed earlier in, e.g., (Graves, 2013). Source:  [32]

3 For CS379C students — or anyone else — not familiar with NLP scoring metrics, the BLEU metric scores a translation on a scale of 0 to 1, but is frequently displayed as a percentage value. The closer to 1 [100%], the more the translation correlates to a human translation. Put simply, the BLEU metric measures how many words overlap in a given translation when compared to a reference translation, giving higher scores to sequential words. Source: Wikipedia

4 Note that the decoder has the advantage of all the annotations to assist in generating each word in the translation, whose length could be more or less than Tx. Since the encoder is bidirectional, each annotation hi contains information about the whole input sequence with a strong focus on the parts surrounding the i-th word of the input sequence.

5 I always like it when the same computational principles apply at multiple scales, modalities and levels of abstraction. It gives me some — perhaps undeserved — assurance that we’re on the right track in modeling cortical function.

6 The pooling layers in CNN models provide some measure of invariance but at the cost of selectivity. It is difficult to find a balance as the type and degree of invariance varies from level to level. Moreover since the input — whether pixels, words or some other sort of sequential data — is discarded after the first layer, higher layers of the model have an increasingly diluted connection to the input.

7 Dynamic max pooling generalizes LeCun et al [120] in two ways: First, it involves k-max pooling over a linear sequence of values by returning k values instead of just one, and, second, k is selected dynamically as a function of other signals in the network.

8 Zhang and LeCun [234] recently applied a temporal variant of convolutional networks to hierarchical language modeling demonstrating some generality across tasks and relatively impressive performance on a number of natural language processing tasks.

9 I’m not opposed to hybrid models per se and agree with Descartes’ pragmatic approach to exploiting the best of symbolic and connectionist inference technologies as a bridging strategy on the path to a more integrated approach down the line.

10 I like to think of these mutable pieces of embedding vectors as genes since they can be exchanged wholesale by an operation akin to gene splicing and can also be altered piecewise by operations analogous to those associated with epigenetics such as methylation and phosphorylation, but the analogy doesn’t quite work since genes typically correspond to dense sequences of (contiguous) base pairs. Nevertheless I’ll return to this analogy later on.

11 Basically, singular subjects need singular verbs and plural subjects need plural verbs, but then it gets complicated, with lots of messy edge cases. For example, the indefinite pronoun, ‘‘none’’, can be either singular or plural, and it often doesn’t matter whether you use a singular or a plural verb unless something else in the sentence determines its number. Writers generally think of ‘‘none’’ as meaning ‘‘not any’’ and will choose a plural verb, as in ‘‘None of the engines are working,’’ but when something else makes us regard ‘‘none’’ as meaning ‘‘not one’’, we want a singular verb, as in ‘‘None of the food is fresh.’’ — see here for more examples.

12 A competing hypothesis is that the information required to apply a given constraint is not stored directly in the embedding vector or accessible indirectly by simply following a pointer to some location in an external memory, but rather the necessary information is recovered by inference. For example, to replace the subject of a sentence encoded in an embedding vector, determine the number — singular or plural — of the replacement subject, and, if necessary, alter the number of the verb as well.

13 According to the currently prevailing view of English grammar, verbs have five standard properties {tense, person, number, voice, mood} plus two additional properties that concern auxiliary verbs: tense {present, past, future} × {simple, perfect, progressive}, class {auxiliary, lexical {transitive, intransitive}}, auxiliarized {true, false}, person {first, second, third}, number {singular, plural}, voice {active, passive}, mood {indicative, subjunctive, imperative}.

14 Here are some examples of sentences that could be used to test if a network model is capable of enforcing subject-verb agreement: ‘‘None of the engines are running’’, ‘‘The starboard engine is working’’, ‘‘None of the food is fresh’’, ‘‘Organic vegetables tend to taste better’’, ‘‘My brothers like to play soccer’’, ‘‘My sister wants to be a gymnast’’, ‘‘I am partial to chocolate deserts.’’

15 In terms of statistical analysis, we have one dichotomous dependent variable, number, that takes on one of two values e.g., singular or plural, and N continuous independent variables corresponding to the N scalar components (dimensions) of the embedding vectors. For this type of problem, there are two methods commonly applied in practice: linear logistic regression and Fischer’s linear discriminative analysis (LDA). LDA is typically associated with feature selection, i.e., dimensionality reduction, since the goal is to find a linear combination of a subset of the features (independent variables) that separates data. However, LDA depends on the independent variables being normally distributed and so is somewhat more restrictive than linear logistic regression.

16 It is an interesting exercise to compare this measure of word similarity to the Hebbian dictum ‘‘neurons that fire together wire together’’.

17 Here is some recent work by a Berkeley statistician that might be relevant:

Title: Signal Recovery from Scattering Convolutional Networks

Speaker: Joan Bruna, UC Berkeley, Asst. Prof. of Statistics

Abstract: Scattering operators cascade several layers of wavelet decompositions and complex modulus to define stable and locally invariant signal representations, resulting in state-of-the-art classification results on several pattern and texture recognition problems. The reasons for such success lie on the ability to preserve discriminative information while generating stability with respect to high-dimensional deformations. In this talk, we will explore the discriminative aspect of the representation, giving conditions under which signals can be recovered from their scattering coefficients. Although the scattering recovery is non-convex and corresponds to a generalized phase recovery problem, gradient descent algorithms show good empirical performance. We will discuss connections to non-linear compressed sensing and applications to texture synthesis and perceptual coding.

18 Following up on a conversation with Terry Sejnowki, I contacted Sury Ganguli about an analysis that Terry summarized in our discussion. Surya provided a nice summary of the analysis he did with Peiran Gao and a link to a paper under review for publication in Current Opinion in Neurobiology a draft of which you find here. Here’s the relevant excerpt from Surya reply:

We often record from 100’s of neurons — we are using Krishna Shenoy’s data from motor cortex, compute trial averages, do dimensionality reduction, and get low dimensional state space dynamics that make sense, and can decode single trial information well from this small subset of neurons. This raises conceptual questions:
  1. Why is the dimensionality so small, relative to the number of neurons?

  2. Why do the state space dynamics make sense, and why can we decode well despite recording so few neurons?

  3. Can we trust these results — would either the dimensionality or dynamics change if we recorded all 1 million neurons say?

We develop a theoretical framework that (a) derives an upper bound on the dimensionality of neural data given the complexity of the task and smoothness of neural responses and (b) connects the action of doing electrophysiology to doing a random projection of neural activity patterns onto the subspace of recorded neurons — this can be used to tell us how many neurons we need to record to accurately recover collective neural dynamics.

Using this theory, the answers to the above questions are:

  1. Because the task is too simple, and the dimensionality is actually as high as possible given the complexity of the task.

  2. Because you don’t need that many random projections (recorded neurons) to recover the geometry/manifold of neural activity patterns.

  3. Neither the dimensionality nor dynamics will change if we record more neurons, while doing the same task.

Hope that helps! Am happy to chat about this further if you wish.

best wishes,
Surya

19 Here is the notice for a technical talk given at Google by Eric Jonas on his joint work with Konrad Kording:

Title: Automatic discovery of cell types and microcircuitry from neural connectomics [97]

Speaker: Eric Jonas is a postdoc working on measurement and computation with Ben Recht in EECS at UC Berkeley. He completed his PhD on stochastic circuitry for Bayesian inference at MIT in September of 2013, where he also received his M. Eng. and SB in EECS and and SB in Neurobiology. His research interests lie at the intersection of measurement, inference, and biology.

Abstract: Neural connectomics has begun producing massive amounts of data, necessitating new analysis methods to discover the biological and computational structure. It has long been assumed that discovering neuron types and their relation to microcircuitry is crucial to understanding neural function. Here we developed a nonparametric Bayesian technique that identifies neuron types and microcircuitry patterns in connectomics data. It combines the information traditionally used by biologists, including connectivity, cell body location and the spatial distribution of synapses, in a principled and probabilistically-coherent manner. We show that the approach recovers known neuron types in the retina and enables predictions of connectivity, better than simpler algorithms. It also can reveal interesting structure in the nervous system of C. elegans, and automatically discovers the structure of a microprocessor. Our approach extracts structural meaning from connectomics, enabling new approaches of automatically deriving anatomical insights from these emerging datasets.

Paper: Automatic discovery of cell types and microcircuitry from neural connectomics, Eric Jonas and Konrad Kording, CoRR, 2014.

20 Kenneth Hayworth has worked with some of the best researchers in field of neuroscience and electron microscopy. He is a Senior Scientist at the HHMI Janelia Farm Research Campus and he works in the Harald Hess Lab. In addition to his work at Janelia, Ken is the President and Co-Founder of the Brain Preservation Foundation which is, as its name suggests, dedicated to preserving the brains of humans, including their individual memories and identities, after they die.

21 For a given pair of sentences, the semantic relatedness task is to predict a human-generated rating of the semantic similarity between the two sentences. To evaluate their model on the semantic relatedness task, Tai et al use the the Sentences Involving Compositional Knowledge (SICK) dataset (Marelli et al 2014), which consists of ~10,000 sentence pairs divided 4500/500/5000 training/validation/testing.

22 The authors distinguish between recursive neural network models in which the output feeds back to the input and recurrent models in which the hidden state values persists over time, i.e., over recursive applications of the model.

23 Here are the three machine-learning technologies employed in the Socher et al [188] paper on segmenting tubular structures: marginal space learning — A. Barbu, V. Athitsos, B. Georgescu, S. Boehm, P. Durlak, and D. Comaniciu, ‘‘Hierarchical learning of curves: Application to guidewire localization in fluoroscopy,’’ CVPR, 2007, probabilistic boosting trees — Zhuowen Tu, ‘‘Probabilistic boosting-tree: Learning discriminative models for classification, recognition, and clustering.,’’ ICCV, 2005, and steerable features — Y. Zheng, A. Barbu, B. Georgescu, M. Scheuering, and D. Comaniciu, ‘‘Fast automatic heart chamber segmentation from 3-D CT data using marginal space learning and steerable features,’’ ICCV, 2007.

24 Here are the BibTeX entries including abstracts for the papers mentioned in the March 19 log entry:

@inproceedings{SocheretalISBI-08,
        title = {A Learning Based Hierarchical Model for Vessel Segmentation},
       author = {Richard Socher and Adrian Barbu and Dorin Comaniciu},
    booktitle = {IEEE International Symposium on Biomedical Imaging: From Nano to Macro},
    publisher = {IEEE},
      address = {Paris, France},
         year = {2008},
     abstract = {In this paper we present a learning based method for vessel segmentation in angiographic videos. Vessel Segmentation is an important task in medical imaging and has been investigated extensively in the past. Traditional approaches often require pre-processing steps, standard conditions or manually set seed points. Our method is automatic, fast and robust towards noise often seen in low radiation X-ray images. Furthermore, it can be easily trained and used for any kind of tubular structure. We formulate the segmentation task as a hierarchical learning problem over 3 levels: border points, cross-segments and vessel pieces, corresponding to the vessel's position, width and length. Following the Marginal Space Learning paradigm the detection on each level is performed by a learned classifier. We use Probabilistic Boosting Trees with Haar and steerable features. First results of segmenting the vessel which surrounds a guide wire in 200 frames are presented and future additions are discussed.}
}
@article{ZhuetalCoRR-15,
        title = {Long Short-Term Memory Over Tree Structures},
       author = {Xiaodan Zhu and Parinaz Sobhani and Hongyu Guo},
      journal = {CoRR},
       volume = {abs/1503.04881},
         year = {2015},
     abstract = {The chain-structured long short-term memory (LSTM) has showed to be effective in a wide range of problems such as speech recognition and machine translation. In this paper, we propose to extend it to tree structures, in which a memory cell can reflect the history memories of multiple child cells or multiple descendant cells in a recursive process. We call the model S-LSTM, which provides a principled way of considering long-distance interaction over hierarchies, e.g., language or image parse structures. We leverage the models for semantic composition to understand the meaning of text, a fundamental problem in natural language understanding, and show that it outperforms a state-of-the-art recursive model by replacing its composition layers with the S-LSTM memory blocks. We also show that utilizing the given structures is helpful in achieving a performance better than that without considering the structures.}
}
@article{TaietalCoRR-15,
        title = {Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks},
       author = {Kai Sheng Tai, Richard Socher, Christopher D. Manning},
      journal = {CoRR},
       volume = {abs/1503.00075},
         year = {2015},
     abstract = {A Long Short-Term Memory (LSTM) network is a type of recurrent neural network architecture which has recently obtained strong results on a variety of sequence modeling tasks. The only underlying LSTM structure that has been explored so far is a linear chain. However, natural language exhibits syntactic properties that would naturally combine words to phrases. We introduce the Tree-LSTM, a generalization of LSTMs to tree-structured network topologies. Tree-LSTMs outperform all existing systems and strong LSTM baselines on two tasks: predicting the semantic relatedness of two sentences (SemEval 2014, Task 1) and sentiment classification (Stanford Sentiment Treebank).}
}
@article{VinyalsetalCoRR-15,
       author = {Oriol Vinyals and Lukasz Kaiser and Terry Koo and Slav Petrov and Ilya Sutskever and Geoffrey E. Hinton},
        title = {Grammar as a Foreign Language},
      journal = {CoRR},
       volume = {abs/1412.7449},
          url = {http://arxiv.org/abs/1412.7449},
         year = 2014,
     abstract = {Syntactic parsing is a fundamental problem in computational linguistics and natural language processing. Traditional approaches to parsing are highly complex and problem specific. Recently, Sutskever et al. (2014) presented a task-agnostic method for learning to map input sequences to output sequences that achieved strong results on a large scale machine translation problem. In this work, we show that precisely the same sequence-to-sequence method achieves results that are close to state-of-the-art on syntactic constituency parsing, whilst making almost no assumptions about the structure of the problem. To achieve these results we need to mitigate the lack of domain knowledge in the model by providing it with a large amount of automatically parsed data.}
}

25 In the process of imagining herself in Alice’s situation working two jobs, Jan might realize that this option for getting out of debt isn’t likely to work for her since Jan knows that she needs at least eight hours of sleep every night to function, and working two jobs would quickly exhaust her. How exactly she realizes her limitations in this regard is an open question and one that we’ve been thinking a lot about in developing applications for Descartes.

Specifically, how might Jan draw this particular conclusion but avoid drawing all of the myriad other true but irrelevant conclusions that have no bearing on the feasibility of her proposed ‘‘working two jobs’’ solution to getting out of debt? This is an example of the sort ‘‘relevant deduction’’ that we would like from a natural logic inference NLI system of the sort we considered last quarter in discussing Bill MacCartney’s work [131130].

We would also like our NLI system to draw the same conclusion if Jan’s memory was less directly applicable as in the case that Jan is reminded of Fred telling her about hiring a graduate student to work nights in his convenience store to avoid taking out a student loan to cover her tuition and housing. We’re not so naïve as to believe there is a general or optimal solution to this problem, but we’re hoping that a flexible episodic memory might help to cope with the combinatorics.

27 These models developed by Costas Anastassiou and his team at AIBS and Sean Hill at EPFL consist of networks of reconstructed, multi-compartmental, virtually-instrumented and spiking pyramidal neurons and basket cells, plus ion- and voltage-dependent currents and local field potentials that allow us to generate the same sort of rasters we expect to collect during calcium imaging.

26 I’ll be teaching cs379c Computational Models of the Neocortex again this spring, but with a different focus than in previous years. This generation of computer scientists will be the first in history to have access to brain data in sufficient quantity and quality for large-scale structural and functional connectomics, and this year I’m trying to attract computer science students and computer-savvy engineers and neuroscientists interested in tackling some of the machine-learning and signal-processing challenges in analyzing such data.

In collaboration with the Allen Institute for Brain Science (AIBS), HHMI Janelia Farm, Max Planck Institute for Medical Research and MIT, we are compiling EM (Electron Microscopy) datasets that will enable computer scientists to reconstruct the neural circuits for several model organisms, and co-registered activity recordings using calcium imaging (CI) from which we hope to glean algorithmic insights by fitting various artificial neural network models to account for observed input / output behavior.

We’ll be working with two teams of scientists and engineers who are building the tools to acquire this data. We have several relatively-small (10TB) EM datasets (including ground truth) that students interested in circuit tracing (structural connectomics) can use in projects. Scientists at AIBS have volunteered to help students in understanding the data and technologies used to collect it. In addition, engineers from my team at Google will supply examples of algorithms that have worked well for us.

Inferring function from CI data is more challenging since until recently there haven’t been good datasets to work with. We now have several such datasets provided by our collaborators that can be used in student projects. In addition, we’ll be generating synthetic datasets for cortical circuits of 5-50K neurons using Hodgkin-Huxley models27 developed at AIBS and EPFL. These models and their associated simulators provide a controlled environment in which to experiment with and evaluate machine-learning technologies for functional connectomics.

The prerequisites are basic high-school biology, good math skills, and familiarity with machine learning. Some background in computer vision and signal processing will be important for projects in structural connectomics. Familiarity with modern artificial neural network technologies is a plus for projects in functional connectomics. Please encourage your qualified students to consider taking the course. As an added incentive, I have a group of extraordinary scientists and engineers lined up to help make it a great course.

28 I asked two of my colleagues on the Neuromancer team to comment on the scalability of techniques like those championed in [14322] and here is what they offered. Peter Li, who worked on retina in E.J. Chichilnisky’s lab at the Salk Institute and has a good deal of hands-on practical experience in tracing circuits, had this to say:

[Peter]: I have experience filling neurons with neurobiotin, which is a very similar biotin derivative (slightly smaller than biocytin). You can get beautiful fills, and the binding to avidin is extremely strong, so you can easily augment the labeling in a variety of ways.

Interestingly, neurobiotin is small enough that it passes through many gap junctions (positive charge may also help in some cases), so it is often used in tracer coupling studies. We used it to investigate coupling between primate photoreceptors.

Scale is a bit of an issue. Normally you inject single cells with tracer using micropipettes. Biolistics should be an option for scaling up, but that’s a (literally) scattershot approach. In general with [non-genetically encoded dyes], the problem is how to fill larger numbers of cells without filling so many that you can’t sort anything out anymore. Similar issue with lipophilic dyes like DiI.

For tracer coupling, people do crude things like cut a slash through the tissue with a razor blade and then soak in biotin. You can then see how far into the tissue the dye spreads. For example, in some cases the spread was greater at night than during the day, indicating circadian modulation of gap junctions.

Viren Jain, who worked at HHMI Janelia Farm has experience tracing individual neurons but on a much larger scale that anything attempted prior, had this to say:
[Viren]: If you want sparse reconstruction of large numbers of individual neurons, you might as well go with GFP or variants thereof these days. Janelia is doing that approach on a massive scale, to image nearly every neuron in drosophila using optical microscopy (the main technological innovation being genetic driver lines to control expression only within very specific neurons). This still won’t tell you anything about connectivity, but is useful for cell type analysis and confirming the correctness of EM reconstructions.

29 The Drosophila visual system is composed of the retina and the optic lobes, which are the ganglia where photoreceptors project and initial processing of visual inputs occurs. The optic lobes are formed by several structures that mediate different behaviors and represent different levels of processing: the lamina, the medulla, and the lobula complex, which is formed by the lobula and the lobula plate. The medula is composed of columns which consist of about sixty cells and serve as the basic functional unit of the medulla. The HHMI Janelia Farm seven-column dataset mentioned in the text consists of a single sample containing seven such columns and a small border of additional tissue. Source: Morante and Desplan [152].

30 Reaction-diffusion systems are mathematical models which explain how the concentration of one or more substances distributed in space changes under the influence of two processes: local chemical reactions in which the substances are transformed into each other, and diffusion which causes the substances to spread out over a surface in space. Reaction-diffusion systems are naturally applied in chemistry. However, the system can also describe dynamical processes of non-chemical nature. Examples are found in biology, geology and physics and ecology. Mathematically, reaction-diffusion systems take the form of semi-linear parabolic partial differential equations. (SOURCE)

31 It appears that there is much more going on in primary visual cortex than edge detectors. The evidence of recurrent computations, geometry, contour completion, etc. has mounted over the decades since Hubel and Wiesel. I’m primarily aware of this through conversations withe Bruno Olshausen and Steven Zucker, reading the work of my colleagues at Brown David Mumford and Tai Sing Lee — David’s graduate student and now a professor at Carnegie Mellon. Here are some representative papers from Lee and Mumford [122123] and here is the abstract from a NIPS paper by Lawlor and Zucker [118] that hints at how resolving geometric ambiguity might be explained in primary visual cortex by invoking the application of higher-order statistics:

Association field models have attempted to explain human contour grouping performance, and to explain the mean frequency of long-range horizontal connections across cortical columns in V1. However, association fields only depend on the pairwise statistics of edges in natural scenes. We develop a spectral test of the sufficiency of pairwise statistics and show there is significant higher order structure. An analysis using a probabilistic spectral embedding reveals curvature-dependent components.

32 Words are discrete easily reproducible quanta of information that we have collectively agreed upon and learned to process. They can be conveyed with little or no loss over a noisy channel. They are more efficient than zeros and ones as a basis for spoken language. They are seemingly indivisible and immutable and yet capable of subtle shades of meaning and readily adapted to describing new phenomena. They provide a solid basis for human communication. Alas, everything above (thoughts and mental states) or below (sensations and external stimuli) present infinitely greater difficulty interpreting.

33 Let V, W and X be three vector spaces over the same base field F. A bilinear map (SOURCE) is a function

B: V × WX

such that for any w in W the map

vB(v, w)

is a linear map from V to X, and for any v in V the map

wB(v, w)

is a linear map from W to X.

In other words, if we hold the first entry of the bilinear map fixed, while letting the second entry vary, the result is a linear operator, and similarly if we hold the second entry fixed. Note that if we regard the product V × W as a vector space, then B is not a linear transformation of vector spaces (unless V = 0 or W = 0) because, for example B(2(v, w)) = B(2v, 2w) = 2B(v, 2w) = 4B(v, w).

34 The tensor product of two vector spaces V and W, denoted VW and also called the tensor direct product, is a way of creating a new vector space analogous to multiplication of integers. The outer product (SOURCE) usually refers to the tensor product of vectors. If you want something like the outer product between an m × n matrix A and a p × q matrix B, you can use the generalization of outer product, called the Kronecker product (SOURCE) and notated AB.

Given two matrices, we can think of them as representing linear maps between vector spaces equipped with a chosen basis. The Kronecker product of the two matrices then represents the tensor product of the two linear maps. For example, if A is an m × n matrix and B is a p × q matrix, then the Kronecker product AB is the mp × nq block matrix:

more explicitly:

It is worth noting that recursively applied tensor products grow exponentially in dimension so that for { ui ∈ ℝd: 0 < in } by the associativity of tensor products:

un ⊗ ... u3u2u1 = [ un ⊗ [ ... [ u3 ⊗ [ u2u1 ] ] ... ] = M

where M is a dn × dn matrix.

35 Message to Vivek Srikumar:

I’ve been reading your 2014 NIPS paper and I’m a bit puzzled with some of your notation, specifically the use of the integer-valued variables n and N that you use to specify the dimensionality of the weight vector w, feature vector Φ(x, y) and input feature vector φ(x).

If I assume that (uppercase) N is used exclusively for the user-defined features that determine the size of φ(x), then (lowercase) n defines the dimensionality of the weight vector and Φ(x, y), but the latter seems to be a function of Ψ(x, yp, A) and |yp| the latter of which could be of length anywhere from 1 to M the total number of labels.

So I conclude it must me that yp = (y0, y1, ..., ym) represents a sparse vector with m << M non-zeros and so therefore:

|Ψ(x, yp, A)| = dM |φ(x)| for all yp in Γx

and furthermore:

|Ψ(x, yp, A)| = is also dM |φ(x)|

though presumably the latter is quite sparse or d is quite small.

Could you confirm or disconfirm this, and, if I’m wrong, provide an alternative interpretation? Thanks.

From: Vivek Srikumar

Thanks for your email! I didn’t quite understand how you got this conclusion:

So I conclude it must me that yp = (y0, y1, ..., ym) represents a sparse vector with m < M non-zeros and so therefore:

But let me try to explain the notation a bit better. The dimensionalities involved are:

  1. n: The weight vector w is n dimensional. Φ(x, y, A) also n dimensional

  2. M: Number of labels in the problem M, corresponding to the set {l1, l2, ..., lM}. We will call this set L.

  3. d: Each label i is associated with a d-dimensional vector ali. These vectors are the columns of the d × M matrix A.

  4. m: Let us take a specific part/factor in Γx p. Suppose this p is associated with m of the outputs in the factor graph. That is, part p has a m-tuple label (y0, y1, ..., ym−1). Each yi is an element of L.

By unrolling the recursion in (3), we have

Ψ(x, yp, A) = ay0ay1 ⊗ ... ⊗ aym−1 ⊗ φ(x)

This is an m + 1 order tensor. The first m elements of this tensor product are all d dimensional vectors and the last one is a |φ(x)| dimensional one.

So the dimensionality of vec(Ψ(x,yp, A)) is dm|φ(x)| for all p ∈ Γx. Note that here m is just the number of variables associated with this part p and is not related to M, the number of labels in the problem.

In the paper, I use the (in hindsight unnecessary) additional indirection of using aly0 to refer to the vector corresponding to the label ly0. That is, the yi’s index into the label set L.

To: Vivek Srikumar

We’re on the same page up until the penultimate paragraph. You seem to be saying that all the parts in Γx have the same number of labels. Indeed the exact same labels!

Then since we simply sum the vec(Ψ(x,yp, A)), that means |Φ(x, y, A)| = dm|φ(x)| = n = |w| and so I agree that in this case everything works. I had just assumed that a given x might have parts with different label sets.

From: Vivek Srikumar

Ah, not really. The input feature vector is defined to be part specific (φp instead of φ, as in Equation 3). In practice, we could pad the input features with enough zeros. For example, say there are two parts:
  1. part p1 has one label y1 and is associated with φ1 gives d1(x)| features, and

  2. part p2 has two labels (y1, y2) and is associated with φ2 gives d22(x)| features.

We can pad the vectors vec1) and vec2) with zeros appropriately so that the two sets of features can be added. If the two feature spaces are completely orthogonal, then the actual feature vector will be of size d1(x)| + d22(x)|.

We can make this statement more formally by thinking about the basis vectors of the different vector spaces. Say the basis vectors for the label vectors are {e1, e2, ... ed}, the basis vectors for the range of φ1 are {f1, f2, ...}, with |φ1(x)| elements and those for φ2 are {g1, g2,....} with |φ2(x)| elements.

Then we can say the following about the two feature tensors and their vectorizations:

  1. the tensor product ay1 ⊗ φ1(x) will have the basis tensors {eifj} with d1(x)| elements. vec(ay1 ⊗ φ1(x)) will have the basis vectors {vec(eifj)}, and

  2. the tensor product ay1ay2 ⊗ φ2(x) will have the bases {eiejgk} with d22(x)| elements. And its vectorization will have the basis vectors {vec(eiejgk)}.

The full feature vector is the sum of these two vectors. But for the sum to be meaningful, the two vectors should be in the same space. That is, they should have the same basis vectors. To make that happen, we need to assume that the REAL feature space is defined by the basis vectors {vec(eifj)} ∪ {vec(eiejgk)}. Let’s call this union FULL. The size of FULL is d1(x)| + d22(x)| if the two sets do not share any common elements.

There are different ways of making the feature vectors from (1) and (2) exist in the FULL space. One way to easily achieve this is to define the output of the vectorization operator to produce vectors in the FULL space, with zeros for all bases that do not correspond to the corresponding part.

Note that this basically comes for free if we define use sparse vector implementations that are internally defined as maps from strings to doubles. Strings will then define the bases, ⊗ for the bases is just string concatenation. I didn’t do this for efficiency, though.

To: Vivek Srikumar

Thanks. In my original message asking for clarification, I was thinking of something along the lines of your sparse solution for working within the FULL vector space. I suppose you could also include an L1 term in the loss function to keep the basis vectors sparse. In any case, if your approach works well enough it would be worth spending time on this to make it scale.

I’m working on hierarchical document models implemented as RNNs with LSTM hidden layers. In the case of words, sentences, paragraphs, etc., there are clear syntactic markers that suggest semantic boundaries, but obviously there are also meaningful fragments at the phrase and topic level likely to prove useful.

Completely automated detection of boundaries for such fragments is beyond the state of the art, but research has focused primarily on compositional grouping for parsing, alignment, sentiment analysis and traditional question answering. It may be that using a method such as described in your NIPS paper with enough data we can reveal more of fine-grained structure for analysis.

P.S. I found your 2014 EMNLP paper with Jonathan Berant quite interesting and I look forward to following your work more closely in the future.

36 Semantic relations are model theoretic entities, e.g., entailment, contradiction and mutual consistency, which shouldn’t be confused with the binary operators that are part of the syntax of the logic, e.g., ⊃ for implication, ∧ for conjunction and ¬ for negation.

37 HTML 4.0 doesn’t have character codes for the natural-logic-relation symbols used by MacCartney [129] and Bowman et al [23], and so I’m using → for entailment, ← for reverse entailment, ↔ for equivalence, ¦ for alternation, and ≠ for independence.

38 Think about forward chaining producing a sequence of inferences in the form of bound atomic vectors that are fed into an LSTM layer in which each LSTM block is capable of remembering an entire embedding vector — the vector equivalent of a ground atomic formula, but with the mutability and flexibility of an embedding vector. We could usea variant of the NTM model [71] to set aside a round-robin buffer and feed the inferred consequents into the buffer which is subsequently scanned for relevant information to construct a context to assist response generation. The problem is that as soon as I starting thinking along these lines I immediately recall the unpleasant consequences of unfettered forward or backward chaining in applying theorem provers or classical planning systems to anything other than toy domains. Perhaps the same sort of wishful thinking will surface as we try to apply NLI at scale, but I’m hoping the enormous capacity of high-dimensional embedding spaces coupled with the graceful degradation in precision we see in the case of very-large language models will carry the day.

39 The prediction process is seeded with down-sampled low-resolution, gist-like [208] version of the whole input image annotated by the ‘‘context-network’’ to provide ‘‘sensible hints on where the potentially interesting regions of the image are located’’. At each subsequent stage in the process, the system has a ‘‘foveal’’ view centered on the target location of the last saccade — a high-resolution focal region surrounded by a low-resolution peripheral region.

40 There is always the option of using higher-level modeling tools like Chris Eliasmith’s Spaun [199] (TED) but I’m not comfortable with the accuracy or level of detail of such models. Preferably I would like to simulate at the molecular level and use molecular-scale models to develop simulated instruments to mimic genetically encoded calcium indicators.