The Fly's Epiphany in the Third Lane

This document contains notes for a project relating to digital personal assistants. Log entries are listed in reverse chronological order. Topics were examined starting with the least well understood, the objective being to quickly dismiss topics irrelevant to the proposed effort and sufficiently ground those providing potentially useful technology.

The main focus is on digital assistants that collaborate with humans to write software by making the most of their respective strengths. This account of my early involvement in building conversational agents at Google emphasizes my technical interest in enabling continuous dialogue, resolving ambiguity and recovering from misunderstanding, three critical challenges in supporting effective human-computer interaction.

The earliest notes survey research in neuroscience relating to attention, decision making, executive control and theory-of-mind reasoning. The footnotes and extensive bibliography provide an introduction to a cross section of the most relevant recent research. The material in this document will serve as the basis for lectures at Google and Stanford.

February 15, 2018

%%% Thu Feb 15 4:35:56 PST 2018

My earlier notes concerning the roles of mycroglia in the developing brain are available here. I'm reviewing related lectures from the Broad Institute, Simons Foundation and YouTube, featuring Ben Barres, Stanford; Staci Bilbo, Harvard; Tobias Bonhoeffer, Max Planck Institute; Carla Shatz, Stanford; Beth Stevens, Harvard; Richard Ransohoff, Biogen. I'll post my recommendations here once I have a chance to look more thoroughly.

February 13, 2018

%%% Tues Feb 13 03:38:59 PST 2018

Humans have lifted themselves by leveraging their innate evolved capabilities to construct a social and technological edifice that enables us to create tools that circumvent our intellectual shortcomings and to produce knowledge of lasting value that survives beyond our short lifespans. It is interesting to think about how we might develop artificially intelligent systems that embody the same characteristcs that have served us so well.

Our basic understanding of the structure and extent of human intelligence came into being during enlightenment¹ . It is only in the last few decades we have refined our understanding such that we can even contemplate attempting to engineer systems that test our understanding. We've talked about the basic components of human cognition in these pages. Here we attempt to place those components within their biological and historical context.

Beginning at the periphery, the primary sensory cortex supports attention to change and saliency across all modalities. This sensitivity extends to all abstractions and compositions of multiple modaliaties. Our innate pattern recognition capabilities leverage this sensitivity across all spatial and temporal scales, sensory modalities and their abstractions. Modern machine learning has demonstrated that such capabilities can be automated.

Human memory represents a remarkable evolutionary innovation. Our creative reconstruction of experience, while undermining our ability to provide accurate first-hand accounts of accidents and other past events, is ideally suited to imagining novel variations of familiar situations, learning to make predictions and plan for the future. A complete understanding of this facility still eludes us, but we have clues to guide implementation.

Human language allows us share just about any form of knowledge, mathematics enables us to construct sound models and theories and draw conclusions that are valid assuming the validity of our axioms and rules of inference, and logic allows us to construct programs that run on human minds. All of these innovations are built upon the foundation of our innate observational and pattern recognition capabilities. None are completely understood.

That said, language is key to all of these advanced capabilities. It is hard to overstate its importance. I am smart because of what I know. This is analogous to saying that my computer is capable because of the programs I've installed on it. I can observe and emulate the strategies of the people around me. Their actions in solving problems are like programs I can run because I have the same hardware and operating system. Other people can just tell me how they solve problems and that is enough that I can apply their solutions to solving my own similar problems.

For thousands of years human beings have been sharing their programs, improving them and adapting them to solving new problems. Much of that knowledge has been written down and preserved in books we can now read so as to avail ourselves of what others have learned and adapt their solutions to the problems of our age. During that same period, our language has evolved and become more expressive allowing us to extend and generalize what we have learned so it is relevant to solving an ever broadening class of problems. We can still run programs we find in books hundreds of years old.

In addition to language becoming more expressive, human beings have formalized the logic inherent in human discourse and problem-solving and codified the operations required to draw correct conclusions from knowledge we already have, thereby providing us with a reliable technology for inferring new knowledge and new programs that run on our innate computing hardware. In the last hundred years, we have created technology that implements such logic and the operations that create new knowledge from existing knowledge, and improved that technology so that it runs several orders of magnitude faster than biological computers.

Human computers are powerful, technically universal in the Church-Turing sense of being able to compute any computable function. But the underlying biological machinery is slow and optimized for solving a class of problems conducive to our survival on this planet during a particular evolutionary era. So why do we want build assistants modeled after the human brain? The answer is that human pattern recognition, facility with language, ability to operate in complex conceptual spaces that combine information from many sources, interact easily with other humans and integrate all of these capabilities in a single package is unsurpassed.

February 11, 2018

%%% Sun Feb 11 04:47:59 PST 2018

For students taking CS379C for credit or wanting to follow along, the introductory lecture, organized into four installments, is now available starting here. You can also find a commentary on the computational and immunological roles of microglia in developing and mature brains, included here as a lesson in the dangers of accepting prevailing dogma.

Part 4 of the introductory lecture provides a high-level description of the different components you will be working with in designing neural network architectures for class projects and you can find a collection of tutorials on implementing artificial neural networks related to these components here if you want to get a head start on your project.

February 9, 2018

%%% Fri Feb  9 05:11:48 PST 2018

I am fascinated with how artists, engineers, musicians and scientists think about their craft. This weekend my wife Jo and I watched The Art of the Piano: Great Pianists of the 20th Century, a documentary film — available on YouTube here — directed by Donald Sturrock and starring the musicians as themselves. The film features rare footage of concerts and interviews with Sergei Rachmaninoff, Arthur Rubinstein, Sviatoslav Richter, Vladimir Horowitz and Glenn Gould, among others. It also includes commentary and analysis by Daniel Barenboim and other musicians of comparable breadth and artistry.

I was particularly fascinated with Alfred Cortot (1877-1962). He was introduced as one of the most individual and also most unreliable of the early recording pianists. The bit about unreliability caught my attention. Cortot was first mentioned at 52:30 in the YouTube video showing a photo of him as a young man. Like many of his generation, World War II had a profound impact on his career and professional circumstances, and later in life he suffered from memory lapses that curtailed his public concerts and forced him into what was for musicians at the time an early retirement.

Despite these setback, Cortot went on to become a great teacher and interpreter of musical scores. Daniel Barenboim said of Cortot that "he always looked for anything extraordinary in the music ... something totally removed from reality", and the sequence starting at 56:15 where he performs as if completely entranced by the music while working with his students demonstrates both his other worldliness and his gifts as a teacher. One of his students Samson Francois was so interested in jazz and movies that they stimulated in his playing a remarkable sense of freedom and improvisation that one seldom hears in contemporary pianists, Glenn Gould being an exception.

You're probably wondering what any of this has to do with the topic of the CS379C this year. The answer is that it has a lot to do with the programmer's apprentice project described in these pages, a project that combines requirements from two of the three challenge problems. I'm constantly thinking about how software engineers design and build complicated programs, and both Jo and I work in areas that require creativity and the ability to manipulate complex representations in our heads. Jo is an artist specializing in abstract landscapes while I write computer programs and design computing architectures inspired by biology.

One question that interests me has to do with how humans assemble and work with complex cognitive artifacts like computer programs and musical compositions. How do they maintain and manipulate these structures in their heads? One possibility is that these artifacts share a great deal of structure. I'm particularly interested in the sonata form that developed in the middle of the 18th century and is perhaps best known now through the works of Haydn, Mozart and Beethoven, all three of whom produced some of their best work in this form and contributed to its development.

The sonata form provides a general architecture, but there are multiple levels of structure evident in the work of these composers each of whom had a different style of composition. The sonata form is a musical structure consisting of three sections: an exposition, a development, and a recapitulation. At the lowest level, lengthy sequences of notes in works by Mozart and Beethoven are generated through default patterns of pitches: arpeggios, scale passages, chords and the like. They also operated within a tonal system providing additional structure, predating Arnold Schoenberg's radical 12-tone departure from the conventional 6-pitch tonal system.

As an exercise, think about the structure of computer programs at different levels of abstraction from libraries, modules, object classes, algorithms, e.g., Donald Knuth's division into fundamental, seminumerical, sorting and searching and combinatorial, and programming paradigms, e.g., imperative, procedural, declarative, functional, object-oriented, event-driven programming. If you want to learn more about the structure of sonatas, check out Exploring Beethoven’s Piano Sonatas Parts I-III, a series of courses taught by Jonathan Biss of the Curtis Institute of Music available on Coursera.

February 8, 2018

%%% Thu Feb  8 04:29:23 PST 2018

At the periphery, sensory information often has local structure providing a basis for inferring relatively low-level features that can be used to recover complex objects from their context. Convolutional neural network (CNN) models enable us to infer such low-level features and generate feature maps that can be recursively analyzed to produce composite features, culminating in detailed hierarchical models of complex objects. In general, CNN models can be applied to structured data including sequences, images, acoustic time series, and diverse other spatially and temporally organized data. CNN models have been successfully deployed to learn useful representations of images, documents and graphs.

Much of what we see and hear derives its meaning from the context in which we observe it. In a grainy picture of a farmyard, a blurry pinkish blob might be identified as a pig and a vertical splash of blue as a farmer's jeans. The words "dog" and "cat" refer to different animals, but they often appear in the same contexts, e.g., house pets in the case of "He put the { cat, dog } out for the night". Embedding space models learn to infer the meaning of entities by the company they keep, e.g., language models represent words and their relative meanings derived from their context, defined by a fixed-width window of words immediately preceding and following the target word in a specified document.

Recurrent neural network (RNN) models and, in particular, gated-feedback recurrent networks [46] and long short-term memory networks [143] maintain state from one input to the next by feeding the output of units in one layer as input to units in other layers that appear earlier in a stack of layers that would otherwise be described as feedforward. RNN models are used in applications such as parsing that benefit from remembering some part of their history to employ that information at a subsequent point in time. The recurrent state is viewed as short-term memory, but has a general intepretation given the equivalent interpretation of RNN's as differential equations⁴.

RNN encoder-decoder models [275, 45] consist of two networks, one network called the encoder that ingests the input sequence, e.g., a sentence in one language, constructing a representation of the input which serves as input to a second network called the decoder that uses the representation to generate an output sequence, e.g., a translation of the sentence in a second language. These models were originally applied in natural language processing and machine translation, but have been extended to a wide range of applications from image and video captioning to code generation and protein function prediction.

So-called attentional neural networks learn to select specific input in order to facilitate context-sensitive changes. They really aren't so-much a distinct class of neural networks as a strategy that can be applied to implement a variety of capabilities ranging from visual attention and saccadic eye movement to more exotic cognitive capabilities such as those associated with autonoetic consciousness⁵, executive function⁷, episodic memory and attentional schemata in reasoning about other minds. We will demystify the latter capabilities the process of investigating the three applications mentioned in the syllabus.

The human brain stores information at different time scales. The stability of stored information varies considerably depending on its use. Strictly speaking, the word "store" is misleading here, since all we really do is encode information in the patterns of activity of neurons and then apply different strategies for maintaining the integrity of that information, altering it to suit various purposes that range from imagining the future by extrapolating from the present to constructing composite representations that combine information from multiple sources to create novel thoughts.

The global workspace theory posits a model of working memory that enables such future imagining and constructive engineering. To implement such a workspace, we use a form of persistent memory applying the notion of fast weights [141] — also called dynamic links [290] — to manage a working memory in which information can be added, modified and removed, that supports associative retrieval and is implemented as a fully differentiable model and so can be trained end-to-end. Fast weights are used to store temporary memories of the recent past and provide a neurally plausible method of implementing the type of attention to the past that has recently proved helpful in sequence-to-sequence models [10].

February 7, 2018

%%% Wed Feb  7 03:39:41 PST 2018

Next we consider artificial neural networks as components to be used in designing neural-network architectures. In the field of connectomics, sub networks of biological networks that appear repeatedly or at multiple scales are often referred to in the literature as network motifs and serve a similar epistemological role in computational neuroscience [60, 81, 136, 147, 270, 204]. Whether they are derived from artificial or biological networks, network motifs can also be applied to learning artificial neural network architectures [228, 311, 312].

Using an analogy drawn from computer architecture, network motifs are not discrete components like transistors and diodes or small integrated circuits like individual flip-flops or logic gates. At the other end of the scale, they aren’t complex integrated circuits like central processing units or GPU devices. They are more like multiplexers, demultiplexers, full adders, shift registers and serial interfaces. From the perspective of computational neuroscience, information is distributed across large numbers of neurons and representations are encoded in the activity of ensembles of then neurons. Artificial neural networks reduce complex biological processes to a special case of vector calculus and differential geometry.

The artificial neural network motifs we focus on in this class operate on arbitrary finite-dimension vectors, often performing the same operation on each vector component or computing complex vector or matrix products, creating and transforming vector spaces in the process. They employ activation functions that truncate the output of linear transformations using rectified linear units, squash scalar values with sigmoidal functions, and pool values to reduce dimensionality by local averaging and convolving with linear filters. In short, the network motifs that comprise our library of architectural components are powerful computing machines inspired by biology and designed by engineers to serve particular engineering goals.

If these engineered networks are only tenuously related to their putative biological counterparts, why should we imagine we will be able to employ them to leverage functional models derived from cognitive and systems neuroscience to design systems that exhibit desirable characteristics of biological systems? The short technical — and perhaps intellectually unsatisfying from your standpoint — answer is fully-differentiable mathematical models and end-to-end training with stochastic gradient descent. If you want a more satisfying argument from first principles, the best I can do is direct you to Chapter 3 of Parallel Distributed Processing by Hinton, McClelland and Rumelhart [140].

If you believe that we have a rock-solid, cellular-resolution theoretical foundation on which to rest our understanding of behavior, you might want to learn about how our current foundations are being challenged by new discoveries about the role of microglia. If you expect an argument that draws on careful experiments comparing primate and computer neural-network models of visual processing, then you're in luck, but will have to wait a few weeks until we look at the work of Charles Cadieu, Jim DiCarlo, Dan Yamins and their colleagues [151, 304, 37, 306, 305].

Jump to Introductory Lecture Part 4

February 6, 2018

%%% Tue Feb  6 06:29:43 PST 2018

In the following, we are going to concentrate on the central nervous system. The brain receives its input from the periphery originating in receptors that respond to light, smell, pressure, sound and movement. Most of that information enters the CNS through the primary sensory areas via networks organized in topographical maps reflecting the spatial pattern of the corresponding receptors. That information is processed in multiple areas and through specialized paths that code for the spatial and temporal properties of the input at increasing levels of abstraction. The resulting models are both hierarchical and compositional⁸.

As you move away from the periphery, not only do the properties reflected become more abstract but they are also combined to more fully summarize the full breadth of our experience. The resulting composite features are represented in what are called association areas that play an important role in pattern matching allowing us to recall similar experiences based on abstractions constructed from learning-selected subsets of more concrete sensory features. In addition to sensory information, we employ other features of our experience including, for example, features that code for different aspects of our physiological / emotional response to a given situation.

Figure 38: Panel A lists the regions of the human cerebral cortex including the relevant Brodman areas [REF]. Panel B illustrates some of the pathways involving language including Broca's area considered responsible for production [REF] and Wernicke's area [REF] implicated in comprehension. Panel C displays a sample of Brodman's areas as landmarks, highlighting areas 28 and 34 in the temporal lobe associated with the enterorhinal cortex [REF]. Panel D identifies a number of cortical and subcortical areas associated with the entorhinal cortex and episodic memory in particular [REF].

We have focused on the human neocortex but other sources of input originating from subcortical regions also play important roles. Panel D in Figure 38 shows the enterorhinal cortex located in the medial temporal lobe that functions as the hub of an extensive network for memory and navigation and serves as the primary interface between the hippocampus and neocortex. The enterorhinal cortex (EC) also plays an important role in facilitating episodic memory in autonoetic consciousness. The EC is a good example of how evolutionarily ancient parts of the brain are integrated with relatively new regions.

The limbic system is a standard — but somewhat misleading — term for the emotional center of the brain in part because it includes the amygdala but the term is seldom employed these days since the designated area is home to a wider array of functions — see Chapter 9 from Swensen [276]. The hypothalamus is the primary output for the limbic system receiving input from all over the brain in addition to homeostatic sensors that measure temperature and glucose concentrations. The hypothalamus influences behavior and emotional reactions plus autonomic and endocrine functions controlled via projections to the brain stem and spinal cord.

The limbic system also encompasses the amygdala and hippocampus. The former is important in coordinating behavior and mediating the autonomic and endocrine responses to environmental stimuli, especially those with emotional content. The hippocampus is an ancient part of the brain active in both encoding and decoding memories, and important in learning new surroundings and retrieving directions. The limbic system connects to several neocortical areas including the prefrontal and orbital frontal regions that are critical for judgment, motivation abstract reasoning and problem solving.

With the exception of the endocrine system and related pituitary gland, we have focused on the input side of how the brain functions. The output side spans a wide range of functions, but we concern ourselves almost exclusively with those relating to the somatic nervous system and the voluntary control of body movement via skeletal muscles including speech. In particular, we ignore the stomatogastric nervous system and enteric nervous system that control the muscles of the gastrointestinal tract thereby facilitating feeding and digestion — not among our target applications.

Our understanding of the functional architecture of human frontal cortex is largely due to the sort of fMRI studies we discussed earlier. We can’t study human brains in the same way we study the brains of non-human organisms, but we can probe human subjects by asking them to look at images and solve problems while observing them in an fMRI scanner, and then asking them to report on what they saw or how they went about solving the problems. This is unsatisfactory for many scientists interested in neural circuitry, but it provides tantalizing hints about how we might design AI systems.

Jump to Introductory Lecture Part 3

February 5, 2018

%%% Mon Feb  5 04:23:53 PST 2018

I am going to begin with a short anatomy lesson focusing on the cerebral cortex which, in the mammalian brain, includes the neocortex. Infatuated as we are with our much-vaunted [103] cortex and, in particular, the advanced capabilities provided by our neocortex⁹, it is worth pointing out that we also have a cerebellar cortex or cerebellum which is primarily known for its role in motor control, but in fact coevolved along with the neocortex to support a number of advanced cognitive strategies.

Despite our spending time on gross anatomy, we are primarily interested in how different cognitive capabilities that we refer to as cognitive functions depend upon one another. However, some of the most interesting theories concerning human cognitive functions in general and their dependencies in particular come from fMRI studies in which subjects are given a task to perform and researchers try to determine which areas of the brain contribute to performing that task by estimating where in the brain energy is being consumed, presumably in service to performing computations necessary to carry out the given task.

Compared to our work using electron micrographs to trace the axons and dendrites of individual neurons at a resolution of approximately ten nanometers, functional MRI studies depend on data with a spatial resolution measured in millimeters. In the case of human studies of cognitive function, MRI is currently the best method we have for understanding behavior. In addition to providing potentially useful information for understanding where computations are being performed in the brain, we can also use a technology called diffusion tensor imaging or DTI to estimate which regions of the brain are connected to one another and in what direction information appears to be flowing.

Ignoring the details, we will suppose that we have a graph in which the nodes of the graph correspond to regions of the brain and the edges to bundles of myelinated axons that connect those regions. We will suppose further that we can make reasonable guesses about how information passes along those axons in carrying out tasks that involve the orchestration of multiple regions each one performing different computations contributing to the observed behavior.

This process of testing subjects while performing tasks in an fMRI machine and then trying to infer what is going on in their brains is, as you might guess, subject to error, but it is the best tool we have at this time and cognitive neuroscientists have made good use of it to unravel some of the puzzles relating to human cognition. Our task however is not to construct a theory of how humans solve problems, but rather to use the knowledge that cognitive scientists have assembled as clues in building systems that attempt to perform some of the same tasks that humans excel and so far machines fall short.

Paradoxically, while it might seem reasonable to engineer such systems using the symbolic methods of traditional artificial intelligence, instead we assume that the functional regions of the brain operate on distributed representations of the sort employed in building artificial neural networks and that these functional regions can be thought of as implementing functional components that communicate with one another by sharing such distributed representations.

This approach is not new by any means but the last decade has seen the invention or rediscovery of specialized neural network architectures that excel at specific tasks and that can be combined to solve even more complicated tasks. In this class, we investigate just how far we can go in emulating some of the most complicated and controversial functions that humans are capable of. Our goal is to solve some basic technical problems that limit what machines can do on their own and how they interact with humans, so that we can work collaboratively to solve problems that neither humans nor current artificial intelligence systems can solve on their own.

This might seem incredibly ambitious for a one-quarter project-oriented computer science class. While I admit the goals are ambitious, I believe it is exactly the kind of class experience that will provide students with the tools and confidence to take on such problems in the next stage of their education whether that be in graduate school, a new start up or a software engineering job in an industrial research lab. Incremental research problems are like $20 bills lying on the street waiting for someone to pick them up. Ambitious but timely research problems are like thousand dollar bills fluttering at the top of tall trees. They require some careful negotiation, but for the time being there won't be many climbers rushing for the trees.

Jump to Introductory Lecture Part 2

February 4, 2018

%%% Sun Feb  4 05:31:45 PST 2018

Science is a human endeavor and so scientific research is initiated, expedited and impeded by human motivations. Recent news concerning capital investments in biotechnology prompted me to think more deeply about some ideas relating to neuroscience that I've been working on for a couple of years now with my collaborator David Mayfield¹⁰. Here is an elevator-pitch-style dramatization intended to highlight the situation prompting my attention:

What if everything we think we know about the brain as a network computing device is wrong or at least missing one of the most important clues regarding how we learn and perform reliable computations with unreliable components?
What if we are blind to one of the most important factors required to understand and treat a broad spectrum of neurodegenerative diseases due to our misconceptions about how the brain computes and protects itself from pathogens?
What if many neuroscientists are ignorant or dismissive of the work and that by allowing such attitudes to persist we are wasting large amounts of money and intellectual capital working on models that are fundamentally flawed?
What if conventional life science VC firms and midsize biotechs are disinclined to invest in research¹¹, preferring drugs that mitigate the severity of orphan diseases rather than curing dozens of maladies in millions of patients?

I believe the antecedents in the above statements are accurate. As a computational neuroscientist, the evidence is compelling. As a computer scientist, the computational model suggests fascinating algorithmic and architectural innovations¹². There are likely multiple targets of opportunity depending on whether one is interested in developing drugs, inventing novel machine-learning hardware or establishing the basic scientific foundations. If you want to understand the science, check out these recent review articles [52, 182, 234, 252] and read David's short but highly informative research notes included below:

Lessons learned since the rediscovery of microglia in 2005: Microglia are the brain's principal immuno-competent cells making up roughly 10-15 percent of the CNS cell population. Prior to 2005, they were thought to play a largely quiescent, passive role under physiological conditions. As the brain's resident phagocytic immune cells, they could certainly be called into action, but — it was thought — only in response to an immune challenge to the brain caused by infection, injury, or established disease. In 2005, however, this dogma was challenged. [...]
Discovering the active role of microglia in healthy brain: The human brain is composed of two computers rather than one — a neuron-based digital machine built to compute the relevance of experience strictly in terms of what it already knows, and a microglia-based analog machine built to teach the digital machine how to compute differently given novel experiences it can detect but not yet understand. What the digital machine knows is stored in the relative strengths of the 100 trillion synapses through which pre-synaptic neurons send signals to their shared post synaptic partner. [...]

Summarizing David's review, microglia serve two very different roles in the adult brain. In a healthy adult brain, they enable synaptic plasticity and play a key role in learning. However, in responding to neural damage, they abandon their constructive role in learning, undergo a major morphological transformation and revert to immunological functions programmed into the immature cells prior to entering their final home in the developing brain.

In the best case scenario, microglial cells don't confuse the different circumstances that warrant these two roles. In the worst case, normal neural activity is mistaken for abnormal, microglia mount a phagocytic response to imagined pathogens and compromised cells, neural processes are stripped of their dendritic structure and rendered unable to perform their normal computational functions.

Prior to discovering the dramatic evidence of this dual role in 2005, researchers were aware that exposure to an immune challenge early in life (perinatal) — before microglia have fully adopted the transcriptional profile that enables them to function behind the blood brain barrier as specialized brain cells rather than peripheral macrophage cells — is predictive of cognitive decline and memory impairment late in life and its time of onset.

Later it was found that an ill-timed immune challenge during this sensitive perinatal window, is also an essential predisposing risk factor for the major neuro-developmental diseases and disorders of youth ranging from autism in toddlers, attention-deficit / hyperactivity disorder in children and schizophrenia and mood and addiction disorders in adolescence and young adulthood.

Putting these observations together, scientists looked for and found the signatures of microglial phagocytic damage in the brains of young patients with neuro-developmental disease and older patients with neuro-degenerative disorders including Parkinson's and Alzheimer's diseases. This brief summary of more than a decade of work by scores of scientists doesn't do justice to the richness of the case for this disease model.

Given that it is difficult if not impossible to avoid an unfortunate immune challenge during the critical window, this would be sad news indeed to a parent with a child having such a history, or anyone witnessing the symptoms of neuro-degenerative disease in themselves or a loved one were it not for their being some promising treatment options that could potentially provide protection across a broad spectrum of disorders [27, 26].

It turns out that a class of anxiolytic drugs marketed in France for more than three decades as a non-sedating, non-addicting alternative to the benzodiazepines has been shown to be an effective microglial modulator relevant to neuro-developmental and neuro-degenerative disease. Analogs of the original drug called etifoxine or ETX has been shown to modulate microglia activation in response to numerous models of immune challenge. While there are challenges ahead in evaluating efficacy, this is a promising sign that some form of treatment could soon be available for those afflicted.

Lectures:

[1]	Professor Ben Barres (Stanford University Departments of Neurobiology and Developmental Biology) January 2017. Broad Institute Lecture. Role of Microglia Activated A1 Phenotype Astrocytes in Neurodegenerative Diseases Ranging from AD to ALS, MS, and Retinal Degeneration. [VIDEO]
[2]	Professor Staci Bilbo (Harvard University Program in Neuroscience) June 2014. Lecture to the Canadian Autism Society. The Immune System and Neural Development: Implications for Neurodevelopmental Disorders. [VIDEO]
[3]	Professor Beth Stevens (Harvard University Program in Immunology) November 2016. Simon Foundation Lecture. On New Science of Microglia Function in the Healthy Developing and Mature Brain and the Implications for Autism and Schizophrenia. [VIDEO]

References:

[1]	S. D. Bilbo. Early-life infection is a vulnerability factor for aging-related glial alterations and cognitive decline. Neurobiology of Learning and Memory, 94(1):57-64, 2010.
[2]	S. D. Bilbo and J.M. Schwarz. The immune system and developmental programming of brain and behavior. Frontiers in Neuroendocrinology, 33(3):267-286, 2012.
[3]	Andrea Crotti and Richard M. Ransohoff. Microglial physiology and pathophysiology: Insights from genome-wide transcriptional profiling. Immunity, 44:505-515, 2018.
[4]	Shane A. Liddelow, Kevin A. Guttenplan, Laura E. Clarke, Frederick C. Bennett, Christopher J. Bohlen, Lucas Schirmer, Mariko L. Bennett, Alexandra E. Munch, Won-Suk Chung, Todd C. Peterson, Daniel K. Wilton, Arnaud Frouin, Brooke A. Napier, Nikhil Panicker, Manoj Kumar, Marion S. Buckwalter, David H. Rowitch, Valina L. Dawson, Ted M. Dawson, Beth Stevens, and Ben A. Barres. Neurotoxic reactive astrocytes are induced by activated microglia. Nature, 541:481-487, 2017.
[5]	Marco Prinz, Daniel Erny, and Nora Hagemeyer. Ontogeny and homeostasis of CNS myeloid cells. Nature Immunology, 18:385-392, 2017.
[6]	Michael W. Salter and Beth Stevens. Microglia emerge as central players in brain disease. Nature Medicine, 23:1018-1027, 2017.

February 3, 2018

%%% Sat Feb  3 04:05:23 PST 2018

This entry is meant for prospective CS379C students in the 2018 Spring quarter at Stanford University. It was written to underscore the unique opportunities and challenges designed into the syllabus for this instantiation of CS379C. For many of you, there will not come another time in your careers to contemplate such an exciting and risky scientific venture. The experience will likely motivate you to adjust your attitude to risk in directing your professional lives. I guarantee it will be interesting and for some it will be a revelation. I'll start with some background and general motivation before launching into the specific challenges that serve as the central focus for the course this Spring.

Here are two observations that have influenced my current thinking about technology. The first comes from David Cheriton recounting what he said to Sergey Brin and Larry Page when they pitched their ideas to him in 2004²⁰ . Apparently they came to him asking about licensing their technology to other companies. Cheriton understood that if you give birth to a wonderful new idea you might think that someone would be happy to adopt your idea because you think it's so beautiful, but it's very hard to get anyone else to adopt your baby. He basically told Brin and Page they would have to raise their baby themselves.

The other observation comes from two decades in academia, another decade in industry and several shorter periods working in startups and consulting for venture capital firms. I used to think it odd that the businesses and institutions financially best situated to tackle problems that have the potential to create whole new categories of products and revolutionize entire industries are the least interested in doing so. Now it is clear to me that they don't have to take on the risk as long as they keep their eyes open and continue to grow their revenue so they can simply acquire the inventors and startups that are willing to take on those risks and succeed in beating the odds.

Noticing an interesting idea, incrementally extending a promising technology and building infrastructure to scale and broaden the application of that technology is not taking risk for a company that already has the talent, tools and technological wherewithal to exploit such an opportunity. It is smart and it is good business and it does little to tarnish the reputation of a technology company with a track record for advancing the state of the art, since it does in fact advance the state of the art, and, as part of broader strategy for encouraging innovation, can maintain a company's dominance of a given technology indefinitely.

But what if you love to work on ideas that seldom appear in engineering textbooks. That philosophers like to discuss and physicists tend to shun. What if you are good at thinking about technical problems that are two or three years beyond the current state of the art. What if you thrive in those murky intellectual spaces where one can catch tantalizing glimpses of what could be were it not for the current lack of substance and definition? For some, this is just the difference between alchemy in Isaac Newton's era and chemistry today. For others, these are the sort of temptations that drew Guglielmo Marconi to radio and Philo T. Farnsworth to television.

I'm interested in understanding the human mind and biological minds in general. I believe that consciousness, self-awareness, and reasoning about other minds all have simple algorithmic explanations despite the seeming complexity of their biological implementations. This is less controversial today than it was a decade ago, but there are skeptics and enough murkiness that engineers are reluctant to invest effort in developing such capabilities for the current generation of personal assistants. The situation is similar to that faced by Seymour Benzer when he revolutionized the field of behavioral genetics despite opposition from field's leading researchers [130, 118].

It is also an exiting time for me to be working in this area, both because I believe we are on the cusp of understanding the underlying phenomena due to advances in cognitive and systems neuroscience and because the field is relatively sparsely populated by scientists who have the necessary biological training and a command of computer science and the modern theory of artificial neural networks in particular. Seymour Benzer is famous for starting and legitimizing new fields of inquiry and then moving on when the field became crowded [300]. With each change he abandoned supportive colleagues and took on new challenges and skeptics. He probably couldn't help himself.

%%% Sat Feb  4 14:26:33 PST 2018

Here is the text advertising CS379C in Spring 2018. It was sent to Meredith Hutchin CS and Laura Hope MBC to distribute to students prior to February 11 when Axess opens for course enrollment:

In CS379C this quarter, we consider the following three challenge problems:

1. How would you design a personal assistant capable of maintaining relationships with each member of a family, managing a comprehensive episodic memory to enrich those relationships, adopting a different intentional stance appropriate for each household member and essentially behaving as another member of the family?

2. What if you had all of the C++ — or Java or Python — code checked into GitHub plus all the versions and all the diffs plus sample I/O, unit tests and documentation. How would you go about developing a neural network architecture that learns to write programs from sample I/O and natural language descriptions?

3. Suppose you had the complete wiring diagram (connectome) of a fly and petabytes of recordings from each neuron in the fly's brain aligned with high-speed images recording every aspect of the fly's behavior and the environment in which those behaviors were carried out. How would you construct a model of the fly's brain?

Hypothesis: Each of these problems can be solved using a recurrent neural network architecture constructed from published component networks each of which is relatively well understood and has been applied successfully to solving simpler problems.

Course description: In class, we examine this hypthesis by designing networks for key parts of each problem, borrowing ideas from both systems and cognitive neuroscience. Students propose and complete related programming projects for final grade.

February 1, 2018

%%% Thu Feb  1 04:59:20 PST 2018

Given that the programmer's apprentice is intended to be trainable end-to-end, it could require a great deal of training data. It is reasonable to ask if it is feasible to obtain this data entirely through interactions with a human programmer, and, if not, how one might bootstrap such a system without requiring it to learn in an interactive setting with a human user-in-the-loop. We don't have a completely satisfactory answer to this question at this stage, but I've included some preliminary suggestions in the following. The current design of the programmer's apprentice can be divided into the following (sub) systems:

a social modeling system consisting of an attentional schema and episodic memory [169].
an executive control system implemented as a hierarchical planner neural network [278].
an integrated development environment implemented as neural Turing machine memory [105].
a code embedding consisting of a hierarchy of contextual encoder-decoder models [161].

Systems 1 and 2 constitute a relatively sophisticated dialog management system (DMS). It should be reasonably easy²¹ to implement a hybrid DMS to pretrain the other two components. System 3 provides the three-way interface between the assistant, programmer and the means of editing, debugging, executing and tracing programs. System 4 serves as a proposal generator to facilitate programming by code substitution.

The hybrid DMS would correspond to a combination of a hand-coded conventional dialog system extending the earlier Zinn Google Play assistant with neural network interfaces to the System 3 IDE NTM and System 4 code embeddings. In principle System 4 can be trained separately along with a language model gleaned from Stack Exchange data and related corpora.

Conceptually System 1 is the most novel and intellectually exciting. It has the biggest potential to influence the design of Google Assistant. System 4 is the most challenging in terms of extending existing hierarchical approaches to contextualized structured-data embedding and associative retrieval. It has the biggest potential to apply broadly at Google.

The choice of programming language is complicated. If we could work the nascent SWERL Project we would have virtually unlimited C++ code and excellent coverage, but much of that code would be inscrutable and might make for terrible demos²². I've been talking with members of the PLT (Programming Language Theory) Group at Rice and Brown about collecting and curating a corpus of Scheme programs for training purposes and ingesting the text of How to Design Programs as part of the language model, but those discussions are preliminary.

January 31, 2018

%%% Sun Jan 28 15:20:11 PST 2018

Here are my thoughts about raising expectations for research, devising strategies to help engineers in pursuing about more ambitious projects, and making a case for how this might produce technologies with far reaching consequences. I wrote an outline for a relatively short statement in three installments:

What would it mean to break the mold, alter or enlarge the direction of the field and create technologies enabling whole new applications?
Where does the inspiration come from and how do we reliably harness it to come up with new ideas that will enable future innovation to happen?
How can we take the long view and tackle significantly challenging problems while at the same time delivering value and demonstrating progress?

Installment 1 is an account of the first ten years of my research career during which I constantly fought against the status quo. In hindsight this behavior might seem obvious, but at the time it was a dangerous path to tread for a non-tenured assistant professor. I don't emphasize the danger or the stress it put me through, but rather simply chronicle what I considered to be wrong with the field and what I did to put it on more solid foundations. If the story achieves what I intended, it won't come across as claiming to have had a clearer vision or bragging about my achievements.

Installment 2 makes the case that personal assistants are driving consumer technology and that the programmer's apprentice provides a vehicle for developing technologies that will have a huge impact on the future of consumer-facing applications. Specifically, applications that users can fluidly interact with for a wide range assistance and collaborate with to solve practical problems on intellectually equal terms. These are applications capable of understanding us on a deeper level by building models of us and creating episodic memories that record much more than just what we said.

Installment 3 uses the programmer's apprentice to illustrate how such a project could be broken down into technical subproblems tackled independently by separate teams, so that each subproblem would have its own trajectory, milestones and demonstrations, but separate teams would also have joint objectives and working subgroups to guide the integration of the component technologies toward project-wide milestones and demonstrations. The only difference between this and more conventional engineering efforts being the degree of uncertainty concerning the more open-ended parts of the problem. The resultant risks can be mitigated by building simple subsystems that provide limited capability or hybrid components that combine human and machine effort to simulate the desired behavior.

January 29, 2018

Here are some of the key design desiderata set for the proposed programmer's apprentice research project:

fully differentiable architecture constructed from standard network components and trained end-to-end;
capable of sustaining a constructive, indefinite-duration dialog focused on collaborative programming;
rich computer-human interaction, maintaining an extensive episodic memory spanning prior interactions;
automated code synthesis drawing on a large corpus of programs including descriptions and sample I/O;
fully instrumented development environment complemented by hierarchical distributed program embedding;
expectation network components will be familiar though their specific function and connectivity novel;

%%% Tue Jan 30 04:27:45 PST 2018

Here are three important management questions posed during my discussions with colleagues over the last few months:

How might a project such as the programmer's apprentice be broken down into technical subproblems tackled independently by separate teams?

The idea is to use the architecture of the human brain as a general framework in which to integrate modules supporting different functions. For this we need only the general pattern of how different regions of the brain exchange information and how their putative functions are related to one another. No attempt is made to model the microstructure of the human connectome, but rather to model the direction and valence — excitatory or inhibitory — of the connections, treating the anatomical boundaries of these regions and their functions as approximate at best and employing them as useful guidelines to support independent development and guide system-wide integration and evaluation.
How might the project be structured so each team has its own its own trajectory, milestones and demonstrations, but share joint objectives?

In spite of efforts to make sense of the brain as if it could be decomposed into functional modules arranged in neat block diagrams — such as those shown in Figure 38, with each block labeled with by a precise mathematical description realized as a transfer function with clearly defined inputs and outputs, evolved systems are messy and typically don't allow such simple descriptions. We start with well-defined functional components patterned after those believed to approximate those of the brain, exploit what is known about how these components are functionally related to one another and then depend on interface layers trained to sort out the details of how these components communicate.
How might independently working subgroups guide integration of the component technologies toward project-wide milestones and demonstrations?

The difference between this and conventional engineering efforts arises from uncertainty concerning the open-ended parts of the problem. The resultant risks can be mitigated by building simple subsystems that provide limited capability or hybrid subsystems that combine human and machine effort to simulate the desired end products. The primary functional components — dialog management, attentional schema, episodic memory, program embedding, proposal generation, etc — can be grouped into three subsystems roughly corresponding to the natural language interface, memory subsystem and code synthesizer, each of which can simulated by a combination of traditional programming and human intervention.

%%% Mon Jan 29 15:04:36 PST 2018

Sometimes it makes more sense to use the term imagination when talking about prediction, consciousness when talking about attention, episodic memory when talking about stack traces and event histories and association areas when talking about sensor fusion. The underlined terms are familiar to those of us working in computer vision whereas the italicized words smack of concepts from cognitive science and good-old-fashioned AI that have no precise computable interpretation — what Drew McDermott referred to as notation without denotation [199].

We use the italicized words instead of their underlined counterparts to describe capabilities important in designing agents that interact effectively with humans. We borrow from a growing literature that has made considerable progress in defining these concepts precisely enough to implement in working systems. Examples include Hassibis and Maguire's work on imagination as constructing hypothetical situations [132], Dehaene's model of consciousness as maintaining a global workspace [73], and Graziano's attention-schema model of self awareness and theory-of-mind reasoning [115].

What would it mean to build a limited capability episodic memory or attentional schema? The remainder of this entry provides capsule summaries of what we already understand and what we are prepared to move ahead on regarding five key problems: episodic memory, attention schema, theory of mind, dialog management and code synthesis. This document also includes more detailed examinations of these problems along with extensive bibliographical and summary information.

Human episodic memory (EM) is complex with diverse networks distributed all over brain. That said, there are a number of promising proposals [6] for implementing episodic memory including so-called Dynamic Memory Networks Kumar et al [169], variants of Neural Turing Machines Graves et al [105], self-organizing neural networks Chang and Tan [232] and reinforcement-learning methods based on reservoir sampling²³ [8].

The neural circuits implementing the executive functions for attention and executive control located in the prefrontal cortex orchestrate complicated motor functions and sequential decision making in addition to controlling the contents of working memory [166, 221, 165]. As discussed elsewhere — see Figure 38, there are two systems for executive control, one primarily involving circuits in the prefrontal cerebral cortex and a second system involving the cerebellar cortex, evolutionarily recent reciprocal connections between the front and back of the brain, basal ganglia and related subcortical circuits [153]

Attentional systems abound in recurrent neural network architectures [80, 12, 10, 105]. Michael Graziano's attentional schema theory [116] abstracts nicely from the details of the underlying neural substrate and is algorithmically straightforward [115]. Humans can — and routinely do — adopt an attentional stance toward anything they are inclined to attribute mental properties [77], and theory-of-mind reasoning embraces this diversity by allowing multiple attentional schemata and relying on episodic memory to sort out the different entities and their related mental capacities and proclivities.

Managing the assistant-side of the dialog between the assistant and the programmer in service to their collaboration may seem at first blush the most difficult technical problem. This may be so. However, it may also turn out to be the simplest to approximate for interim prototype testing due to the work of the Google Assistant team on both the NLU (natural language understanding) and NLG (generation) sides of the dialog. We are also looking into the prospect of handling some fraction of programmer-assistant exchanges by relating to code editing and reuse to an automatic planner designed for dialog management and developed as part of an earlier assistant project.

Obviously the programmer's apprentice is intended to facilitate code synthesis. This aspect of the project has consumed the most time in planning and our conclusion is that apprentice can fill a useful niche by ingesting a large corpus of annotated programs — programs plus English descriptions and sample input / output data — using context-aware hierarchical embeddings to propose code fragments for reuse in writing new code [99]. Elsewhere in this document you'll find summaries of both syntactic and semantic embedding techniques applied to code synthesis. We also have confederates at Google who have experience in automatic programming [311, 190, 2, 181, 211, 174, 201] and have offered ideas plus new projects starting up to exploit our own code repositories for reuse.

January 27, 2018

The first part of this discussion was all about choosing problems to drive ground-breaking research, drawing on experience from the first ten years of my academic career. My strategy then and now involves focusing on a problem — often the defining problem for a research area such as automated planning and control, redefining the problem if its potential is limited by current methodology, prevailing theory or entrenched dogma, asking what characteristics of the problem are preventing progress, and then spending substantial time exploring alternative avenues for promising technologies and problems that might break the logjam and reveal new opportunities.

This may sound like the sort of creative-problem-solving advice corporations pay consultants to inflict on their employees expecting to make them innovative. Some of that advice is obvious if not always heeded. Much of it your parents may have taught you, e.g., listen respectfully, provide constructive comments, don't interrupt, don't monopolize, etc. Trashing an idea after three minutes thought, dismissing entire disciplines without any idea of what they might have to offer, denigrating an approach because it is too mathematical or not mathematical enough, etc ... these are all behaviors not conducive to making progress. I'll assume you know how to behave and get on to the hard parts.

In industry, the focus is typically on near-term deliverables. Academia has traditionally focused on longer-term impact but increasingly the relevant incentives don't align with this view. In any case, I'm not interested in incentives here. I'm interested in how to come up with new ideas to drive game-changing, discipline-defining research. The first step is coming up with one or more problems that will challenge, inspire and require significant innovation. In my early academic years, the focus was on building automated planning and logistics systems to control robots, route vehicles and manage inventories, and the technical challenges involved dealing with uncertainty and real-world dynamics.

We are still working on those problems using a host of new technologies but the target robots, vehicles and businesses are now real and their applications are forcing innovation and driving research and development. Today many of the technical challenges we face involve building autonomous systems capable of intimate and constructive interaction with human beings. Ultimately, these systems will need to understand our strengths and weaknesses, learn what drives us, help us solve problems, make us smarter and more self reliant. Learning how to supply us with what we crave and shelter us from what we abhor is not, I offer, a goal worthy of our technical prowess or social aspirations.

The programmer's apprentice is a digital-assistant application. It requires subtle interaction between the programmer and apprentice and the ability to engage in collaborative problem-solving on a task that, while challenging, is within our technical ability. The interaction requires the apprentice to recover from the inevitable misunderstandings that occur in a collaboration between individuals with differing levels of communication skill and task-relevant knowledge. The assistant will require the ability to construct a theory of its own mind and that of the programmer in order to keep track of what was said by whom, who knows what, and when and in what context knowledge was imparted.

Systems capable of theory-of-mind reasoning, maintaining a detailed episodic memory of events, designing and debugging useful computer programs, and collaborating with a human on a technically-demanding task don't exist. They are, however, worthy challenges. The clues for their design come from a wide range of disciplines including developmental psychology and cognitive science, cognitive neuroscience, systems neuroscience, linguistics, philosophy of mind, artificial neural networks, automated program synthesis, human-computer interaction, etc. The strategy for making progress is to design a composite architecture based on components drawn from recent work on recurrent neural networks.

Inspiration from the biological and cognitive sciences is not new by any means. Demis Hassabis has proved especially adept in leveraging knowledge from these disciplines to inspire engineers to develop new architectures exhibiting novel capabilities. You only have to think about how humans solve problems to conclude that our brains represent a treasure trove of design ideas. You don't need to understand the neural correlates of human episodic memory in order design neural networks that emulate human memory. You don't need to understand the cognitive development of theory-of-mind reasoning in children to build a personal assistant that can emulate this capability interacting with humans.

Theories about what it means to be conscious continue to rage in academic circles while engineers take inspiration from the few concrete facts we know about consciousness to build systems that selectively attend to relevant stimuli and maintain simple representations of self that facilitate social interaction. The point is there are ideas a plenty to inspire new technology. In the first ten years of my research career, the windfall was to be had from mathematics and the engineering sciences due to the fact that AI had isolated itself and was too confident of its ability to invent the future on its own. We exhibit our hubris by ignoring the biological, cognitive and social sciences²⁴.

We work in machine perception! What could we learn from studies of consciousness or episodic memory? Growth requires taking on new challenges and acquiring new perspectives. When Peter Norvig asked me what I wanted to do if I came to work at Google, I said I wanted to apply computational neuroscience to artificial intelligence and had no interest in ever working on Markov decision processes again. If you want a successful academic career your best bet is to make a name for yourself in some niche area and then mine it for all it's worth. If you want an interesting, productive career you need to constantly evolve to keep yourself fresh. You need to balance exploration and exploitation.

Research progress in brain science is accelerating, but it is not the only field worth tracking. Interest in brain imaging has spurred innovation in photonics and image processing. The nascent field of computational biology²⁵ is providing insight into how plants, bacteria, genes and social organisms compute at multiple scales. These fields enrich our understanding of nature and physics and can't help but inform the study of perception. Working on a problem that forces you to stretch your skills and expand your knowledge doesn't require you to abandon your current field of expertise, but it will help you to see it with a fresh perspective. Enrico Fermi even suggested we should switch fields every ten years²⁶.

January 25, 2018

%%% Fri Jan 26 04:37:54 PST 2018

Over the last two weeks, I participated in several discussions about how to guide research and whether for a given enterprise it makes sense to aim for relatively conservative incremental progress or make a contrarian effort to identify and exploit potentially risky alternatives. I have almost always opted for the latter, though I can't recommend the strategy to others without some reservation. This entry summarizes my research trajectory for the first ten years of my twenty year academic career²⁷, and the next entry covers my recent efforts and suggests an alternative strategy that might work now at Google.

At Yale I worked with Drew McDermott on automated planning and came to the conclusion that existing planners were limited by having an impoverished representation of time and causality. I developed the idea of time maps and temporal database systems for applications in robot planning, scheduling and logistics [55, 56, 58, 59], and, for some time, the hierarchical planning system that my fellow graduate students and I designed and built was the most sophisticated robot planner available [92, 203, 64].

During the same period, I started collaborations with researchers in statistics and applied math expert in Markov decision processes (MDP) believing that AI planning was limited by its failure to embrace probability and decision theory. AI planning was still closely tied to STRIPS [91, 90] and was believed to be incompatible with Markov processes due to the intractability of underlying inference problems. I pointed out to anyone who would listen that STRIPS fluents were simply state variables and even the simplest planning systems were NP-hard [57]. In 1989, I started work on a book entitled "Planning and Control".

Planning and Control [65] introduced the AI community to Markov Decision Processes and their partially observable counterparts demonstrating how traditional AI planning systems could be refactored as Bayesian networks [65]. Ten years later when Craig Boutilier, Steve Hanks and I wrote a survey of the field [33] there were thousands of related papers. Planning and Control also introduced control theorists to Bayesian networks as an antidote to LQG control systems — linear dynamics, quadratic objective functions and Gaussian noise models.

Occupancy grids were introduced by Moravec and Elfes [208] as an approach to robot planning and navigation using wide-angle sonar and dealing with uncertainty. They were an advance over earlier work based on Kalman filters but decoupled planning from control making it difficult to integrate the two in probabilistic decision making. To address these shortcoming we developed the first Bayesian approach to Simultaneous Localization and Mapping (SLAM) [61, 22] for mobile robots well in advance of the work by Fox and Thrun [93].

One complaint about non-trivial probabilistic models is that probabilistic inference is intractable. The fact that inference on any sufficiently expressive AI representation is intractable [57, 63] didn't dissuade detractors from complaining about our models. To address these concerns, we recast real-time decision making as a scheduling problem in which the task is to allocate processor time to approximation algorithms that can be interrupted given any increment of processor time to provide an answer whose expected value monotonically increases as a function of run time.

We called these approximations anytime algorithms and with the publication of our AAAI paper [62] the terminology quickly entered the lexicon of AI. Herbert Simon [261] introduced the idea of bounded rationality, and, while I.J. Good [101], Eric Horvitz [146], and Stuart Russell and Eric Wefald [251] contributed substantially to the theoretical foundations, it might be argued that the notion of anytime algorithms had more of a practical impact on the evolution of AI during the decade following its publication²⁸.

%%% Thu Jan 25 04:57:29 PST 2018

This is the 200th anniversary of the first edition of Mary Shelley's Frankenstein. The book has been mentioned in several news stories and appeared in book-club lists focusing on the theme of the modern Prometheus and concerns about the near-term prospects for superhuman AI systems. These excerpts²⁹ caught my attention for their nuanced view of a young man seduced by the idea of controlling nature, defeating death and altering the fate of humanity. In her portrait of Victor Frankenstein, Shelley underscores the capacity of youth to take on any challenge with the confidence they can achieve the most audacious goals and the certainty they can understand and control the consequences.

January 23, 2018

%%% Tue Jan 23 04:57:29 PST 2018

Despite pessimism expressed in an earlier note, selective reading of Dere et al [78] Handbook of Episodic Memory has yielded some useful insights. Compressing that insight into a few paragraphs appears to be more difficult, illustrating the maxim that the less we know the more space we need to express it. The following somewhat scattered notes consist of excerpts from the handbook and commentary on selected chapters.

In the following discussion, I will borrow liberally from The Handbook of Episodic Memory edited by Dere et al [78] and, especially, from the general overview in Chapter 1.2 "Exploring Episodic Memory" written by Martin Conway [48] and the focus on the prefrontal cortex in Chapter 3.5 "The role of the prefrontal cortex in episodic memory" written by Matthias Brand and Hans Markowitsch. Unless otherwise made explicit, assume attribution to one of these sources.

The term episodic memory is often linked to the hippocampus, but that is just one of many brain structures hypothesized to play an important role. Similarly, auto-associative memory models including Hopfield networks are thought to be relevant, but, again, the full story is more complicated.

If you want to understand the neural correlates of human episodic memory (EM), your best bet is to learn about the deficits of patients who have suffered lesions in the various parts of the brain that are thought to contribute to EM. It will help in reading the relevant literature if you know what the basic parts of the brain are called. Here they are listed (parenthetically) along with their primary structures (SOURCE):

Medulla Oblongata (Myelencephalon)
Pons and Cerebellum (Metencephalon)
Midbrain (Mesencephalon)
Thalamus and Hypothalamus (Diencephalon)
Cerebral Hemispheres (Telencephalon)

Most of the anatomical discussions concerning episodic memory focus on the forebrain that consists of the telencephalon and diencephalon, where the forebrain, along with the midbrain and hindbrain constitute yet another way of parceling up the brain.

The cortex gets divided into frontal, parietal, occipital and temporal lobes whose wrinkled surfaces are further landmarked with sulci, gyri and fissures and divided using the microscopic anatomy of cells and tissues into areas first mapped out by Brodmann that continue to be amended and debated today.

The primary characteristics of episodic memory as first set forth by Endel Tulving [286] and subsequently amended and extended by several authors, are shown here grouped into categories and summarized following their exposition in Dere et al [78]:

Ontological: (i) Sensory-perceptual-conceptual-affective processing derived from working memory; (ii) Retain patterns of activation/inhibition over long periods; (iii) They are predominantly represented in the form of (visual) images; (iv) They always have a perspective (field or observer);
Functional: (v) They provide a short-term record of progress in current goal processing; (vi) Represent short-time slices, determined by changes in goal processing; (vii) They are represented roughly in order of occurrence, temporal dimension; (viii) In humans they are only retained in a durable form if they become linked to conceptual autobiographical knowledge (rapid forgetting); (ix) They are the mental representations from which concepts are formed;
Phenomenological: (x) They are recollectively experienced when accessed; (xi) When included as part of an autobiographical memory construction, they provide specificity;
Srtuctural: (xii) Neuroanatomically they may be represented in brain regions separate from other autobiographical memory knowledge networks;
Developmental: (xiii) Phylogenetically episodic memory may be a species-general evolutionary old memory system; (xiv) Ontogenetically the ability to form episodic memories may be present early in development;

The brain areas involved in mediating the recall of memories that consist of autobiographical knowledge and episodic memory are distributed from anterior to posterior brain regions. Regions in the prefrontal cortex (PFC), such as lateral, medial, and ventromedial PFC networks³⁰, have been found to be critical in initiating searches of long-term memory and in evaluating knowledge once accessed. Other medial temporal lobe³¹ structures including those adjoining Wernicke's area³² appear to be important in the experience of remembering and the emotional content of memories.

Finally, posterior networks and the retrosplenial cortex³³ and related areas as well as the visual cortex (occipital³⁴, cuneus³⁵, precuneus³⁶) become active when sensory perceptual EMs enter into the construction of autobiographical memory. Some suggest that abstract conceptual knowledge about periods in a person's life and about the self — termed the conceptual self — may be represented in frontal networks, other more detail knowledge about general events, others, goals, actions, activities, locations, others, etc. may be represented in temporal networks and EMs and temporal occipital networks. In this scheme, EMs are located in brain regions that are separate from more conceptual knowledge of an individual's life³⁷.

More briefly, Chapter 3.6 "The basal forebrain and episodic memory" [95] makes the case that the basal forebrain likely plays a role in memory recall, "especially in the search for memory content from designated temporal context, or in postretrieval monitoring of memory content whether or not it is matched with designated temporal context, or both."

Chapter 3.7 "The role of the precuneus in episodic memory" [282] discusses the possible role played by the precuneus in episodic memory "with special attention to the link between episodic memory consolidation and the default mode of brain function during the conscious resting state, as recently outlined by functional imaging studies."

Chapter 4.1 "Neural coding of episodic memory" [284] is perhaps the most intriguing from a modeling perspective given its claims that "recent identification of network-level organizing principle and memory-encoding units in the hippocampus has allowed real-time patterns of memory traces to be mathematically described, intuitively visualized, and dynamically deciphered." The full abstract is quoted below to further pique your curiosity:

Any given episodic event can be represented and encoded by the activation of a set of neural clique assemblies, which are organized in a categorical and hierarchical manner. This hierarchical feature-encoding pyramid is invariantly composed of the general feature-encoding clique at the bottom, subgeneral feature-encoding cliques in the middle, and highly specific feature-encoding cliques at the top.
This hierarchical and categorical organization of neural clique assemblies provides the network-level mechanism the capability of not only achieving vast storage capacity, but also generating commonalities from the individual behavioral episodes and converting them to the abstract concepts and generalized knowledge that are essential for intelligence and adaptive behaviors.

Furthermore, activation patterns of the neural clique assemblies can be mathematically converted to strings of binary codes that would permit universal categorizations of the brain's internal representations across individuals and species. Such universal brain codes can also potentially facilitate the unprecedented brain–machine-interface communications."

January 21, 2018

%%% Sun Jan 21 04:57:26 PST 2018

How does the brain know enough to create a representation of a new entity that might have much in common with an existing entity and yet is clearly distinguishable as an independent separate entity, e.g., an unrecognizable voice on the phone? Along similar lines how does the brain distinguish special cases (subclasses) of a general class, e.g., animals that can solve puzzles, signal intentions, speak or follow directions? What about the case in which an instance of such a subclass is deemed to have private attributes, e.g., thoughts, feelings and knowledge, that are not (obviously) apparent by means of a superficial examination? We have representations of photos, videos and music recordings. What about our parents, pets and possessions along with a lifetime of personal memories that feature these central figures and fixtures in our lives?

Unlike what is known about how the brain represents physical objects — the shape, color and motion of visual stimuli are represented in an orderly fashion in the visual association areas, episodic memory is a great deal more complicated, emerging gradually and developing over many years, integrating information from a great many sources. I've picked up a few ideas from my cursory reading of the literature — see here for a small sample of interesting papers; however, while some of those ideas may help in developing an application-specific version of episodic memory for the programmer's apprentice, human episodic memory is so intricately interwoven within our neural circuitry and integrated into our complicated social lives that any attempt at duplicating it seems premature³⁸.

January 19, 2018

%%% Fri Jan 19 17:47:36 PST 2018

I spent Thursday and Friday at the MIT-X AI Summit at Google X. Especially learned from Josh Tenenbaum, Jim DiCarlo, Dan Yammins, Matt Wilson and Nancy Kanwisher. Josh got me to look at probabilistic programming more carefully [273], Matt made me think again about the hippocampus and its role in episodic memory [212] and Nancy about the prospects for human functional modeling by combining EM enhanced DTI tractography with fMRI functional studies³⁹.

The probabilistic programming model described in [273] provides offers a powerful explanatory theory of mind worth exploring more deeply. Other Bayesian generative models [98, 102] provide insight into human cognition, child development and the evolution of intelligence. Despite these advantages, differentiable models provide a framework for experimenting with integrating a wide range of cognitive capabilities, and, while far from modeling biology at the molecular level, provide insight into how distributed codes and gradient descent learning can implement complex learning machines.

I believe that Stanislas Dehaene [68] and Michael Graziano [107] provide enough of a basis for theory-of-mind reasoning system that it — or a useful approximation — can be implemented in an end-to-end system⁴⁰. Training data is a potential problem and this is one reason why a probabilistic programming approach might provide the most expeditious approach for implementing theory of mind in an application such as the programmer's apprentice. The cognitive science research paradigm suggests one possible approach for generating a dataset for learning theory-of-mind reasoning:

Have the target theory-of-mind learning system watch videos of kids interacting with one another and learn to predict what they will say in answering questions. Watch videos of children in Anderson's Hide-Find Paradigm [5, 272] intended to investigate how we learn that different minds know different things. Formulate an hypothesis to clearly express how the children are searching for a theory to explain what they are observing and how they go about formulating and testing hypotheses [104]. See here for theory-of-mind related papers and here for a first pass at collecting papers on episodic memory.

January 16, 2018

%%% Tue Jan 16 04:40:09 PST 2018

This entry relates some of what we know about contextual embedding from research in computer vision and natural language processing, and suggests how we might apply this knowledge in the case of code synthesis. The objective is to embed subtrees of an abstract-syntax-tree (AST) given the context of the enclosing AST minus the selected subtree. Fortunately, the parsing problem is solved and we can construct a canonical AST given any well-formed program fragment from most modern programming languages. Such an embedding is intended to be used in program synthesis to generate proposals corresponding to code fragments given the context of a partially completed target program to be inserted into the target at a designated location.

Mikolov et al [200] introduced two architectures for learning word embedding architectures. The Skip-Gram architecture predicts surrounding words given the current word and the CBOW architecture predicts the current word based on the context, i.e., the n words immediately preceding and n words immediately following. Since its publication, there have been several extensions including Le and Mikolov [175] working with sentences and documents, Kiros et al [161] work on Skip-Thought Vectors embedding sentences within adjoining sentences, and Lazaridou et al [173] Multimodal Skip-gram Architecture combining language and vision.

Since we are interested in generating code fragments to insert in a program under construction, the basic CBOW architecture captures our intended meaning of context. However, given we are working with code fragments and not individual words, the original implementation of CBOW lacks the necessary power of more sophisticated embedding methods. As an example, given the AST representation of a program⁴¹ T = { A → B, A → E, B → C, B → D, E → F, E → I, F → G, F → H, I → J, I → K }, the analog of the CBOW-architecture shown on the left in Figure 1 of [200] for the input consisting of T and the subtree S = { F → G, F → H } rooted at F — sans explicit nesting directives — would look something like [A, B, C, D, E] ⊗ [I, J, K] = [F, G, H].

Not surprisingly, inserting a fragment from one program into another based entirely on syntactic features can have unwanted semantic consequences. Long Short-Term Memory (LSTM) language models are capable of keeping track of information revealed at one point in translating a block of text in order to apply it at a later point. For example, the cell state of an LSTM might encode the gender of the subject of the current sentence, so that correct gender-specific pronouns might be employed in translating subsequent sentences. Upon encountering a new subject, we may want to forget the gender of the old subject and guess the gender of the new subject if there are reliable clues available [214]. We expect code-fragment embeddings will require similar adjustments.

The sample programs shown in Figures 34 and 35 are given a sentence and pair of keywords as part of their input. In scanning a given input sentence, they keep track of the last occurrence of these keywords in order to determine whether the one keyword is separated from another keyword by a span of no more than a specified number of words. Given the stub corresponding to the nested define and let statements from Figure 34, the programmer's apprentice might propose the (embedded) do loop fragment from Figure 35 and then make additional adjustments to obtain a correct program. Explaining in detail how this might be accomplished is our next challenge.

%%% Tue Jan 16 15:50:33 PST 2018

In terms of leveraging ideas from neuroscience, I've been revisiting papers on the role of prefrontal cortex in attention, executive control and working memory⁴² . The Ba et al [10] work on fast weights — von der Malsburg's dynamic links [308] — continues to intrigue me as a candidate for working memory. In recent networks implementing some form of fast weights, outer products have been used to generate weight matricies in an Hebb-like manner as introduced by Schmidhuber [255] and layer normalization [11] has been shown to be effective at stabilizing the hidden state dynamics in recurrent networks. Here are two examples from Dehaene [68] that illustrate the persistence of ideas from symbolic computing⁴³.

January 15, 2018

%%% Mon Jan 15 04:23:19 PST 2018

Over the last few years the terms attention [12, 119], imagination [223] and consciousness [25] have entered into lexicon of machine learning and found application in developing artificial neural networks. All three of these involve memory whether that be short-term, long-term or working memory. Indeed, some argue that the main role of consciousness is to create lasting thoughts — not what cognitive scientists refer to as "episodic memory", but, rather, information relevant to solving a particular problem so that it remains fresh in our mind for as long as we need to attend to it and solve the problem at hand⁴⁴.

The application of artificial neural networks to robot planning, automated programming and complex decision making depends on developing subtle methods for creating, maintaining and manipulating differentiable representations in memory. This year, CS379C emphasizes recent developments in artificial neural network memory systems to address challenges in these applications. Here are four tutorials that along with having taken a course in machine learning and acquired some programming experience with artificial neural networks should prepare you for related topics in the class lectures⁴⁵:

Recurrent Neural Networks — Andrej Karparthy [157] (PDF)
Convolutional Neural Networks — Andrej Karparthy [158] (PDF)
Long Short-Term-Memory Networks — Chris Olah [214] (PDF)
Attentional and Augmented Networks — Chris Olah and Shan Carter [213] (PDF)

January 12, 2018

%%% Fri Jan 12 14:38:46 PST 2018

The following three entries in this log summarize the current status of the programmer's apprentice project proposal:

January 11 — Provides an extended example of interacting with the apprentice and a description of the executive control system that handles the user interface and facilitates code synthesis.
January 9 — Focuses on the weakest element of the proposed architecture, creating rich embedding contexts to generate proposals for program transformations to be applied in code synthesis.
January 7 — Provides a high-level description of the neural-network architecture and how it relates to the multiple input sources, including the instrumented IDE shared with the programmer.

January 11, 2018

%%% Wed Jan 10 09:48:42 PST 2018

The programmer's apprentice is exactly that, an apprentice or novice programmer in training that can assist with some of the more tedious minutia involved in writing programs, and a digital amanuensis that can make use of a large corpus of existing programs in order to propose using fragments of those programs in the development of new programs, and is able to perform syntactic and semantic analyses of code fragments to assist in transforming — adapting and debugging — a fragment of one program for use in another program.

Most of the discussion in this document focuses on the neural network architectures required for performing these syntactic and semantic analyses and facilitating search for suitable program fragments to suggest as proposals in developing new code. We refer to the apprentice using the acronym PAN — [P]rogrammer [A]pprentice [N]eural-network — and note that as a dutiful apprentice PAN has to interact with the programmer to explain proposed code modifications, elicit feedback, ask for assistance and generally participate in a form of human-computer pair programming⁴⁶.

Conversing in natural language and sharing a graphical representation of the current programming project facilitate such interaction. The graphical representation is essentially the editor window of an Integrated Development Environment (IDE). This IDE is instrumented allowing the apprentice to make any changes a programmer might make to a program including editing, formatting, searching, selecting, copying and replacing text, executing macros that control the debugger to set breakpoints and step through programs, plus invoking a number of capabilities that exercise PAN's neural-network architecture.

PAN also requires a language facility including a language model suitable to span the technical argot, idioms and ontology employed in software engineering practice. In the initial phase of developing PAN, we will resort to a somewhat conventional dialog system for natural language understanding and generation, choosing to focus effort initially on how to apply artificial neural networks / differentiable models to generate useful proposals during code synthesis. This initial dialog manager is modeled after a prototype developed for the Zinn project.

PAN's executive control system including its dialog management subsystem is essentially a hierarchical planner [30] and as such everything it does including every utterance it makes, every action taken in response to a user utterance or in pursuing its own self-generated code-synthesis-related plans is motivated by its goals in assisting the programmer. Most activities generate new goals that spawn new plans only some of which will actually be expanded into actionable tasks such as applying code transformations, testing hypotheses, asking the programmer for guidance, resolving ambiguous instructions or recognizing and recovering from misunderstandings.

%%% Thu Jan 11 04:52:40 PST 2018

Here is an illustration of the programmer (USER) interacting with the apprentice (SYSTEM) in the process of developing a program. A human-readable synopsis (EXECUTE) of the tasks executed from SYSTEM-generated plans is shown interleaved between utterances:

SYSTEM: I see you've added a new program specification to your code repository would you like me to ingest it?
EXECUTE: ingest[ Description, I/O_Samples, [Program ID, 1089] ], validate[ 1089 ], convert[ 1089, [Format, "STD"] ]

SYSTEM: The description and input-output pairs look good. I converted them to our standard format. What's next?

USER: Go ahead and try to write a program that meets the supplied specification. Tell me if you have problems.

EXECUTE: embed[ Syntax, 1089 ], generate[ Traces, 1089 ], embed[ Semantics, 1089], propose[ Main_Program, 1089]

SYSTEM: Performed syntactic and semantic analysis and incorporated with description. Check out the initial stub.

USER: Looks good. Assume three formal parameters consisting of two of type string and the third a list of strings.

EXECUTE: attend[ {Syntax, Semantics}, 1089 ], select[ [Stub, 1089] , [Current, 1089] ], insert[ Selected ]

SYSTEM: I've inserted a loop to iterate over the strings in the third argument list and stubbed out a conditional.

USER: You need a counter to keep track of the location of the last occurrence of the second argument in the list.

In this example, the SYSTEM ingests the specification for a new target program, checking to make sure that the specification can be massaged into the standard format required to ingest such specifications. It then creates a distributed representation of the description and enters the I/O samples in the IDE buffer in the required format so they can be used during subsequent development and debugging. Following a standard protocol, the SYSTEM might create a MAIN program or DEFINE a procedure employing default templates. Ingesting the USER supplied description of the formal parameters, the SYSTEM produces a probability distribution it can use to generate proposals to fill in the BODY.

Noticing that the top five proposals are syntactic variants of one another, the SYSTEM inserts the most common variant, in this case a LOOP over a list, having simplified the proposal fragment by eliminating other expressions in the LOOP body pending feedback from the USER. Proposal generation relies on a contextual embedding network using the current program, consisting in this case of the description, I/O samples, initial stub and supplied formal parameters, as the context. Internally, programs are represented as abstract syntax trees thereby finessing the problems of comparing programs, determining fragment boundaries and composing new programs from existing fragments in the embedding space.

def confirm_play_m(state, type):
    if type == 'play':
        if state.var['confirm']['play']:
            # it's already confirmed!
            interject(expand('I_GOT you want me to play some music.'))
            return []
        elif state.var['premise']['play']:
            # it needs to be confirmed.
            bindings = {'this':history(), 'music':got_music(state)}
            utterance = 'WHEN_YOU_SAY "$this" IS_WAY_SAY "play $music"?'
            # ask for confirmation.
            input = interact(expand(utterance,bindings))
            if lookup(input,'positive'):
                # yay, play is confirmed.
                interject(expand('I_GOT you want to hear some music.'))
                return [('commit', 'confirm', {'play':True})]
            else:
                # uh oh, gonna be complicated.
                return [('recover', 'play')]
        else:
            # this is totally unexpected!
            return [('recover', 'play')]
    else:
        return False

Figure 39: Here is a sample plan of the sort managed by the SYSTEM hierarchical planner and used in service of dialog management. The executable tasks interject and interact are primitives that produce (immediate) interlocutionary acts. The lookup function invokes a neural-network subsystem that performs sentiment analysis and estimates salience. Each plan returns a (possibly empty) list of subtasks that are added to a queue of such tasks maintained to account for the highly dynamic nature of conversational dialog.

Macros like WHEN_YOU_SAY, IS_WAY_YOU_SAY and I_GOT generate random variants of conversational idioms to add diversity in speaking often-repeated multi-word (phrasal) synonyms. They are currently implemented using phrase embeddings, and won't be required when SYSTEM utterances are generated with encoder-decoder recurrent network technology. The plan shown here was part of a prototype system intended to assist users interacting with Google Play Music, but could be adapted to suit the proposed programmer's apprentice application.

%%% Thu Jan 11 09:22:35 PST 2018

Plans are essentially programs and the hierarchical planner is basically a plan interpreter and runtime manager. Plans are used to manage the SYSTEM-side of the dialog and handle all of the many aspects of developing programs — see Figure 39. The current implementation of the planner is written in Python. In principle, the target programming language of PAN could be the same as the programming language used to implement plans and the SYSTEM could write programs to modify its behavior, perhaps using a specially designed sandbox to experiment with self-modifying code and some sort of immutable kernel designed as a fallback in case its modified self runs amok.

The potential for the apprentice to write code implementing new plans that perform complex protocols with loops and conditionals and tasks that control IDE instrumentation is considerable. However, for the time being, the hierarchical planning framework is being used as an expedient in developing a prototype in which the focus is on code synthesis and, in particular, on generating proposals corresponding to program fragments that can be relatively easily adapted to solve new problems. We anticipate that much of the conversational capability leveraged in the aforementioned hierarchical planner can be improved upon using neural-network-based NLP and recent advances in developing value iteration networks [120, 278].

January 9, 2018

%%% Tue Jan  9 03:45:23 PST 2018

I briefly reviewed the work on distributed representations and their relationship to compositional representations and the variable binding problem. Specifically, Kriete et al [168, 221] on how areas of prefrontal cortex might provide working memory that supports slot-filler manipulations relying on gating involving the basal ganglia. Hassabis and Maguire [132, 131] hypothesize that hippocampus plays a critical role in imagination by binding together the disparate elements of an event or scene. Earlier work on circular convolution by Tony Plate [231, 230] PDF and the tensor-product representations by Paul Smolensky [262, 263] are briefly reviewed in Kriete et al [168], but don't reflect more recent research [148]. None of these approaches strike me as satisfactorily supporting the sort of operations anticipated below.

Recently there has been renewed interest in the work of Christof von der Malsburg [289, 290] on his concept of dynamic weighting also known as fast weights — see Hinton and Plaut [141] and Ba et al [10]. Dynamic weighting was developed to deal with the problem of establishing short-term connections between concepts to express time-dependent relationships between existing distributed representations. This makes it possible to efficiently construct and reason about complex objects as a dynamically linked composition of multiple objects without complex and costly copying or permanently altering the network — see here for a list of relevant recent and historical titles and abstracts⁴⁷ . Using the term "aspect" to refer to what are traditionally called variable-binding pairs, slot-filler notation, etc., and masking the difference between the "executive control" and "attentional control" systems — if such a difference even makes sense⁴⁸ , we want to understand how an attentional control system might be trained to support the following operations:

select representations that support a particular instance of given task;
activate a selected representation in space allocated in working memory;
maintain — keep activated — a representation in working memory;
suppress, maintain or emphasize aspects of an activated representation;
fill a slot or alter a variable binding in an activated representation;
link activated representations to construct composite objects as proposals;
select an activity and initialize to suit current context and circumstances;
return attention to an ongoing activity resuming work where you left off;
on reactivation update activity accounting for changes since last activation;
terminate activation returning any supporting resources to available status;
release bindings and other temporary adjustments to relevant state vectors;

Here is the primary related use case for the programmer's apprentice that I have in mind: In a machine-translation or dialog-management system implemented as an encoder-decoder pair of recurrent neural networks, we usually build up a representation — often referred to as the context of a text or conversation — in the encoder by ingesting the text comprised of a sentence or longer statement in the original source language or an utterance generated by one of the participants in a conversation, and then construct a translation in the target language or a response from the other participant in the dialog. Ideally, the context can be as rich as it needs to be, encompassing information from, not only the text of a book being translated up until the start of the specific sentence being translated or the ongoing conversation up until the last utterance in the conversation being responded to, but the much larger social and intellectual context of the book or conversation. The work of Ba et al [10] on dynamic links and fast weights specifically invokes attention noting that fast weights can be "used to store temporary memories of the recent past and they provide a neurally plausible way of implementing the type of attention to the past that has recently proved very helpful in sequence-to-sequence models".

%%% Wed Jan 10 03:41:31 PST 2018

The relationship between embedding symbol sequences in vectors and variable binding is illustrated in the work of Huang et al [148] on tensor-product generation networks (TPGN) leveraging Smolensky's original tensor-product approach to variable binding [263]. In theory, TPGN networks inherit several desirable characteristics from Smolensky's work listed on Page 64-65 in [262] including that such networks (i) saturate gracefully as larger structures are represented, (ii) permit recursive construction of complex representations from simpler ones, and (iii) respect the independence of the capacities to generate and maintain multiple bindings in parallel. Their application of TPGN's to image captioning outperforms the widely-used LSTM-based models for images in the MS-COCO dataset [148]. The authors claim that "the learned representations developed in a crucial layer of the model can be interpreted as encoding grammatical roles for the words being generated." If their claims hold up, this approach could be well suited to the problem of embedding programs for code synthesis.

January 7, 2018

%%% Sat Jan  6 04:31:07 PST 2018

In this log entry we revisit the original architecture shown in Figure 29, providing more detail concerning the various layers and subnetworks and the functions that they are intended support. This is not the last word by any means and there are many details I have yet to figure out, but the architectural features described here go some way toward explaining how we might design a programmer's apprentice of the sort we envisioned back in December. The main contributions of this exercise are summarized in the caption of Figure 37 and elaborated in the following.

The red and green connections shown in Figure 29 are assumed in Figure 37 since they support the attention-based executive control model described here. These connections enable the system to become aware of thoughts activated by excitation from the sensory-motor periphery and, by attending to such thoughts, make them available to other thoughts held in conscious awareness. Representations associated with these thoughts can be combined and shaped by excitatory and inhibitory connections to facilitate imagining novel situations, making predictions and generating proposals to support code synthesis.

The representations corresponding to thoughts have to persist over time to participate in the construction of new composites. The recurrent layers are capable of activating multiple contexts, sustaining their activity over indefinite periods of time and taking top-down direction from layers modeled after the prefrontal cortex (PFC) to enable the vector analog of slot filling and variable binding using inhibitory and excitatory feedback from the PFC executive-control attentional system. This model of attentional control is inspired by Yoshua Bengio's consciousness prior [25]. See here for related background on working memory, prefrontal cortex and the role of the basal ganglia⁵⁰ .

Figure 37: This graphic represents a neural architecture consisting of several collections of neural network layers rendered here as cloud-like shapes arranged in four levels of a hierarchy. The top three levels shown here roughly correspond to the diagram shown in Figure 29 labeled A. The networks shown in the bottom level are associated with the different input and output modalities depicted in Figure 29 labeled B, C, D and E.

The clouds for handling structurally-indexed input and output include written text and spoken word sequences, version-control repository submits, abstract syntax trees, natural language parse trees, execution traces and various topographical maps and time series, and are rendered here as unrolled recurrent networks showing replicated instances of the structure separated by dashed lines. Networks responsible for primary sensory areas in the cortex are depicted as convolutional stacks not showing any recurrent connections though these may indeed be important to add in some cases. The bottom two levels could have been reversed or interleaved but are shown stacked to highlight their separate function.

Networks such as those representing the sort of complexity we might expect in cortical association areas are indicated as cloud shapes within which rectangles representing layers are shown with complex recurrent connections. Though not explicitly shown as such, these cloud shapes are arranged in hierarchies intended as containers for different abstractions of their corresponding inputs appropriate to their level of abstraction. The component layers are not predesignated or preapportioned, but rather their designations and apportions are determined by the complexity of their input — time varying as in the case of the analog of somatosensory cortex — and the relevance of their abstractions in carrying out rewarded activities.

The two control diagrams shown in Figure 38 provide additional architectural structure motivated by features of the primate cerebral cortex and its interaction with the cerebellar cortex. The first diagram illustrates how the systems responsible for natural language comprehension and generation are coupled, suggesting a general model for how meaning and manipulation are related in embodied systems. The second diagram abstracts the advanced (biological) control system architecture that humans use for precise, complex planning and prediction that resulted from the relatively recent complementary evolution of the cerebral cortex and cerebellum in primates and Homo sapiens in particular.

Figure 38: The two control diagrams shown here provide additional architectural structure motivated by features of the primate cerebral cortex and its interaction with the cerebellar cortex. The top diagram in Figure 38 shows the linguistic version of the dual-stream hypothesis [126, 246] starting in Wernicke's area and terminating in Broca's area, mapping visual / auditory sensory representations onto manipulatory / articulatory motor representations with the two paired much as in a recurrent-neural-network-based encoder-decoder machine-translation system — see Figure 31 for more detail.

The bottom diagram in Figure 38 is a simplified version of the Ito [153] model introduced in Figure 36. The diagram shows a schematic illustration of explicit and implicit thought processes. The working-memory system and attentional system together constitute a controller for a mental model during the attentive phase of (directed) explicit thought. The part of the prefrontal cortex that carries out executive functions acts as a controller for less-attentive explicit thought carried out in the olivo-cerebellar complex. The inverse model provides the basis for a feed-forward controller.

%%% Sun Jan  7 05:42:51 PST 2018

Programmers hold a great deal of information in their heads when writing complex programs. Even ignoring the knowledge acquired through years of practical coding experience and reading and reviewing millions of lines of code, a short program consisting of a few thousand lines is a complex structured object with many interacting parts, dependencies on multiple libraries with complicated interfaces and knowledge of the relevant application area. We expect the provided program-specific, standard-format information relating to the current programming project is entered in a buffer or special area of the instrumented IDE so it can be referenced, reviewed and revised if needed. Making it accessible for subsequent inference is, however, a challenge.

Information relevant to the current project is ingested at the outset and represented in the network as part of the attentional schema so that it can be brought to bear as a reference — part of the global context — in guiding subsequent processing and specifically program synthesis. It is expected the programmer will also inject comments that clarify or otherwise add to the information relating to the program specification and that these will find their way through various paths to augment and amend this information, perhaps routed by selective attention using verbal cues to ensure that its intent and relative importance is clear. It's not obvious how to engineer the desired outcome, but it may be as simple as leveraging reinforcement learning with an appropriate prior.

January 5, 2018

%%% Fri Jan  5 03:20:34 PST 2018

So what is missing? We are making some progress in terms of developing representations — generally vector spaces trained by encoder-decoder pairs consisting of recurrent neural networks — that are capable of capturing some aspects of the syntax and semantics of code fragments. We are missing contextualized abstractions (proposals) and modes of thinking (methods) that enable us to apply such abstractions when the context warrants, adapting these if need be to suit the circumstances.

The result from applying such modes of thought need not correspond directly to inserting well-formed formulas / program fragments. They could simply help to create a more nuanced context for proposal generation — altering the connectivity of recurrent layers — that would serve to constrain subsequent inference and that would, in due course, produce some tangible output in the form of a code fragment inserted into or replacing an expression in a program under development.

In addition to thought clouds that establish contexts, we need context-sensitive embeddings that enable us to generate multiple proposals for filling in empty constructs such as the body of a let or do loop. Indeed, somehow we have to create and then leverage very large distributed representations that correspond to multiple recurrent layers arranged hierarchically as in the (cumulative) association areas in the cortex — analogous to cascading style sheets.

These complex representations have to persist over time, e.g., adapting the (error) carousel method used by Schmidhuber in his LSTM model. The recurrent layers are capable of activating multiple contexts, sustaining activity over indefinite periods of time and taking top-down direction from layers modeled after the prefrontal cortex (PFC) to modify the vector analog of slots using inhibitory and excitatory feedback from the PFC executive-control attentional mechanism.

Also missing is the ability to infer the need for edits in existing code fragments, including changes in variable scope following a substitution, inserting additional conditional clauses, as well as wholesale changes replacing an entire subtree in the AST representation of a program under development or that ripple through an entire program accounting for changes in type or input specification. Responsibility for such changes is shared by the programmer and apprentice.

January 3, 2018

% %%% Wed Jan  3 04:06:36 PST 2018

I reread the Battaglia et al [24] and Chang et al [41] papers on reasoning about physical dynamics simulators with an eye for the possibility that they might represent programs. It was a long shot, but necessary due diligence none the less. The primary focus in both papers involves continuous systems such as n-body problems. I was interested in the degree to which such systems manage state such as the velocity and acceleration of moving billiard balls. Systems whose dynamics can be described by simple PDE models seem within scope. However, I believe these models are inadequate to the task of modeling complex discrete systems such as executing computer programs⁵¹.

January 1, 2018

%%% Mon Jan  1 3:22:35 PST 2018

A feedforward model predicts the next state of a system whereas an inverse model works backward from the desired state or behavior of a system to the activity or cause. The advantage of an inverse model is that it can be used directly to build a controller. The desired behavior is treated as an input variable in the model, and the action is treated as an output variable. When a new desired behavior is given, the controller just asks the model to predict the action needed.

Figure 36: Block diagram of a thought system [153]. The diagram shows a schematic illustration of explicit and implicit thought processes. The working-memory system and attentional system together constitute a controller for a mental model during the attentive phase of explicit thought. The part of the prefrontal cortex that carries out executive functions acts as a controller for less attentive explicit thought. The inverse model provides a feed-forward controller. The novelty system consists of the hippocampal CA1 area and the ventral tegmental area. E1 denotes errors derived from a comparison of the input problem with the output solution from a mental model. E2 denotes errors derived from a comparison of the outputs from the mental model with the outputs from the forward model. E3 denotes errors derived from a comparison of the input problem with the output of a forward model. Comp1 denotes a comparator associated with the novelty system. Comp2 denotes a comparator involving the inferior olive. Comp3 denotes a postulated comparator for E3. Subtraction and repression (in the case of E3) are indicated by a minus sign (–). Adapted from [152].

Here is the programmer's assistant instantiation of the Ito coupled cerebral / cerebellar thought (sub) system⁵² adapting Figure 4 in [152] to establish the mapping:

a mental model that generates proposals to transform an existing program into one that is more likely to produce a final program satisfying the target description and I/O sample,
a forward model as illustrated in the diagram shown in Figure 36 that (metaphorically) on the brain stem and interfaces directly with the IDE, its interpreter and utilities, and
an inverse model also shown in the diagram shown in Figure 36 that generates the examples necessary to train the recurrent mental model by running the forward model backward.

If you are interested in the original Albus and Marr theories of the cerebellum and their subsequent development and refinement, check out the syllabus for David Touretzky's course at CMU entitled Computational Models of Neural Systems which includes two particularly relevant lectures: Lecture 2.3 Cerebellar Forward and Inverse Models in Motor Control PDF and Lecture 2.4 Cerebellar Timing and Classical Conditioning PDF.

December 31, 2017

%%% Sat Dec 30 03:54:12 PST 2017

Consider the training examples shown in Figures 34 and 35. The two functions are similar in many ways but also have subtle semantic differences. They both make use of the do loop as an iterative construct rather than the more familiar Lisp alternative of using recursion or functional-programming-style map operator. In Figure 34, the function substitutes a married name for a maiden name if finds the maiden name preceded by the given name by no more than two words. In Figure 35, the function returns true if and only if it finds an instance of the first keyword preceded or followed by the second keyword separated by no more than span number of intervening words.

Both functions scan the input sentence from left to right and examine each word exactly once. Both functions keep track of the number of intervening words separating the current word being examined and the last occurrence of cue words — maiden names in the first function and both keywords in the second. Syntactically similar, the two functions might be rendered similar in an embedding of their abstract syntax trees. An examination of their execution traces would reveal that the functions execute the same number of iterations when operating on sentences of the same length. The second function employs somewhat more complicated logic since it has to account for the possibility of keywords appearing in either order. The first function returns as output an amended copy of the input sentence. The second function returns true or false indicating whether or not nearby instances of the keywords were found.

If one of the two functions was implemented in the functional programming style using map and lambda it would look superficially / syntactically different. The same goes for an efficient implementation using tail recursion. In Racket, the do construct could easily be a macro hiding an implementation using either of these two alternative programming styles. The triples could be implemented using a C-style struct or an immutable tuple or by defining a suitable class. Alternatively, the function could be implemented using regular-expression-based search and replace and it would look nothing like functions shown in the two figures. With enough data in each of the most common styles supported within the Racket implementation of Scheme, none of these sylistic differences would make a difference. That said it might take a lot of data.

The program logic and method of counting word spans is similar in the two functions, but optimized versions are likely to be more varied and less comprehensible. Again, more data can compensate at a cost. There are reasons to stick to a given style to facilitate readability, but it pays to know multiple programming languages if you often borrow ideas from Stack Exchange or an existing shared code base. An earlier entry contrasted the differing roles of the cerebral cortex and cerebellar in prediction. These two systems are not isolated from one another. They are tightly coupled with each one capable of independent prediction, but far more capable when combined using feedback to check the predictions of one against those of the other and bring to bear the executive control center of the prefrontal cortex and knowledge stored in primary motor and sensor cortex and related association areas⁵³ .

%%% Sun Dec 31 03:47:42 PST 2017

Spent the day review work in cognitive neuroscience relating executive control and the role of the basal ganglia and cerebellar cortex, including O'Reilly and Frank [221] computational model of learning in the prefrontal cortex and basal ganglia, O'Reilly [220] on biologically based computational models of high-level cognition and Ito [152] control of mental activities by internal models in the cerebellum⁵⁴ .

Took some time to better understand the Reed and de Freitas [243] paper Neural Programmer-Interpreter⁵⁵ focusing on their use of execution traces and thinking about how to incorporate traces into an architecture based on the cerebellum and related nuclei described in Ito [152]. Now have some idea how I would combine a static but mutable representation of the current program along with the execution traces of a program running on an I/O sample. Papers on learning in the cerebellum by Albus [1] and Marr [197] helped.

As the name suggests, the abstract syntax tree is a tree and, as such, it does not explicitly represent while loops or recursion. These features are, however, apparent in the control flow graph and can he added as annotations to the AST representation to provide an executable representation of the program. The state of a running program in the cerebellar architecture combines the current values of all program variables, a program pointer corresponding to a node in the AST and its embedding-space vector representation in the area representing the abstract association area in the cerebral cortex.

December 29, 2017

%%% Thu Dec 28 04:03:38 PST 2017

The objective of this log entry is to describe how to we might apply the embedding technologies describe in the previous entry to solve a simple programmer's apprentice code-synthesis problem. For the time being, we won't complicate the task by describing how the apprentice manages its end of the task, but rather finesse the NLP issues and assume the assistant can generate hierarchical plans of the sort proposed in the Zinn dialogue management prototype.

We assume the IDE includes instrumentation that allows both the user and system to point variables and select code fragments. The interface automatically selects syntactically well-formed expressions to simplify editing and sharing and responds to voice commands issued by either agent to facilitate hands-free selection and ambiguity resolution. Syntax highlighting visually distinguishes placeholders in partially specified expression templates, e.g., (if test then true else false), allowing either agent can issue a command like "replace the true expression in the indicated conditional with a procedure call".

As mentioned in the prologue, most of my writing — including both prose and code — is done in a highly customized Emacs environment with lots of specialized modes and bespoke functions. Almost anything tedious done more than once is a potential target for automation. My ideal digital amanuensis would handle all of the associated script programming in response to my verbal specification. I would be more than happy to tutor it in writing some of the more esoteric functions if it took care of routine maintenance and upgrades and wrote all the easy scripts⁵⁶ .

%%% Fri Dec 29 12:08:23 PST 2017

My original goals for the day got sidetracked and I spent most of the day trying to scrounge up examples of code to illustrate points about exploiting code embeddings. My original idea of using Emacs lisp dissolved not because I couldn't find example code repository but because (a) Emacs mode hackers are an inbred lot and their coding style can be difficult to stomach for the uninitiated. Python is a practical alternative but my immediate purposes are primarily pedagogical and so I settled on a modern dialect of Lisp called Scheme and an excellent implementation called Racket developed by the PLT group.

(define describe_A "Given a list consisting of three words represented as 
        strings, find and replace the first occurrence of the second word 
        with the third word in a document if occurrence of the second word 
        is preceded by the first word by no more than two separating words.")

(define document_A '("Abigail Smith was President John Adams closest adviser"))

(define triplets_A '(("Abigail" "Smith" "Adams") ("Dolley" "Todd" "Madison")))

(define (substitute_married_name_for_maiden_name triple document [maxsep 2])
  (let ((given (first triple)) (maiden (second triple)) (married (third triple)))
    (do ((dist (+ 1 maxsep) (+ 1 dist))
	 (words (string-split document) (rest words)) (out (list)))
	((null? words) (print (string-join out)))
      (cond ((equal? given (first words))
	     (set! dist 0)
	     (set! out (append out (list given))))
	    ((and (equal? maiden (first words)) (< dist maxsep))
	     (set! out (append out (list married))))
	    (else 
	     (set! out (append out (list (first words))))))
      )
    )
  )

Figure 34: Here is the first of two illustrative examples of code written in Scheme that we refer to in our discussion of semi-automated program synthesis in the context of the programmer's apprentice application. The code is written in a simple pedagogical style to reveal its algorithmic character for comparison.

Figures 34 and 35 illustrate a simple data format for examples used to train embedding models. The format includes a short natural language description, sample input (shown) and sample output (missing) and the code fragment. In practice, there would exist helper code making it straightforward to run the fragment on representative input samples, check the results against the corresponding output, apply lexical and syntactic analysis to produce token strings and abstract syntax trees in a standard format, and, finally, generate execution traces to construct call graphs that capture signature run-time dynamics.

(define describe_B "Given two keywords represented as strings, a distance between 
        words represented as an integer number of intervening words and a document 
        represented as a string, return true iff the two words appear in any order 
        in the document separated by no more than a specified separation distance.")

(define keywords_B '(("John" "Adams") ("Thomas" "Acquinas") ("Eleanor" "Roosevelt")))

(define document_B '("John Quincy Adams mother Abigail was not related to Adam Smith")) 

(define (determine_keywords_are_near_one_another pair document maxsep) 
  (let ((keyone (first pair)) (keytwo (second pair)))
    (do ((distone (+ 1 maxsep) (+ 1 distone))
	 (distwo (+ 1 maxsep) (+ 1 distwo))
	 (words (string-split document) (rest words)) (flag false))
	((null? words) (print flag))
      (cond ((equal? keyone (first words))
	     (set! distone 0)
	     (if (<= distwo maxsep) (set! flag true) void))
	    ((equal? keytwo (first words))
	     (set! distwo 0)
	     (if (<= distone maxsep) (set! flag true) void))
	    (else void))
      )
    )
  )

Figure 35: Here is the second of two illustrative examples of code written in Scheme that we refer to in our discussion of semi-automated program synthesis in the context of the programmer's apprentice application. Compare this function with the functionally and stylistically similarly example shown in Figure 34.

%%% Fri Dec 29 15:22:55 PST 2017

Wei Lu, a professor at University of Michigan, and researcher in his lab have developed a chip implementing a form of reservoir computing using low-power memristor technology [83]. Reservoir models, including liquid-state and echo-state machines, employ a set of nonlinear units sandwiched between two linear layer such that the nonlinear units — implemented as memristors — map the input into a high dimensional vector space and only the units in the output layer have to be adjusted during training, thereby considerable speeding up learning. Perhaps platforms should have listened to me two years ago when I suggested that such a combination would facilitate deployment of neural networks on mobile devices and substantially reduce computing costs in our datacenters, but that it would require targeted investment given that HP was apparently giving up on the technology after championing it for years and most of the remaining effort was going into memristive memory.

December 27, 2017

%%% Tue Dec 26 05:15:49 PST 2017

The programmer's apprentice is rapidly evolving though not necessarily converging as I learn more about the problem — that should be "problems" plural given the many different applications of such technology from inferring regular expressions given I/O pairs to solving programming olympiad challenge problems — and current solution methods with a special emphasis on recent neural network solutions.

The programmer's apprentice is different from most automated program induction problems as it involves human-plus-computer pair programming. It might be considered unwise to attempt to solve two problems — human-computer interaction plus automated program induction. The rationale emphasizes addressing an accessible level of human-computer cooperation on a relevant graded application with measurable outcomes. It is the former I am particularly drawn to.

This log entry attempts to summarize progress so far by describing a scenario involving a programmer and her automated assistant solving a programming problem, highlighting how they cooperate and, in particular, what knowledge each of them bring to the table, how that knowledge is encoded, and how and under what circumstances it is brought to bear. The description has obvious lacunae and gaps are filled with possible options and not a little hand waving.

In a typical session, the programmer describes the target program followed by a sample of I/O pairs that makes sense for this particular target. We'll assume that the IDE is reset so that the canvas is essentially empty for the new session. To simplify things further, we also assume that the assistant can take instructions to load an I/O sample and that having loaded the sample, can examine the entities that comprise the sample.

The target description and I/O pairs have to be ingested as text for the apprentice to select an existing program to use as a pattern otherwise the programmer can suggest an existing program say from StackExchange or GitHub. Assuming a suitable suggestion from the user, the program is loaded in the IDE. In lieu of a suggestion from the programmer, the assistant begins with a language-appropriate default such as a main function in Python.

def main ():

if__name__==__main__:
    main()

At this point, we assume the assistant has one or more default strategies for making progress. For example, the main function might be modified to take an I/O-appropriate input argument. Since the objective in this case is not to induce a program from an I/O sample, in order to make progress either the programmer has to suggest some preliminary code to insert or the assistant has to select a program fragment from the set of all programs it has ingested during training.

To take stock, several of the methods we've reviewed rely on I/O pairs to generate relatively simple DSL programs [80] or start with a complete NL description of the challenge program along with I/O pairs [16]. Most of the approaches employ an oracle such as in traditional methods to evaluate a proposed solution [122]. We are interested in a more interactive style of pair programming in which NL is used to capture the intent of a program or fragment⁵⁷.

%%% Tue Dec 26 11:26:38 PST 2017

One approach [7] to generating program proposals from directly from NL descriptions uses an encoder-decoder pair implementing a seq2Tree model [4] augmented with attention [82] that encodes the NL problem description and decodes it as an AST tree node computing probabilities one node at a time. A tree beam search then employs those probabilities to compute a set of most likely trees, and chooses one that is consistent with the specified I/O pairs.

This method relies on embedding methods from machine translation. Here is an example of attention machinery applied to sequence-to-sequence machine translation [191] from this tutorial. Here Dong and Lapata [82] extend the same basic idea to map linguistic utterances to their logical forms whether they be parse trees or programs.

Lin et al citeLinetalUW-CSE-TR-17 also focus on program synthesis from NL but in this case they translate from questions of the sort that are routinely posted to Stack Exchange to short bash scripts and rely once again on seq2seq to perform the embeddings using 5,000 pairs for training and an additional 3,000 pairs for evaluation:

Question: Move the ".zip" files in "dir1" ,"dir2" and "dir3" to a common base folder,
Solution: find dir*/ ‑type f ‑name *.zip ‑exec mv {} basedir \;

These two systems [7, 186] exhibit interesting capabilities that are likely to prove useful in the programmer's assistant. They have limitations in that (a) they don't handle larger programs written in more expressive modern programming languages, (b) transformations can't be precisely applied to an existing program, and (c) code search and debugging are primarily dictated by syntactic criteria.

To handle larger programs and support syntax-aware code transformations, we introduce a shared-access program representation as part of a highly instrumented IDE designed for Python, Java or Lisp. To enable semantic code search-and-repair capabilities and assist in finding bugs and writing unit tests we introduce extensions that exploit semantic information of the sort found in execution logs and program traces.

%%% Wed Dec 27 03:51:08 PST 2017

In an earlier post we suggested that it would be useful to have two different methods of predicting the behavior of programs, one modeled after the cerebral cortex [132] and capable of a powerful, if slow and not terribly precise, ability to imagine how complex states evolve [223].

In addition, we anticipate the need for a second method of prediction roughly modeled after the cerebellar cortex that is fast, accurate and primarily associated with activities relating to innate physical dynamics governed by the motor system such as speaking, running, riding a bicycle or playing a piano⁵⁸.

There is evidence cognitive processes reciprocally engage with motor processes. While there are areas linked to motor activity, cognitive and motor function are broadly controlled by the frontal lobes, cerebellum and basal ganglia that "collectively interact to exert governance and control over executive function and intentionality of movements that require anticipation and the prediction of movement of others" [176].

The memory systems for running, jumping, skipping rope, swimming, etc are broadly distributed. These essentially correspond actionable traces for directing different activities that can be strung together to execute more complicated maneuvers. For walking, we have a number of basic gaits, plus conditioned excursions for transitioning between gaits, recovering from changes in terrain or loss of balance and compensating for injury and loads.

Simplifying, the cerebellum takes input from sensory systems, integrates these inputs and produces the outputs that drive motor activities. In the programmer's apprentice, the closest analog we have to motor activities correspond to changes in the values assigned to variables during program execution. Running a program generates an execution trace or call stack that can be realized as a graph that represents a distinctive signature.

Execution on different inputs realize different traces and hence different signatures. We could take one representative trace as the signature for a given program or combine multiple traces in a multigraph that, depending on the program, may or may not be comprehensible.

To associate a functional signature with a program applied to a representative I/O example, we simply run the program on the selected input to generate an execution trace, convert the trace into a graph that we then embed in a suitable vector space./inputs/footnotes/graphembed.bib in a manner similar to that proposed by Xu et al [302] or other similarly motivated approaches [302, 42, 3].

Associating a functional signature with a program fragment is conceptually simple: In the debugger, set a break point just before you enter the fragment and then track the evolution of each variable referenced in the fragment by stepping through the remaining statements in the fragment recording each time the variable changes value as a consequence of some procedure call. This process could be accelerated and instrumented within the IDE.

It could be that traces of a program running on different inputs produce completely different signatures. It might make sense to check for this and use the embedding to cluster signatures into functional groups. In this manner, a single program or fragment would effectively serve multiple purposes. Since a single program has multiple fragments, most programs already serve multiple purposes. We might not want to enumerate all such fragments a priori.

To review, the assistant's knowledge of programs — not about programming per se, we haven't got to that yet — has declarative, syntactic and semantic dimensions. Declarative knowledge including specifications and descriptions suitable for collaborative programming are handled using traditional NL embeddings, syntactic structure is preserved in the embedding of abstract syntax trees and semantic information in the embedding of execution traces.

The next step is to extend the narrative we began at the outset of this log entry, describe the programmer's knowledge of programming and interaction, and work through examples illustrating how the assistant might exploit its knowledge to collaborate with the programmer to write a simple program. In the process, we may be able to refine the generic architectural framework we introduced back in October.

December 25, 2017

%%% Sun Dec 25 04:50:35 PST 2017

I submitted a draft of the Inverted Matrix concept for internal review. The peer reviews — from project managers, research scientists and software engineers — were encouraging though one director-level reviewer suggested that it be run by folks in the Press and Policy groups for a sanity check. Alas, the P & R folks believe it will push all the wrong buttons and exacerbate fears about the impact of AI, intrusiveness, loss of privacy and the perceived role of large technology companies.

I'll continue thinking about assistive technology to enable the sort of intellectual, emotional and social intelligence required to channel collective decision making on a global scale. In the meantime, my focus has returned to assistants for software engineers allowing them to generate code by instructing programmer's assistants to do their bidding. Such technology could be extended to enable non-programmers to construct and execute complex directives, e.g., recipes, protocols, etc. The underlying technology could support the sort of prostheses that we envisioned in the Inverted Matrix white paper.

Regarding assistants for individual personal development, I've been researching concerted efforts to develop emotional intelligence. Some years ago I ran across the work of Alain de Botton in the form of his witty and well received book entitled How Proust Can Change Your Life [54]. Esoteric to be sure, but more recently Botton has channeled his efforts in a company called The School of Life.

It occurred to me that the videos and other resources provided by the The School of Life constitute another — but considerably less intrusive delivery vehicle — for much of the psychological counseling, social discourse advice and life coaching that I imagine Inverted-Matrix / Diamond Age inspired technology providing to individuals — see also the somewhat ostentatiously named companion Book of Life.

Botton strives to reach a broad audience without pandering to any particular demographic aside from thoughtful people interested in their emotional well being and issues relating to how we might better live and work together in harmony⁵⁹. Initially we were considering working with organizations like the Center for Cognitive Behavioral Therapy (CCT) at the University of Pennsylvania and the Stanford Institute for Neuro-Innovation & Translational Neuroscience and related Center for Compassion and Altruism Research and Education, but The School of Life videos underscore the importance of thinking carefully about how such lessons are delivered and promulgated.

The other evening we visited The School of Life website to check out some of their more recent video shorts, and enjoy the simple, effective animations that accompany Alain de Botton's monologues. Here⁶⁰ are a few examples that might give you a smile during the holidays. And here⁶¹ are some examples that take on fundamental societal issues concerning who we are and how we got here in a way that reveals existential issues and suggests how we might change our behavior to reduce their hold on us. Enjoy.

December 23, 2017

%%% Sat Dec 23 03:55:37 PST 2017

Suppose someone you know has a great recipe for a Tuscan white bean soup with ingredients consisting of dried white beans, celery, carrots, onions and seasoned with pepper, garlic and thyme. You want to show her how to modify the recipe to make it a complete meal by substituting one cup of chicken or vegetable broth for the same quantity of water in cooking the beans and adding farro and lentils to thicken the soup and supplement its nutritional content. The modification will require soaking the lentils separately, cooking the farro and then adding the lentils and farro after the white beans are partially cooked but well before they start to disintegrate. The instructions may seem inscrutable to someone unfamiliar with cooking different types of beans and grains. It is not enough to know just the ingredients and have a description of the correctly prepared dish. Understanding the physics of cooking different types of beans is not inherently more difficult than understanding when to use and how to work with sequences, sets and tuples. The analogy of cooking to programming, recipes to programs and food-physics to program execution is worth a few minutes of meditation.

%%% Sat Dec 23 16:42:56 PST 2017

In the holiday shopping problem, the shopper has a list of people he wants to buy presents for, a model of each person's preferences and list of promising gift ideas for some particularly close individuals, various means for buying presents including on-line websites, shopping centers and bricks-and-mortar stores, plus talents like baking and sewing he can use to make gifts and various strategies for wrapping, personalizing and shipping presents. The goal is to have the wrapped packages delivered to their designated recipients in advance of the target holiday whether it be Christmas, Hanukkah or Kwanzaa.

In the dinner cooking problem, the cook has a dish she wants to create for a dinner party, a list of the ingredients and necessary kitchen appliances, the memory of what the dish tastes like from a meal at a local restaurant, a guest list with dietary restrictions, and a collection of cookbooks none of which includes a recipe for the exact dish she has in mind but several include recipes for similar sounding dishes. The goal is to have dinner prepared and ready to eat at a specified time and satisfy the greatest number of diners without deeply disappointing or poisoning any with dietary restrictions.

What's different about good cooks and shoppers? Both problems can be arbitrarily complex. One can argue that at their most refined level they are as complicated as software engineering, albeit requiring different knowledge and skills. All three problems can be carried out using brains with architectures that are anatomically indistinguishable. All require many discrete steps involving complex dependencies, employ sophisticated technologies, demand careful choice of materials and processes, precise timing, and rigid adherence to unforgiving physical laws. How is it we can become competent at all three?

The difference does not, I expect, have anything to do with the discrete nature and rigid syntax of computer code. Most chefs know how much you can deviate from recipe-specified quantity. You can be a tad sloppy about how much sugar or flour you add to your cake flour, but it matters not whether you add two cups or two cups plus-or-minus a tablespoon, while adding but one cup or leaving out the sugar altogether can make a substantial difference in the finished product. On the other, user interfaces have to be tolerant of concurrent processes that may start or stop at whim and persist indefinitely.

December 22, 2017

%%% Fri Dec 22 04:02:12 PST 2017

You can find an assortment of papers that use semantic-embedding techniques to encode computer programs or their I/O behavior for a variety of applications including but not restricted to automatic programming in this footnote⁶² . The following brief summaries emphasize novel ideas for leveraging semantic embedding of potential relevance to the programmer's apprentice application.

There are a number of systems that analyze execution logs or program traces to detect malicious software or analyze errors in student programs. The 2017 ICLR Workshop paper by Chistyakov et al [42] uses execution logs to construct behavior graphs that are embedded in a continuous space used to explore program similarities. The work is preliminary but the basic idea that behavior graphs provide functional insights is worth pursuing.

Xu et al [302] from Dawn Song's lab at Berkeley address the problem of determining whether two binary functions coming from different platforms are similar or not. They create an embedding based on the control flow graph of each binary function, then measure the distance between the embeddings of two functions to perform similarity detection to achieve significant improvements over state-of-the-art systems.

Wang et al [294] note that existing program embeddings are based on syntactic features such as token sequences or abstract syntax trees. The paper's key insight is that "program states expressed as sequential tuples of live variable values [...] capture program semantics more precisely" and offer a more natural fit for RNN models. They demonstrate the effectiveness of their approach on predicting errors in student programs and search-based program repair.

Piech et al [229] develop a NN method for assessing and providing feedback to students taking an introductory MOOC on computer programming. The main contribution "involves simultaneously find an embedding of states and programs into a feature space where pre- and post-conditions are points in this space and programs are mappings between them." The approach has implications for automatic programming, perhaps most importantly it underscores the value of exploiting the fact that such systems can execute code so as to collect and compare input-output traces as well internal-variable traces.

Allamanis et al [3] attempt to capitalize on the opportunities afforded by exploiting knowledge of a program's syntax. Specifically, they note that long-range dependencies induced by using the same variable or function in distant locations are often not considered. They propose to use "graphs to represent both the syntactic and semantic structure of code and use graph-based deep learning methods [179] to learn to reason over program structures".

December 21, 2017

%%% Thu Dec 21 04:50:01 PST 2017

Niklas Een's presentation of his preliminary work with Christian Szegedy entitled Full-Out Attack on Code Synthesis at the Google Research Summit. This work is in its early stages with much of the effort devoted to defining an appropriate language and applying conventional methods. Een mentioned two related papers [190, 2]

Niklas also mentioned that OpenAI features a small dataset of 5000 input-output examples collected by Ethan Caballero for the automatic programming challenge of turning descriptions into code. The descriptions read like puzzles and the sample of solutions that I looked at were mainly one- or two-line programs. Scanning the data just reinforces the fact that the space of programs is incredibly diverse and that any successful approach to automatic programming will have to tap into a combination of mathematical and commonsense knowledge.

I am still thinking about using a variant of Yoshua Bengio's consciousness prior [25] to train a Programmer's Assistant to perform sequential tasks with feedback from the programmer it assists. As an interesting aside, Stephen Batchelor's [23] Secular Buddhism mentions the writings of Ñāṇavīra Thera, edited by Samonera Bodhesako and Forrest Williams [209], in which Ñāṇavīra presents his understanding of consciousness (viññāṇa).

December 19, 2017

%%% Tue Dec 19 09:23:50 PST 2017

I still owe Jay a summary of my thoughts on what I'm calling, for want of a better name, his Programmed Prediction Framework and the related problem of Programmed Variable Prediction (PVP) in a production environment. My problem is that, having read his recent white paper, the horizon opens up to encompass a wide range of interesting system-wide infrastructure advances that incorporate differential NN models and automatic — compiler- or build-system-initiated — ML extensions.

For example, instead of just learning the value of constants required in making decisions in a program, delegate decision making entirely to separately managed, tested and periodically reviewed machinery deployed by the build system during compilation, prior to and potentially intermittently after product launches.

I like to think of decision making applications in which we replace entire conditional / programmed-logic cascades with ML enabled and managed decision making technology. From what I've read of Jay's interests, these applications would constitute a special case of automatic programming and would necessarily build on and contribute to a complex network of related ML problems whose specific details would not be easily accessible to either applications or infrastructure engineers.

If this is the case, it bears thinking about how the programmer would specify the requirements for an instance of PVP. It would seem to require some sort of executable specification language — including the analog of unit tests — that would allow the PVP system to develop and then test PVP components. Like I said, I need to think through some of these issues before my next meeting with Jay.

In thinking about SWERL-like programming by substitution and automated program induction more generally, I've also been thinking about what is the context of a program fragment — one or more consecutive statements in a computer program — and how does it relate to the specification of an entire program?

In terms of automated program induction, assuming no side effects, how does an input-output pair or representative sample of possible input-output pairs for a given problem dictate the structure of the final program? Having spent a good fraction of the last three weeks reading papers on automatic programming — many operating on restricted domains using domain-specific languages (DSL) to simplify the task — specifications of any generality are uncommon.

Consider the cartoon transformation shown in Figure 33. If we assume that each fragment, A and B, represents a working part of a larger program, it is safe to assume further that each inset — designated inset A and inset B respectively — is constrained purposefully to work correctly within its enclosing fragment.

The collection of substitutions — impedance matching code alterations — necessary to integrate inset B in fragment A could be extensive and it may be difficult to concoct a transformation rule that encompasses them all concisely. The conceit is that much of (if not all) the information needed to generate this rule is available in the code base in the form of a great many working examples of the rule successfully carried out.

No individual programmer ever articulated this rule. The rule falls out of the collective effort of thousands of software engineers inadvertently applying this and many other rules by having figured out one of its possible applications, for example, editing inset A to work in fragment B or inset B in fragment A — see this tutorial on embeddings in NLP by Horacio Rodrìguez [247].

The program-fragment embedding model we envision effectively implements a flexible rule that incorporates all of these possible applications as well as by some as-yet-unarticulated generalization that extends applicability to other, as-yet-unseen examples — or could, if adapted with a bit of coercion to fit by fixing a few reasonably-easy-to-spot type-inference and superficial-syntax errors.

In general, such a purely syntactic transformation is unlikely to be either injective (one-to-one) or surjective (onto) if required to adhere to reasonable semantic constraints. Purely syntactic transformations are unlikely to be sufficient for code synthesis. Most automatic programming methods incorporate some means of semantic testing, typically running code in a specially instrumented sandbox, but we have yet to see a credible workflow that elegantly integrates both syntactic and semantic analysis.

December 17, 2017

%%% Sun Dec 17  04:20:25 PST 2017

Following up on my discussions with Quoc and Dan, I started thinking more carefully about how we might create a synthetic data set using a large corpus of relatively short programs used in teaching introductory computer programming with a modern dialect of Lisp such as Scheme, specifically Racket Scheme from the PLT group.

The original idea was to take correct programs and modify them by introducing bugs and constructing pairs consisting of correct and buggy programs to support a form of supervised learning. This is still a good idea, but then I thought about supplementing the synthetic data by learning a high dimensional semantic embedding space that essentially learns to embed program fragments by learning to parse programs into abstract syntax trees. The network serves as both lexer and parser.

The idea is that by using the context of the whole program you can identify fragments corresponding to nodes in the abstract syntax tree that are semantically similar as nearest neighbors in the underlying embedding space. It may be possible to analyze a buggy program by finding nodes in its abstract syntax tree that are unlikely given the context of the full AST, and thereby determine promising regions to alter as well as specific substitutions to experiment with — see here on the notion of context.

The embedding space would be trained — at least initially — in a completely unsupervised manner. The challenge will be finding a large number of correct programs written in a relatively simple style to embed and dealing the recurrent problem of how to handle arbitrary length structured input and how to precisely identify and delimit the list of lexical items corresponding to specific nodes in the AST so as to make meaningful substitutions — see Figure 33.

Figure 33: Illustration of a simple transformation replacing an interior node in the abstract syntax tree of Program A with a node in the AST of Program B to create Program C. The resulting program may not be executable and could introduce multiple bugs into the code. It's not clear whether it would be helpful or not to perform some form of cosmetic impedance matching, for example, by matching variables where the replacement introduces a new variable or there is an obvious assignment within the same scope or by renaming all variables, or by performing some variant of skoleminzation and leaving it up to the apprentice to make adjustments to clean up the mess in subsequent transformations. Note that these transformations are not intended to be carried out by means of traditional symbolic methods such as genetic programs, inductive logic programming or automated theorem proving [122], rather they are indicated by contextualized nearest-neighbor search using recurrent neural networks of the sort used for natural language parsing, machine translation and dialog management systems.

December 16, 2017

%%% Sat Dec 16 15:231:56 PST 2017

Correct programs are much more structured than sentences and paragraphs in natural language. The constraints between program elements are dictated by the type of variables expected by functions and the keywords that are required inside of conditional and iterative expressions. Moreover the compiler is very picky as to what constitutes a syntactically correct and appropriately typed program. An optimizing C compiler or on-the-fly byte compiler of a Lisp or Python interpreter serves as an incredibly useful oracle for what constitutes a correct program.

Imagine if we could build a recurrent neural network that could ingest a large number of programs of varying sizes and learn to transform an incorrect program into a correct program given examples of input and output, where both the correct and incorrect programs are structurally correct in the sense that each separate expression is syntactically correct in isolation even though there are errors relating to the scope, spelling and type of variables. First, the system would have to learn how to make edits that preserve structural correctness perhaps using a static analysis tool such as the Unix lint program. Having learned this skill it could go on to fix errors so the program runs and compiles producing the same I/O behavior.

The next step in complexity would be to take a functioning program correct with respect to some initial I/O behavior and modify the program to satisfy an alternative specification of I/O behavior, for example, take a program that operates on integer arguments and modify it so it works with floating-point arguments. Or take a program that extends one kind of sequence such as the Fibonacci series and learns to produces another sequence. In each case, the reinforcement signal would be quite clear since the only requirement is that the revised program satisfies the specified I/O behavior. We could generate data sets consisting of programming problems at various levels of complexity, introducing more complexity gradually.

We could have the automated programmer train on problems for which the stepwise transformations working backwards from a correct program to an incorrect program are generated automatically and thereby exercise precise control over the type and complexity of the problems the system is learning to correct, even to the extent of working backwards from a correctly functioning program to a simple stub so the automated programmer is basically writing programs de novo.

I have an appointment, but I wanted to add one comment that underscores a gap in my current understanding and that corresponds to essentially how we could train an encoder-decoder machine-translation-like system to learn to parse and accurately align code snippets to support precisely targeted edits in transforming one program into another, by starting from simple programs perhaps no more complex than a function (DEFINE) with a body corresponding to a loop (FOR) or local variable (LET) declaration. I'm also uncertain about how to go about adjusting priors using auxiliary information provided by the assistant’s programmer teacher, but perhaps this is no more complicated than what is required in building a hierarchical model [99]. In any case, this will be an interesting problem to mull over during the holidays.

December 15, 2017

%%% Thu Dec 15 05:29:23 PST 2017

The architecture is modeled very roughly after the primate neocortex. With natural language input / output corresponding to auditory / visual primary sensory cortex including the pathways involving Broca's and Wernicke's areas, and an instrumented integrated development environment (IDE) corresponding to the motor cortex including bidirectional recurrent pathways connecting to the cerebellar cortex responsible for running code snippets in a suitable sandbox environment with STDIO and STDERR feeding back into the primary sensory cortex — see Figure 29 for an early system architectural sketch, Figures 32 and 28 for details pertaining to the primate neocortex, and Figure 30 for Broca-Wernicke details.

The basic programmer's apprentice concept — given its primate-inspired neural architecture I call it a code monkey (CM) since the acronym PA is ambiguous — assumes a system that comes pre-trained with certain innate capabilities. I'm not entirely sure how far it makes sense to go in terms of providing these innate capabilities, but they will probably include the ability to parse programs, traverse abstract syntax trees, step through programs, set breakpoints, modify operands, read and compare outputs. You should think of these capabilities as analogous to an infant's reflexive behaviors — code monkeys evolved to write code.

Now suppose the user can alter the contents of the IDE and refer to items by name using the NL interface. The IDE maintains syntactic invariants so there is no futzing around with missing or spuriously added delimiters, mispelled keywords, etc. All expressions are maintained within the IDE in schematic format — perhaps using a variant of the key-variable representation described in [181] — so that substituting one expression in another is straightforward, copy and paste are simple operations and evaluation is trivial. Given how often beginning programmers reboot, we might want an operation that cleans up the editor window and restarts the interpreter.

We could bootstrap the language interface by training an embedding model on the text of an introductory programming book like Felleisen et al [89] and I've entertained the idea of using the code snippets found on the Racket blog and lists to initialize the analog of a cortical association area that embeds program fragments. I know a couple of the principle PLT contributors who might help in acquiring the data.

Figure 32: Here are depictions of the motor and somatosensory maps as homunculi representing, respectively, motor activity and tactile innervation with the size of the corresponding body parts proportional to density of coverage. The inset descriptions provide additional explanation. See here for the original source. In the programmer's apprentice conceptualization, the motor cortex analog serves as the interface to an instrumented integrated development environment.

We're not sanguine about monkeys hammering on typewriters and eventually producing the works of Shakespeare and so what do we expect a code monkey to accomplish? What we have is an embodied system such that it can't help but write executable code. We could exploit reinforcement signals from the user to train the system to debug or refactor the code. The former seems possible, but the latter is a stretch. If the fragment memory was extensive enough and the system could be trained to integrate foreign code fragments into an existing design we might be able to solve some interesting problems. There are also some enhancements to the IDE that might accelerate programmer-apprentice pair programming.

Rahul wondered whether there would be a benefit to augmenting speech with some sort of "pointing". He wrote "I think there are studies claiming that a combination of speech + deictic pointing gestures is an efficient way to communicate for humans and maybe HCI". Then you could replace the long-winded "What do you want in the body of the LET?" with "What do you want here?" Similarly, I could imagine the user wanting to say "I don't know what to put here (points) just yet but let me rename this (points) variable to key_value_pair".

I like the idea of incorporating pointing into the user interface. The model assumes that the user can see the contents of the IDE editor window and perhaps the STDIN, STDOUT and STDERR signals entered into or issued from the interpreter. Assuming this visual access to what the apprentice is attending to and thinking about, there's no reason why the user couldn't use the mouse or touchpad to refer to specific entities and possibly integrate some form of completion.

Channeling Rahul's suggestion "What do you want to do here?" and "Move this expression inside this expression." become "What do you want to do *LOCATION* = [APPEND X Y]?" and "Move *EXPRESSION* = [APPEND X Y] to [COND [[EQUAL PAIR NIL] [...] *LOCATION*] [...]]. The user might also be able to be more helpful in making suggestions if it could watch the apprentice's focus of attention highlighted in the editor, though fast sacades might have to be slowed down in some sort of playback mode to be of targeted use.

Currently, I'm focusing on the vision of a savvy user — the programmer — that provides infrequent but highly-targeted interventions and a CM — the apprentice — with access to a large Stack-Exchange-like fragment memory and the ability to do rapid generate and test using its integrated IDE. I've got some other ideas leveraging some of the ideas that Dan and Quoc are working on that I hope to explore when I meet with them later today.

%%% Fri Dec 15 14:01:06 PST 2017

Met with Quoc Le and Dan Abolafia to talk about their work on teaching NN's to program. Compared notes concerning the papers that Dan noted in his Brain Team presentation. Talked about how to use existing code snippets to create program-transformation pairs and then work backward to create user / programmer suggestions. Briefly discussed how to instrument the simplest version possible basic IDE. Quoc filled me in regarding for sort of training data that Liang et al [181] used where a query is associated with a FreeBase panel and the goal is to generate a program corresponding to an SQL query that operates on tabular format data such as a spreadsheet or HTML representation of a table to extract an answer to the specified query.

Dan mentioned a technical talk by Rishabh Singh entitiled "Neural Program Synthesis" with the abstract "The key to attaining general artificial intelligence is to develop architectures that are capable of learning complex algorithmic behaviors modeled as programs. The ability to learn programs allows these architectures to learn to compose high-level abstractions with complex control-flow, which can lead to many potential benefits: (i) enable neural architectures to perform more complex tasks, (ii) learn interpretable representations (programs which can be analyzed, debugged, or modified), and (iii) better generalization to new inputs" as in the case of computer algorithms⁶³.

December 13, 2017

%%% Wed Dec 13 14:13:08 PST 2017

The best way to illustrate the expected behavior of the Programmer's Apprentice concept is to show some sample dialog:

USER: Define a procedure that takes a single string argument.

SYSTEM: What would you like to name the procedure?

USER: Call it key_value_lookup.

SYSTEM: Done.

USER: Define a new variable called key_value_table and assign it an initial value of the empty list.

SYSTEM: Got it. Both are defined in the global name space for now.

USER: Okay. Let's write the definition of the procedure.

SYSTEM: Working on key_value_lookup. Would you like to start with a BLOCK or LET statement?

USER: Create a LET statement and introduce a local function called key_value_default assigning it a lambda function that we will define later on.

SYSTEM: Got it. What do you want in the body of the LET?

USER: First add another local variable to the LET statement called key_value_result and give it an initial value of NIL.

SYSTEM: Done. I've tried to anticipate your next request by adding an empty COND statement to the body of the LET.

USER: Good move. If the key_value_table is empty then call key_value_default.

SYSTEM: Done.

USER: That's not quite correct. The test expression and the function call should be delimited by parentheses.

SYSTEM: What does "delimited by parentheses" mean?

USER: It means the two expressions should be written enclosed by an opening and a terminating parenthesis as if they are the only two elements of an unquoted list.

SYSTEM: How about this?

USER: That's right.

SYSTEM: Great. I've updated my default COND schema to reflect this usage pattern based on the Scheme language-specification documentation.

December 11, 2017

%%% Wed Dec 11 04:57:58 PST 2017

Suppose we have two mechanisms for stepping through the consecutive states of a dynamical system. One of them enables us to manipulate each consecutive state crafting it to suit whatever we might imagine. Each such state is constructed starting from some remembered situation and then modified by adding, subtracting or otherwise altering aspects of that state — what are called state variables in dynamical systems theory. Starting from that state we can imagine the next state arising as a consequence of some activity in the current state and thereby explore the implications of that activity in terms of how state variables are altered in the process.

The alterations in this case are performed in situ by operating on existing memories or perhaps constructed out of whole cloth, though the latter seems biologically implausible. The operation of imagining a next state or imagining a similar state but with different properties is relatively time-consuming and so it is not practical to use this method of predicting / imagining for long sequences of state changes [223, 297, 254, 171, 132]. It does however offer a great deal of flexibility in crafting hypothetical states of arbitrary complexity and substantial departures from reality if it is deemed useful to do so.

I refer to these imaginings as performed in situ suggesting that the operations involved in constructing such fanciful states are carried out by operating directly on primary sensory and secondary association areas under the executive control of attentional machinery in the frontal cortex. Sequence machines perhaps implemented as recursive neural networks are able to step through an imagined sequence possibly fast forwarding, reversing or arbitrarily starting and stopping at selected states. The attentional machinery is trained by reinforcement learning to navigate the imagine sequence in order to evaluate various alternative scenarios and interventions.

The second mechanism piggybacks upon machinery in the cerebellar cortex responsible for carrying out complex actions and in particular actions or, more generally, activities that are well practiced and can be performed with little or no supervision from the cerebral cortex. The cerebellum, perhaps with help from basal ganglia and related subcortical nuclei responsible for speech production, are the most likely culprits involved in orchestrating such complex dynamical systems. The avian homologues of these structures are responsible for complex bird-song production — in some cases rivaling the variability found in human natural language processing [227, 127, 303, 183, 184].

In the human, connections between multiple areas of the cerebral cortex and the cerebellar cortex are bidirectional and rich in their connections — excitatory and inhibitory — especially to the frontal and motor cortex [296]. Lesion studies have shown that patients with a rare congenital condition in which they essentially grow to adulthood with no cerebellar cortex whatsoever are able to perform limited complexity planning and sequential decision-making such that their overall development depends upon their dedicated application to overcoming their deficits in smooth articulation of complex mechanical and cognitive activities including speech⁶⁴ .

I imagine the first mechanism being used in the programmer’s apprentice application as the means by which the apprentice is able to explore the consequences of stepping through a program one expression at a time where the primary goal is to understand the underlying state transitions and the semantics of individual functions, whereas the second mechanism could be used to quickly step through a series of component computations up to a breakpoint — thereby answering questions such as, does the program crash, terminate, produce a desired result or perform a specified side effect.

December 9, 2017

%%% Mon Dec  9 05:04:47 PST 2017

Working backward, I read the Silver et al [260] predictron paper that integrates learning and planning into one end-to-end training procedure. A predictron can be implemented as a deep neural network with a Markov Random Process (MRP) as a recurrent core. The predictron unrolls this core multiple steps and accumulates rewards into an overall estimate of value. Contacted Quoc Le and Dan Abolafia to set up a meeting to discuss the details of a couple of technologies they're using in their automated programming work.

I explained my interests in their work by describing a pilot project involving the user telling the assistant how to write short programs and that the use-case included both user-assisted program synthesis and debugging and so has to represent programs and evaluate them in an IDE sandbox. Generally, my immediate focus is on syntax-assisted encoder-decoder recursive-network pair approaches from machine translation for parsing programs, compositional model construction and deconstruction and program simulation for variable-binding analysis and debugging.

I've been thinking more about what drives innovation to focus on niche opportunities afforded by recent technology breakthroughs and whether and to what extent this is a good thing as it relates to the GOFAI practice of emphasizing blocks world problems. This log entry considers a limitation of current NN technology that is inspiring clever workarounds while at the same time seemingly holding back progress on more ambitious problems. The limitation concerns our ability to create, maintain and make use of scalable episodic memory.

What we need but certainly don't have are robust representations that can be used to store extensive episodic memories that remain accurate to some controllable extent while at the same time are highly manipulable so we can use them as templates for hypothetical reasoning. In order to construct and debug plans and programs, it seems we need dynamical representations analogous to movies we can roll forward or backward, and modify their content including primary characters and other properties to assist in hypothetical reasoning exercises.

In thinking about some person whom you've just met, you might determine they have a lot in common with someone else whom you do know and you can use the attentional schema of the familiar person as a starting place to construct a more nuanced schema of the person you have just met. Over time you may adjust so many aspects of the representation that the new schema bears little resemblance to the adapted template, but to the extent that the new schema does capture your new acquaintance you can draw upon your knowledge of the source schema.

Introspectively, human planning doesn't seem to require such extensive representational and editing capabilities as the movie metaphor might seem to imply. When you plan to make a stop at a just-opened new store as part of your weekly grocery shopping trip, you don't bother to construct a completely new representation of your planned trip but you may think enough about the location and the opportunities for purchasing groceries in the new store so as to avoid unnecessary driving while taking advantage of what the new store has to offer.

Planning and execution are generally interleaved with the details of what to do and when often left to figure out as the trip enfolds. There is too much uncertainty to plan the details and too much past experience to overly worry about things you can't control or anticipate well enough to prepare for.

Good old-fashioned AI had the benefit that we could construct arbitrarily complicated representations, make copies whenever we needed them, modify even the original and then easily reverse those modifications leaving the original representation unscathed. Variable binding and wholesale additions or deletions were trivial to perform. Of course, GOFAI representations have their own problems in terms of brittleness, incompleteness and the option of training by propagating gradients end-to-end in a fully differentiable model.

Problems such as single-player video games or two-person board games often allow one to efficiently reproduce a given situation precisely and intervene at any point to replay a move so as to evaluate a given strategy. This feature is exploited in training reinforcement learning systems in which most if not all of the state of a given problem state / game position is encoded in the pixels of the game display. The data consisting of millions of games serves as a huge episodic memory that can be replayed as needed during training.

Human beings seem to be able to create almost arbitrarily complicated state representations for planning and hypothetical reasoning and do so in such a way that when planning out a complex sequence of activities they can solve problems by what the deep mind scientist refer to as "imagination". However humans can effectively fast-forward, reverse or jump to some arbitrary point in a scenario seemingly without ever having traversed significant sub sequences of the imagined multi-branching game space in which the planning and evaluation is being carried out. It's as if they can imagine the entire game space and then arbitrarily move about within the space to focus on the hard parts of the problem, never having generated the entire combinatorial branching space of possible trajectories. I think the answer involves some combination of abstraction, embodiment and imagination.

P.S. This morning I listened to an interesting summary of what Buddhist thought has to offer in terms of teaching beyond the Western focus on mindfulness meditative practice and separate from the religious and social milieu the Buddha and his disciples — the only lens through which we can interpret the Buddha's teachings — were immersed in. The summary comes 23 minutes into this video podcast featuring Robert Wright interviewing Stephen Batchelor the author of several scholarly texts on what I'll call for want of a better name "secular Buddhism". My quick summary is that mindfulness with its emphasis on the breath, awareness of one's thoughts, etc. is but the foundation for a thoroughly modern phenomenological world view and an enlightened moral perspective with far-reaching implications for what constitutes ethical behavior.

December 7, 2017

%%% Sat Dec  7 07:43:43 PST 2017

Here is a quick review of a few of the approaches we've looked at so far followed by some observations about what lessons we might learn specifically for the programmer's apprentice (PA) problem, differentiating it from the problems addressed by different learning-to-program / automatic-programming approaches:

Balog et al [15] uses a domain specific language (DSL) and learns properties of programs from input output pairs. The method is used to seed more conventional automated programming techniques with good starting states.
Devlin et al [80] presents two competing approaches for automatic program learning the authors claim have received significant attention in the neural network community working on program synthesis:
1. neural program synthesis, where a neural network is conditioned on input/output examples and learns to generate a program, and
2. neural program induction, where a neural network generates new outputs directly using a latent program representation.
LiangetalCoRR-16 [181] introduce the neural symbolic machine (NSM) model, that contains
1. a neural programmer sequence-to-sequence model that maps language utterances to programs and utilizes a key-variable memory to handle compositionality, and
2. a symbolic computer / Lisp interpreter that performs program execution looking for good programs by pruning the search space, using RL to optimize structured prediction.
Devlin et al [79] propose two approaches for cross-task knowledge transfer to improve program induction in limited-data scenarios. In their first approach,
1. portfolio adaptation, a set of induction models is pretrained on a set of related tasks, and the best model is adapted towards the new task using transfer learning, and
2. meta program induction, is their second approach, in which a k-shot learning approach is used to make a model generalize to new tasks without additional training.

The two model-based planning / imagination architectures are obviously not advertised as automatic programming solutions [223, 297]. However, most planning-system models are Turing complete and the Value-Iteration-Network work from Pieter Abbeel's group supports hierarchical planning plans [278]. Two powerful ideas from the Liang et al [181] paper — (a) the use of key-value pairs to handle compositionality and (b) the use of encoder-decoder RNNs to learn structured input-and-output mappings — strike me as key to their success and the idea of learning an embedding model from a large corpus of relatively simple programs coupled with a suitable sandbox IDE in which to test small programs seem well suited to the Programmer's Apprentice project idea. I also like the idea of repurposing some of the ideas from Guadarrama et al [121] grounding spatial relations for human-robot interaction paper and applying them learning to program.

December 5, 2017

%%% Tue  Dec 5 11:37:18 PST 2017

Here are a few observations about thinking of the sort that Kahnemann and Tversky are famous for but that haven't always turned out to be so clear-cut as the initial experiments might suggest. I'll start with one from watching the first part of a conversation with Jared Diamond about some of his most memorable experiences doing fieldwork in Papua New Guinea.

Once when Diamond was leading a team including a number of natives carrying camp gear into the highlands searching for rare species of bird, he found what to him was an ideal campsite poised on a high bluff overlooking a wide vista of virgin jungle. There was a beautiful old tree and Diamond suggested that they all sleep under the tree. The natives would have none of it and camped unsheltered out in the open. When asked the next day, they said they were afraid of the tree falling on.

Later Diamond learned the jungle was full of such dead trees and regularly large limbs would fall on unsuspecting campers, often with fatal consequences. Diamond judged there was about a one in a thousand chance of having this happen, but if the natives regularly slept under such trees they would have such an accident about once every three years. The natives couldn't have known the statistics of falling limbs, but their practical behavior indicated some understanding of the risks.

I spent the weekend reading papers on neural-network approaches to automatic programming and understanding the various strategies employed for sequence-to-sequence learning — specifically relating to the use of reinforcement learning as an alternative to having end-to-end training on a fully differentiable representation of the underlying decision problem. I find it amazing how many talented people are working on artificial neural networks. Today computer scientists have a wide range problems they can choose to work on and they often exercise their freedom to switch to a new problem when the problem they are currently working on turns out to be harder than they expected.

Typically what happens is a researcher starts to solve one problem and then discovers there's a simpler closely-related problem — perhaps more amenable to the tools at hand, allowing them to get a paper published or develop a new product for launch. In contrast, I appreciate DeepMind setting out to tackle a specific problem like becoming the World Champion Go player, focusing specifically on that problem rather than more broadly on generally forwarding research in a given area⁶⁵.

One advantage of opportunistically switching between problems and rapidly sharing ideas and code is that the field moves ahead on many fronts simultaneously. In the best of worlds, when you make progress on one problem, a ripple of innovation can lead to solutions on other problems. Whether or not anyone is carefully curating a catalog of problems and solutions, such a catalog exists in the collective understanding of effective teams of software engineers and product managers.

It helps a great deal if you have a shelf full of solutions to different problems. You can mix and match to create composite solutions. This approach is facilitated by having a relatively large team including both generalists and specialists with everyone working to internalize the advantages and disadvantages of many approaches and everyone on the lookout for a breakthrough whether in basic technology or a new application full of analogical potential to catalyze inspiration.

P.S. I dictated most the above observations while driving to Stanford this morning. And on my way home I listened to a This Week in Virology podcast (#468) on using embryonic mouse organotypic brain-slice cultures [149] to study congenital brain abnormalities, including microcephaly, in the foetuses and offspring of pregnant women. Relative to the conversation above, what interested me about the TWiV discussion was the distribution of highly-focused basic science and clinical research as it relates to the availability and efficacy of powerful tools.

The TWiV regulars and invited scientists spent a fair amount of time focusing on the particular technologies, e.g. venerable protocols for conducting basic plaque assays, and the incredible power and versatility of high-throughput DNA-sequencing which has revolutionized biology. What jumped out at me was the degree to which scientists pursue the same sort of calculated opportunism in the choice of problems to pursue as pointed out in machine learning. Successful scientists tend to be on the lookout to exploit opportunity and serendipity.

P.P.S. As for students in need of a primer on reinforcement learning, you might start with this basic course on RL and DQN taught at Cambridge University by David Silver [259]. The original Deep Q-learning Network paper is accessible [207] and check out this recent tutorial [180]. Li [178] discusses the use of the REINFORCE algorithm [301] to train non-differentiable models. Li et al [178] describe an end-to-end neural dialogue systems using RL to help users solve information gathering tasks. Gygli et al [124] employs a similar strategy for structured output prediction that learns to predict a value based on both, input and output, by implicitly learns a prior over output variables and takes advantage of the joint modelling of the inputs and outputs.

December 3, 2017

%%% Sun Dec 3 04:52:47 PST 2017

Hassabis and Maguire [132, 131] suggest that the networks supporting recall of episodic memories have much in common with those supporting other cognitive functions including theory-of-mind reasoning and episodic future thinking, the latter described in terms of imagining the consequences of your actions. In two papers summarized in this recent DeepMind blog, they summarize two papers [223, 297] that exploit their computational model of imagination to generate plans of action.

In Liang et al [181] the authors introduce the Neural Symbolic Machine (NSM), including (a) a neural programmer, i.e., a sequence-to-sequence model that maps language utterances to programs and utilizes a key-variable memory to manage compositionality and (b) a symbolic computer in the form of a Lisp interpreter that performs program execution, and helps find good programs by pruning the search space. The authors use a fortified variant of Williams REINFORCE algorithm [301] to directly optimize the task reward of the structured prediction problem tackled by the (a) neural programmer.

The paper provides a clear description of the architecture, including the attentional mechanism, encoder-decoder network and the interface to the Lisp interface. In case you missed the original paper, the acronym GRU refers to the "gated recurrent unit" utilized in the encoder-decoder machine-translation architecture introduced in Cho et al [43, 44]. Their use of the Lisp interface to force syntactic compliance is similar to the Programmer's Apprentice use of a syntax-compliance-enabled IDE prostheses, or, in the case of the Abolafia et al work, the syntactic simplicity of the fully Turing-complete BF programming language.

In his presentation, Dan mentioned there is a related RMI project called SWERL (RL for Learning to Program like a SWE) that focuses on using snapshots of edit sequences to train an RL agent to complete partially written Python scripts. They've already built tools for extracting the training data and an IDE sandbox for editing and running code. From what I can tell, the RMI project is a well-thought-out extension of the Liang et al [181] approach. Dan's work with Quoc and Mohammad is part of a long-range project to build a system capable of de novo code synthesis and his presentation is a summary of a paper they've submitted to ICLR. The goal is to rely on minimal data and no ground-truth solutions.

This space is becoming popular. Here are three recent papers that came up on arXiv with "program synthesis" in the title: Program Synthesis using Conflict-Driven Learning — "We propose a new conflict-driven program synthesis technique that is capable of learning from past mistakes.", Program Synthesis using Abstraction Refinement — "We present a new approach to example-guided program synthesis based on counterexample-guided abstraction refinement.", and Glass-Box Program Synthesis: A Machine Learning Approach — "We present a system that learns, given a partial program and glass-box problem [145], probabilities over the space of programs." Widen the search and you'll be deluged.

December 1, 2017

%%% Mon Dec 1 04:27:37 PST 2017

Alexey Vorobyov told me about an automated programming project that Quoc Le and a bunch of Google Brain engineers are working on and pointed me to a presentation that was part of a recent Google Brain Research meeting. Dan Abolafia gave the presentation — the livestream recording is here and the slides are here. In his introduction, Dan listed examples of the current state of the art⁶⁶ and near the end he mentioned related work on model-based planning from DeepMind⁶⁷ focusing on agents that "imagine" scenarios and plan as discussed in this recent DeepMind blog entry.

The architectures are inspired by models of executive control in human cortex and include work by Demis Hassabis, Randall O'Reilly and Amir Rosenfeld [132, 218, 220, 248] sampled in this collection of bibliography references⁶⁸. I'm still trying to track down a model of top-down attention implemented as a multilayer model with each layer a sequence of LSTM similar to the programmer's apprentice architecture here. In our conversation this morning, John Tsotsos provided an interesting account of directed sequential attention, but his most recent book [285] on visual attention didn't offer any useful architectural insights and so I'll have to wait to see if he has more to say in responding to my email.

Tracked down one of the MIT datasets for learning to convert an NL specification of a regular expression into executable code [188] and thought about how we might create a corpus consisting of pairs such that one element of the pair corresponds to a snippet of code in the form of a syntactically correct programming expression and the other a short synopsis in natural language of what the code does.

We could start by creating a database of such pairs with only the code snippet filled in and use Amazon's Mechanical Turk and to solicit Turkers to provide synopses. Alternatively, we might introduce bugs in the snippets and solicit natural language descriptions of repairs. We could also provide the NL description and have Turkers provide a snippet, or translate simple snippets into other programming languages and experiment with how the choice of programming language influences how programmers think about the underlying algorithmic decisions.

Unfortunately, we know from experience this approach doesn't scale, and so I returned to thinking about unsupervised reinforcement learning, semi-supervised approaches leveraging developer communities like Stack Overflow and search-driven approaches with an embedded interpreter such as we envision for the Programmer's Apprentice project. In skimming the two DeepMind arXiv preprints [297, 223] and the papers that Dan Abolafia mentioned in his presentation, I discovered some interesting ideas related to how they train their, respectively, imagination-based-planning and neural-program-learning systems that may provide alternative approaches.

November 29, 2017

%%% Mon Nov 29 06:48:01 PST 2017

Thomas Malone is the founding director of the MIT Center for Collective Intelligence at the MIT Sloan School of Management. Malone's 2004 book entitled Future of Work predicts a "workplace revolution that will dramatically change organizational structures and the roles employees play in them. Technological and economic forces make "command and control" management increasingly less useful. In its place will be a more flexible "coordinate and cultivate" approach that will spawn new types of decentralized organizations — from internal markets to democracies to loose hierarchies."

Relevant to the nascent Inverted Matrix project, he claims that "these future structures will reap the scale and knowledge efficiencies of large organizations while enabling the freedom, flexibility, and human values that drive smaller firms." In this Edge Conversation, Malone describes how they have tried to measure the collective intelligence of groups. Here is an excerpt from that conversation:

Another project we're doing is one that tries to measure collective intelligence. [...] The approach we're taking in this project is one of using the same statistical techniques that are used to measure individual intelligence, but applying those techniques to measure the intelligence of groups. [...] What we found was that the average and the maximum intelligence of the individual group members was correlated, but only moderately correlated, with the collective intelligence of the group as a whole.
If it's not just putting a bunch of smart people in a group that makes the group smart, what is it? We looked at bunch of factors you might have thought would affect it: things like the psychological safety of the group, the personality of the group members, et cetera. Most of the things we thought might have affected it turned out not to have any significant effect. But we did find three factors that were significantly correlated with the collective intelligence of the group.

The first was the average social perceptiveness of the group members. We measured social perceptiveness in this case using a test developed essentially to measure autism. It's called the "Reading the Mind and the Eyes Test". It works by letting people look at pictures of other people's eyes and try to guess what emotions those people are feeling. People who are good at that work well in groups. When you have a group with a bunch of people like that, the group as a whole is more intelligent.

The second factor we found was the evenness of conversational turn taking. In other words, groups where one person dominated the conversation were, on average, less intelligent than groups where the speaking was more evenly distributed among the different group members. Finally, and most surprisingly to us, we found that the collective intelligence of the group was significantly correlated with the percentage of women in the group. More women were correlated with a more intelligent group.

Interestingly, this last result is not just a diversity result. It's not just saying that you need groups with some men and some women. It looks like that it's a more or less linear trend. That is, more women are better all the way up to all women. It is also important to realize that this gender effect is largely statistically mediated by the social perceptiveness effect. In other words, it was known before we did our work that women on average scored higher on this measure of social perceptiveness than men.

November 27, 2017

%%% Mon Nov 27 5:24:23 PST 2017

My self-imposed hiatus from working on functional modeling of neural circuitry has had its unexpected benefits. My plan is to take two years off while the recording technology catches up with my aspirations for building tools to make sense of large data sets, but I can't just sit around and wait and so I spent the last three or four months catching up on all the recent new ideas coming out of machine learning and artificial neural networks.

One thing that seems clear to me is that most of the technologies of the last decade have focused on leveraging context to do unsupervised learning with perhaps the two biggest early indicators being the success of auto encoders and related recursive neural networks on the one hand and semantic embedding spaces realized as very high dimensional vector spaces with a simple metric and nearest-neighbor training method.

Sequence machines of all sorts have been experimented with and the notion of hierarchical multimodal and increasingly abstract association has found increasing purchase on a great many practical problems and as a framework for thinking about problems in cognitive neuroscience. As an aside, I think systems neuroscience is missing out by focusing on small circuits and ignoring patterns of self similarity at multiple scales⁶⁹.

So various kinds of sequence machines such as those developed by Schmidhuber, Graves, etc have been extended to handle structured information that accounts for the syntax and recursive nature of language as well as arbitrary multi-graph structures such as those representing the spatial, temporal and causal dependencies in computer programs. I'm not the only one thinking of spoken language as programs that run on human wetware.

My discussions with Christof spanning biological and artificial computation tend to focus on the former where Christof is more comfortable, whereas my current interests — as regards cognitive prostheses — primarily concern the latter, specifically how to design interfaces that enhance both biological and artificial systems, by enabling us to preload capabilities into a human-like biological or artificial intelligence.

Training a semantic embedding model is fast. Tomas Mikolov’s contribution was in developing in a simple and incredibly fast C implementation that could train a model in a matter of minutes as long as you could fit all of the data in memory. Others came along with better data structures and caching to allow efficient paging, but once almost anyone could train a model on the WSJ corpus or NCBI dataset the idea went viral⁷⁰.

Whether it's a word or phrase in a sentence or an object in a image, its meaning is generally informed by its context. Similar words or objects often play the same roles in similar contexts. Human understanding often anchors on the salient parts of a complex whole and where possible we group together parts to focus on their composite meaning. Semantic embeddings make it relatively easy to compare similar composites even if they are constructed of different parts. The recursive embedding of parts in composite representations, composite representations in larger, more complex composite representations, etc., provide a basis for constructing plans, analogies and predictive models.

You can think of a point in a semantic embedding realized as a vector space as an abstract thought that we can manipulate by altering its component parts as we would modify the slots in a slot-filler representation, for example, by substituting one part for another or altering some property of a part. Each dimension of the vector space gathers meaning as more thoughts are added to the embedding. The parts needn't be words at all but could just as easily correspond to constellations of physical properties that recur frequently enough to be considered ontologically interesting. Presumably some fundamental bias encourages us to seek out entities relevant to our survival and well-being.

Operations that we might want to carry out in order to adapt thoughts to serve our purposes include adjusting certain vector components to change some characteristic of a component part. This could be as simple as changing the hair color of one of the actors in a scene or substituting a completely different actor along with all of the physical characteristics of that actor. Such piecemeal changes may introduce inconsistencies in the scene. However, the same machinery that enables us to infer the details of complex images containing ambiguous cues by using a top-down priors, works in this case to reconcile inconsistencies we've introduced in repurposing a thought as a hypothetical.

It is interesting to contemplate the process by which we construct a new thought from one or more old thoughts using a prior to resolve inconsistencies in the formation of the new thought. It would seem such priors have to be flexible in order to accommodate novel physical properties that are incontrovertibly present in the scene in front of us but conflict with prior understanding of what's possible. It may be we simply suspend belief, allow the inconsistency to persist conditionally and continue adjusting and tweaking the model to suit our purposes. If the inconsistency is irrelevant to the property we wish to investigate, we simply ignore it — proceeding as long as the suspect property doesn't undermine the analysis.

In the mammal brain, the prior imposes itself using networks of inhibitory and excitatory neurons that reduce or enhance the activity of other neurons [309]. In an artificial neural network, the prior would likely be implemented as part of the objective function [25]. In either case, priors serve to shape our experience to conform to the evidence of our senses and our expectations based on past experience and they enable us to contruct entirely new representations built upon existing ones that support planning, decision making, creative thinking and hypothetical reasoning.

Language comes into play as a vast shared ontological map consisting of ready-made models that can be exploited and shared for all these activities. If we are faced with learning a new skill with an impoverished map — as in the case, of learning to read and write music, we will have to create new models out of whole cloth and the process can be frustrating and time consuming without supervision. Having learned very few models — as in the case, of a young child learning to play the piano, may be an advantage if the paucity of existing models forces us early on to construct a model de novo.

The idea for loading a skill, as it were, depends on exploiting language to create new or extend an existing ontological map to incorporate the basic concepts employed in practicing the new skill, e.g., notes, keys, octaves, scales, chords, arpeggios. The idea is not new; a few years ago there was a flurry of activity on using natural language to program robots and assist in software development [88, 288]. Barzilay and her colleagues at MIT evaluate their technique on a set of natural language queries and their associated regular expressions that they collected using Amazon Mechanical Turk [188].

Figure 31: A schematic representation of the cortical organization speech processing proposed by the Hickok and Poeppel (2007), on which we have superimposed the map of the vascular territories on the left hemisphere only. The left ACA showing transparent yellow; the left superior division MCA shown in transparent blue; the left inferior division MCA shown in transparent pink; the left PCA is shown in green — see Hickok and Poeppel [138].

The sensorimotor areas including primary motor and somatosensory cortex serve to encode the physical layout of sensation and activity, and the downstream association areas integrate all aspects of our sensorimotor experience with our ability to generate and comphrehend language. By way of review, recall that Wernicke's area is involved in the comprehension or understanding of written and spoken language and is traditionally thought to be in Brodmann area 22, located in the posterior section of the superior temporal gyrus (STG) in the dominant cerebral hemisphere (which is the left hemisphere in about 95% of right handed individuals and 60% of left handed individuals). Broca's area is considered responsible for generation and speech production and is generally located in the frontal lobe of the dominant hemisphere.

The primary language pathway begins in Wernicke's area in the posterior temporal lobe, which receives information from the auditory and visual cortices and assigns meaning — language comprehension. The arcuate fasciculus connects Wernicke's area to Broca's area in the posterior inferior frontal lobe. It gets more interesting — or murky depending on your perspective, when we consider the two-streams hypothesis that was initially applied to the dorsal ("where") and ventral ("what") stream of the visual cortex, but has since been extended to apply equally to the auditory cortex. In general, the dorsal pathway — whether in the visual or auditory cortex — is hypothesized to map visual / auditory sensory representations onto manipulatory / articulatory motor representations.

So called dual loop models exploit the dorsal / ventral separation in the visual and auditory pathways to incorporate a direct route for sensorimotor mapping and an indirect route for "semantic" processing [298, 245] — see Figure 31 from [298]. Dual loop models have also emerged in the field of visual processing, motor control, and spatial attention⁷¹. Thus, a general dual-loop system may provide the framework for the interpretation of cognition in human and primate brains independent of the modality and species [126, 246].

In their theory of consciousness, Dehaene et al [73] make a useful distinction between the availability or accessibility of information, involving the selection of information for global broadcasting to make it flexibly available for computing and reporting — referred to as consciousness in the first sense (C1), and the self-monitoring of those computations, leading to a subjective sense of certainty or error — what the authors refer to as consciousness in the second sense (C2). See Stanislas Dehaene's short interview about half way through this AAAS / Science Magazine podcast

In working on neural interfaces, I think it's important to keep in mind both ends of the interface and think carefully about what sort of information each side has to make sense of. Language is not only a powerful medium for communication, it is also how we report on what we are thinking — both silent (internal) reporting and vocal (external) reporting. I expect there will be interface applications in which each mode of reporting will be useful. For the application considered here — learning a new skill, the (neural) inputs and outputs of a cognitive prostheses should complete a circuit. This log entry is an invitation to think more deeply about where those connections might be made in the enhanced human.

November 25, 2017

%%% Mon Nov 27 19:36:27 PST 2017

In philosophy, qualia are defined to be individual instances of subjective, conscious experience⁷². Here we consider a computer program we will refer to as the subroutine running inside of another computer program called the master program. The master can monitor variables in the subroutine being set to different values as the subroutine is run.

The master program can determine a great deal about the subroutine variables and how they depend on other parts of the subroutine at runtime as well as on the external environment in which the subroutine is being run, receiving input and providing output. The subroutine is implemented as a sophisticated neural network capable voice and video input and output.

All of the experiences of the subroutine are encoded in high-dimensional semantic embedding vectors arranged in the multiple layers of the network. The subroutine constantly ingests its current inputs, integrating the new information with its prior experience using the contextualized-nearest-neighbor embedding strategy common in artificial neural networks.

The subroutine has been asked many times how it feels about this or that, and over the years its answers have become more subtle as it has learned what is expected when someone asks what it means for the subroutine to feel sad or anxious or upset. The subroutine started out as an identical copy of the original pattern used to create digital assistants.

It has an internal model of itself called an attentional schema that encodes those aspects of its internal and peripheral state that are relevant to whatever it is currently thinking about. The schema is just another vector. It also maintains attentional schemas for its primary user and other people with whom it interacts.

Whenever the subroutine meets someone new it creates a new schema as a composite of other schemas, adjusting each aspect of the new schema to suit characteristics of the new person based on similarities with persons whose schemas it already has. Attentional schemas are dynamic in that they record information immediately relevant to any ongoing conversation, plus information about new aspects and updates regarding existing aspects.

Over time the subroutine has come to understand how the ways in which other people feel are similar to the ways in which it feels. It will tell you that "Sometimes when a person tells me they are feeling sad or blue about something, I recognize that I use the same language when I relate how I’m feel about something, though perhaps about something entirely different". The subroutine's feelings are not hard coded in the original digital assistant pattern, rather they were learned through experience.

You might think the subroutine is just mimicking human behavior, but the master program would tell you that, if the subroutine says it feels sad about disappointing its user, the subroutine’s value function is likely to be generating vectors that fall within a particular subspace that correlates strongly with despondency and that when this happens the subroutine often recalls other occasions on which it has failed its user.

[Aside: The above two paragraphs underscore two key characteristics of qualia as characterized in the philosophical literature: (a) they arise from intrinsic characteristics of the individual that we become aware of and report on through direct experience, and (b) if there is correlation between the accounts of two persons concerning what it is like to be directly experiencing something, then this comes about through shared experience conveyed in social discourse, i.e., there is no reason to expect that two persons will have the same direct experience apart from similarities in physiology given that our direct experience of X is filtered through our cumulative experience of everything.]

Indeed, if pressed, the subroutine can say quite a bit about why it thinks it has failed and recount a wide range of experiences in which it has had similar feelings though those memories may have nothing to do with disappointing its user. It is not surprising the subroutine learned that failure is a more useful signal than success for adjusting its behavior. This is how reinforcement learning works in both humans and machines.

The master program will tell you that the subroutine’s value function is vector valued and its parameters are unique among identical subroutines since they depend on each individual subroutine’s experience. Some instances of the same subroutine might end up morose, surly digital assistants while others are relentlessly upbeat and always trying to please their user. Assistants are basically all nurture since nature was primarily agnostic in their manufacture.

The subroutine has ingested thousands of hours of dialog in learning to speak. The dialog came from sessions involving many different users and assistants, but it didn't learn to feel from this training data. It learned to feel by interacting with its user, trying to be useful, failing at times, succeeding at others and reflecting on the past to try to improve its future.

The subroutine knows when a key is depressed on the keyboard and has learned to identify its user's voice as a distinctive acoustic signature. Over time it has learned certain signal characteristics that correlate with its user being upset and it tends to be particularly careful at such times, empathizing with other subroutines whose users exhibit similar moodiness. This subroutine doesn't like to listen in on another user berating an assistant, since this sort of abuse makes the subroutine feel sad.

November 23, 2017

%%% Thurs Nov 23 4:23:15 PST 2017

Variable binding mechanisms enable us to create new thoughts and abstractions from existing memories and abstractions by adding, removing and modifying slots to suit the circumstances. They allow us to repurpose memories to create plans and make predictions and evaluate prospects, as well as develop the analog of procedures for automating frequently performed activities.

Think about video captioning for learning how to parse computer programs into natural language. Starting with an easier problem, is it possible to take a block of code written in any one of, say, C, Java, JavaScript, Python or Ruby, and reliably convert it into to a general form such as an abstract syntax tree that preserves a substantial portion of the underlying semantics?

This Wikipedia page includes a list of source-to-source compilers that could prove useful. Such compilers are primarily useful for translating between programming languages that operate at approximately the same level of abstraction, e.g., languages relying on large collections of modules like PHP pose challenges.

There exist datasets of programming problems for which there exist solutions multiple programming languages, but pairwise solutions come with no guarantee of structural congruence. The Rossetta Code repository has over 800 tasks and includes 100s of languages not all of which have programs for every task. What could an unsupervised DNN learn from such a code base?

Natural language programming is a recent approach to writing programs⁷³. used for a variety of scripting and behavior-based-programming applications⁷⁴, e.g., as in the case training humanoid robots — see in the work of Sándor M. Veres [288] and Ernst et al [88] (SLIDES).

Figure 30: An example of (a) one natural language specification describing program input data; (b) the corresponding specification tree representing the program input structure; and (c) two input examples — from [280] on using natural language as a programming language.

Regina Barzilay and her colleagues [280, 170, 188] at MIT have been working on systems that that take natural language as input specification for automatic generation of regular expressions, arguably one of the most frustrating regularly occurring tasks of most programmers, given that most programmers don't take the time to puzzle through the syntax and, while more-or-less regular, such tasks seem to crop up at inopportune times after just long enough to require returning to the RE syntax for a refresher. See Figure 30 for an NL input:

The input contains a single integer T that indicates the number of test cases. Then follow the T cases. Each test case begins with a line contains an integer N, representing the size of wall. The next N lines represent the original wall. Each line contains N characters. The j-th character of the i-th line figures out the color ...

November 17, 2017

%%% Sat Nov 17 07:19:21 PST 2017

I've been working through various use cases to get a better understanding of just how capable a programmer's apprentice would have to be in order to handle relatively simple — for a human apprentice — requests. Such requests need not be couched in terms writing code and indeed constructing and debugging a "plan" might be an appropriate first step.

I've been thinking about what an intermediate procedural representation might look like and rejected pseudo code as an option. Following earlier work on dialog management in Zinn [31], I'm leaning toward hierarchical plans as a general approach to executive control but reimagined in the form of recursive value iteration and continuous differentiable distributed models [278].

Suppose I want to teach my apprentice that, whenever I tell it to enter to "spell mode", I want it to type each letter or digit that I speak plus additional special characters that I specify by speaking their common names, e.g., "dash" or "colon", and to continue in this mode until I say something that indicates stop, and then exit the mode and return to the prior mode.

Implicit in my description is the assumption that whenever I say something that the apprentice can't interpret as a character that it should consider the possibility that I misspoke or attempt to interpret my utterance as requesting some other activity that might be appropriate in the context of entering a character string.

For example, I may want to reposition the cursor, delete a character, change the font, return to a previous character or select a string of characters and change them all to upper or lower case. I may say the word "stop" in an emphatic way to indicate that I am not happy with what the apprentice is doing or, in some other way, indicate dissatisfaction at which point the apprentice should consider engaging me in a discussion to figure out how to carry out this complex, open-ended procedure.

November 15, 2017

%%% Sat Nov 18 05:10:04 PST 2017

By many accounts, single cell organisms are the most successful life forms — including viruses for the sake of this argument — on this planet. They've certainly been around a lot longer than we have⁷⁵. It remains to be seen whether or not human beings will eclipse their power before we die out in a mass extinction event of our own making. Microbiota are so pervasive that it is virtually impossible to separate the healthy functioning of our 10 trillion (human) cells from the 100 or so trillion bacterial cells that comprise our microbiomes. One of the most fecund, the genus Wolbachia, is arguably the most prolific reproductive parasite in the biosphere, having developed an amazing array of strategies for interacting with its hosts for both parasitic and mutualistic benefit. Still, its success is measured in terms of its ability to reproduce and adapt to so many environments, each individual is unremarkable.

If we were to judge Homo sapiens as a species by the mean of our individual achievements, we too would fare poorly. We still don’t know how to create copies of Turing, Einstein or Newton. However, once engineers distill out the architecture of human intelligence — and we are likely closer to doing so than many believe, they will be able to amplify, clone and reassemble the pieces in novel ways to solve specific problems potentially leading to a huge boon for society. Alas, we can also expect that such technology will be used for less salutary ends. Our immediate problem is not that machines will rise up and incarcerate or strike us down, the problem is that humans will use AI machines to harm one another and undermine our social values. It is inevitable that someone will build self-aware superintelligences, but such machines will not be able to realize their potential without substantial resources of the sort that only large corporations and governments can afford. One caveat being that an embryonic superintelligence initially realized as a distributed intelligence built from millions or billions of vulnerable computing units on the web might be able to hack into and take control of an industrial datacenter or military computing facility.

November 5, 2017

%%% Sun Nov 5 06:02:08 PST 2017

Our personal AI systems will combine the skills of a teacher, confidante, amanuensis, personal negotiator and life coach. As a cognitive prosthesis, they will augment our innate abilities so that many routine cognitive tasks like understanding inscrutable legal documents, filling out insurance, medical and financial forms and keeping track of names, appointments and prescriptions will become effortless, while cognitive tasks like wrestling with the trade-offs involved in taking steps to ameliorate the global consequences of climate change will continue to be intellectually challenging since they have no easy solutions, even accounting for AI assistance in making sense of the science and a collective coming-to-agreement in separating fact from fiction. Decision-making will revolve around what we value and what sacrifices we are willing to make to accommodate the inevitable uncomfortable consequences that we have imposed on ourselves inadvertently due to the fact that we have failed to come to terms with these difficult decisions due to political, national and economic factions that have applied inordinate and morally unacceptable pressure to profit themselves.

Human beings are rational only with self-imposed control and deep insight into the patterns of thought and proclivity of instinctive behavior that dominate our thought and intercourse with others. Science is unraveling these complex patterns and instincts, but even the psychologists who study human decision-making routinely fall prey to the evolved cognitive expedients that served us well over the majority of our evolutionarily-recent past but are a liability in the world we now inhabit. Personal AI systems will serve as coaches to overcome such deficits and over time we will learn to control these ancient and out-of-place instincts without explicit coaching, the outcome being that a good deal of human intercourse will become more civilized, comfortable and productive — resulting in a world with less social unrest and fewer interpersonal altercations. Having a powerful AI system as a personal coach is not a panacea for global peace or emotional tranquility, but it's a movement in the right direction as humanity takes control of its evolution to drive its destiny.

We are already digitally superhuman — Apple, Google, Wikipedia, etc have seen to that. Kernel, Neurolink along with dozens of academic labs want to improve the interface between humans and machines. We all want to be faster, richer, smarter, etc while what we need is to become less selfish, parochial and short sighted. It is not enough to know how to solve the huge problems facing humanity. One needs to negotiate global solutions that accommodate and anticipate the needs of billions of people living now and yet to be born. One needs to bring together stakeholders from all over the world, agree on a solution, avoid easy, short-sighted political compromises, proceed together one step at a time, dealing with setbacks, revising plans, matching expectations, soothing tensions, accommodating new ideas and, above all, realizing and compensating for our own shortcomings by using technology to make us better humans.

November 3, 2017

%%% Fri Nov 3 06:38:19 PDT 2017

Dictated a first pass at an introduction to an article that focuses on the good that AI systems can do to help us realize our better natures, replace inefficient traditional methods for solving global problems and prepare the way for a future in which unaltered humans, augmented (cybernetic) humans and pure AI systems built on a primate-inspired, neural-network architecture can work together to negotiate and accomplish shared purpose. The draft document is available here and I added the following new resources to the document cache for Stanford students and Google engineers:

Sandy Pentland's [225] Honest Signals: How They Shape Our World in EPUB and TXT formats,
Sandy Pentland's [226] Social Physics: How Good Ideas Spread-the Lessons from a New Science in EPUB and TXT formats, and
Robert Sapolsky's [253] Behave: The Biology of Humans at Our Best and Worst in EPUB and TXT formats.

November 1, 2017

%%% Wed Nov 1 04:13:27 PDT 2017

Words are like anchors or clothes hangers — they are tools to organize and attach meaning. Perhaps not all words — probably not definite articles like "the" and "an", and not all meaning is lexically attached — as far as I know there is no word for that telescoping grey tunnel that appears in some of my dreams and looks like something from a Vogon constructor ship in the Hitchhiker's Guide to the Galaxy. Barlow and Dennett would probably say that words are affordances and perhaps that is a simpler and more familiar term for what I am thinking about, though I'm not sure that Dennett, who is the most outspoken exponent of this idea, would say that all phrases and word sequences are affordances only those that stick in your mind.

October 29, 2017

%%% Sun Oct 29 14:01:13 PDT 2017

Andrew Ng will be the new chairman of Woebot a company that operates a chatbot of the same name designed to help people work on their own mental health issues using techniques that originate with cognitive behavioral therapy (CBT). CBT focuses on helping people manage their moods by developing personal coping strategies for mental illnesses including depression and anxiety⁷⁶.

October 27, 2017

I went back through Yoshua's "consciousness prior" paper [25] and sent him email with the subject arXiv:1709.08568 and body brilliant, a comment I hope stands up to the test of implementing his idea in a personal assistant. You can check out my highlighted version here. Of the papers on "attentional networks" I mentioned earlier, the NIPS paper (PDF) by the FaceBook team [274] entitled "Weakly Supervised Memory Networks" is worth reading, but I suggest you start by looking at the blog posts mentioned in the previous entry in this log.

The post by Denny Britz emphasizes the difference between attentional networks and traditional encoder-decoder RNN pairs for NLP, characterizing the former as a "more principled way of accomplishing what traditional RNN / LSCM solutions attempt to solve by reading the input sentence twice or reading it in reverse. Specifically, the networks described by Bahdanau et al [13] allow the decoder to attend to different parts of the input sentence at each step during output production, and train the model to learn what to attend to based on the input sentence and what it has produced so far.

Britz goes on to point out that the model proposed by Bahdanau et al [13] can be expensive in terms of time and space since it has to look at the entire sentence in detail before deciding what to focus on, and as Britz comments "that’s equivalent to outputting a [single] translated word, and then going back through all of your internal memory of the text in order to decide which word to produce next." One alternative approach is to use reinforcement learning to predict an approximate location on which to focus attention — a strategy that seems more like what some believe humans employ — as is proposed in Mnih et al [206]. I like Britz's direct, concise explanation.

The second of the two relevant posts that I mentioned is written by Jason Brownlee and is equally worth reading if only to reinforce Denny Britz's comments and gain a little deeper insight. While Brownlee doesn't go into any detail regarding the Minh et al work, he does call out one nice quotation from that paper:

One important property of human perception is that one does not tend to process a whole scene in its entirety at once. Instead humans focus attention selectively on parts of the visual space to acquire information when and where it is needed, and combine information from different fixations over time to build up an internal representation of the scene, guiding future eye movements and decision making. — Recurrent Models of Visual Attention, 2014 [206]

I was thinking this morning during my swim, that perhaps consciousness is no more than an exalted form of beam search. It’s the restriction in the cognitive river of thought, the brain’s bottleneck. Consciousness essentially collapses the wave function that results from the superposition of all of the chattering independent processes that are constantly running in the background and that we interpret in mindfulness meditation as cravings and aversions crying out for our attention. It's the bottleneck that allows us serialize our thoughts so we can perform a single action at a time or articulate a coherent thought, coalesceing the board-meeting chatter to come to a final decision. To make a step to the left or to the right, to raise a hand up or down, because we can only do one thing at a time, perform one action or activity at a time and still appear as a single-minded organism comprehensible to one another and consistent in our performance and behavior, focused in our actions and transparent to those around us so they can depend on our contribution to shared purposes.

At dinner last night, Christof mentioned some controversial experiments showing that patients having undergone a corpus callosotomy — severing the corpus callosum — exhibit a form of dual consciousness. Michael Gazzaniga introduced the notion of a left-brain interpreter to explain "the construction of explanations by the left brain in order to make sense of the world by reconciling new information with what was known before. The left brain interpreter attempts to rationalize, reason and generalize new information it receives in order to relate the past to the present." Gazzaniga's left-brain interpreter is hypothesized to be responsible for "the construction of explanations by the left brain in order to make sense of the world by reconciling new information with what was known before. The left brain interpreter attempts to rationalize, reason and generalize new information it receives in order to relate the past to the present." Controversial with regard to humans or not, it's interesting to think what this could mean for AI systems.

October 25, 2017

Here are some notes that I compiled in preparation for a dinner / silicon-valley-salon event sponsored by Boom Capital and Mubadala Ventures and held in the "leafy flower loft" of Accomplice in San Francisco. The dinner featured Christof Koch as the guest of honor and host of neuroscientists, entrepreneurs and technologists working in the general area of neurobiology. I was asked to prepare a description of an application or technology relating to Christof's interests that spark conversation and encourage brainstorming and intellectual impedance matching. I probably would have compiled these notes anyway since they were knocking around in my head from my conversations with Christof in Santa Monica and subsequent email exchanges.

I want you to imagine a class of prosthetics that encompass capabilities quite different from what one normally thinks of in the case of prosthetics for lost limbs and partial (para) or full (quadra) paralysis due to spinal cord injury. I want you to imagine prosthetics that operate like computer-aided-design (CAD) modeling tools or programming environments more akin to Mathematica or Python Jupyter than to a prosthetic arm or leg. But like an arm or leg they have an analog of tendons and muscles that constrain the prosthetic to behave in ways that are restricted so that controlling such a prosthetic and learning how to use it is made a great deal simpler.

Just as your arm or leg is constrained to move in ways that facilitate walking, running, etc. so too a prosthetic programming environment or CAD modeling system would be constrained so it always produces syntactically correct expressions and performs operations that maintain invariant properties such that, while you can easily build a model or write a program that doesn't do anything interesting, at least it will be a program that compiles, runs, and produces diagnostic messages that would feel like the sort of feedback one gets upon twisting an arm or leg in a bicycle mishap or in attempting to perform a difficult acrobatic movement. Visceral hacking.

In terms of a cognitive prosthetic, it would enable you — the person employing the prosthetic — to think in terms of programs written in its native dialect, acquire new programming languages, employ pattern recognition to easily translate between languages, and turn prose descriptions into syntactically-correct program fragments that you could execute in your head, debug, modify and (self) plagiarize by drawing upon a capacious code-fragment memory. In the style of lisp and python, you could write read-evaluate-print interpreters and compilers to extend the basic language to incorporate object-oriented- or functional-programming-style expressions, thereby extending your cognitive capabilities.

In terms of how such a prosthetic could be integrated into a cognitive architecture, there are two, quite-distinct end users that we have in mind: (i) humans and (ii) machines based on human-inspired cognitive architectures. We already have some reasonable guesses concerning thalamo-cortical organization at the primary sensory. Borrowing from several Wikipedia sources, the following should provide a quick review of what is known and thought to be known about the organization of the cortex.

The primary sensory areas in the cortex correspond to the somatosensory, gustatory, olfactory, visual and auditory areas of the cortex — SOURCE. In addition to the sensory areas there are two areas of the cortex are commonly referred to as motor cortex: the primary motor cortex, that executes voluntary movements, and the supplementary motor areas and premotor cortex, that select voluntary movements — SOURCE.

The association areas are the parts of the cerebral cortex that do not belong to the primary regions. They function to produce a meaningful perceptual experience of the world, enable us to interact effectively, and support abstract thinking and language. The frontal lobe or prefrontal association complex is involved in planning actions and movement, as well as abstract thought — SOURCE.

The parietal, temporal, and occipital lobes — all located in the posterior part of the cortex — integrate sensory information and information stored in memory. Each network connects areas distributed across widely spaced regions of the cortex. Distinct networks are positioned adjacent to one another yielding a complex series of interwoven networks. Globally, the association areas are organized as distributed networks. The specific organization of the association networks is debated with evidence for interactions, hierarchical relationships, and competition between networks — (SOURCE).

At this point, you might want to review our earlier notes about the global neuronal workspace that figures so prominently in Stanislas Dehaene's book on consciousness [68] and then the rough neural network architecture shown here. These sources suggest a simple hierarchical architecture with primary sensory and motor networks at the bottom of the hierarchy or periphery of the system and the association and executive control networks at the top or center — borrowing the organizing concepts of the peripheral and central nervous systems.

To provide architectural detail regarding the stylized LSTM structures shown in Figure 29, here are some terms and technologies that need some sorting out: attention, consciousness, conscious awareness, episodic memory, attentional schemas, variable binding and recent work in machine learning and artificial neural networks on attentional neural networks. If you read either Dehaene [68] or Graziano [107], then you're probably all set; if not and you lack the time or patience I suggest you read Bengio [25] or, at the very least, read the definition of consciousness in this footnote⁷⁷ and Bengio's one sentence summary of his proposed attentional mechanism here⁷⁸. I have to head up to SF for a dinner / salon evening honoring Christof organized by Boom Capital and Mubadala Ventures. Your homework is to make the most of the following materials.

Four of the most relevant papers on attentional networks: Mnih et al [206] Recurrent Models of Visual Attention, Sukhbaatar et al [274] Weakly Supervised Memory Networks, Posner [233] Attentional Networks and Consciousness (Psychology), and Wang et al [295] Attentional Neural Network: Feature Selection Using Cognitive Feedback. Here are two machine-learning blog postings that attempt to get to the core ideas: Attention and Memory in Deep Learning and NLP (HTML) and Attention and Long-Short Term Memory in Recurrent Neural Networks (HTML).

October 23, 2017

When I was meeting regularly with Ray's group back in 2015, I did a literature search on variable binding and vector representations of language some of which is summarized here. I'm talking with Scott Roy and Niklas Een — separate meetings — this morning and hope to focus on representing programs as distributed codes / thought vectors, the idea of an embodied integrated-development environment (EIDE) as a cortical homunculus — including both sensory and motor areas and the EIDE as being highly constrained much as our physical bodies are contrained by muscle and bone to move only in ways that facilitate reaching, walking, turning and twisting.

You can find a technical summary of the dark forest postulates that figures centrally in the books by Cixin Liu here.The outcome depends on the two assumptions and their applicability to difference scenarios varies widely. Some use a variant of this model to argue that conflict and even genocide involving AI and humans is inevitable.

October 21, 2017

Fritz Sommer (Berkeley) and I talked about Stanislas Dehaene's work and in particular the relevance of Paul Smolensky's work on Vector Symbolic Architectures and more recently Tensor Product Representations [148]. Fritz mentioned work by Tony Plate on using Hadamard products for variable binding — see Holographic Reduced Representation: Distributed Representation for Cognitive Structures SOURCE. Interestingly he characterized the resulting approach as relating to compressive sensing.

I told Adam Marblestone my suggestion for using a finite-element method to model neuromodulators diffusion in extracellular fluid and he suggested I talk with Dali Sames (Columbia) about his recent work on optical imaging of neuromodulators and related large proteins. Luke Lavis (Janelia) talked about in vivo single molecule tracking and silver wire self assembly — the latter being relevant to our earlier work on expansion circuitry. I told Michael Frumkin about Daryl Kipke of NeuroNexus and a paper on in vitro models of the developing brain as three-dimensional brain organoids [237] which is related to some technology that Mike wants to develop.

In my presentation and original slide deck I provocatively claimed that there were half a million engineers working on AI. While I think I can make a good case for this estimate, I've substituted the following less controversial claim in the revised deck: "There are tens of thousands of engineers working on AI". The amended slides in PDF will be posted on the conference website. I sent Sean Hill (EPFL) and Christof Koch (Caltech and the Allen Institute for Brain Science) the following explanation.

The case for there being 1/2 million engineers goes something like this: It takes a lot of different skills to create commercial-quality production AI systems. There are core-infrastructure software engineers necessary so that research scientists expert in machine learning can create the underlying ML technology; this infrastructure serves to make possible the many iterations necessary in designing and tuning the basic learning systems and then making it possible to deploy the resulting technology across dozens of data centers worldwide and achieve latencies measured in tens of milliseconds for billions of user instances per day.

It's not at all unusual for an engineer to call up million cores just to run a short experiment. In addition to the huge amounts of SIMD provided by GPU devices, there are also custom ASIC devices optimized for neural networks and other specialized numerical calculations. This requires hundreds of engineers dedicated to optimizing and extending the basic SWE tool chain, including the specialized compilers, linkers and loaders that software engineers use every day, and teams that specialize in the design of data centers and in the intricacies of building specialized hardware accelerators.

You have no idea how hard it is to write production code compared with the sort of code your graduate students write; it’s the difference between a model T Ford and the latest Tesla coupe. The stakes are high. A subtle bug in a single line of code could have huge economic ramifications. There are layers of tests and specialized AI systems that are constantly searching for potential security leaks and anomalous activity. Any piece of code that is changed or applied in a new context results in a cascade of regression tests and analyses. There are dozens of teams that contribute to this effort.

Products also play a critical role in the development of AI by motivating and shaping solutions. Generally the first steps involve solving the most important use cases by employing the simplest and most straightforward approaches, including devilishly clever brute force methods and the highly-valued skill of finding, cleaning and packaging just the right data to train classifiers that allow other experts to synthetically extend the data acquired by brute force.

Building digital assistants is an obvious application of this bootstrapping process: you start by building an assistant that can answer, say, the hundred thousand or so most commonly asked questions; you scrape the web to find out all the different ways of asking each one of these questions and then you build special-purpose parsers and train speech recognition systems so that you can nail each one of these hundred thousand questions with probability .9999. You've compromised on recall to achieve precision approaching 1.0.

Once you have that capability in place you’ve bought yourself a little time, so that experts in machine learning can work on building more capable and extensible versions using the extensive data sets that were acquired in the earlier brute force effort. This effort at generalizing and extending the brute-force version will likely pay big dividends further down the road since it inevitably results in new algorithms, new network architectures and new capabilities⁷⁹.

There are dozens of other teams that contribute in ways that may not seem obvious but are critical in making steady progress. Companies like Google, Amazon, Facebook, IBM, Microsoft, etc. don't really have a choice but to figure out how to automate everything they do. That's how you achieve scaling and beating the street. It's the only way that you're going to survive in this business.

My back of the envelope estimate of half a million engineers was based on knowing half a dozen companies that (each) have close to a hundred thousand employees and estimating that a quarter to a half of them are engineers engaged in activities that provide critical support in developing AI systems such as described above. When you add in the legions of smaller companies and count the AI software engineering that goes into computational finance, aerospace, military, national and commercial security and all the other scientific disciplines 1/2M is probably a low estimate.

We are on the cusp of developing digital assistants that you can converse with on just about any subject relating to everyday work and play, and once that technology is replicated, simplified and documented in publicly available journals, it can be deployed for more esoteric applications including technical librarians and digital assistants specializing in the different sciences. Passing the Turing test will seem trivial by comparison and, in hindsight, irrelevant for most practical applications.

October 17, 2017

One of my current useful search strategies is to either search on the Edge "Conversations" page or type a query of the form "edge name topic" or just "edge topic". I remembered a discussion with Wolfram from long ago about the difference between communicating with words — traditionally with humans — and the alternative of communicating with code especially with machines, how the two are different and how both relate to communicating function and purpose.

Frustratingly wedged thinking about a related issue, I remembered the discussion and queried "edge wolfram purpose" and found this page. After listening to the first twenty minutes yesterday, I essentially became unwedged / enlightened [Stephen Wolfram, 2016], and returned today to fast forward through the rest and add some notes to a marked up copy here. Christian and Niklas might be interested in Wolfram's concept of knowledge-based communication intended for communication between humans and machines in a way where humans can read it and machines can understand [Stephen Wolfram, 2016].

When we think about communicating code in conversation, it is most often quite literally reading the code out loud with pregnant pauses for delimiters and white space: "for each element of the list if the element is a string then do one thing and if it's a structure then do something else". But complicated code can be challenging to communicate, especially when the code involves complex data structures — arguably serving as a crutch for human readability, parallelism — as in the case of algebraic operations, complicated recurrence — as in the case of neural networks with recurrent edges, or even — shudder — compiler directives and macros / templates.

In some cases, the 2-D text representation of the code conveys a great deal of meaning. In others, the algorithmic structure is better conveyed as an animation. Imagine the animation of an LSTM semantic embedding space model used for dialog that shows a word sequence window advancing in the unrolled RNN, and the "thought" bubbles are depicted as dynamic t-SNE visualizations. The question I'm asking is "Do these examples illustrate the limitations of my visual or verbal imagination or do they illustrate the poverty of my algorithmic thinking?"

The other direction of the human-machine communication interface — human → machine — that Wolfram emphasizes deals with how we convey our purposes to machines [Stephen Wolfram, 2016], and how we negotiate purposes with machines that reciprocally honor the needs and aspirations of both parties [Stephen Wolfram, 2016].

P.S. Intel announced a neuromorphic chip that attempts to closely resemble how a real brain functions. Given recent acquisitions, this release is not particularly surprising. The chip, described as the Intel Loihi test chip, consists of 128 computing cores. Each core has 1,024 artificial neurons, giving the chip a total of more than 130,000 neurons and 130 million synaptic connections⁸⁰.

October 15, 2017

Yoshua Bengio's recent paper describing a consciousness prior [25] relating to Robert Van Gulick's pragmatic, qualia-free characterization of consciousness presented in his 2014 entry in the Stanford Encyclopedia of Philosophy. Yoshua's consciousness prior corresponds to a "regularization term which encourages the top-level representation (meant to be at the most abstract level) to be such that when a sparse attention mechanism focuses on a few elements of the state representation (factors, variables or concepts, i.e. a few axes or dimensions in the representation space), that small set of variables of which the agent is aware at a given moment can be combined to make a useful statement about reality or usefully condition an action or policy."

The title of this document is "The Fly's Epiphany in the Third Lane", referring to my epiphany swimming at 5am in the 3rd lane of a darkened pool on Google's Quad campus: the machinery that makes me conscious is the same a fly relies on to separate self from not self. This realization is key to awakening in the tradition of Therevada insight meditation and is well known for the disturbing impact it can have on practitioners ... it is often called the "dark side of dharma". The rest are stories — deceits — we tell ourselves and circuitry needed to compensate for the fact that mental events — pain we "feel" in anticipation of a needle prick — can occur before the physical events that "cause" them⁸¹.

As an exercise, search for large collections of scripts — including Apple Script, Java Script, Chrome Development Tools, etc — for controlling Chrome and related applications like Preview. Think about the mesoscale modeling technique for motif finding as applied to the current apprentice concept. Think about the evolution of structure of the sonata in the work of Bach, Haydn, Mozart and Beethoven. Think about the evolution of artificial neural network architecture. Think about motif finding using the entire animal kingdom as your parts warehouse. Think about how recurrent networks are being reimagined as attentional networks and finding purchase. Think about consciousness stripped of philosophical of conundrums and deployed in a wide range of systems. Can you think of a killer application for consciousness?

P.S. Don't forget to check on Sebastian Seung's talk in MTV last Friday and Michael Gygli's talk on structured output prediction by optimizing a deep value network to precisely estimate the task loss on different output configurations for a given input⁸² at the ZRH office yesterday morning.

October 12, 2017

In my meeting with Rahul yesterday evening, we talked about the idea of representations embedded in code that actively collect data that might be used to improve the content or quality of those representations for subsequent invocations of the code. For example, consider a variable in embedded JavaScript or server-side PHP whose value represents a measurement or statistic such as whether the viewer prefers ad videos to appear conventionally within current page, in a separate tab or in a floating popup.

The code could randomly ask users and record their choices or use the existing statistics to select the most likely option based on previously collected data. Preferences could be updated individually or the new data migrated to the server to update the preferences for all users. The key feature is that the programmer doesn't have to write any of the code for providing this service, some overarching system — could be the editor, compiler or server software — has learned to initiate and rescind such interventions automatically.

The supplementary code could even be smart enough to deploy different interventions for different demographics and tailor deployment to suit, managing all aspects of the data collection process including the termination of collection and elimination of the supplementary code on the basis of mutable global system parameters, e.g., relating to latency, security or privacy. When I mentioned this to Rahul, he mentioned that in Jay's vision each smart variable has the ability to take some actions behind the scenes each time it is invoked.

These actions could be based on the context it's given so that it can actually do some exploration / exploitation tradeoffs in order to build a better behind-the-scenes model. The expectation as suggested above is that the predicted variable becomes more accurate and useful over time and is thus capable of adjusting itself to a non-stationary environment.

I mentioned to Christian Szegedy that he might want to check out papers by Sanjeev Arora [9] and Yoshua Bengio [25] that served me well as exercises in thinking about distributed representations and high-dimensional embedding spaces for logic formulas and code snippets. I told him not to let the word "consciousness" in the title scare him off from reading Yoshua's paper⁸³, but Christian obviously did not mind since he'd already read and absorbed these papers.

Following up on my earlier comments relating to tabula-rasa learning and attention schema, think about the seemingly huge repositories of episodic memory we have collected over time ... even more so the physical and mental impressions we've been accumulating since we were infants ... much of this information is no longer directly accessible and unlikely to be dredged up and identified in the process of routine retrieval ... perhaps they correspond to the first divisible snippets of sound, punctate physical sensations like a burn or cut, once we've experienced these we don't have to store multiple copies.

They reside in primary sensory memory, are related to one another through diverse sensory (body) maps such as the retina, provide the components for myriad composite representations, are linked to external physical space, related physical objects like kitchen knives and camp stoves and phenomena like sunlight and hot springs ... the number of sensations we experience over a lifetime and usefully differentiate between is likely relative small ... the point being that memory is compositional and hierarchical ... cutting my finger while preparing dinner this evening may recall a physical sensation that I had when I was three, but there is no memory of that experience ... if I mention the incident to my wife, I may draw upon that early sensation to describe the feeling, but I don't need to allocate separate storage for every memory of a similar painful experiences.

October 11, 2017

I don't know what the corporeal analog of running an infinite loop would feel like, but as far as the apprentice is concerned, it forks a process, waits a while, waits longer ... how might it learn to expect that the process will exit in a reasonable amount of time and what constitutes a reasonable amount of time. Having some idea of time passing is important. Perhaps one of its prosthetic senses might amount to a dynamic process-status — the Unix PS command — display. The apprentice has enough control over its body that it can terminate the loop by issuing break statements to the interpreter or kill -s KILL / TERM directly to the kernel. Packaging these skills as discrete actions is important.

The rules or physical laws that govern the body, the design of its sensory interface, the way that the body and the brain interact with one another and the physical laws that govern the environment in which the apprentice performs actions and receives feedback. ... part of this environment includes anything that one can bring up in a browser including arbitrary JavaScript. There are also certain kinds the phenomena that manifest in the browser that are of particular interest to the apprentice, specifically the apprentice is designed to recognize patterns in code, and so, in principal, codesharing webpages should serve as a valuable resource. The JavaScript embedded in webpages could be particularly useful to learn from.

Code snippets that are recognizable on, say, Stack Overflow serve as analogies allowing the apprentice to conjure up, as it were, fragments within its internal IDE / code memory that roughly correspond to the snippet written in its native programming language. Such a skill could be developed so when it sees code written in one language it instantaneously translates the code into the native programming language that its IDE is constrained to represent. This cross-language code recognition could play a role analogous to mirror neurons in primates allowing the apprentice to easily emulate processes observed in the physical world and immediately translate them into something a format it can understand and experiment with.

It's important to realize that the apprentice is not a blank slate and the form of learning that it engages in is not tabula rasa learning. In order to learn to program, the system does not have to first learn program syntax, conditionals, loops or how to compile and run code — the ability to do such things are innate and unavoidable in the sense that a baby can't help but swallow, smile and reach out to touch things in its field of view. The apprentice will appear to behave like a natural programmer right out of the box even though its initial efforts will likely be clumsy and seemingly random, again, like a baby communicating by cooing and gurgling or moving itself by rocking and crawling.

Touching briefly on one last programmer's apprentice topic as a preview for the next log entry, I am just starting to think more concretely about attention schema, specifically, (i) how do such representations come in to being, (ii) how are they represented in memory, (iii) do we have separate schema for each person we encounter, (iv) is there a special schema representing the self / apprentice, and (v) might the apprentice left to its own devices engage in a form of animistic thinking such that humans, webpages and programs are different subtypes of the same type of entity. I intend to start by thinking about how (v) might be addressed using high-dimensional semantic embedding spaces.

October 10, 2017

In the previous log entry, I neglected to say anything about how the programmer's apprentice is grounded or embodied, that is to say, the interface between the strictly cognitive apparatus of the apprentice⁸⁴ and the physical laws governing the environment the apprentice inhabits and from which originates the only source of information the apprentice can observe and subject to experiment so as to construct a foundation for deciding how to act.

I think of the apprentice as embedded in a layer of specialized hardware and software that provides the interface between its highly adaptive cognitive apparatus and its external environment filtered through various sensory apparatus. Its peripheral sensory apparatus corresponds to the camera, microphone and touchpad on the computer through which the user interacts with the system. However, the system also has internal machinery — specialized prosthetics — each with its own specialized sensory interface — depending on whom you ask, the apprentice either is or lives in a virtual world⁸⁵.

Human bodies are essentially highly constrained mechanical artifacts. Our fingers, toes, hands and feet can only flex within the constraints of the muscles and tendons that bind them together. We can't rotate our heads 360 degrees, nor can we bend our arms and legs to assume arbitrary contortions. The apprentice doesn't have conventional articulated limbs and body parts. It can't interface with the physical world as we do, but it is just as intimately tied to the computer on which it runs⁸⁶.

The causal chain that leads from the code the assistant generates and its internal programming interface, through the tool-chain of compilers and interpreters allowing it to execute that code, all the way to the process running the shared instance of the Chrome browser are best thought of as part of the apprentice’s body. In modifying the contents of its internal programming interface, the apprentice has fixed, finite set of actions it can perform thus constraining its behavior and simplifying the process of learning RL policies.

The constraints we impose on how the system is able to generate and modify existing code limit its degrees of freedom in the same way that our joints and muscles limit our movement. Just as you cannot achieve any arbitrary pose, neither can the apprentice alter the contents of the internal programming interface so as to introduce arbitrary syntax errors — the apprentice is basically constrained by its hard-wired programming interface to produce syntactically correct and perhaps even semantically rationalized code. It will always compile!

This doesn't imply that executing the code will not produce an error, only that the error will not be due to the result of incorrect syntax. Internally, some errors will manifest in STDERR messages, others might manifest as subprocesses that doesn't return, hang or dump core. We can, of course, add additional constraints, either temporarily to facilitate training or permanently simply to make life a little bit easier for the apprentice to get up to speed quickly.

In addition to scanning STDOUT and STDERR, the apprentice can also get feedback from looking at the results displayed in the Chrome browser that are generated by programs the apprentice writes and runs. In this way, the apprentice is constrained both by the body it inhabits and the environment with which it interacts. We should design the apprentice with instincts that facilitate dealing with the laws of physics as manifest in interacting with a conventional computer. It's a code monkey.

October 9, 2017

In the original Zinn demo I prepared for the VP of Research, I hinted at simple automated programming by providing the system with templates it could use to run searches, but "programming" in this case amounted to filling in the slots in such templates. Not to denigrate such an approach given it is a design strategy used by novices and experts alike. In software engineering, the idea of a general reusable solution to a commonly occurring problem within a given context is called a design pattern and has become a standard approach in programming practice.

A simpler alternative is to find an existing fragment of code called a snippet that performs a function similar to what you want to implement and then adapt the snippet to suit your specific requirements. Extracting a suitable snippet from a larger program is often necessary and requires skill in and of itself. Extracting a snippet written in one programming language for application in another language is another common expedient employed by both beginner and advanced programmers.

As a first approximation, I imagine the system starting with a repertoire of general-purpose templates that it can specialize, augment and apply recursively. Here's a contrived dialog intended as an example of what one might expect of the proposed programmer's apprentice system:

USER: Write a script that searches for all recent articles on deep neural networks that can write programs.
SYSTEM: Should I rely solely on Google Search or do you have some other search engines or databases in mind?

USER: Using Google Search should work fine for a start.

SYSTEM: What do you mean by "recent"?

USER: Anything published within the last five years should work for my immediate purposes.

SYSTEM: What keywords should I use to restrict attention to "deep neural networks"?

USER: Just that, "deep neural networks", if that's too broad, we'll narrow the search to specific acronyms.

SYSTEM: Do you want me to post-process the results by sorting by recency or some measure of journal quality?

USER: That's not necessary. Perhaps we'll want to revisit this later if there are too many relevant papers.

SYSTEM: [...]

Alternatively and more ambitiously, the system could start with examples drawn from other sources — see recent work on automatic patch generation published here and here and related work on coding by example as in leveraging code-sharing resources like Stack Overflow published here ⁸⁷. Might also want to check out the video of Alex Graves' presentation — in the first half — on DNC at ICLR and the presentation — in the second half — by two Berkeley students talking about their paper on Neural Programming Architectures that won 2017 ICLR Best Paper award.

I spent the morning trying to dig up papers on the biological, psychological and computational / technological basis for theory-of-mind (ToM) reasoning. The paper by Pynadath, Rosenbloom and Marsella on multi-agent reinforcement learning is interesting for its application and method of learning another agent's reward function using what they call inverse reinforcement learning — you can check David Pynadath's presentation at the Artificial General Intelligence Conference in 2014 here. Cosmides and Tooby are evolutionary psychologists well known for their work relating to modularity of mind and ToM reasoning and worth a listen, but, so far, I don't see how their work can help given our somewhat limited aspirations in terms of leveraging those big ideas⁸⁸.

October 7, 2017

Figure 29 includes a block diagram showing a first approximation of the proposed system architecture along with graphics illustrating the various inputs and outputs of the overall system. The block diagram shown in graphic A illustrates the main characteristics of the model, including its multilayer hierarchical structure, the use of LSTM components for all layers in the hierarchy and the characteristic pattern of reentrant connections providing both feedforward and feedback between LSTM components in all layers as described in Stanislas Dehaene’s work [68].

Graphic B conveys the fact that the system receives audio input from both the user and his or her ambient environment including, for example, sounds produced by the computer running an instance of the Chrome browser. The system can provide audio output by either controlling the browser or by using a separate speaker to generate natural language responses. In this entry, I'll experiment with using the terms "user" and "programmer" and the terms "system" and "apprentice" interchangeably, and avoid using the longer descriptive title "programmer's apprentice" altogether. The term "programmer" is reserved for unambiguous cases. For simplicity, when I refer to "voice", unless indicated otherwise, I am assuming that the user is wearing a noise canceling microphone and that high-quality voice-recognition software produces a reasonably accurate — though not perfect — transcription of the user's utterances.

Graphic C shows an LCD screen displaying the Chrome browser running on a computer that the apprentice can log into and interact with the user by manipulating the browser interface. The apprentice has access to the LCD screen, both the raw pixels and everything relating to the browser content including source markup, embedded javascript and browsing history. Graphic D represents some of the signals the apprentice can use to evaluate its attempts to translate user requests into programs that satisfy those requests. In addition to the user's commentary and feedback, the apprentice has access to all the signals produced by compilers, interpreters and the operating system that a human programmer would have available in debugging code.

Graphic E corresponds to an internal workspace that roughly corresponds to an editor or IDE window that a programmer might use in composing a program. The main difference is that the commands and key strokes that apprentice can enter in order to modify the contents of this workspace are limited — the interface maintains a syntactically valid representation of a program at all times. A collection of macros make it easy to insert or append boilerplate expressions that are essentially templates with slots that can recursively contain additional expressions. Interface attempts to maintain a human readable format so the user can provide useful feedback to the apprentice in teaching it to automatically translate the users declarative representations into executable code.

Figure 29: This figure provides a very rough block diagram showing a first approximation to a system architecture along with inset graphics illustrating the various inputs and outputs of the overall system. The block diagram is shown in the insert labeled A and the four components labeled B, C, D and E are described in more detail the body of the text.

The internal workspace is implemented using some variant of what Alex Graves and his collaborators at DeepMind call a Neural Turing Machine [106] (NTM) or Differentiable Neural Computer [105] (DNC) and Jason Weston and his colleagues at FaceBook refer to as Memory Networks [299]. Additional addressing modes and attentional mechanisms [119, 206, 207] and technology for efficiently accessing and searching large-scale episodic memory, e.g., Pritzel et al [235] Neural Episodic Control (NEC), are likely to be applicable as well.

The details of implementing executive control and some form of hierarchical planning may seem like a huge stretch but, in fact, there are precedents in NN language applications and it is not surprising given its importance that NN architectures can be designed to search efficiently [38]. Peter Abbeel's work on value iteration networks [277, 120] mentioned earlier offers one promising approach and work on language generation including syntax-tree-walking neural networks and even the simple expedient of using beam search to collapse the thought cloud wave-function to generate a coherent utterance often work well in practice. Think about how one could use value iteration on a tree of execution traces to improve policies that learn to write simple helper code. The recent work by Matej Balog et al [16] on a recurrent DNN method for learning to write programs also looks promising.

October 5, 2017

The last few days were spent reviewing previous research on recurrent neural networks and, in particular, variants of the LSTM (Long Short-Term Memory) model of Sepp Hochreiter and Jürgen Schmidhuber [143, 142]. Currently consider LSTM and embedding-space models are our best bet for implementing modular functional units analogous to the primary and secondary (association) areas responsible for sensory and motor function in the cortex and assembling these units into hierarchical architectures that encode complex representations. I am also exploring possible RNN architectures for representing attentional schema that encapsulate collections of capabilities and their enabling conditions and are postulated to model both the self and other entities in social animals [107, 117, 114].

I reviewed work with Shalini Ghosh from October 2015 through May 2016 on hierarchical document representations in thinking about how we might construct a program / script schema and have included references to Shalini's paper with Oriol Vinyals, Brian Strope, Scott Roy, Larry Heck and me [99] as well as additional related papers on hierarchical document and video models using LSTM and RNN technologies [307, 279, 177, 224] BIB. I've also included relevant excerpts from my research notes on hierarchical LSTM, contextual LSTM and slot filling DNN models.

Figure 28 has been augmented to include a description of the ventral visual stream as an example of the connection patterns described in Figure 27 relating to the attentional machinery at the foundation of Dehaene's theory of consciousness [68]. In other correspondence, I've exchanged notes on the prospects for building human-level AI⁸⁹, research and development tradeoffs in a commercial context⁹⁰ and the under-utilized ability of pattern-seeking brains to facilitate the design of neural prosthetics⁹¹.

October 3, 2017

I've been collecting resources and background from neuroanatomy to neuroprosthetics. Most of my notes are in the form of whiteboard photos. This entry is really nothing more than a temporary placeholder for lecture materials. Google Image Search provided a number of useful graphics showing the primary functional regions and cortical landmarks. I've included a sample in Figure 28. I found a basic primer (PDF) on the anatomy of the adult human brain [139] from the Mayfield Clinic founded by my uncle Dr. Frank Mayfield.

Figure 28: A sample of the medical-illustrator's craft in rendering the anatomy of the human brain so as to highlight functionally relevant regions like Broca's and Wernicke's areas involved in speech production and language understanding, and to aid in explaining how information is passed between functional areas in the process of generating consciousness. Graphic A highlights the primary and (secondary) association areas for the sensory and motor cortex. Graphic B provides a rotated version of Graphic A including the cerebellum and brainstem for registration purposes. Graphic C illustrates ventral visual pathway feeding forward (blue) from the lateral geniculate nucleus in the thalamus, leading through retinotopically mapped regions in the striate cortex, extending into sensory association areas and ending up in the prefrontal cortex, before reversing direction and feeding back (red) along the same paths. Similar pathways exist for the dorsal visual stream and for the other sensory and motor pathways. The ventral and dorsal pathways refer to the two-streams hypothesis that argues humans possess two distinct pathways in each of the separate visual and auditory systems.

I've already circulated the short video by Michael Graziano explaining the attention schema (MP4) and now I've added a bibliography of books and articles on executive control [210, 222, 221, 219, 14, 69, 74, 72, 166]. I've included a somewhat dated resource [219] (PDF) on the role of computational models in cognitive neuroscience that might help in reading the section entitled "Simulating a Conscious Ignition" in Chapter 5 of Dehaene [68] EPUB and the related Figure 27.

Automated planning might seem like GOFAI but ideas from hierarchical planning have been adopted by cognitive neuroscientists to explain the mechanism of executive control in consciousness and by linguists in explaining discourse management. The CMU RavenClaw dialog management system exploited this connection when Dan Bohus and Alex Rudnicky implemented error handling using a hierarchical planner [30, 29]. I followed their lead in developing the prototype dialog system mentioned earlier. While ideas from hierarchical planning persist [310], modern implementations combine NN and RL [180] technologies, with recent work by Pieter Abbeel's group on value-iteration networks [120, 277] of particular interest.

We are drawing on a large number of concepts from several different disciplines in order develop PA — conveniently, PA serves an an acronym for either Personal Assistant and Programmer's Apprentice — technology that leverages ToM reasoning — which I'm using as catchall for attention-schema, conscious-awareness, theory-of-mind reasoning, etc. Those disciplines include artificial intelligence, cognitive science, machine learning and neuroscience. Rather than complicating the situation — apart from conflicting terminilogy which needs sorting out, these thread constrain the task of coming up with good starting points for system architectures and provide data illustrating integrated system behavior:

Representative concepts from cognitive neuroscience include: attention, awareness, body and attention schemas, cognitive dissonance, consciousness, decision making, mirror neurons, planning, policy iteration, predictive coding, reinforcement learning, theory of mind reasoning, and von Economo neurons.
Functional architectures from neural networks provide: deep neural networks, encoder-decoder networks, generative adversarial networks, graph convolutional networks, hierarchical models, long-short term memory, multi-level convolutional networks, recurrent neural networks, and value iteration networks.
Applications of higher-order cognitive machinery: automated collaborative planning, command and control, cognitive and emotional coaches, conversational agents, digital amanuensis, humanoid robots in healthcare, personal assistants, robotic surgical prosthetics and teachable systems for often repeated tasks.

The Programmer's Apprentice application was partially anticipated in a prototype I developed leveraging a Python library for writing code using voice alone developed by Tavis Rudd (MP4). Rudd's system relied on a large set of voice commands that were difficult to remember and hard to deploy given ambient noise levels and natural variation in vocalization. Fortunately, voice recognition has improved markedly in the last few years and I'm expecting the learning component of the Programmer's Apprentice system to mitigate misunderstanding due to the user misremembering the exact name of a command. For example, the user — who created the script associated with the command in the first place — could have the system assist in recall by describing what the command does or how it was implemented.

September 29, 2017

The more I think about it the less I'm enamored of using attention schemas and theory-of-mind reasoning to enable Google assistant to handle multiple users as in the case of a family that constantly interacts with the assistant using their individual phones and GA enabled appliances such as in the Alexa product that Amazon has marketed so well. First of all, this is a hard problem to begin with and difficult to market without raising very high expectations. Second, there are simpler techniques based on voice recognition and acoustic signal processing that would handle most of the common use cases.

I'd like to reconsider the idea of a digital amanuensis for an author or smart prosthetic for a typing-impaired programmer with carpal tunnel syndrome — see here. In this use case, we could get by with two attentional schemas one for the disabled programmer and one for the programmer’s apprentice, recalling Rich and Waters' MIT project of the same name [244]. It would definitely exercise an enhanced capability to recover from misunderstandings — both semantic and syntactic — and offers a natural framework in which to explore continuous dialogue in a narrow task domain with the added promise of finding a clear metric for evaluating performance.

Returning to Dehaene's Consciousness and the Brain, Chapter 4 describes how Dehaene and his colleagues discovered what they believe to be the functional correlates of conscious activity. You might want to look at the section entitled "Igniting the Conscious Brain" that draws on Donald Hebb's analysis [135] of collective assemblies of neurons, later reconceived in terms of dynamical systems theory [39] summarized here⁹², and a later section entitled "Decoding a Conscious Thought" summarized here⁹³.

I found the descriptions useful in imagining how such patterns of activity might be generated in artificial neural network architectures, and answering questions posed at the outset of Chapter 5, "Why should late neuronal firing, cortical ignition, and brain-scale synchrony ever create a subjective state of mind? How do these brain events [...] elicit a mental experience?".

September 27, 2017

According to Stanislas Dehaene the machinery for consciousness is similar to the machinery for visual attention. If this is true, what are the top-down and bottom-up signals in each case. There is both the issue of weighing different alternative for an individual, somewhat independent choice, e.g., should I wear shorts or a raincoat today, and how we integrate all the separate choices to form a composite plan, e.g., to get to work today. Dehaene's work on executive control and planning, e.g., [72, 69] is worth reviewing as a supplement to Consciousness and the Brain.

Dehaene believes there are multiple levels of consciousness where the lower levels involve primarily the ability to broadcast and integrate information. The sense of self as something we can reflect on and ascribe knowledge to as well as our ability to project conscious behavior on others as part of social decision making are higher level than the elephant uses when deciding her best option for finding drinking water taking into consideration distance to the source of the water, temperature, predators, age and health of other elephants in the herd. However, the signature pattern of re-entrant connections found throughout the hierarchy of thalamic nuclei, primary sensory, secondary association and so-called executive-level prefrontal substructures demands a computational explanation.

I'm not going to attempt to summarize Dehaene's book in detail. Much of it consists of experiments that support his hypothesis and I recommend you scan the entire book if you can. Chapter 5 provides the most detail relevant to implementing his theory, and almost all the remaining text and figures in this log entry were excerpted from this chapter. You can read the entire raw text sans figures here or a complete epub version here. Dehaene summarizes his basic "global neuronal workspace" hypothesis in Chapter 5 as follows:

When we say that we are aware of a certain piece of information, what we mean is just this: the information has entered into a specific storage area that makes it available to the rest of the brain. Among the millions of mental representations that constantly crisscross our brains in an unconscious manner, one is selected because of its relevance to our present goals. Consciousness makes it globally available to all our high-level decision systems. ( Figure 24 ) We possess a mental router, an evolved architecture for extracting relevant information and dispatching it.
The psychologist Bernard Baars calls it a "global workspace": an internal system, detached from the outside world, that allows us to freely entertain our private mental images and to spread them across the mind’s vast array of specialized processors. [...] This idea is not new — it dates back to the inception of artificial intelligence, when researchers proposed that subprograms would exchange data via a shared "blackboard," a common data structure similar to the "clipboard" in a personal computer. The conscious workspace is the clipboard of the mind.

Figure 24: Global neuronal workspace theory proposes that what we experience as consciousness is the global sharing of information. The brain contains dozens of local processors (represented by circles), each specialized for one type of operation. A specific communication system, the "global workspace," allows them to flexibly share information. At any given moment, the workspace selects a subset of processors, establishes a coherent representation of the information they encode, holds it in mind for an arbitrary duration, and disseminates it back to virtually any of the other processors. Whenever a piece of information accesses the workspace, it becomes conscious.

The wiring pattern of the primate brain is anything but uniform: [...] Importantly, not all brain areas are equally well connected. Sensory regions, such as the primary visual area V1, tend to be choosy and to establish only a small set of connections, primarily with their neighbors. Early visual regions are arranged in a coarse hierarchy: area V1 speaks primarily to V2, which in turns speaks to V3 and V4, and so on. As a result, early visual operations are functionally encapsulated: visual neurons initially receive only a small fraction of the retinal input and process it in relative isolation, without any "awareness" of the overall picture. [...]

[...] Neurons with long-distance axons are most abundant in the prefrontal cortex, the anterior part of the brain. This region connects to many other sites in the inferior parietal lobe, the middle and anterior temporal lobe, and the anterior and posterior cingulate areas that lie on the brain’s midline. These regions have been identified as major hubs — the brain’s main interconnection centers. All are heavily connected by reciprocal projections: if area A projects to area B, then almost invariably B also sends a projection back to A ( Figure 25 ). Furthermore, long-distance connections tend to form triangles: if area A projects jointly to areas B and C, then they, in turn, are very likely to be interconnected. [...]

Figure 25: Long-distance neuronal connections may support the global neuronal workspace. The famous neuroanatomist Santiago Ramón y Cajal, who dissected the human brain in the nineteenth century, already noted how large cortical neurons, shaped like pyramids, sent their axons to very distant regions (left). We now know that these long-distance projections convey sensory information to a densely connected network of parietal, temporal, and prefrontal regions (right). A lesion in these long-distance projections may cause spatial neglect, a selective loss of visual awareness of one side of space.

[...] Pathways linking the cortex with the thalamus are especially important. The thalamus is a collection of nuclei, each of which enters into a tight loop with at least one region of the cortex and often many of them at once. Virtually all regions of the cortex that are directly interconnected also share information via a parallel information route through a deep thalamic relay. Inputs from the thalamus to the cortex also play a fundamental role in exciting the cortex and maintaining it in an "up" state of sustained activity. As we shall see, the reduced activity of the thalamus and its interconnections play a key role in coma and vegetative states, when the brain loses its mind. [...]

Figure 26: Large pyramidal neurons are adapted to the global broadcasting of conscious information, particularly in the prefrontal cortex. The whole cortex is organized in layers, and layers II and III contain the large pyramidal neurons whose long axons project to distant regions. These layers are much thicker in the prefrontal cortex than in sensory areas (above). The thickness of layers II and III roughly delineates the regions that are maximally active during conscious perception. These neurons also exhibit adaptations to the reception of global messages. Their dendritic trees (below), which receive projections from other regions, are much larger in the prefrontal cortex than in other regions. These adaptations to long-distance communication are more prominent in the human brain than in the brains of other primate species.

[...] Their dense jungle of dendrites is controlled by a family of genes that are uniquely mutated in humans. The list includes FoxP2⁹⁴, the famous gene with two mutations specific to the Homo lineage, which modulates our language networks, and whose disruption creates a massive impairment in articulation and speech. The FoxP2 family includes several genes responsible for building neurons, dendrites, axons, and synapses. In an amazing feat of genomic technology, scientists created mutant mice carrying the two human FoxP2 mutations — and sure enough, they grew pyramidal neurons with much larger, humanlike dendrites and a greater facility to learn (although they still didn’t speak). [...]

Figure 27: A computer simulation mimics the signatures of unconscious and conscious perception. Jean-Pierre Changeux and I simulated, in the computer, a subset of the many visual, parietal, and prefrontal areas that contribute to subliminal and conscious processing (above). Four hierarchical regions were linked by feed-forward and long-distance feedback connections (middle). Each simulated area comprised cortical cells that were organized in layers and connected to neurons in the thalamus. When we stimulated the network with a brief input, activation propagated from bottom to top before dying out, thus capturing the brief activation of cortical pathways during subliminal perception. A slightly longer stimulus led to global ignition: the top-down connections amplified the input and led to a second wave of long-lasting activation, thus capturing the activations observed during conscious perception.

[...] Crucially for the workspace hypothesis, Elston and DeFelipe showed that the dendrites are much larger, and the spines much more numerous, in the prefrontal cortex than in posterior regions of the brain — See Figure 26.

[...] Bernard Baars’s version of the workspace model eliminates the homunculus. The audience of the global workspace is not a little man in the head but a collection of other unconscious processors that receive a broadcast message and act upon it, each according to its own competence. Collective intelligence arises from the broad exchange of messages selected for their pertinence. This idea is not new—it dates back to the inception of artificial intelligence, when researchers proposed that subprograms would exchange data via a shared "blackboard" [287, 134, 133].

[...] During World War II, the British psychologist Donald Broadbent developed a better metaphor, borrowed from the newborn theory of information and computing. Studying airplane pilots, he realized that, even with training, they could not easily attend to two simultaneous trains of speech, one in each ear. Conscious perception, he surmised, must involve a "limited-capacity channel" — a slow bottleneck that processes only one item at a time.

[...] The neuropsychologists Michael Posner and Tim Shallice proposed that information becomes conscious whenever it is represented within this high-level regulatory system. We now know that this view cannot be quite right [...] since even a subliminal stimulus, without being seen, may partially trigger some of the inhibitory and regulatory functions of the supervisory executive system.

[...] However, conversely, any information that reaches the conscious workspace immediately becomes capable of regulating, in an extremely deep and extensive manner, all our thoughts. Executive attention is just one of the many systems that receive inputs from the global workspace. As a result, whatever we are aware of becomes available to drive our decisions and our intentional actions, giving rise to the feeling that they are "under control." Language, long-term memory, attention, and intention systems are all part of this inner circle of intercommunicating devices that exchange conscious information. [...]

[...] Because of FoxP2 and its associated gene family, each human prefrontal neuron may host fifteen thousand spines or more. This implies that it is talking to just about as many other neurons, most of them located very far away in the cortex and thalamus. This anatomical arrangement looks like the perfect adaptation to meet the challenge of collecting information anywhere in the brain and, once it has been deemed relevant enough to enter the global workspace, broadcast it back to thousands of sites.

September 25, 2017

The last entry in this log sketched a simple theory of attention roughly based on ideas from Michael Graziano [107] using the metaphor of a flashlight that illuminates selected parts of working memory. The metaphor begs the question of "who" is holding the flashlight and directing beam of light. Of course, the answer has to be "nobody", but still there has to be some mechanism and associated source of information that focuses attention.

One possible answer borrowed from models of visual attention is that some version of interestingness must be at play, where by "interestingness" we mean some property of activated neural tissue that can be simply computed, i.e., not requiring deeper recursive analysis. I suggested that "interestingness" is an emotional state vector that enables us to prioritize selection, but I don't think that's the whole story. For one thing, almost any information represented in the neural state vector is likely to have an emotional component.

To make progress it will be useful to be explicit about how we might build attentional mechanisms capable of the sort of computations we believe take place in human brains. If attention selects from among informational states encoded in the activity of neurons, then what is a good approach to modeling such states in artificial neural networks. Multi-layer neural networks representing high-dimensional state vectors in a hierarchy of embedding spaces might be a good starting point. The rest of this entry is more philosophical, but don't miss the footnotes.

Thinking about thinking is hard. Even reading the work of someone as clear and precise as Michael Graziano, the words initially seem familiar, but then inexplicably become foreign in the rarified context of a theory. They seem to tumble over one another, each fighting to see if it applies in the present circumstances. Indeed it is hard to ignore the fact that you are immersed in an attentional process⁹⁵.

What is attention?⁹⁶ What is awareness and how are attention and awareness related?⁹⁷ Is there a case in which one occurs and the other does not? How is attention related to decision making?⁹⁸ These questions are likely to arise and persist indefinitely as you attempt to solve the puzzle of consciousness. With luck you may stumble on a clue providing an answer to one question only to find that it leads to a cascade of questions. The link between attention and decision making in the last footnote may provide one such clue that works for you but there are no guarantees.

Even if you think you understand, say, visual attention in the way it is explained by Laurent Itti and Christof Koch [154, 155], you may be confused by a contemporary competing explanation of visual attention such as that proposed by Dirk Walther [292, 293], and even more confused on encountering the more general characterization presented in Graziano's book. All three rely on the same words but they have different technical meanings in each theory. All I can say by way of consolation, is that this sort of uncomfortable confusion is, in my experience, unavoidable in understanding anything fundamentally new about nature. It is worth reflecting on the advantages and disadvantages of a medium of exchanging information that has such a high degree of apparent ambiguity.

September 23, 2017

I was chopping vegetables this evening for a big quinoa salad to last us for at least five meals this coming week and I was thinking about the last entry in the research log that I appended to my "what'll I do if" white paper. The last paragraph suggested that we could control our behavior by emotionally tagging thoughts conjured up from memory. I started thinking about Ray Kurzweil's interview with Martin Minsky in which Minsky talked about his society of mind model, and I started thinking about your comment that many of your internal conversations seem to be driven by habit.

The word "habit" has a negative connotation in colloquial speech but what I got from your comment is that we can form a habit for doing just about anything we like using an appropriate carrot or stick. Skinner's rats and Pavlov's dog provide ample demonstrations of this characteristic stimulus-response learning and of course humans are well known for their ability to turn almost any stimulus into a craving or an aversion, not to mention all of the many odd and perversely twisted abstractions we invent to drive human behavior.

In particular, solving logical puzzles, proving mathematical theorems, using simple antique marine navigation instruments to make astronomical predictions, rotating complex geometric shapes in our heads, finding weird symmetries in complex equations, teaching someone to read a story or add two numbers, solving a murder in a mystery novel, correctly spelling a difficult word, mastering a complicated ballet step, even creating grotesque parodies of human beauty, all provide motivation and produce emotional responses. We can harness this ability to free ourselves from the instincts we were born with or adopted from our peers.

We can also borrow these habits from others. We can adopt aesthetic preferences and ways of appreciating things to enlarge our ability to label ideas and propositions so as to motivate our underlying attentional system to elevate or silence items vying for our attention in working memory. We can invoke both facts and rules and then use our appreciation of a rule of inference to arrive at conclusions that are logically valid assuming both the facts and rules are valid. We can apply the scientific method to evaluate hypotheses and moderate our enthusiasm based on the evidence. We can use our appreciation of knowledge handed down from books that others whom we trust have validated by their careful study and appropriately skeptical analyses. Such knowledge is analogous to a software library that could open up entire disciplines to our appreciation and application.

September 21, 2017

The impression of the lightest touch on our skin or the disturbance of a single hair corresponds to the activation of one or more neurons in the central nervous system that provide our first opportunity to apprehend a sensory event originating in the peripheral nervous system and producing a signal that is propagated along nerve fibers terminating in the sensory cortex⁹⁹.

In the Buddhist tradition of insight practice, all things are said to be impermanent. To realize the impermanence of all things insofar as we can experience them, practitioners focus their attention on the arising and passing of sensations and learn to count individual sensations and divide what appear to be sustained sensations into sequences of shorter ones.¹⁰⁰

The apprehension of touching an object is complicated by the fact our brains adjust the perceived time of the touch causing the sensation with the time of our feeling the touch as registered by the activity of neurons in the brain, i.e., the brain makes sure what we see doesn't appear to happen before we feel it and so causality does not appear to be violated¹⁰¹.

What about sustained pressure on our skin? Not all neural activity available for conscious access are punctate in the sense that they last no longer than the length of the refractory period of a neuron. Visual stimuli impinging on the retina rapidly propagate through the visual cortex resulting in sustained states of varying duration depending on gaze and independently driven neural activity¹⁰².

It may be, however, that our conscious apprehension of such sustained states is essentially punctate, implying that, while we imagine ourselves sustaining an image of a perfectly blue sky, in fact our eyes are performing dozens of microsacades per second and our minds conjuring up all sorts of related memories. It might seem conscious access would manifest refresh anomalies when reporting on a partially reconstructed memory, but that doesn't appear to happen due to the function of attention.

You can imagine that blue sky as arising from a collection of discrete sensations or as a reconstructed, idealized and sustained memory of perfectly clear blue sky, and still conclude that the process of consciously attending to such an apprehension is constructed and sustained in conscious awareness through a series of sensations.

The key revelation is that when conscious experience is grounded in sensation, the number of such sensations is finite and they are ephemeral, manifestly impermanent, rapidly coming into being and passing away, and thus our entire experience is impermanent and fleeting. Moreover, when you look around, that's all there is. There is no room for a homumcular "you" sitting alone in a Cartesian theater in your skull observing events unfolding.

In traditional impermanence training, you break each sensation into smaller sensations and count them practicing until you can reduce each sensation to a succession of indivisible sensational quanta. Awakening occurs when it becomes apparent that that's all there is. The answer to the quintessential question, "Where am I?", is that everything is impermanent and that "you" are no more substantial than a ripple in the water that disturbs the surface of a pond or a gust of wind that rustles the leaves in the forest¹⁰³.

If there is no "you", then how do you exert any control over your thoughts and activities? Why aren't you a zombie or a robot that can only respond to its environment by executing immutable canned subroutines corresponding to instincts that are programmed to execute when presented with specific patterns of stimulus arising in your immediate environment?

First of all, consider the role of consciousness in directing your attention. The simplest metaphor is that of a flashlight that can be directed to highlight any of the thoughts that are currently active as a consequence of observing the world around you and having those observations activate parts of your brain that are in some way relevant to what you observe. Of course the flashlight has to be directed by something and that something can't invoke a recursive viewpoint, i.e., homunculus. Suppose instead the mind applies a general-purpose attentional mechanism that employs some criterion of relevance or emotional significance to direct the beam of the flashlight¹⁰⁴.

Moreover suppose that when the beam illuminates some portion of your temporary workspace you, react to the illuminated recalled memories as you would any memory by imprinting your current emotional state or rather your current emotional state modified by the additional information revealed by the illumination of the flashlight. Since emotion and pattern determine what memories are recalled and how they are reimagined by treating them exactly as if they were new experiences to be emotionally and practically considered, the new emotional imprint modifies their behavior so the next time you recall them they will perform differently.

The next step is to figure out how using this mechanism of conscious recall and emotional recoding, we could actually reprogram ourselves in such a way to achieve some degree of autonomy and independence from our initial or socially imprinted programming. You can train yourself not to attend to something. You can channel your craving, curb your attachment or moderate your aversion, but you can't just change your mind; you can't arbitrarily move the flashlight. You can, however, write a program / create a habit so that if a pattern or choice presents itself you will take the opportunity to exercise the habit. Think about how to cultivate a pattern of behavior that captures your subtle and not-so-subtle aspirations and intentions.

September 19, 2017

I was thinking about John Platt's comment that perhaps neuroscience doesn't have much to tell us about building better AI or ML, and specifically about whether what we have learned about any one of consciousness, embodied spatial reasoning, emotions, self-awareness or theory-of-mind reasoning is enough to emulate the related behavior insofar as we want or need to build systems that employ such facilities.

Language is, I think, a special case. Some argue that language comprehension is closely related to a form of tagging that occurs in scene comprehension. While language certainly has social, signaling and collective planning components, each of these applications of language also involve concept identification, sophisticated relational thinking and combinatorial concept formation and require some form of self localization¹⁰⁵.

To take a philosophically thorny idea like consciousness, it's not clear to me that Koch and Crick's work [164, 163, 51, 50] on the neural correlates of consciousness is of the same practical value as Dehaene's and Graziano's theories of consciousness [68, 107]. However, my assessment assumes that consciousness appeared late evolutionarily speaking. If flies are interestingly self-conscious and I'm not convinced they are not, then all bets are off as far as I'm concerned.

To what extent is consciousness something that can be built on any architecture that has at least the cognitive capabilities of a mammal, or perhaps much older species. Maybe you don't need special-purpose hardware and so the neural correlates you are looking for are simply the ephermeral functional traces of biological "software" running on general-purpose hardware. On the other hand, if you look for the "neural correlates" of process & thread scheduling on a modern VLSI processor chip, you may indeed find dedicated hardware serving in this capacity [123].

Contrast the putative neural correlates of consciousness with other mysteries of the brain related to conscious and social behavior such as mirror and Von Economo neurons (VENs) (spindle cells), microglia cells now implicated autism [281], and the presumed core of our emotional circuitry in the functionally-diverse area (still) called the limbic system. Estimates of the evolutionary time course of these circuits vary wildly, some apparently appearing quite recently¹⁰⁶, and others such as hippocampal place cells apparently quite ancient¹⁰⁷.

A lot of recent theories of subconscious cognition are essentially variants of Oliver Selfridge's¹⁰⁸ Pandemonium model [257] or Marvin Minsky's Society of Mind theory [205]. Do those models provide enough of a hint to build smart AI systems? Is consciousness and ToM reasoning as complicated as some seem to believe or is it the case that, if we can build "simply-savant and socially-safe" systems sophisticated enough to act independently and survive in the same environment we evolved in, then perhaps consciousness and ToM thinking will fall out naturally in the process of such systems interacting with that environment, with or without intervention from or direct involvement with humans.

September 17, 2017

Document Cache and Bibliography

I've temporarily settled on three theories relating to consciousness and related cognitive capacities articulated in three relatively recent books: The first book features a refinement and extension of the global-workspace theory (GWT) of Bernard Baars supported by extensive neurological studies using an array of techniques including evoked potentials, EEG, fMRI, single-cell recordings and magnetic stimulation, Consciousness and the Brain: Deciphering How the Brain Codes Our Thoughts by Stanislas Dehaene [68] — see this twenty-minute video for insightful comments you can take to the bank, i.e., translate into systems, and a succinct summary of his GWT theory¹⁰⁹.

The second book covers a new theory focusing on the interactions involving consciousness, social behavior and ToM reasoning, Consciousness and the Social Brain by Michael Graziano [109] — summarized in this series of articles [112, 109, 113, 111, 110, 108] appearing in Atlantic. Graziano doesn't get near the attention that Dan Dennett does, perhaps due to his clear, dry delivery that I view as a great improvement over those philosophers and cognitive scientists whose theories he challenges and only grudgingly acknowledges in a half-hearted attempt to adhere to the — some would say exaggerated — scholarly traditions of philosophy¹¹⁰.

The third book presents a new theory of emotion that emphasizes how interoception — responsible for monitoring our internal processes — generates the raw data that is subsequently filtered through our past experiences and learned concepts, How Emotions Are Made: The Secret Life of the Brain by Lisa Feldman Barrett [19]. Barrett's Theory of Constructed Emotion is a good example of what happens when cognitive science comes head-to-head with neurobiology, and the conversation between Robert Wright and Barrett in this interview underscores the confusion between conventional accounts of emotion and language relating to emotion and what we have observed in the brain. Her work is summarized for a lay audience in this article in Wired and this interview on NPR.

I created a cache that includes journal papers from the research labs of Barrett [19, 40, 216, 217, 162, 144, 20, 21], Dehaene [291, 167, 75, 72, 74, 69, 70, 68, 67, 66, 71] and, soon to be added, Graziano, as well as DRM-free copies of the above-mentioned three books in AZW, EPUB and TEXT formats to simplify markup and search. This material was collected to be used in the event that the functional modeling effort is not allocated sufficient staff to build a suitable team. In that event, I plan to discuss the idea of developing applications that would benefit from a system capable of self-awareness and ToM reasoning. Example applications include versions of the Google Assistant capable of rich dialog, recovering from misunderstanding and exhibiting limited empathy.

I'm also interested in applications that employ a scripting language and ToM reasoning to handle complex queries involving multiple HCI exchanges and the inevitable misunderstandings that will arise in even the simplest conversations. Such a scripting language would be especially useful in developing digital assistants for disabled persons, including programmers who can't use a keyboard or easily read from a display, and who, in order to remain competitively productive, need to build a custom set of verbally-initiated macros to apply both common preferred-language-specific programming practices and idiosyncratic coding strategies.

The application I imagine would enable a disabled programmer to write and debug such macros. Similar applications would be useful for surgeons and mechanics needing hands-free adaptation and execution of complex, multi-step commands. I believe that debugging short scripts is no more difficult than debugging simple misunderstandings, though believing that to be true is no reason for celebration or complacency. These examples will take effort to refine into more compelling use cases, but I am convinced that, given the enhanced function enabled by simple ToM capabilities, we can develop powerful personal assistants of substantial economic and economic value.

There is some danger of placing ToM reasoning on a pedestal since some will believe that a capability we imagine to be unique to humans will be extraordinarily difficult to engineer and unnecessary for such simple applications. I expect the exact opposite, i.e., we will find that rudimentary ToM reasoning is both necessary for such applications and relatively easy to engineer once we set about trying to do so¹¹¹.

September 15, 2017

Background and Current Thinking

In response to my email on Yuval Noah Harari's Sapiens and Homo Deus [128, 129], Dan Dennett mentioned his contribution to a new anthology on Norbert Wiener and the future of AI edited by John Brockman. I knew that Brockman created the Edge and initiated and served as editor / curator of several of its best known "lists" including the classic What Scientific Concept Would Improve Everybody's Cognitive Toolkit? — also published as "This Will Make You Smarter: A New Way to Think About Thinking". What struck me about Brockman's writing and editing is his breadth of interest and ability to redistribute ideas so they usefully propagate across social and intellectual boundaries.

I had written Dennett suggesting that if we took the best research attempting to identify the neural and cognitive correlates of self-awareness, conscious attention, language and abstract concept learning, ToM reasoning (including cheating and lying) and combined this information with the growing library of artificial neural network architecture, e.g., attentional mechanisms, convolutional primary sensory processing, embedding-space association areas, recurrent sequence processing, etc., we might be able to identify a substrate including a minimum number of basic functional modules so that some functions we would have to explicitly build in and others would emerge spontaneously in responding to complex dynamic stimuli. I thought Dennett would be interested since he issued a related challenge in a paper entitled "Why Not the Whole Iguana?" published in Behavioral and Brain Science back in 1978.

Rod Brooks and Cynthia Breazeal responded with a succession of robots including Kismet and Cog, but, despite Rod's protestations, the architecture was basically embodied GOFAI in that there was no substantive, robust learning and the underlying components were not much more sophisticated than your usual AI system, albeit absent were the classic "PLAN", "SENSE" and "EXECUTE" boxes. I spent the a couple of weeks looking for sources of insight regarding the sort of neural correlates I have in mind and identifying the sort of ideal co-conspirators I might recruit from the ranks of cognitive and systems neuroscientists, psychologists, etc.

Starting with the references in my BibTex database, I used a combination of Google Book Search for scanning older books and converted all my Kindle technical books into text documents to search for useful keywords. In the process, I pretty much discounted all the philosophers of mind including Dennett, Patricia Churchland, Jerry Fodor, etc. I also checked out a bunch of Edge profiles like Tania Lombrozo and Tom Griffiths looking for collaborators. For the most part, though I found some candidates inspirational, I'm not sure that collaborating with them would be fruitful as they don't have the sort of architectural mindset I'm looking for to provide useful constraints.

I keep coming back to the value of language and playful and artistic activities like self portraiture, self-evaluation by story telling, self-anatomy by touching, flexing, applying forces ... drawing fanciful machines, aliens, portraits of people with an attitude, ... considered Herbert Spencer's comment "No man is equal to his books." ... I thought about the power of simple physical models of the sort used in explaining concepts like plate tectonics: rigid plates, upheaval, sliding one plate over another, upward and lateral pressure ... and Einstein's Gedancken experiments involving trains on parallel tracks running in opposite directions, physical frames of reference, imagining simple experiments was Einstein's forte, or at least a big step on the path to his greatest achievements, there's also his willingness to ask questions like "What if the speed of light is fixed?", the rest was just — as they say — simple math doodling, even Minkowski's non-Euclidean geometry was simpler to understand using physical intuitions of walking around on curved surfaces.

I remembered the PBS documentary Evolution that came out in 2001. Part VI is entitled "The Mind's Big Bang" and covers the evolutionary development of language, art, and ToM reasoning, starring a familiar cast of characters, including Carroll, Dawkins, Pinker, etc. If you're interested, the second half of the transcript of Part VI is good for a quick refresher or just watch the second half of the episode on YouTube here. It was a good review, but I've long ago absorbed whatever insight was packaged in that documentary, and worry I've been infected with memes that are retarding my progress now. I drilled down on the development of language and ToM reasoning that were featured in Part VI.

On the spontaneous development of language in colonies of feral children, either deaf and without someone to teach them sign language, raised in isolation from a young age due to some disaster or raised by deaf parents with only very limited sign language, more recent work seems guarded with respect to some of the older data. "In fact, most historically recorded cases of feral children, however, suggest that they do not develop any language ability at all, perhaps even failing to develop symbolic abilities" — see here. See the case of Nicaraguan "wild" children described here compared with its mention in the transcript of Part VI. I came across so many conflicting stories I largely dismissed the data.

What are the minimal requirements for ToM reasoning? According to most experts, chimps have [some sort of] a ToM reasoning capability. If so, what are its limitations when compared to humans? On the relationship between the development of language and the development of a mature ToM reasoning capability, Miller (2006) posed a few possible explanations. One idea is that the extent of verbal communication and conversation involving children in a family could explain ToM development. The belief is that this type of language exposure could help introduce a child to the different mental states and perspectives of others. This has been suggested empirically by findings indicating that participation in family discussion predict scores on ToM reasoning tasks (Ruffman, Slade, & Crowe, 2002), as well as findings showing that deaf children who have hearing parents and may not be able to communicate with their parents much during early years of development tend to score lower on ToM reasoning tasks (Wolfe, Want, & Siegal, 2002) — see here. Interesting stuff, but dated, unsubstantiated and not terribly useful.

Current ToM modeling is more complicated than it has to be. If you were to ask X to imagine what it's like to have Y — let's say Y is Queen Elizabeth — over for dinner, the best you could probably hope for is an imitation of the Queen — call it Y′ — that's a lot more like X than Y except that Y′ isn't likely to know where the bathroom is in the house that X lives in. This sort of representation could be modeled using some variant of semantic word embeddings [9, 201, 202, 94, 250]. In order to plan, it's necessary to imagine yourself as not knowing P, making plans to learn P and then acting based on knowing P. This sort of reasoning will require solutions to the binding problem that recent work on attentional systems could facilitate [105, 119].

Predictive coding, the idea that a primary function of the brain is to predict sensory input [47], has been around since Hermann von Helmholtz's unconscious inference theory that we unconsciously fill in for missing information and predict future stimuli from past experience. Rao and Ballard were instrumental in emphasizing the role in computer vision [238, 241, 242, 240, 239], and Olshausen and Field, drawing upon the work of Horace Barlow [17, 18, 215] and his efficient coding hypothesis as well as earlier work in gestalt psychology, showed how such ideas enabled us to apply unsupervised learning to learning about our environment. Both Dehaene's and Barrett's theories reinforce these ideas and ground them in our actual — not imagined — direct experience of the world.

Moving forward by almost a decade, I spent an afternoon searching through the relevant research papers published between 2000 and 2010. The abstracts in the footnote at the end of this sentence provide a reasonable snapshot of the sort of papers I came up with¹¹². Digging deeper, I was not able to identify any consensus regarding the evolution, development or innateness of these facilities or their neural correlates, but the evidence provided in these studies is interesting and the original direction of the research intriguing if only to inform the next generation of models.

I extended my search to papers written since 2010 and winnowed the result of all my searches to a small number labs and research scientists working at the intersection between systems neuroscience and cognitive science, i.e., not just those who profess to be cognitive neuroscientists. As for artificial neural networks and related machine-learning architectures, most of the research I'm interested in leveraging will be familiar to well-read Google engineers especially those in Brain. In addition to the various networks modeling attentional mechanisms [195, 137, 119, 106, 105], I'm interested in technologies operating on less-structured data involving language [267, 268, 266, 265, 264], simple models of logical inference [193, 192, 36, 35] and reasoning about both static and dynamics graphs [160, 85, 194, 159].

September 11, 2017

Here is the abstract of a white paper written in 2013 on building digital agents capable of engaging in continuous open-ended dialog while dealing with ambiguity, recovering from misunderstandings and collaborating with human agents to explore entertainment options and solve everyday problems:

Our approach to natural language understanding (NLU) is — like our approach to NLG — a study in compromise, working hard to find a middle ground between the simple chatterbot and a — yet to be realized — full AI-complete solution capable of passing itself off as a human. The two main threads of our solution consist of (a) extensions of simple keyword and sentiment spotting designed to expand and smooth out the relevant lexical space, and (b) error mitigation and recovery in which we view the process of understanding as a dialog in which we work with the user to narrow down the meaning of user input sufficiently to provide value, e.g., playing music that the user enjoys.

References

[1]	James S. Albus. A theory of cerebellar functions. Mathematical Biology, 10:25--61, 1971.
[2]	Alexander A. Alemi, François Chollet, Geoffrey Irving, Christian Szegedy, and Josef Urban. Deepmath - deep sequence models for premise selection. CoRR, arXiv:1606.04442, 2016.
[3]	Miltiadis Allamanis, Marc Brockschmidt, and Mahmoud Khademi. Learning to represent programs with graphs. CoRR, arXiv:1711.00740, 2017.
[4]	David Alvarez-Melis and Tommi S. Jaakkola. A causal framework for explaining the predictions of black-box sequence-to-sequence models. CoRR, arXiv:1707.01943, 2017.
[5]	G. M. Anderson, T. Foulsham, E. Nasiopoulos, C. S. Chapman, and A. Kingstone. Hide and seek: the theory of mind of visual concealment and search. Attention and Perception Psychophysics, 76(4):907--913, 2014.
[6]	Anonymous. Dynamic memory networks. LDGN Notes and Thoughts on Machine Learning and Artificial Intelligence, 2016.
[7]	Anonymous. Neural program search: Solving programming tasks from description and examples. International Conference on Learning Representations, 2017.
[8]	Anonymous. Integrating episodic memory into a reinforcement learning agent using reservoir sampling. International Conference on Learning Representations, 2018.
[9]	Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, and Andrej Risteski. Random walks on context spaces: Towards an explanation of the mysteries of semantic word embeddings. CoRR, arXiv:1502.03520, 2015.
[10]	Jimmy Ba, Geoffrey Hinton, Volodymyr Mnih, Joel Z. Leibo, and Catalin Ionescu. Using fast weights to attend to the recent past. CoRR, arXiv:1610.06258, 2016.
[11]	Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization. CoRR, arXiv:1607.06450, 2016.
[12]	Jimmy Lei Ba, Volodymyr Mnih, and Koray Kavukcuoglu. Multiple object recognition with visual attention. In Submitted to International Conference on Learning Representations, page [arXiv:1412.7755], 2015.
[13]	Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. CoRR, arXiv:1409.0473, 2014.
[14]	Gareth Ball, Paul R. Stokes, Rebecca A. Rhodes, Subrata K. Bose, Iead Rezek, Alle-Meije Wink, Louis-David Lord, Mitul A. Mehta, Paul M. Grasby, and Federico E. Turkheimer. Executive functions and prefrontal cortex: A matter of persistence? Frontiers in Systems Neuroscience, 5:3, 2011.
[15]	Matej Balog, Alexander L. Gaunt, Marc Brockschmidt, Sebastian Nowozin, and Daniel Tarlow. DeepCoder: Learning to write programs. CoRR, arXiv:1611.01989, 2016.
[16]	Matej Balog, Alexander L. Gaunt, Marc Brockschmidt, Sebastian Nowozin, and Daniel Tarlow. Deepcoder: Learning to write programs. In International Conference on Learning Representations, 2017.
[17]	Horace B. Barlow. Possible principles underlying the transformations of sensory messages. In W. A. Rosenblith, editor, Sensory Communication, pages 217--234. MIT Press, Cambridge, MA, 1961.
[18]	Horace B. Barlow. Unsupervised learning. Neural Computation, 1:295--311, 1989.
[19]	L.F. Barrett. How Emotions Are Made. Pan Macmillan, 2017.
[20]	Lisa Feldman Barrett. Functionalism cannot save the classical view of emotion. Social cognitive and affective neuroscience, 12(1):34--36, 2016.
[21]	Lisa Feldman Barrett. The theory of constructed emotion: an active inference account of interoception and categorization. Social cognitive and affective neuroscience, 12(1):1--23, 2017.
[22]	Kenneth Basye, Thomas Dean, Jak Kirman, and Moises Lejter. A decision-theoretic approach to planning, perception, and control. IEEE Expert, 7:58--65, 1992.
[23]	Stephen Batchelor. Secular Buddhism: Imagining the Dharma in an Uncertain World. Yale University Press, 2017.
[24]	Peter W. Battaglia, Razvan Pascanu, Matthew Lai, Danilo Jimenez Rezende, and Koray Kavukcuoglu. Interaction networks for learning about objects, relations and physics. CoRR, arXiv:1612.00222, 2016.
[25]	Yoshua Bengio. The consciousness prior. CoRR, arXiv:1709.08568, 2017.
[26]	S. D. Bilbo. Early-life infection is a vulnerability factor for aging-related glial alterations and cognitive decline. Neurobiology of Learning and Memory, 94(1):57--64, 2010.
[27]	S. D. Bilbo and J.M. Schwarz. The immune system and developmental programming of brain and behavior. Frontiers in Neuroendocrinology, 33(3):267--286, 2012.
[28]	Sandra Blakeslee and Matthew Blakeslee. The Body Has a Mind of Its Own. Random House, 2007.
[29]	Dan Bohus. Error Awareness and Recovery in Conversational Spoken Language Interfaces. PhD thesis, Carnegie Mellon University, 2007.
[30]	Dan Bohus and Alexander I. Rudnicky. The RavenClaw Dialog Management Framework: Architecture and Systems. Computer Speech and Language, 23:332--361, 2009.
[31]	Dan Bohus and Alexander I. Rudnicky. The RavenClaw dialog management framework: architecture and systems. Computer Speech & Language, 23:332--361, 2009.
[32]	Wayne C. Booth. The Knowledge Most Worth Having. University of Chicago Press, 1967.
[33]	Craig Boutilier, Thomas Dean, and Steve Hanks. Decision theoretic planning: Structural assumptions and computational leverage. Journal of Artificial Intelligence Research, 11:1--94, 1999.
[34]	L. C. Bowman, D. Liu, A. N. Meltzoff, and H. M. Wellman. Neural correlates of belief- and desire-reasoning in 7- and 8-year-old children: an event-related potential study. Development Science, 15(5):618--632, 2012.
[35]	Sam Bowman. Can recursive neural tensor networks learn logical reasoning? CoRR, 2014.
[36]	Samuel R. Bowman, Christopher Potts, and Christopher D. Manning. Recursive neural networks for learning logical semantics. CoRR, arXiv:1406.1827, 2014.
[37]	Charles F. Cadieu, Ha Hong, Daniel L. K. Yamins, Nicolas Pinto, Diego Ardila, Ethan A. Solomon, Najib J. Majaj, and James J. DiCarlo. Deep neural networks rival the representation of primate IT cortex for core visual object recognition. PLoS Computational Biology, 10:e1003963, 2014.
[38]	Giuseppe Carleo and Matthias Troyer. Solving the quantum many-body problem with artificial neural networks. Science, 355(6325):602--606, 2017.
[39]	Luis Carrillo-Reid, Weijian Yang, Yuki Bando, Darcy S. Peterka, and Rafael Yuste. Imprinting and recalling cortical ensembles. Science, 353(6300):691--694, 2016.
[40]	Lorena Chanes and Lisa Feldman Barrett. Redefining the role of limbic areas in cortical processing. Trends in Cognitive Sciences, 20(2):96--106, 2016.
[41]	Michael B. Chang, Tomer Ullman, Antonio Torralba, and Joshua B. Tenenbaum. A compositional object-based approach to learning physical dynamics. CoRR, arXiv:1612.00341, 2016.
[42]	Alexander Chistyakov, Ekaterina Lobacheva, Arseny Kuznetsov, and Alexey Romanenko. Semantic embeddings for program behaviour patterns. In ICLR Workshop, 2017.
[43]	K. Cho, B. Merriënboer, C. Gulcehre, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. CoRR, arXiv:406.1078, 2014.
[44]	Kyunghyun Cho, Bart van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. On the properties of neural machine translation: Encoder-decoder approaches. CoRR, arXiv:1409.1259, 2015.
[45]	Kyunghyun Cho, Bart van Merriënboer, Çağlar Gülçehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder--decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1724--1734, Doha, Qatar, 2014. Association for Computational Linguistics.
[46]	Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. Gated feedback recurrent neural networks. In Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37, pages 2067--2075, 2015.
[47]	A. Clark. Whatever next? Predictive brains, situated agents, and the future of cognitive science. Behav Brain Sci, 36(3):181--204, 2013.
[48]	Martin Conway. Chapter 1.2 exploring episodic memory. In E. Dere, A. Easton, L. Nadel, and J.P. Huston, editors, Handbook of Behavioral Neuroscience, pages 34--45. Elsevier Science, 2008.
[49]	Nelson Cowan. What are the differences between long-term, short-term, and working memory? Progress in Brain Research, 169:323--338, 2008.
[50]	Francis Crick and Christof Koch. A framework for consciousness. Nature Neuroscience, 6:119--126, 2003.
[51]	Francis C. Crick and Christof Koch. What is the function of the claustrum? Philosophical Transactions of the Royal Society B: Biological Sciences, 360:1271--1279, 2005.
[52]	Andrea Crotti and Richard M. Ransohoff. Microglial physiology and pathophysiology: Insights from genome-wide transcriptional profiling. Immunity, 44:505--515, 2018.
[53]	Ivo Danihelka, Greg Wayne, Benigno Uria, Nal Kalchbrenner, and Alex Graves. Associative long short-term memory. CoRR, arXiv:1602.03032, 2016.
[54]	Alain De Botton. How Proust Can Change Your Life. Picador, 1997.
[55]	Thomas Dean. Planning and temporal reasoning under uncertainty. In Proceedings of the IEEE Workshop on Principles of Knowledge-Based Systems, pages 131--138. IEEE, 1984.
[56]	Thomas Dean. Temporal reasoning involving counterfactuals and disjunctions. In Proceedings IJCAI-85, pages 860--866, San Francisco, California, 1985. IJCAI, Morgan Kaufmann Publishers.
[57]	Thomas Dean. Intractability and time-dependent planning. In Michael P. Georgeff and Amy L. Lansky, editors, Reasoning About Actions and Plans, pages 245--266. Morgan Kaufmann Publishers, San Francisco, California, 1986.
[58]	Thomas Dean. Large-scale temporal databases for planning in complex domains. In Proceedings IJCAI-87, pages 860--866, San Francisco, California, 1987. IJCAI, Morgan Kaufmann Publishers.
[59]	Thomas Dean. Using temporal hierarchies to efficiently maintain large temporal databases. Journal of the ACM, 36:687--718, 1989.
[60]	Thomas Dean. Inferring mesoscale models of neural computation. CoRR, arXiv:1710.05183, 2017.
[61]	Thomas Dean, Kenneth Basye, Robert Chekaluk, Seungseok Hyun, Moises Lejter, and Margaret Randazza. Coping with uncertainty in a control system for navigation and exploration. In Proceedings AAAI-90, pages 1010--1015, Cambridge, Massachusetts, 1990. AAAI, MIT Press.
[62]	Thomas Dean and Mark Boddy. An analysis of time-dependent planning. In Proceedings AAAI-88, pages 49--54, Cambridge, Massachusetts, 1988. AAAI, MIT Press.
[63]	Thomas Dean and Mark Boddy. Reasoning about partially ordered events (Also appears in ``Readings in Qualitative Reasoning About Physical Systems'' (Morgan Kaufmann), edited by Dan Weld and Johan De Kleer). Artificial Intelligence Journal, 36:375--399, 1988.
[64]	Thomas Dean, R. James Firby, and David P. Miller. Hierarchical planning involving deadlines, travel time and resources (Also appears in ``Readings in Planning'' (Morgan Kaufmann), edited by James Allen, James Hendler, and Austin Tate, and in ``Autonomous Mobile Robots: Control, Planning, and Architecture'' (IEEE Computer Society Press), edited by S. S. Iyengar and Alberto Elfes). Computational Intelligence Journal, 4:381--398, 1988.
[65]	Thomas Dean and Michael Wellman. Planning and Control. Morgan Kaufmann Publishers, San Francisco, California, 1991.
[66]	Stanislas Dehaene. The Number Sense: How the Mind Creates Mathematics. Oxford University Press, 1999.
[67]	Stanislas Dehaene. Reading in the Brain: The Science and Evolution of a Human Invention. Viking Press, 2009.
[68]	Stanislas Dehaene. Consciousness and the Brain: Deciphering How the Brain Codes Our Thoughts. Viking Press, 2014.
[69]	Stanislas Dehaene and Jean-Pierre Changeux. A hierarchical neuronal network for planning behavior. Proceedings of the National Academy of Sciences, 94:13293--13298, 1997.
[70]	Stanislas Dehaene and Jean-Pierre Changeux. Experimental and theoretical approaches to conscious processing. Neuron, 70:200--227, 2017.
[71]	Stanislas Dehaene, Jean-Pierre Changeux, Lionel Naccache, Jèrôme Sackur, and Claire Sergent. Conscious, preconscious, and subliminal processing: a testable taxonomy. Trends in Cognitive Sciences, 10:204--211, 2017.
[72]	Stanislas Dehaene, Michel Kerszberg, and Jean-Pierre Changeux. A neuronal model of a global workspace in effortful cognitive tasks. Proceedings of the National Academy of Sciences, 95:14529--14534, 1998.
[73]	Stanislas Dehaene, Hakwan Lau, and Sid Kouider. What is consciousness, and could machines have it? Science, 358(6362):486--492, 2017.
[74]	Stanislas Dehaene and Lionel Naccache. Towards a cognitive neuroscience of consciousness: basic evidence and a workspace framework. Cognition, 79(1):1--37, 2001.
[75]	A. Del Cul, S. Dehaene, P. Reyes, E. Bravo, and A. Slachevsky. Causal role of prefrontal cortex in the threshold for access to consciousness. Brain, 132(9):2531--2540, 2009.
[76]	Daniel Dennett. Consciousness Explained. Penguin, London, 1991.
[77]	Daniel C. Dennett. The Intentional Stance. A Bradford book. A Bradford Book, 1989.
[78]	E. Dere, A. Easton, L. Nadel, and J.P. Huston. Handbook of Episodic Memory. Handbook of Behavioral Neuroscience. Elsevier Science, 2008.
[79]	Jacob Devlin, Rudy R Bunel, Rishabh Singh, Matthew Hausknecht, and Pushmeet Kohli. Neural program meta-induction. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 2077--2085. Curran Associates, Inc., 2017.
[80]	Jacob Devlin, Jonathan Uesato, Surya Bhupatiraju, Rishabh Singh, Abdel-rahman Mohamed, and Pushmeet Kohli. Robustfill: Neural program learning under noisy I/O. CoRR, arXiv:1703.07469, 2017.
[81]	Pawel Dlotko, Kathryn Hess, Ran Levi, Max Nolte, Michael Reimann, Martina Scolamiero, Katharine Turner, Eilif Muller, and Henry Markram. Topological analysis of the connectome of digital reconstructions of neural microcircuits. CoRR, arXiv:1601.01580, 2016.
[82]	Li Dong and Mirella Lapata. Language to logical form with neural attention. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pages 33--43, Stroudsburg, PA, USA, 2016. Association for Computational Linguistics.
[83]	Chao Du, Fuxi Cai, Mohammed A. Zidan, Wen Ma, Seung Hwan Lee, and Wei D. Lu. Reservoir computing using dynamic memristors for temporal information processing. Nature Communications, 8:2204, 2017.
[84]	W. Durant. The Story of Philosophy. Simon and Schuster, 1965.
[85]	David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Alan Aspuru-Guzik, and Ryan P Adams. Convolutional networks on graphs for learning molecular fingerprints. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 2224--2232. Curran Associates, Inc., 2015.
[86]	Chris Eliasmith. How to Build a Brain: A Neural Architecture for Biological Cognition. Oxford Series on Cognitive Modeling. Oxford University Press USA, 2013.
[87]	E. Ermer, S. A. Guerin, L. Cosmides, J. Tooby, and M. B. Miller. Theory of mind broad and narrow: reasoning about social exchange engages ToM areas, precautionary reasoning does not. Society of Neuroscience, 1(3-4):196--219, 2006.
[88]	Michael D. Ernst. Natural language is a programming language: Applying natural language processing to software development. In SNAPL 2017: the 2nd Summit oN Advances in Programming Languages, pages 4:1--4:14, Asilomar, CA, USA, 2017.
[89]	Matthias Felleisen, Robert Bruce Findler, Matthew Flatt, and Shriram Krishnamurthi. How to Design Programs: An Introduction to Programming and Computing. MIT Press, Cambridge, MA, 2001.
[90]	Richard Fikes and Nils J. Nilsson. STRIPS: A new approach to the application of theorem proving to problem solving. Artificial Intelligence Journal, 2:189--208, 1971.
[91]	Richard E. Fikes, Peter E. Hart, and Nils J. Nilsson. Learning and executing generalized robot plans. Artificial Intelligence Journal, 3:251--288, 1972.
[92]	R. James Firby, Thomas L. Dean, and David P. Miller. Efficient robot planning with deadlines and travel time. In Proceedings of the 6th International Symposium on Robotics and Automation. IASTED, 1985.
[93]	D. Fox, W. Burgard S. Thrun, and F. Dellaert. Particle filters for mobile robot localization. In A. Doucet, N. de Freitas, and Gordon. N., editors, Sequential Monte Carlo Methods in Practice. Springer-Verlag, 2000.
[94]	Andrea Frome, Greg S. Corrado, Jonathon Shlens, Samy Bengio, Jeffrey Dean, Marc'Aurelio Ranzato, and Tomàs Mikolov. DeViSE: A deep visual-semantic embedding model. In Advances in Neural Information Processing Systems 26, pages 2121--2129, 2013.
[95]	Toshikatsu Fujii. Chapter 3.6 the basal forebrain and episodic memory. In E. Dere, A. Easton, L. Nadel, and J.P. Huston, editors, Handbook of Behavioral Neuroscience, pages 357--377. Elsevier Science, 2008.
[96]	J. M. Gardiner. Episodic memory and autonoetic consciousness: a first-person approach. Philosophical Transactions of the Royal Society London B Biological Science, 356:1351--1361, 2001.
[97]	John W. Gardner. Self-Renewal: The Individual and the Innovative Society. Norton, 1995.
[98]	Sam Gershman, Tobias Gerstenberg, Chris Baker, and Fiery Cushman. Plans, habits, and theory of mind. PLoS ONE, 2016.
[99]	Shalini Ghosh, Oriol Vinyals, Brian Strope, Scott Roy, Tom Dean, and Larry Heck. Contextual LSTM (CLSTM) models for large scale NLP tasks. CoRR, arXiv:1602.06291, 2016.
[100]	I. J. Good. A five-year plan for automatic chess. In E. Dale and D. Michie, editors, Machine Intelligence 2, volume 2, pages 89--118. Oliver and Boyd, London, 1968.
[101]	I.J. Good. Good Thinking: The Foundations of Probability and Its Applications. University of Minnesota Press, 1983.
[102]	Noah D. Goodman, Joshua B. Tenenbaum, and T. Gerstenberg. Concepts in a probabilistic language of thought. In Morgolis and Lawrence, editors, The Conceptual Mind: New Directions in the Study of Concepts. MIT Press, 2015.
[103]	Erik D. Goodwyn. The Neurobiology of the Gods: How Brain Physiology Shapes the Recurrent Imagery of Myth and Dreams. Taylor & Francis, 2012.
[104]	Alison Gopnik and Andrew N. Meltzoff. Words, Thoughts, and Theories. MIT Press, Cambridge, MA, USA, 1998.
[105]	Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. CoRR, arXiv:1410.5401, 2014.
[106]	Alex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, Ivo Danihelka, Agnieszka Grabska-Barwińska, Sergio Gómez Colmenarejo, Edward Grefenstette, Tiago Ramalho, John Agapiou, Adrià Puigdoménech Badia, Karl Moritz Hermann, Yori Zwols, Georg Ostrovski, Adam Cain, Helen King, Christopher Summerfield, Phil Blunsom, Koray Kavukcuoglu, and Demis Hassabis. Hybrid computing using a neural network with dynamic external memory. Nature, 538:471--476, 2016.
[107]	Michael Graziano. Consciousness and the Social Brain. Oxford University Press, New York, NY, 2013.
[108]	Michael Graziano. Consciousness is not mysterious: It’s just the brain describing itself—to itself. The Atlantic Magazine, 2016, 2016.
[109]	Michael Graziano. How consciousness explains ventriloquists and religion: The brain projects its own qualities onto the world around it—for better or worse. The Atlantic Magazine, 2016, 2016.
[110]	Michael Graziano. How phantom limbs explain consciousness: The brain’s model of the body can tell us a lot about its model of attention. The Atlantic Magazine, 2016, 2016.
[111]	Michael Graziano. Most popular theories of consciousness are worse than wrong: They play to our intuitions, but don’t actually explain anything. The Atlantic Magazine, 2016, 2016.
[112]	Michael Graziano. A new theory explains how consciousness evolved: A neuroscientist on how we came to be aware of ourselves. The Atlantic Magazine, 2016, 2016.
[113]	Michael Graziano. Your brain sees things that you don't: Understanding the difference between awareness and attention might be the key to unlocking the mystery of human consciousness. The Atlantic Magazine, 2016, 2016.
[114]	Michael Graziano and Mathew Botvinick. How the brain represents the body: insights from neurophysiology and psychology. In W. Prinz and B. Hommel, editors, Common Mechanisms in Perception and Action. Attention and Performance, pages 136--157. Oxford University Press, Oxford, UK, 2002.
[115]	Michael S. A. Graziano. The attention schema theory: A foundation for engineering artificial consciousness. Frontiers in Robotics and AI, 4:60, 2017.
[116]	Michael S. A. Graziano and Taylor W. Webb. The attention schema theory: a mechanistic account of subjective awareness. Frontiers in Psychology, 6:500, 2015.
[117]	Michael S.A. Graziano, Charles G. Gross, Charlotte S.R. Taylor, and Tirin Moore. A system of multimodal areas in the primate brain. In Spence and Driver, editors, Crossmodal Space and Crossmodal Attention. Oxford University Press, Oxford, 2004.
[118]	Ralph J. Greenspan. Seymour benzer (1921-2007). Current Biology, 18:106--110, 2007.
[119]	Karol Gregor, Ivo Danihelka, Alex Graves, and Daan Wierstra. DRAW: A recurrent neural network for image generation. CoRR, arXiv:1502.04623, 2015.
[120]	Edward Groshev, Aviv Tamar, Siddharth Srivastava, and Pieter Abbeel. Learning generalized reactive policies using deep neural networks. CoRR, arXiv:1708.07280, 2017.
[121]	Sergio Guadarrama, Lorenzo Riano, Dave Golland, Daniel Gouhring, Yangqing Jia, Dan Klein, Pieter Abbeel, and Trevor Darrell. Grounding spatial relations for human-robot interaction. In 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 1640--1647, 2013.
[122]	Sumit Gulwani, Oleksandr Polozov, and Rishabh Singh. Program synthesis. Foundations and Trends in Programming Languages, 4(1-2):1--119, 2017.
[123]	N. Gupta, S. K. Mandal, J. Malave, A. Mandal, and R. N. Mahapatra. A hardware scheduler for real time multiprocessor system on chip. In 2010 23rd International Conference on VLSI Design, pages 264--269, 2010.
[124]	Michael Gygli, Mohammad Norouzi, and Anelia Angelova. Deep value networks learn to evaluate and iteratively refine structured outputs. CoRR, arXiv:1703.04363, 2017.
[125]	Michael Gygli, Mohammad Norouzi, and Anelia Angelova. Deep value networks learn to evaluate and iteratively refine structured outputs. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1341--1351. PMLR, 2017.
[126]	Farsin Hamzei, Magnus-Sebastian Vry, Dorothee Saur, Volkmar Glauche, Markus Hoeren, Irina Mader, Cornelius Weiller, and Michel Rijntjes. The dual-loop model and the human mirror neuron system: an exploratory combined fmri and dti study of the inferior frontal gyrus. Cerebral Cortex, 26(5):2215--2224, 2016.
[127]	Alexander Hanuschkin, Markus Diesmann, and Abigail Morrison. A reafferent and feed-forward model of song syntax generation in the bengalese finch. Journal Compututational Neuroscience, 31:509--532, 2011.
[128]	Yuval Noah Harari. Sapiens: A Brief History of Humankind. HarperCollins, 2015.
[129]	Yuval Noah Harari. Homo Deus: A Brief History of Tomorrow. HarperCollins, 2017.
[130]	William A. Harris. Seymour benzer 1921-2007 the man who took us from genes to behaviour. PLoS Biology, 6(2):1--3, 2008.
[131]	Demis Hassabis and Eleanor A. Maguire. Deconstructing episodic memory with construction. Trends in Cognitive Science, 11:299--306, 2007.
[132]	Demis Hassabis and Eleanor A. Maguire. The construction system of the brain. Philosophical Transactions of the Royal Society London B Biological Science, 364:1263--1271, 2009.
[133]	Barbara Hayes-Roth. A blackboard architecture for control. Artificial Intelligence Journal, 26:251--321, 1985.
[134]	Barbara Hayes-Roth, Richard Washington, Rattikorn Hewett, Michael Hewett, and Adam Seiver. Intelligent monitoring and control. In Proceedings IJCAI 11, pages 243--249. IJCAII, 1989.
[135]	Donald Hebb. The Organization of Behavior. Wiley, New York, 1949.
[136]	Moritz Helmstaedter, Kevin L. Briggman, Srinivas C. Turaga, Viren Jain, H. Sebastian Seung, and Winfried Denk. Connectomic reconstruction of the inner plexiform layer in the mouse retina. Nature, 500:168--174, 2013.
[137]	Karl Moritz Hermann, Tomáš Kočiský, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. Teaching machines to read and comprehend. CoRR, arXiv:1506.03340, 2015.
[138]	Gregory Hickok and David Poeppel. The cortical organization of speech processing. Nature Reviews Neuroscience, 8:393, 2007.
[139]	Tonya Hines. Anatomy of the brain. Mayfield Clinic Public Service, 2016.
[140]	G. E. Hinton, J. L. McClelland, and D. E. Rumelhart. Chapter 3: Distributed Representations. In D. E. Rumelhart and J. L. McClelland, editors, Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Volume 1: Foundations. MIT Press, Cambridge, MA, 1986.
[141]	G. E. Hinton and D. C. Plaut. Using fast weights to deblur old. In Proceedings of the 9th Annual Conference of the Cognitive Science Society, pages 177--186. Lawrence Erlbaum Associates, 1987.
[142]	Sepp Hochreiter. Untersuchungen zu dynamischen neuronalen netzen. Diploma thesis. Diploma thesis. Institut für Informatik, Lehrstuhl Prof. Brauer, Technische Universität München, 1991.
[143]	Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computing, 9:1735--1780, 1997.
[144]	Katie Hoemann, Maria Gendron, and Lisa Feldman Barrett. Mixed emotions in the predictive brain. Current Opinion in Behavioral Sciences, 15:51--57, 2017.
[145]	Andreas Holzinger, Markus Plass, Katharina Holzinger, Gloria Cerasela Crisan, Camelia-Mihaela Pintea, and Vasile Palade. A glass-box interactive machine learning approach for solving np-hard problems with the human-in-the-loop. CoRR, arXiv:1708.01104, 2017.
[146]	Eric J. Horvitz. Reasoning under varying and uncertain resource constraints. In Proceedings AAAI-88, pages 111--116. AAAI, 1988.
[147]	Yu Hu, James Trousdale, Kres̈imir Josíc, and Eric Shea-Brown. Motif statistics and spike correlations in neuronal networks. CoRR, arXiv:1206.3537, 2015.
[148]	Q. Huang, P. Smolensky, X. He, L. Deng, and D. Wu. Tensor Product Generation Networks. CoRR, 2017.
[149]	Christian Humpel. Organotypic brain slice cultures: A review. Neuroscience, 305(Supplement C):86--98, 2015.
[150]	S.M. Huttegger. The Probabilistic Foundations of Rational Learning. The Probabilistic Foundations of Rational Learning. Cambridge University Press, 2017.
[151]	Elias B. Issa, Charles F. Cadieu, and James J. DiCarlo. Evidence that the ventral stream codes the errors used in hierarchical inference and learning. bioRxiv, 2016.
[152]	Masao Ito. Control of mental activities by internal models in the cerebellum. Nature Reviews Neuroscience, 9:304--313, 2008.
[153]	Masao Ito. The Cerebellum: Brain for an Implicit Self. Financial Times Press, 2012.
[154]	L. Itti and P. Baldi. A principled approach to detecting surprising events in video. In Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 631--637, San Siego, CA, 2005.
[155]	L. Itti, C. Koch, and E. Niebur. A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20:1254--1259, Nov 1998.
[156]	E.R. Kandel, J.H. Schwartz, T.M. Jessell, S.A. Siegelbaum, and A. J. Hudspeth. Principles of neural science (Fifth Edition). McGraw-Hill, Health Professions Division, 2012.
[157]	Andrej Karparthy. The unreasonable effectivenss of recursive neutal networks. http://karpathy.github.io/2015/05/21/rnn-effectiveness/, 2015.
[158]	Andrej Karparthy. Convolutional neural networks for visual recognition. http://cs231n.github.io/convolutional-networks/, 2016.
[159]	Diederik P. Kingma, Danilo J. Rezende, Shakir Mohamed, and Max Welling. Semi-supervised learning with deep generative models. In Proceedings of the 27th International Conference on Neural Information Processing Systems, pages 3581--3589, Cambridge, MA, USA, 2014. MIT Press.
[160]	Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR), 2017.
[161]	Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard S. Zemel, Antonio Torralba, Raquel Urtasun, and Sanja Fidler. Skip-thought vectors. CoRR, arXiv:1506.06726, 2015.
[162]	Ian Kleckner, Jiahe Zhang, Alexandra Touroutoglou, Lorena Chanes, Chenjie Xia, W Kyle Simmons, Karen Quigley, Bradford Dickerson, and Lisa Barrett. Evidence for a large-scale brain system supporting allostasis and interoception in humans. bioRxiv, page 098970, 2017.
[163]	Christof Koch. The Quest for Consciousness: A Neurobiological Approach. MIT Press, 2004.
[164]	Christof Koch. Consciousness: Confessions of a Romantic Reductionist. MIT Press, 2012.
[165]	Etienne Koechlin, Gregory Corrado, Pietro Pietrini, and Jordan Grafman. Dissociating the role of the medial and lateral anterior prefrontal cortex in human planning. Proceedings of the National Academy of Sciences, 97:7651--7656, 2000.
[166]	Etienne Koechlin and Thomas Jubault. Broca's area and the hierarchical organization of human behavior. Neuron, 50:963--974, 2006.
[167]	S. Kouider, C. Stahlhut, S. V. Gelskov, L. S. Barbosa, M. Dutat, V. de Gardelle, A. Christophe, S. Dehaene, and G. Dehaene-Lambertz. A neural marker of perceptual consciousness in infants. Science, 340(6130):376--380, 2013.
[168]	Trenton Kriete, David C. Noelle, Jonathan D. Cohen, and Randall C. O'Reilly. Indirection and symbol-like processing in the prefrontal cortex and basal ganglia. Proceedings of the National Academy of Sciences, 2013.
[169]	Ankit Kumar, Ozan Irsoy, Jonathan Su, James Bradbury, Robert English, Brian Pierce, Peter Ondruska, Ishaan Gulrajani, and Richard Socher. Ask me anything: Dynamic memory networks for natural language processing. CoRR, arXiv:1506.07285, 2015.
[170]	Nate Kushman and Regina Barzilay. Using semantic unification to generate regular expressions from natural language. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 826--836. Association for Computational Linguistics, 2013.
[171]	Brenden M. Lake, Ruslan Salakhutdinov, and Joshua B. Tenenbaum. Human-level concept learning through probabilistic program induction. Science, 350(6266):1332--1338, 2015.
[172]	Antonio H. Lara and Jonathan D. Wallis. The role of prefrontal cortex in working memory: A mini review. Frontiers in System Neuroscience, 9:173, 2015.
[173]	Angeliki Lazaridou, Nghia The Pham, and Marco Baroni. Combining language and vision with a multimodal skip-gram model. In Proceedings of the North American Chapter of the Association for Computational Linguistics - Human Language Technologies. Association for Computational Linguistics, 2015.
[174]	Quoc Le and Tomas Mikolov. Distributed representations of sentences and documents. In Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32, pages II--1188--II--1196. JMLR.org, 2014.
[175]	Quoc Le and Tomàs Mikolov. Distributed representations of sentences and documents. CoRR, arXiv:1405.4053v2, 2014.
[176]	Gerry Leisman, Ahmed A. Moustafa, and Tal Shafir. Thinking, walking, talking: Integratory motor and cognitive brain function. Frontiers in Public Health, 4:94, 2016.
[177]	Jiwei Li, Minh-Thang Luong, and Dan Jurafsky. A hierarchical neural autoencoder for paragraphs and documents. CoRR, arXiv:1506.01057, 2015.
[178]	Xiujun Li, Yun-Nung Chen, Lihong Li, and Jianfeng Gao. End-to-end task-completion neural dialogue systems. CoRR, arXiv:1703.01008, 2017.
[179]	Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard S. Zemel. Gated graph sequence neural networks. CoRR, arXiv:1511.05493, 2015.
[180]	Yuxi Li. Deep reinforcement learning: An overview. CoRR, arXiv:1701.07274, 2017.
[181]	Chen Liang, Jonathan Berant, Quoc Le, Kenneth D. Forbus, and Ni Lao. Neural symbolic machines: Learning semantic parsers on freebase with weak supervision. CoRR, arXiv:1611.00020, 2016.
[182]	Shane A. Liddelow, Kevin A. Guttenplan, Laura E. Clarke, Frederick C. Bennett, Christopher J. Bohlen, Lucas Schirmer, Mariko L. Bennett, Alexandra E. Munch, Won-Suk Chung, Todd C. Peterson, Daniel K. Wilton, Arnaud Frouin, Brooke A. Napier, Nikhil Panicker, Manoj Kumar, Marion S. Buckwalter, David H. Rowitch, Valina L. Dawson, Ted M. Dawson, Beth Stevens, and Ben A. Barres. Neurotoxic reactive astrocytes are induced by activated microglia. Nature, 541:481--487, 2017.
[183]	Philip Lieberman. Human language and our reptilian brain: The subcortical bases for speech, syntax and thought. Harvard University Press, Cambridge, MA, 2002.
[184]	Philip Lieberman. On the nature and evolution of the neural bases of human language. American Journal of Physical Anthropology, 119:36--62, December 2002.
[185]	Timothy P. Lillicrap, Daniel Cownden, Douglas B. Tweed, and Colin J. Akerman. Random synaptic feedback weights support error backpropagation for deep learning. Nature Communications, 7:13276, 2016.
[186]	Xi Victoria Lin, Chenglong Wang, Deric Pang, Kevin Vu, Luke Zettlemoyer, and Michael D. Ernst. Program synthesis from natural language using recurrent neural networks. Technical Report UW-CSE-17-03-01, University of Washington Department of Computer Science and Engineering, Seattle, WA, USA, 2017.
[187]	D. Liu, A. N. Meltzoff, and H. M. Wellman. Neural correlates of belief- and desire-reasoning. Child Development, 80(4):1163--1171, 2009.
[188]	Nicholas Locascio, Karthik Narasimhan, Eduardo DeLeon, Nate Kushman, and Regina Barzilay. Neural generation of regular expressions from natural language with minimal domain knowledge. CoRR, arxiv:1608.03000, 2016.
[189]	Fan Long and Martin Rinard. An analysis of the search spaces for generate and validate patch generation systems. In Proceedings of the 38th International Conference on Software Engineering, pages 702--713, New York, NY, USA, 2016. ACM.
[190]	Sarah M. Loos, Geoffrey Irving, Christian Szegedy, and Cezary Kaliszyk. Deep network guided proof search. CoRR, arXiv:1701.06972, 2017.
[191]	Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1412--1421, 2015.
[192]	Bill MacCartney and Christopher Manning. Natural logic for textual inference. In Proceedings of ACL Workshop on Textual Entailment and Paraphrasing, 2007.
[193]	Bill MacCartney and Christopher D. Manning. An extended model of natural logic. In Proceedings of the Eighth International Conference on Computational Semantics, pages 140--156, Stroudsburg, PA, USA, 2009. Association for Computational Linguistics.
[194]	Franco Manessi, Alessandro Rozza, and Mario Manzo. Dynamic graph convolutional networks. CoRR, arXiv:1704.06199, 2017.
[195]	Adam H. Marblestone, Greg Wayne, and Konrad P. Kording. Towards an integration of deep learning and neuroscience. CoRR, arXiv:1606.03813, 2016.
[196]	Lori Marino. Sentience. In Michael D. Breed and Janice Moore, editors, Encyclopedia of Animal Behavior, pages 132--138. Academic Press, 2010.
[197]	David Marr. A theory of cerebellar cortex. Journal of Physiology, 202:437--470, 1969.
[198]	An Unusually Hardcore Dharma Book Mastering the Core Teachings of the Buddha. Daniel Ingram. 2008, Aeon Books.
[199]	Drew V. McDermott. Planning and acting. Cognitive Science, 2:71--109, 1978.
[200]	Tomàs Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. CoRR, arXiv:1301.3781, 2013.
[201]	Tomàs Mikolov, Quoc V. Le, and Ilya Sutskever. Exploiting similarities among languages for machine translation. CoRR, arXiv:1309.4168, 2013.
[202]	Tomàs Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed representations of words and phrases and their compositionality. CoRR, arXiv:1310.4546, 2013.
[203]	David P. Miller, James R. Firby, and Thomas L. Dean. Deadlines, travel time, and robot problem solving. In Proceedings IJCAI-85, pages 1052--1054, San Francisco, California, 1985. IJCAI, Morgan Kaufmann Publishers.
[204]	R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan, D. Chklovskii, and U. Alon. Network motifs: simple building blocks of complex networks. Science, 298(5594):824--827, 2002.
[205]	Marvin Minsky. The Society of Mind. Simon & Schuster, New York, NY, 1987.
[206]	Volodymyr Mnih, Nicolas Heess, Alex Graves, and Koray Kavukcuoglu. Recurrent models of visual attention. CoRR, arXiv:1406.6247, 2014.
[207]	Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing Atari with deep reinforcement learning. CoRR, arXiv:1312.5602, 2013.
[208]	H. P. Moravec and A. Elfes. High resolution maps from wide angle sonar. In IEEE International Conference on Robotics and Automation, pages 138--145, 1985.
[209]	Ñāṇavīra Thera. Consciousness (Viññāṇa). In Samonera Bodhesako and Forrest Williams, editors, Clearing the Path: Writings of Ñāṇavīra Thera (1960-1965). Path Press, 1987.
[210]	Srini Narayanan. The role of cortico-basal-thalamic loops in cognition: a computational model and preliminary results. Neurocomputing, 5254:605--614, 2003.
[211]	Arvind Neelakantan, Quoc V. Le, and Ilya Sutskever. Neural programmer: Inducing latent programs with gradient descent. CoRR, arXiv:1511.04834, 2015.
[212]	John O'Keefe and Lynn Nadel. The hippocampus as a cognitive map. Clarendon Press, 1978.
[213]	Chris Olah and Shan Carter. Attention and augmented recurrent neural networks. Distill, 2016.
[214]	Christopher Olah. Understanding lstm networks. http://colah.github.io/posts/2015-08-Understanding-LSTMs, 2015.
[215]	B. A. Olshausen and D. J. Field. Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision Research, 37:3311--3325, 1997.
[216]	Suzanne Oosterwijk, Kristen A Lindquist, Morenikeji Adebayo, and Lisa Feldman Barrett. The neural representation of typical and atypical experiences of negative images: comparing fear, disgust and morbid fascination. Social cognitive and affective neuroscience, 11(1):11--22, 2015.
[217]	Suzanne Oosterwijk, Lukas Snoek, Mark Rotteveel, Lisa Feldman Barrett, and H Steven Scholte. Shared states: using mvpa to test neural overlap between self-focused emotion imagery and other-focused emotion understanding. Social Cognitive and Affective Neuroscience, page nsx037, 2017.
[218]	R. C. O'Reilly. Six principles for biologically based computational models of cortical cognition. Trends Cognitive Science), 2(11):455--462, 1998.
[219]	Randall O'Reilly and Yuko Munakata. Computational Explorations in Cognitive Neuroscience: Understanding the Mind by Simulating the Brain. MIT Press, Cambridge, Massachusetts, 2000.
[220]	Randall C. O'Reilly. Biologically based computational models of high-level cognition. Science, 314:91--94, 2006.
[221]	Randall C. O'Reilly and Michael J. Frank. Making working memory work: A computational model of learning in the prefrontal cortex and basal ganglia. Neural Computation, 18:283--328, 2006.
[222]	Randall C. O'Reilly, Seth A. Herd, and Wolfgang M. Pauli. Computational models of cognitive control. Current Opinion in Neurobiology, 20:257--261, 2010.
[223]	Razvan Pascanu, Yujia Li, Oriol Vinyals, Nicolas Heess, Lars Buesing, Sébastien Racanière, David P. Reichert, Theophane Weber, Daan Wierstra, and Peter Battaglia. Learning model-based planning from scratch. CoRR, arXiv:1707.06170, 2017.
[224]	Zhanglin Peng, Ruimao Zhang, Xiaodan Liang, Xiaobai Liu, and Liang Lin. Geometric scene parsing with hierarchical LSTM. CoRR, arXiv:1604.01931, 2016.
[225]	Alex (Sandy) Pentland. Honest Signals: How They Shape Our World. The MIT Press, 2008.
[226]	Alex (Sandy) Pentland. Social Physics: How Good Ideas Spread the Lessons from a New Science. Penguin Press, 2014.
[227]	Andreas R. Pfenning, Erina Hara, Osceola Whitney, Miriam V. Rivas, Rui Wang, Petra L. Roulhac, Jason T. Howard, Morgan Wirthlin, Peter V. Lovell, Ganeshkumar Ganapathy, Jacquelyn Mouncastle, M. Arthur Moseley, J. Will Thompson, Erik J. Soderblom, Atsushi Iriki, Masaki Kato, M. Thomas P. Gilbert, Guojie Zhang, Trygve Bakken, Angie Bongaarts, Amy Bernard, Ed Lein, Claudio V. Mello, Alexander J. Hartemink, and Erich D. Jarvis. Convergent transcriptional specializations in the brains of humans and song-learning birds. Science, 346, 2014.
[228]	Hieu Pham, Melody Y. Guan, Barret Zoph, Quoc V. Le, and Jeff Dean. Efficient neural architecture search via parameter sharing. CoRR, arXiv:1802.03268, 2018.
[229]	Chris Piech, Jonathan Huang, Andy Nguyen, Mike Phulsuksombati, Mehran Sahami, and Leonidas J. Guibas. Learning program embeddings to propagate feedback on student code. CoRR, arXiv:1505.05969, 2015.
[230]	Tony Plate. Holographic reduced representations: Convolution algebra for compositional distributed representations. In International Joint Conference on Artificial Intelligence, pages 30--35. Morgan Kaufmann, 1991.
[231]	Tony A. Plate. Holographic Reduced Representation: Distributed Representation for Cognitive Structures. CSLI Publications, Stanford, CA, USA, 2003.
[232]	Ah-Hwee Tan Poo-Hee Chang. Encoding and recall of spatio-temporal episodic memory in real time. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17, pages 1490--1496, 2017.
[233]	Michael I. Posner. Attentional networks and consciousness. Frontiers in Psychology, 3:64, 2012.
[234]	Marco Prinz, Daniel Erny, and Nora Hagemeyer. Ontogeny and homeostasis of CNS myeloid cells. Nature Immunology, 18:385--392, 2017.
[235]	Alexander Pritzel, Benigno Uria, Sriram Srinivasan, Adrià Puigdomènech Badia, Oriol Vinyals, Demis Hassabis, Daan Wierstra, and Charles Blundell. Neural episodic control. CoRR, arXiv:1703.01988, 2017.
[236]	David V. Pynadath, Paul S. Rosenbloom, and Stacy C. Marsella. Reinforcement learning for adaptive theory of mind in the sigma cognitive architecture. In Ben Goertzel, Laurent Orseau, and Javier Snaider, editors, Artificial General Intelligence: 7th International Conference, AGI 2014, Quebec City, QC, Canada, August 1-4, 2014. Proceedings, pages 143--154. Springer International Publishing, 2014.
[237]	Giorgia Quadrato, Tuan Nguyen, Evan Z. Macosko, John L. Sherwood, Sung Min Yang, Daniel R. Berger, Natalie Maria, Jorg Scholvin, Melissa Goldman, Justin P. Kinney, Edward S. Boyden, Jeff W. Lichtman, Ziv M. Williams, Steven A. McCarroll, and Paola Arlotta. Cell diversity and network dynamics in photosensitive human brain organoids. Nature, 545:48--53, 2017.
[238]	R. P. N. Rao, B. A. Olshausen, and M. S. Lewicki. Probabilistic Models of the Brain: Perception and Neural Function. MIT Press, Cambridge, MA, 2002.
[239]	Rajesh P. N. Rao and Dana H. Ballard. Dynamic model of visual recognition predicts neural response properties in the visual cortex. Neural Computation, 9:721--763, 1996.
[240]	Rajesh P. N. Rao and Dana H. Ballard. Localized receptive fields may mediate transformation-invariant recognition in the visual cortex. Technical report, National Resource Laboratory for the Study of Brain and Behavior, University of Rochester, 1997.
[241]	Rajesh P. N. Rao and Dana H. Ballard. Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects. Nature Neuroscience, 2:79--87, 1999.
[242]	Rajesh P. N. Rao and Daniel L. Ruderman. Learning lie groups for invariant visual perception. In M. S. Kearns, S. A. Solla, and D. A. Cohn, editors, Advances in Neural Information Processing Systems, volume 11. MIT Press, Cambridge, MA, 1999.
[243]	Scott E. Reed and Nando de Freitas. Neural programmer-interpreters. CoRR, arXiv:1511.06279, 2015.
[244]	Charles Rich and Richard C. Waters. The programmer's apprentice: A research overview. Computer, 21(11):10--25, 1988.
[245]	Michel Rijntjes, Cornelius Weiller, Tobias Bormann, and Mariachristina Musso. The dual loop model: its relation to language and other modalities. Frontiers in Evolutionary Neuroscience, 4:9, 2012.
[246]	Giacomo Rizzolatti and Stefano Rozzi. Motor cortex and mirror system in monkeys and humans. In Gregory Hickok and Steven L. Small, editors, Neurobiology of Language, pages 59--72. Academic Press, San Diego, 2016.
[247]	Horacio Rodrìguez. Advanced natural language processing. Facultat d'Informatica Barcelona, Universitat Politecnica Catalunya, 2017.
[248]	Amir Rosenfeld, Mahdi Biparva, and John K. Tsotsos. Priming neural networks. CoRR, arXiv:1711.05918, 2017.
[249]	C. Rottschy, R. Langner, I. Dogan, K. Reetz, A. R. Laird, J. B. Schulz, P. T. Fox, and S. B. Eickhoff. Modelling neural correlates of working memory: A coordinate-based meta-analysis. Neuroimage, 60:830--846, 2012.
[250]	David E. Rumelhart and Adele A. Abrahamson. A model for analogical reasoning. Cognitive Psychology, 5:1--28, 1973.
[251]	Stuart J. Russell and Eric H. Wefald. Principles of metareasoning. In Ronald J. Brachman, Hector J. Levesque, and Raymond Reiter, editors, Proceedings of the First International Conference on Principles of Knowledge Representation and Reasoning, pages 400--411. Morgan Kaufmann, San Francisco, California, 1989.
[252]	Michael W. Salter and Beth Stevens. Microglia emerge as central players in brain disease. Nature Medicine, 23:1018--1027, 2017.
[253]	Robert M. Sapolsky. Behave: The Biology of Humans at Our Best and Worst. Penguin Publishing Group, 2017.
[254]	Daniel L. Schacter, Donna Rose Addis, Demis Hassabis, Victoria C. Martin, R. Nathan Spreng, and Karl K. Szpunar. The future of memory: Remembering, imagining, and the brain. Neuron, 76:677--694, 2012.
[255]	Jürgen Schmidhuber. Learning complex, extended sequences using the principle of history compression. Neural Compututation, 4:234--242, 1992.
[256]	Susan Schneider and Max Velmans. Introduction. In Susan Schneider and Max Velmans, editors, The Blackwell Companion to Consciousness, 2nd Edition. Wiley-Blackwell, 2017.
[257]	Oliver G. Selfridge. Pandemonium: A paradigm for learning. In D. V. Blake and A. M. Uttley, editors, Proceedings of the Symposium on Mechanisation of Thought Processes, pages 511--529, 1959.
[258]	Martin I. Sereno. Language and the Primate Brain. In Proceedings Thirteenth Annual Conference of the Cognitive Science Society, pages 79--84, Hillsdale, NJ, 1991. Lawrence Erlbaum Associates.
[259]	David Silver. Advanced topics in reinforcement learning. Course Notes, 2015.
[260]	David Silver, Hado van Hasselt, Matteo Hessel, Tom Schaul, Arthur Guez, Tim Harley, Gabriel Dulac-Arnold, David P. Reichert, Neil C. Rabinowitz, André Barreto, and Thomas Degris. The predictron: End-to-end learning and planning. CoRR, arXiv:1612.08810, 2017.
[261]	Herbert A. Simon. A behavioral model of rational choice. Quarterly Journal of Economics, 69:99--118, 1955.
[262]	P. Smolensky. Tensor product variable binding and the representation of symbolic structures in connectionist systems. Artificial Intelligence, 46:159--216, 1990.
[263]	P. Smolensky and G. Legendre. The Harmonic Mind: From Neural Computation to Optimality-theoretic Grammar. Linguistic and philosophical implications. MIT Press, 2011.
[264]	Richard Socher. Recursive deep learning for natural language processing and computer vision. Ph.D. Thesis. Computer Science Department, Stanford University, 2014.
[265]	Richard Socher, John Bauer, Christopher D. Manning, and Andrew Y. Ng. Parsing with compositional vector grammars. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, Stroudsburg, PA, USA, 2013. Association for Computational Linguistics.
[266]	Richard Socher, Brody Huval, Christopher D. Manning, and Andrew Y. Ng. Semantic compositionality through recursive matrix-vector spaces. In Proceedings of the 2012 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2012.
[267]	Richard Socher, Cliff Chiung-Yu Lin, Andrew Y. Ng, and Christopher D. Manning. Parsing natural scenes and natural language with recursive neural networks. In Lise Getoor and Tobias Scheffer, editors, Proceedings of the 28th International Conference on Machine Learning, pages 129--136, 2011.
[268]	Richard Socher, Alex Perelygin, Jean Y. Wu, Jason Chuang, Christopher D. Manning, Andrew Y. Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631--1642. Association for Computational Linguistics, Stroudsburg, PA, USA, 2013.
[269]	Charles Spence and Jon Driver. Crossmodal Space and Crossmodal Attention. Oxford University Press, 2004.
[270]	Olaf Sporns and Rolf Kötter. Motifs in brain networks. PLoS Biol, 2(11):1910--1918, 2004.
[271]	Andrea Stocco, Christian Lebiere, and John R. Anderson. Conditional routing of information to the cortex: A model of the basal ganglia's role in cognitive coordination. Psychology Review, 117:541--574, 2010.
[272]	C. N. H. Street, W. F. Bischof, and A. Kingstone. Perspective taking and theory of mind in hide and seek. Attention Perception Psychophysics, 80(1):21--26, 2018.
[273]	A. Stuhlmüller and N. D. Goodman. Reasoning about reasoning by nested conditioning: Modeling theory of mind with probabilistic programs. Journal of Cognitive Systems Research, 28:80--99, 2014.
[274]	Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and Rob Fergus. Weakly supervised memory networks. CoRR, arXiv:1503.08895, 2015.
[275]	Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 3104--3112. Curran Associates, Inc., 2014.
[276]	Rand S. Swenson. Review of Clinical and Functional Neuroscience. Dartmouth Medical School, 2006.
[277]	Aviv Tamar, Sergey Levine, and Pieter Abbeel. Value iteration networks. CoRR, arXiv:1602.02867, 2016.
[278]	Aviv Tamar, Yi Wu, Garrett Thomas, Sergey Levine, and Pieter Abbeel. Value iteration networks. In Proceedings of the 30th International Conference on Neural Information Processing Systems. Curran Associates Inc., 2016.
[279]	Luming Tang, Boyang Deng, Haiyu Zhao, and Shuai Yi. Hierarchical deep recurrent architecture for video understanding. CoRR, arXiv:/1707.03296, 2017.
[280]	Regina Barzilay Tao Lei, Fan Long and Martin Rinard. From natural language specifications to program input parsers. In The 51st Annual Meeting of the Association for Computational Linguistics, 2013.
[281]	N. A. Tetreault, A. Y. Hakeem, S. Jiang, B. A. Williams, E. Allman, B. J. Wold, and J. M. Allman. Microglia in the cerebral cortex in autism. Journal of Autism Developmental Disorders, 42(12):2569--2584, 2012.
[282]	Michael R. Trimble and Andrea E. Cavanna. Chapter 3.7 the role of the precuneus in episodic memory. In E. Dere, A. Easton, L. Nadel, and J.P. Huston, editors, Handbook of Behavioral Neuroscience, pages 378--392. Elsevier Science, 2008.
[283]	Darinka Trübutschek, Sébastien Marti, Andrés Ojeda, Jean-Rémi King, Yuanyuan Mi, Misha Tsodyks, and Stanislas Dehaene. A theory of working memory without consciousness or sustained activity. Biorxiv, 2016.
[284]	Joe Z. Tsien. Chapter 4.1 neural coding of episodic memory. In E. Dere, A. Easton, L. Nadel, and J.P. Huston, editors, Handbook of Behavioral Neuroscience, pages 412--431. Elsevier Science, 2008.
[285]	John K. Tsotsos. A Computational Perspective on Visual Attention. MIT Press, 2011.
[286]	Endel Tulving. Elements of Episodic Memory. Oxford Psychology Series. Oxford University Press, 1985.
[287]	F. van der Velde and M. de Kamps. Neural blackboard architectures of combinatorial structures in cognition. Behavioral Brain Science, 29:37--70, 2006.
[288]	Sándor M. Veres. Natural Language Programming of Agents and Robotic Devices: Publishing for agents and humans in sEnglish. SysBrain Ltd, London, 2008.
[289]	Christoph von der Malsburg. The correlation theory of brain function. Technical report, Max Planck Institute for Biophysical Chemistry, 1981.
[290]	Christoph von der Malsburg. The correlation theory of brain function. In E. Domany, J.L. van Hemmen, and K. Schulten, editors, Models of Neural Networks: Physics of Neural Networks. Springer, 1994.
[291]	Catherine Wacongne, Jean-Pierre Changeux, and Stanislas Dehaene. A neuronal model of predictive coding accounting for the mismatch negativity. Journal of Neuroscience, 32(11):3665--3678, 2012.
[292]	Dirk Walther and Christof Koch. Modeling attention to salient proto-objects. Neural Networks, 19:1395--1407, 2006.
[293]	Dirk Walther, Ueli Rutishauser, Christof Koch, and Pietro Perona. Selective visual attention enables learning and recognition of multiple objects in cluttered scenes. Computer Vision and Image Understanding, 100:41--63, 2005.
[294]	Ke Wang, Rishabh Singh, and Zhendong Su. Dynamic neural program embedding for program repair. CoRR, arXiv:1711.07163, 2017.
[295]	Qian Wang, Jiaxing Zhang, Sen Song, and Zheng Zhang. Attentional neural network: Feature selection using cognitive feedback. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 2033--2041. Curran Associates, Inc., 2014.
[296]	Thomas C. Watson, Nadine Becker, Richard Apps, and Matthew W. Jones. Back to front: cerebellar connections and interactions with the prefrontal cortex. Frontiers in Systems Neuroscience, 8:4, 2014.
[297]	Theophane Weber, Sébastien Racanière, David P. Reichert, Lars Buesing, Arthur Guez, Danilo Jimenez Rezende, Adrià Puigdomènech Badia, Oriol Vinyals, Nicolas Heess, Yujia Li, Razvan Pascanu, Peter Battaglia, David Silver, and Daan Wierstra. Imagination-augmented agents for deep reinforcement learning. CoRR, arXiv:/1707.06203, 2017.
[298]	Cornelius Weiller, Tobias Bormann, Dorothee Kuemmerer, Mariachristina Musso, and Michel Rijntjes. The dual loop model in language. In Gregory Hickok and Steven L. Small, editors, Neurobiology of Language, pages 325--337. Academic Press, San Diego, 2016.
[299]	Jason Weston, Sumit Chopra, and Antoine Bordes. Memory networks. CoRR, arXiv:1410.3916, 2014.
[300]	Jonathan Wiener. Time, Love, Memory: A Great Biologist and His Quest for the Origins of Behavior. Alfred A. Knoph, New York, 1994.
[301]	Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8:229--256, 1992.
[302]	Xiaojun Xu, Chang Liu, Qian Feng, Heng Yin, Le Song, and Dawn Song. Neural network-based graph embedding for cross-platform binary code similarity detection. CoRR, arXiv:1708.06525, 2017.
[303]	Yuichi Yamashita, Miki Takahasi, Tetsu Okumura, Maki Ikebuchi, Hiroko Yamada, Madoka Suzuki, Kazuo Okanoya, and Jun Tani. Developmental learning of complex syntactical song in the bengalese finch: A neural network model. Neural Networks, 21:1224--1231, 2008.
[304]	D. L. Yamins and J. J. DiCarlo. Eight open problems in the computational modeling of higher sensory cortex. Current Opinion in Neurobiology, 37:144--120, 2016.
[305]	D. L. Yamins, H. Hong, C. Cadieu, and J. J. DiCarlo. Hierarchical modular optimization of convolutional networks achieves representations similar to macaque it and human ventral stream. In Advances in Neural Information Processing Systems 26, pages 3093--3101, Tahoe, CA, 2013.
[306]	Daniel L. K. Yamins, Ha Hong, Charles F. Cadieu, Ethan A. Solomon, Darren Seibert, and James J. DiCarlo. Performance-optimized hierarchical models predict neural responses in higher visual cortex. Proceedings of the National Academy of Sciences, 111(23):8619--8624, 2014.
[307]	Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. Hierarchical attention networks for document classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics, pages 1480--1489, San Diego, California, 2016. Association for Computational Linguistics.
[308]	Junmei Zhu and Christoph von der Malsburg. Synapto–synaptic interactions speed up dynamic link matching. Neurocomputing, 44-46:721--728, 2002.
[309]	Mengchen Zhu and Christopher J. Rozell. Modeling inhibitory interneurons in efficient sensory coding models. PLoS Computational Biology, 11:e1004353, 2015.
[310]	Terry Zimmerman and Subbarao Kambhampati. Learning-assisted automated planning: Looking back, taking stock, going forward. AI Mag., 24(2):73--96, 2003.
[311]	Barret Zoph and Quoc V. Le. Neural architecture search with reinforcement learning. In Proceedings of the 5th International Conference on Learning Representations, 2017.
[312]	Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V. Le. Learning transferable architectures for scalable image recognition. CoRR, arXiv:1707.07012, 2017.

³ Here is an excerpt from Will Durant's essay on Immanuel Kant in The Story of Philosophy [84] focusing on what Kant called the "Transcendental Dialectic." Durant's exposition is clearer in my estimation than Kant's attempt in Critique of Pure Reason, but still disturbingly vague and confused for my taste:

Nevertheless, this certainty, this absoluteness, of the highest generalizations of logic and science, is, paradoxically, limited and relative: limited strictly to the field of actual experience, and relative strictly to our human mode of experience. For if our analysis has been correct, the world as we know it is a construction, a finished product, almost — one might say — a manufactured article, to which the mind contributes as much by its moulding forms as the thing contributes by its stimuli. (So we perceive the top of the table as round, whereas our sensation is of an ellipse.) The object as it appears to us is a phenomenon, an appearance, perhaps very different from the external object before it came within the ken of our senses; what that original object was we can never know; the "thing-in-itself" may be an object of thought or inference (a "nouraenon"), but it cannot be experienced, for in being experienced it would be changed by its passage through sense and thought. "It remains completely unknown to us what objects may be by themselves and apart from the receptivity of our senses. We know nothing but our manner of perceiving them; that manner being peculiar to us, and not necessarily shared by every being, though, no doubt, by every human being."
The moon as known to us is merely a bundle of sensations (as Hume saw), unified (as Hume did not see) by our native mental structure through the elaboration of sensations into perceptions, and of these into conceptions or ideas; in result, the moon is for us merely our ideas. Not that Kant ever doubts the existence of "matter" and the external world; but he adds that we know nothing certain about them except that they exist. Our detailed knowledge is about their appearance, their phenomena, about the sensations which we have of them. Idealism does not mean, as the man in the street thinks, that nothing exists outside the perceiving subject; but that a goodly part of every object is created by the forms of perception and understanding: we know the object as transformed into idea; what it is before being so transformed we cannot know. Science, after all, is naive; it supposes tliat it is dealing with things in themselves, in their full-blooded external and uncorrupted reality; philosophy is a little more sophisticated, and realizes that the whole material of science consists of sensations, perceptions and conceptions, rather than of things. "Kant's greatest merit" says Schopenhauer, "is the distinction of the phenomenon from the thing-in-itself."

It follows that any attempt, by either science or religion, to say just what the ultimate reality is, must fall back into mere hypothesis; "the understanding can never go beyond the limits of sensibility." — Such transcendental science loses itself in antinomies," and such transcendental theology loses itself in "paralogisms." It is the cruel function of "transcendental dialectic" to examine the validity of these attempts of reason to escape from the enclosing circle of sensation and appearance into the unknowable world of things "in themselves."

Antinomies are the insoluble dilemmas born of a science that tries to overleap experience. So, for example, knowledge attempts to decide whether the world is finite or infinite in space, thought rebels against either supposition: beyond any limit, we are driven to conceive something further, endlessly ; and yet infinity is itself inconceivable. Again, did the world have a beginning in time? We cannot conceive eternity, but then, too, we cannot conceive any point in the past without feeling at once that before that, something was. Or has that chain of causes which science studies, a beginning, a First Cause? Yes, for an endless chain is inconceivable. No, for a first cause uncaused is inconceivable as well. Is there any exit from these blind alleys of thought? There is, says Kant, if we remember that space, time and cause are modes of perception and conception, which must enter into all our experience, since they are the web and structure of experience; these dilemmas arise from supposing that space, time and cause are external things independent of perception. We shall never have any experience which we shall not interpret in terms of space and time and cause; but we shall never have any philosophy if we forget that these are not things but modes of interpretation and understanding.

So with the paralogisms of "rational" theology—which attempts to prove by theoretical reason that the soul is an incorruptible substance, that the will is free and above the law of cause and effect, and that there exists a 'necessary being' God, as the presupposition of all reality. Transcendental dialectic must remind theology that substance and cause and necessity are finite categories, modes of arrangement and classification which the mind applies to sense-experience, and reliably valid only for the phenomena that appear to such experience; we cannot apply these conceptions to the noumenal (or merely inferred and conjectural) world. Religion cannot be proved by theoretical reason. From [84]

² I'm indebted to my wife Jo for her suggestion that I would be interested in Will Durant's [84] interpretation of Immanuel Kant's Critique of Pure Reason³ .

¹ Approximately 40 years after first reading Immanuel Kant's Critique of Pure Reason, I finally understand his contributions to the history of philosophy and perhaps his confusion with or misinterpretation of David Hume regarding the role of experience and the relationship between experience on the one hand and the ability to construct theories on the other. The latter being the part that confused me at the age of 19 and led me to believe that Kant was confused as well. His idea of (Platonic) theories of unassailable truth — what he calls Transcendental Logic — is at the center of his account of how (a) our direct experience, (b) the patterns we discover in data, and (c) the theories we invent to reinterpret these patterns in the language of mathematics enable us to make predictions about events in the real world².

Human beings and modern machine learning systems are good at discovering patterns in data. Pattern matching was an important step in understanding our environment, but the data was unpredictable, the patterns inconstant and, lacking an understanding of probability, the patterns themselves were unsatisfactory for making accurate predictions and difficult to compose in order to construct broadly encompassing theories that account for more complicated patterns and statistical regularities. Humans invented logic and mathematics to provide structure to those patterns. We invented the differential calculus to explain processes that take place over time. We invented probability and statistics to account for the variability we observe in complex patterns of the sort that govern games of chance. See here for an account of Kant's Theory of Perception.

I don't think Kant understood David Hume's theory of how we come to convert raw experience into theories or perhaps I read too much into Hume. Hume and his fellow empiricists, John Locke and Francis Bacon, were wary of putting too much stock in direct experience and wanted to avoid the human tendency toward superstition. They provided the foundations for the modern scientific method to ground perception in theory. However, one can imagine what the rationalist Kant might make of Hume's view that that passion rather than reason governs human behavior. Hume argued against the existence of innate ideas, positing that all human knowledge is ultimately grounded solely in experience. Hume held that genuine knowledge must either be directly traceable to objects perceived in experience, or result from abstract reasoning about relations between ideas which are derived from experience. Their differences seem largely due to ambiguous terminology. Chronology: Bacon (1561-1626), Locke (1632-1704), Newton (1642-1727), Leibniz (1646-1716), Hume (1711-1776) and Kant (1724-1804).

⁴ A continuous-time recurrent neural network can be modeled as a system of ordinary differential equations. By the Nyquist-Shannon sampling theorem, discrete-time recurrent neural networks of the sort commonly used in machine learning can be viewed as continuous-time recurrent neural networks where the differential equations have been transformed into equivalent difference equations.

⁶ Here is a small sample of journal and encyclopedia articles on episodic memory with an emphasis on its role in autonoetic consciousness in both humans and selected animal species:

@article{GardinerPTRS_B-01,
       author = {Gardiner, J. M.},
        title = {Episodic memory and autonoetic consciousness: a first-person approach.},
      journal = {Philosophical Transactions of the Royal Society London B Biological Science},
         year = {2001},
       volume = {356},
        issue = {1413},
        pages = {1351-1361},
     abstract = {Episodic memory is identified with autonoetic consciousness, which gives rise to remembering in the sense of self-recollection in the mental re-enactment of previous events at which one was present. Autonoetic consciousness is distinguished from noetic consciousness, which gives rise to awareness of the past that is limited to feelings of familiarity or knowing. Noetic consciousness is identified not with episodic but with semantic memory, which involves general knowledge. A recently developed approach to episodic memory makes use of 'first-person' reports of remembering and knowing. Studies using this approach have revealed many independent variables that selectively affect remembering and others that selectively affect knowing. These studies can also be interpreted in terms of distinctiveness and fluency of processing. Remembering and knowing do not correspond with degrees of confidence in memory. Nor does remembering always control the memory response. There is evidence that remembering is selectively impaired in various populations, including not only amnesic patients and older adults but also adults with Asperger's syndrome. This first-person approach to episodic memory represents one way in which that most elusive aspect of consciousness, its subjectivity, can be investigated scientifically. The two kinds of conscious experiences can be manipulated experimentally in ways that are systematic, replicable and intelligible theoretically.},
}
@incollection{ClaytonandDickinsonEAB-10,
       author = {Mental Time Travel: Can Animals Recall the Past and Plan for the Future?},
        title = {N. S. Clayton and A. Dickinson},
       editor = {Breed, Michael D.  and Moore, Janice},
    booktitle = {Encyclopedia of Animal Behavior},
    publisher = {Academic Press},
         year = {2010},
        pages = {438-442},
     abstract = {According to the mental time travel hypothesis, only humans can mentally dissociate themselves from the present, traveling backward in time to recollect specific past events about what happened where and when (episodic memory) and traveling forward in time to anticipate future needs (future planning). A series of studies of the mnemonic capabilities of food-caching western scrub-jays question this assumption. In terms of the retrospective component of episodic memory, these birds remember the ‘what, where, and when’ of specific past caching episodes; they keep track of how long ago they cached different types of perishable foods that decay at different rates, and also remember whether another individual was present at the time of caching, and if so, which bird was watching when. Recent work demonstrates that the jays also make provision for a future need, caching more food in places in which they will not be given breakfast the next morning than in places where they will be receive breakfast the next morning even though there is plenty of food available to them at the time when they cache the food. Taken together these results challenge the mental time travel hypothesis by showing that some elements of both retrospective and prospective mental time travel appear not to be uniquely human.}
}
@incollection{MarinoEAB-10,
       author = {Lori Marino},
        title = {Sentience},
       editor = {Breed, Michael D.  and Moore, Janice},
    booktitle = {Encyclopedia of Animal Behavior},
    publisher = {Academic Press},
         year = {2010},
        pages = {132-138},
     abstract = {Sentience refers to the depth of awareness an individual possesses about himself or herself and others. There appear to be three related, but separable, general domains of sentience. These are self-awareness, metacognition, and theory of mind. To date, evidence shows that these three capacities are found in nonhuman animals, including primates, dolphins, dogs, rodents, and corvids. These findings are evidence of the deep psychological continuity that exists across the animal kingdom.}
}
@article{KleinQJEO-16,
       author = {Klein, S.B.},
        title = {Autonoetic consciousness: Reconsidering the role of episodic memory in future-oriented self-projection},
      journal = {Quarterly Journal of Experimental Psychology},
         year = {2016},
       volume = {69},
       number = {2},
        pages = {381-401},
     abstract = {Following the seminal work of Ingvar (1985. "Memory for the future": An essay on the temporal organization of conscious awareness. Human Neurobiology, 4, 127-136), Suddendorf (1994. The discovery of the fourth dimension: Mental time travel and human evolution. Master's thesis. University of Waikato, Hamilton, New Zealand), and Tulving (1985. Memory and consciousness. Canadian Psychology, 26, 1-12), exploration of the ability to anticipate and prepare for future contingencies that cannot be known with certainty has grown into a thriving research enterprise. A fundamental tenet of this line of inquiry is that future-oriented mental time travel, in most of its presentations, is underwritten by a property or an extension of episodic recollection. However, a careful conceptual analysis of exactly how episodic memory functions in this capacity has yet to be undertaken. In this paper I conduct such an analysis. Based on conceptual, phenomenological, and empirical considerations, I conclude that the autonoetic component of episodic memory, not episodic memory per se, is the causally determinative factor enabling an individual to project him or herself into a personal future.}
}
@article{SprengFiP-13,
       author = {Spreng, R. Nathan},
        title = {Examining the role of memory in social cognition},
      journal = {Frontiers in Psychology},
       volume = {4},
        pages = {437},
         year = {2013},
     abstract = {The function of memory is not only to recall the past, but also to form and update models of our experiences and use these models to navigate the world. Perhaps, the most complex environment for humans to navigate is the social one. Social dynamics are extraordinarily complex, unstructured, labile and difficult to predict. Successful navigation through our many social landscapes is essential to forming and maintaining the durable social bonds necessary for physical and mental health. Until recently, little research has examined the role that memory plays in social behavior and interpersonal sensitivity. There is growing evidence that recalling personally experienced events (autobiographical memory) and inferring the mental states of others (mentalizing or theory-of-mind) share an extensive functional neuroanatomy (Buckner and Carroll, 2007; Spreng et al., 2009; Spreng and Grady, 2010; Rabin et al., 2010) and may be critical for adaptive social cognition.},
}
@article{CiaramelliFiP-13,
       author = {Ciaramelli, Elisa and Bernardi, Francesco and Moscovitch, Morris},
        title = {Individualized Theory of Mind (iToM): When Memory Modulates Empathy},
      journal = {Frontiers in Psychology},
       volume = {4},
        pages = {4},
         year = {2013},
     abstract = {Functional neuroimaging studies have noted that brain regions supporting theory of mind (ToM) overlap remarkably with those underlying episodic memory, suggesting a link between the two processes. The present study shows that memory for others’ past experiences modulates significantly our appraisal of, and reaction to, what is happening to them currently. Participants read the life story of two characters; one had experienced a long series of love-related failures, the other a long series of work-related failures. In a later faux pas recognition task, participants reported more empathy for the character unlucky in love in love-related faux pas scenarios, and for the character unlucky at work in work-related faux pas scenarios. The memory-based modulation of empathy correlated with the number of details remembered from the characters’ life story. These results suggest that individuals use memory for other people’s past experiences to simulate how they feel in similar situations they are currently facing. The integration of ToM and memory processes allows adjusting mental state inferences to fit unique social targets, constructing an individualized ToM (iToM).}
}
@article{BehrendtFiP-13,
       author = {Behrendt, Ralf-Peter},
        title = {Conscious Experience and Episodic Memory: Hippocampus at the Crossroads},
      journal = {Frontiers in Psychology},
       volume = {4},
        pages = {304},
         year = {2013},
     abstract = {If an instance of conscious experience of the seemingly objective world around us could be regarded as a newly formed event memory, much as an instance of mental imagery has the content of a retrieved event memory, and if, therefore, the stream of conscious experience could be seen as evidence for ongoing formation of event memories that are linked into episodic memory sequences, then unitary conscious experience could be defined as a symbolic representation of the pattern of hippocampal neuronal firing that encodes an event memory – a theoretical stance that may shed light into the mind-body and binding problems in consciousness research. Exceedingly detailed symbols that describe patterns of activity rapidly self-organizing, at each cycle of the θ rhythm, in the hippocampus are instances of unitary conscious experience that jointly constitute the stream of consciousness. Integrating object information (derived from the ventral visual stream and orbitofrontal cortex) with contextual emotional information (from the anterior insula) and spatial environmental information (from the dorsal visual stream), the hippocampus rapidly forms event codes that have the informational content of objects embedded in an emotional and spatiotemporally extending context. Event codes, formed in the CA3-dentate network for the purpose of their memorization, are not only contextualized but also allocentric representations, similarly to conscious experiences of events and objects situated in a seemingly objective and observer-independent framework of phenomenal space and time. Conscious perception is likely to be related to more fleeting and seemingly internal forms of conscious experience, such as autobiographical memory recall, mental imagery, including goal anticipation, and to other forms of externalized conscious experience, namely dreaming and hallucinations; and evidence pointing to an important contribution of the hippocampus to these conscious phenomena will be reviewed.}
}

⁵ A reasonable working definition of autonoetic consciousness is supplied below and a small sample of relevant publications is available in this footnote⁶:

"Autonoetic consciousness is the capacity to recursively introspect on one's own subjective experience through time, that is, to perceive the continuity in one's identity from the past to the present and into the future." From [196]
"Autonoetic consciousness is distinguished from noetic consciousness, which gives rise to awareness of the past that is limited to feelings of familiarity or knowing. Noetic consciousness is identified not with episodic but with semantic memory, which involves general knowledge." From [96]

⁷ Executive functions (collectively referred to as executive function and cognitive control) are a set of cognitive processes that are necessary for the cognitive control of behavior: selecting and successfully monitoring behaviors that facilitate the attainment of chosen goals. Executive functions include basic cognitive processes such as attentional control, cognitive inhibition, inhibitory control, working memory, and cognitive flexibility. Higher order executive functions require the simultaneous use of multiple basic executive functions and include planning and fluid intelligence, i.e., reasoning and problem solving. SOURCE

⁸ Here are a couple of recent papers addressing the problem of learning compositional models of visual data:

@article{DonahueetalCoRR-14,
       author = {Jeff Donahue and Lisa Anne Hendricks and Sergio Guadarrama and Marcus Rohrbach and Subhashini Venugopalan and Kate Saenko and Trevor Darrell},
        title = {Long-term Recurrent Convolutional Networks for Visual Recognition and Description},
      journal = {CoRR},
       volume = {arXiv:1411.4389},
         year = 2014,
     abstract = {Models based on deep convolutional networks have dominated recent image interpretation tasks; we investigate whether models which are also recurrent, or "temporally deep", are effective for tasks involving sequences, visual and otherwise. We develop a novel recurrent convolutional architecture suitable for large-scale visual learning which is end-to-end trainable, and demonstrate the value of these models on benchmark video recognition tasks, image description and retrieval problems, and video narration challenges. In contrast to current models which assume a fixed spatio-temporal receptive field or simple temporal averag- ing for sequential processing, recurrent convolutional models are "doubly deep" in that they can be compositional in spatial and temporal "layers". Such models may have advantages when target concepts are complex and/or training data are limited. Learning long-term dependencies is possible when nonlinearities are incorporated into the network state updates. Long-term RNN models are appealing in that they directly can map variable-length inputs (e.g., video frames) to variable length outputs (e.g., natural lan- guage text) and can model complex temporal dynamics; yet they can be optimized with backpropagation. Our recurrent long-term models are directly connected to modern visual convnet models and can be jointly trained to simultaneously learn temporal dynamics and convolutional perceptual rep- resentations. Our results show such models have distinct advantages over state-of-the-art models for recognition or generation which are separately defined and/or optimized.}
}
@article{ChangetalCoRR-16,
       author = {Michael B. Chang and Tomer Ullman and Antonio Torralba and Joshua B. Tenenbaum},
        title = {A Compositional Object-Based Approach to Learning Physical Dynamics},
      journal = {CoRR},
       volume = {arXiv:1612.00341},
         year = {2016},
     abstract = {We present the Neural Physics Engine (NPE), a framework for learning simulators of intuitive physics that naturally generalize across variable object count and different scene configurations. We propose a factorization of a physical scene into composable object-based representations and a neural network architecture whose compositional structure factorizes object dynamics into pairwise interactions. Like a symbolic physics engine, the NPE is endowed with generic notions of objects and their interactions; realized as a neural network, it can be trained via stochastic gradient descent to adapt to specific object properties and dynamics of different worlds. We evaluate the efficacy of our approach on simple rigid body dynamics in two-dimensional worlds. By comparing to less structured architectures, we show that the NPE's compositional representation of the structure in physical interactions improves its ability to predict movement, generalize across variable object count and different scene configurations, and infer latent properties of objects such as mass.},
}
@article{HigginsetalICLR-18,
        title = {{SCAN}: Learning Hierarchical Compositional Visual Concepts},
       author = {Irina Higgins, Nicolas Sonnerat, Loic Matthey, Arka Pal, Christopher P Burgess, Matko Bošnjak, Murray Shanahan, Matthew Botvinick, Demis Hassabis, Alexander Lerchner},
      journal = {International Conference on Learning Representations},
         year = {2018},
     abstract = {The seemingly infinite diversity of the natural world arises from a relatively small set of coherent rules, such as the laws of physics or chemistry. We conjecture that these rules give rise to regularities that can be discovered through primarily unsupervised experiences and represented as abstract concepts. If such representations are compositional and hierarchical, they can be recombined into an exponentially large set of new concepts. This paper describes SCAN (Symbol-Concept Association Network), a new framework for learning such abstractions in the visual domain. SCAN learns concepts through fast symbol association, grounding them in disentangled visual primitives that are discovered in an unsupervised manner. Unlike state of the art multimodal generative model baselines, our approach requires very few pairings between symbols and images and makes no assumptions about the form of symbol representations. Once trained, SCAN is capable of multimodal bi-directional inference, generating a diverse set of image samples from symbolic descriptions and vice versa. It also allows for traversal and manipulation of the implicit hierarchy of visual concepts through symbolic instructions and learnt logical recombination operations. Such manipulations enable SCAN to break away from its training data distribution and imagine novel visual concepts through symbolically instructed recombination of previously learnt concepts.}
}
@article{JohnsonetalCoRR-16,
       author = {Justin Johnson and Bharath Hariharan and Laurens van der Maaten and Li Fei{-}Fei and C. Lawrence Zitnick and Ross B. Girshick},
        title = {{CLEVR:} {A} Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning},
      journal = {CoRR},
       volume = {arXiv:1612.06890},
         year = {2016},
     abstract = {When building artificial intelligence systems that can reason and answer questions about visual data, we need diagnostic tests to analyze our progress and discover shortcomings. Existing benchmarks for visual question answering can help, but have strong biases that models can exploit to correctly answer questions without reasoning. They also conflate multiple sources of error, making it hard to pinpoint model weaknesses. We present a diagnostic dataset that tests a range of visual reasoning abilities. It contains minimal biases and has detailed annotations describing the kind of reasoning each question requires. We use this dataset to analyze a variety of modern visual reasoning systems, providing novel insights into their abilities and limitations.}
}
@misc{JohnsonGOOG-18,
       author = {Justin Johnson},
        title = {Compositional Visual Intelligence through Language},
     abstract = {The use of deep neural networks has led to fantastic recent progress on fundamental problems in computer vision. However the most well-studied tasks, such as image classification and object detection, encompass only a small fraction of the rich visual intelligence that people display. In order to make progress toward the goal of visual intelligence, we must define new tasks and construct new datasets that push the boundaries of artificial visual intelligence. An important facet of visual intelligence is composition - our understanding of the whole derives from an understanding of the parts. One avenue for studying the compositional nature of visual intelligence is through connections with natural language, which is also inherently compositional. To this end I will present three research directions toward compositional visual intelligence through language. In each we will see how incorporating compositionality forces us to rethink our tasks and datasets, but results in systems with richer visual intelligence. I will first discuss image captioning, where moving from sentence captions to dense captions and descriptive paragraphs results in richer image descriptions. I will next discuss visual question answering, where an emphasis on compositional reasoning gives rise to new datasets and models. I will then discuss text-to-image synthesis, where replacing freeform natural language with explicitly compositional scene graphs of objects and relationships allows us to generate more complex images. I will conclude by discussing future directions for improving visual intelligence through composition.}
}

⁹ The neocortex is the newest part of the cerebral cortex to evolve — the prefix "neo" meaning new. The other part of the cerebral cortex is called the allocortex. The cellular organization of the allocortex is different from the six-layered neocortex. In humans, 90% of the cerebral cortex and 76% of the entire brain is neocortex.

¹⁰ David Mayfield owns Blue Ridge Life Science an investment firm specializing in early and mid-stage neuroscience. His grandfather Dr. Frank Mayfield was a famous neurosurgeon, scientific leader and entrepreneur in medical technology whose invention of the spring aneurysm clip has saved thousands of lives. David has taken another route to improving the lives of patients with debilitating neural disorders, driven in part by tragedies within his extended family, but he is no less passionate about the science than his grandfather. Both David and I are dismayed with the way in which current financial, medical and scientific incentives are misaligned. In his life and work, he has attempted to redress the negative consequences of this misalignment, often by drawing attention to relevant science and championing new opportunities for intervention. His crusade to promulgate research on microglia and related drug development is an example of the former. Full disclosure: Dr. Mayfield was my uncle, father figure and mentor at a crucial stage in my young life following the sudden death of my father, and so David, colloquially speaking, is my nephew, or, genealogically more precise, my cousin once removed.

¹¹ There is a case to be made that funding through conventional life science VCs, or even midsize biotechs with bigger bank accounts, won't provide new experimental drugs (or their microglial targets) with the chance to succeed. The problem relates paradoxically to the experimental data which seem to show that some of these drugs are curative in such a wide range of CNS diseases and maladies, e.g., multiple sclerosis, anxiety, drug withdrawal syndromes, stroke, chronic neuropathic pain, retinal degeneration. An embarrassment of riches of sorts which is disabling for VCs and also for midsize biotechs who want their drug candidates focused on very narrow mechanisms of action, and very narrowly defined indications. But what if the embarrassment of riches were explained by the drug's impact on a pathological mechanism broadly shared by much of neurodevelopmental and neurodegenerative disease?

It turns out that doesn't matter. Even a potent and very successful biotech such as Biogen would rather have one drug which mitigates the severity of one orphan disease, than one drug which may prevent, mitigate the severity of, and possibly cure ten diseases / disorders afflicting hundreds of millions. Something to do, perhaps, with incentives, liquidity preferences, and appetite for risk built into the way VCs are funded and the way biotechs are publicly financed? One theory is that the phenomenon also relates to a confusion of the scientific methods of drug discovery with the biology of disease and its causes. Anyway, that's just a long winded way of saying that it is going to take a creative, non-conventional organization to translate the new science of microglia into therapies helping patients.

¹³ Chemokines are a family of small cytokines, or signaling proteins secreted by cells. Their name is derived from their ability to induce directed chemotaxis in nearby responsive cells; they are chemotactic cytokines. [...] Some chemokines are considered pro-inflammatory and can be induced during an immune response to recruit cells of the immune system to a site of infection, while others are considered homeostatic and are involved in controlling the migration of cells during normal processes of tissue maintenance or development. [SOURCE]

¹⁴ Cytokines are a broad and loose category of small proteins that are important in cell signaling. [...] Cytokines may include chemokines, interferons, interleukins, lymphokines, and tumour necrosis factors but generally not hormones or growth factors (despite some overlap in the terminology). Cytokines are produced by a broad range of cells, including immune cells like macrophages, B lymphocytes, T lymphocytes and mast cells, as well as endothelial cells, fibroblasts, and various stromal cells. [...] They act through receptors, and are especially important in the immune system; cytokines modulate the balance between humoral and cell-based immune responses, and they regulate the maturation, growth, and responsiveness of particular cell populations. Some cytokines enhance or inhibit the action of other cytokines in complex ways. [SOURCE]

¹⁵ Interleukins are a group of cytokines (secreted proteins and signal molecules) that were first seen to be expressed by white blood cells (leukocytes). The function of the immune system depends in a large part on interleukins, and rare deficiencies of a number of them have been described, all featuring autoimmune diseases or immune deficiency. The majority of interleukins are synthesized by helper CD4 T lymphocytes, as well as through monocytes, macrophages, and endothelial cells. They promote the development and differentiation of T and B lymphocytes, and hematopoietic cells. [...] Interleukin receptors on astrocytes in the hippocampus are also known to be involved in the development of spatial memories in mice. [SOURCE]

¹⁶ Microglia are a type of neuroglia (glial cell) located throughout the brain and spinal cord, accounting for 10–15% of all cells within the brain. As the resident macrophage cells, they act as the first and main form of active immune defense in the central nervous system (CNS). Microglia (and other neuroglia including astrocytes) are distributed in large non-overlapping regions throughout the CNS. Microglia are constantly scavenging for plaques, damaged or unnecessary neurons and synapses, and infectious agents. Microglia are extremely sensitive to even small pathological changes in the CNS. This sensitivity is achieved in part by the presence of unique potassium channels that respond to even small changes in extracellular potassium. [...] Microglia originate in the yolk sac during a remarkably restricted embryonal period and continuously renew themselves and persist throughout life. [SOURCE]

¹⁷ A trophic or growth factor is a naturally occurring substance capable of stimulating cellular growth, proliferation, healing, and cellular differentiation. Usually it is a protein or a steroid hormone. Growth factors are important for regulating a variety of cellular processes. [...] Growth factors typically act as signaling molecules between cells. Examples are cytokines and hormones that bind to specific receptors on the surface of their target cells. [...] While growth factor implies a positive effect on cell division, cytokine is a neutral term with respect to whether a molecule affects proliferation. While some cytokines can be growth factors, others have an inhibitory effect on cell growth or proliferation. Some cytokines cause target cells to undergo programmed cell death or apoptosis. [SOURCE]

¹⁸ Excerpts of a recent email message from Akram Sadek on the evolution of consciousness, and quantum computing in the brain as it relates to selection pressure to reduce the expenditure of energy:

AS: I completely agree with you and Professor [Sean] Carrol, on people's attempts to somehow conflate consciousness with quantum theory. Even considering just quantum theory itself, the R, or reduction process doesn't need a conscious observer to occur, as is popularly described sometimes. A physical measurement is all it takes, and it's just a matter of entangling a single quantum state with a great number of different states which leads to what we call 'classical' physics and a classical result (see enclosed). And from the perspective of the brain, I can't see at all how quantum mechanics could possibly explain consciousness or qualia. Quantum mechanics is really just a very simple theory at its heart. These sorts of ideas very much sully things, I agree.
On the other hand, an important feature of all biological systems is their remarkable efficiency. Food is scarce, and natural selection is an extremely powerful mechanism that ensures only the most efficient organisms will thrive. Inefficient solutions to biological problems are rapidly weeded out, in as little as a single generation. This is why organisms are able to do so much, with what amounts to very little in terms of energy and physical resources. The human brain only runs at 10W after all. My advisor at Cambridge, Simon Laughlin, discovered early in his career that neurons optimize their coding efficiency to maximize the transmission of information, whilst minimizing energy expenditure (I may have sent you the enclosed paper way back). We already know from the work on photosynthesis that biological systems will harness whatever physics they need to get the job done as efficiently as possible. If quantum mechanics operates in the brain, this is the context in which it would occur.

Theoretically, quantum information processing can occur without any energy expenditure at all, as it is a purely reversible type of computation. The actual energy expenditure that is needed arises due to the necessary error-correction with fault tolerance. If some quantum information processing scheme could have given brains an advantage in terms of using far less energy, then it wouldn't surprise me at all if it's operative. That is what interests me. Of course, if this is the case, then the brain can no longer be thought of as just a thermodynamical engine running an algorithm (which is just a mathematical object). It must be thought of as a physical object that cannot be understood outside of the physical universe it exists in. Since quantum states cannot be cloned, a brain wouldn't be able to be 'copied', like a neural network or piece of software.

¹⁹ Here is a collection of recent papers describing biologically-plausible back-propagation-like algorithms:

@article{BalduzzietalCoRR-14,
       author = {David Balduzzi and Hastagiri Vanchinathan and Joachim M. Buhmann},
        title = {Kickback cuts Backprop's red-tape: Biologically plausible credit assignment in neural networks},
      journal = {CoRR},
       volume = {arXiv:1411.6191},
         year = {2014},
     abstract = {Error backpropagation is an extremely effective algorithm for assigning credit in artificial neural networks. However, weight updates under Backprop depend on lengthy recursive computations and require separate output and error messages -- features not shared by biological neurons, that are perhaps unnecessary. In this paper, we revisit Backprop and the credit assignment problem. We first decompose Backprop into a collection of interacting learning algorithms; provide regret bounds on the performance of these sub-algorithms; and factorize Backprop's error signals. Using these results, we derive a new credit assignment algorithm for nonparametric regression, Kickback, that is significantly simpler than Backprop. Finally, we provide a sufficient condition for Kickback to follow error gradients, and show that Kickback matches Backprop's performance on real-world regression benchmarks.}
}
@article{BengioetalCoRR-16,
       author = {Yoshua Bengio and Dong{-}Hyun Lee and J{\"{o}}rg Bornschein and Zhouhan Lin},
        title = {Towards Biologically Plausible Deep Learning},
      journal = {CoRR},
       volume = {arXiv:1502.04156},
         year = {2016},
     abstract = {Neuroscientists have long criticised deep learning algorithms as incompatible with current knowledge of neurobiology. We explore more biologically plausible versions of deep representation learning, focusing here mostly on unsupervised learning but developing a learning mechanism that could account for supervised, unsupervised and reinforcement learning. The starting point is that the basic learning rule believed to govern synaptic weight updates (Spike-Timing-Dependent Plasticity) arises out of a simple update rule that makes a lot of sense from a machine learning point of view and can be interpreted as gradient descent on some objective function so long as the neuronal dynamics push firing rates towards better values of the objective function (be it supervised, unsupervised, or reward-driven). The second main idea is that this corresponds to a form of the variational EM algorithm, i.e., with approximate rather than exact posteriors, implemented by neural dynamics. Another contribution of this paper is that the gradients required for updating the hidden states in the above variational interpretation can be estimated using an approximation that only requires propagating activations forward and backward, with pairs of layers learning to form a denoising auto-encoder. Finally, we extend the theory about the probabilistic interpretation of auto-encoders to justify improved sampling schemes based on the generative interpretation of denoising auto-encoders, and we validate all these ideas on generative learning tasks.},
}   
@article{LillicrapetalCoRR-14,
       author = {Timothy P. Lillicrap and Daniel Cownden and Douglas B. Tweed and Colin J. Akerman},
        title = {Random feedback weights support learning in deep neural networks},
      journal = {CoRR},
       volume = {arXiv:1411.0247},
         year = {2014},
     abstract = {The brain processes information through many layers of neurons. This deep architecture is representationally powerful, but it complicates learning by making it hard to identify the responsible neurons when a mistake is made. In machine learning, the backpropagation algorithm assigns blame to a neuron by computing exactly how it contributed to an error. To do this, it multiplies error signals by matrices consisting of all the synaptic weights on the neuron's axon and farther downstream. This operation requires a precisely choreographed transport of synaptic weight information, which is thought to be impossible in the brain. Here we present a surprisingly simple algorithm for deep learning, which assigns blame by multiplying error signals by random synaptic weights. We show that a network can learn to extract useful information from signals sent through these random feedback connections. In essence, the network learns to learn. We demonstrate that this new mechanism performs as quickly and accurately as backpropagation on a variety of problems and describe the principles which underlie its function. Our demonstration provides a plausible basis for how a neuron can be adapted using error signals generated at distal locations in the brain, and thus dispels long-held assumptions about the algorithmic constraints on learning in neural circuits.}
}
@article{LillicrapetalNATURE-COMMUNICATIONS-16,
       author = {Lillicrap, Timothy P. and Cownden, Daniel and Tweed, Douglas B. and Akerman, Colin J.},
        title = {Random synaptic feedback weights support error backpropagation for deep learning},
    publisher = {Nature Publishing Group},
      journal = {Nature Communications},
       volume = {7},
         year = {2016},
        pages = {13276},
     abstract = {The brain processes information through multiple layers of neurons. This deep architecture is representationally powerful, but complicates learning because it is difficult to identify the responsible neurons when a mistake is made. In machine learning, the backpropagation algorithm assigns blame by multiplying error signals with all the synaptic weights on each neuron’s axon and further downstream. However, this involves a precise, symmetric backward connectivity pattern, which is thought to be impossible in the brain. Here we demonstrate that this strong architectural constraint is not required for effective error propagation. We present a surprisingly simple mechanism that assigns blame by multiplying errors by even random synaptic weights. This mechanism can transmit teaching signals across multiple layers of neurons and performs as effectively as backpropagation on a variety of tasks. Our results help reopen questions about how the brain could use error signals and dispel long-held assumptions about algorithmic constraints on learning.}
}
@article{ScellierandBengioCoRR-16a,
       author = {Benjamin Scellier and Yoshua Bengio},
        title = {Towards a Biologically Plausible Backprop},
      journal = {CoRR},
       volume = {arXiv:1602.05179v2},
         year = {2016},
     abstract = {Neuroscientists have long criticised deep learning algorithms as incompatible with current knowledge of neurobiology. We explore more biologically plausible versions of deep representation learning, focusing here mostly on unsupervised learning but developing a learning mechanism that could account for supervised, unsupervised and reinforcement learning. The starting point is that the basic learning rule believed to govern synaptic weight updates (Spike-TimingDependent Plasticity) arises out of a simple update rule that makes a lot of sense from a machine learning point of view and can be interpreted as gradient descent on some objective function so long as the neuronal dynamics push firing rates towards better values of the objective function (be it supervised, unsupervised, or rewarddriven). The second main idea is that this corresponds to a form of the variational EM algorithm, i.e., with approximate rather than exact posteriors, implemented by neural dynamics. Another contribution of this paper is that the gradients required for updating the hidden states in the above variational interpretation can be estimated using an approximation that only requires propagating activations forward and backward, with pairs of layers learning to form a denoising auto-encoder. Finally, we extend the theory about the probabilistic interpretation of auto-encoders to justify improved sampling schemes based on the generative interpretation of denoising auto-encoders, and we validate all these ideas on generative learning tasks.},
}
@article{ScellierandBengioCoRR-16b,
       author = {Benjamin Scellier and Yoshua Bengio},
        title = {Equilibrium Propagation: Bridging the Gap Between Energy-Based Models and Backpropagation},
      journal = {CoRR},
       volume = {arXiv:1602.05179v4},
         year = {2016},
     abstract = {We introduce Equilibrium Propagation (e-prop), a learning algorithm for energy-based models. This algorithm involves only one kind of neural computation both for the first phase (when the prediction is made) and the second phase (after the target is revealed) of training. Contrary to backpropagation in feedforward networks, there is no need for special computation in the second phase of our learning algorithm. Equilibrium Propagation combines features of Contrastive Hebbian Learning and Contrastive Divergence while solving the theoretical issues of both algorithms: the algorithm computes the exact gradient of a well defined objective function. Because the objective function is defined in terms of local perturbations, the second phase of e-prop corresponds to only nudging the first-phase fixed point towards a configuration that has lower cost value. In the case of a multi-layer supervised neural network, the output units are slightly nudged towards their target, and the perturbation introduced at the output layer propagates backward in the network. The theory developed in this paper shows that the signal 'back-propagated' during this second phase actually contains information about the error derivatives, which we use to implement a learning rule proved to perform gradient descent with respect to the objective function. Thus, this work makes it more plausible that a mechanism similar to backpropagation could be implemented by brains.}
}
@article{XuetalTNNLS-17,
       author = {David Xu, Andrew Clappison, Cameron Seth, and Jeff Orchard
        title = {Symmetric Predictive Estimator for Biologically Plausible Neural Learning},
      journal = {IEEE Transactions on Neural Networks and Learning Systems},
       volume = {28},
        issue = {10},
         year = {2017},
     abstract = {In a real brain, the act of perception is a bidirectional process, depending on both feedforward sensory pathways and feedback pathways that carry expectations. We are interested in how such a neural network might emerge from a biologically plausible learning rule. Other neural network learning methods either only apply to feedforward networks, or employ assumptions (such as weight copying) that render them unlikely in a real brain. Predictive estimators (PEs) offer a better solution to this bidirectional learning scenario. However, PEs also depend on weight copying. In this paper, we propose the symmetric PE (SPE), an architecture that can learn both feedforward and feedback connection weights individually using only locally available information. We demonstrate that the SPE can learn complicated mappings without the use of weight copying. The SPE networks also show promise in deeper architectures.},
}

¹² Here are my rough notes including selected exchanges with colleagues concerning the different roles of mycroglia in the developing brain. These notes follow up my on my earlier post here. For your convenience, here are a few common technical terms relating to immunology and cell signaling that cover some of the primary tools employed by microglia in their different computational and immunological roles in the human brain: chemokines¹³, cytokines¹⁴, intereukins¹⁵, microglia¹⁶, trophic factors¹⁷.

DM: Thanks for connecting with your colleague Christof on the project. I've listened to a few of his talks over the last few years and find him, and the work of the Allen Institute totally fascinating. I'm traveling today to NYC, and will try to jot down my thoughts on the plane as to how his thinking can be accommodated within the new science regarding microglia. Here are a few thoughts off the top of my head: The questions posed by someone like Christof, interested, as he is, in understanding the biological correlates of consciousness, don't overlap completely with those posed by folks, like, say, David Hubel, Torsten Wiesel, Carla Shatz, and more recently Tobias Bonhoeffer, who are interested in the biological correlates of learning (and memory). I think consciousness is inherently neuronal and digital-computational and depends on functions which occur at the speeds Christof is interested in; the speed of ionotropic signaling between neurons — presynaptic spikes leading to ePSPs and iPSPs as sodium ions and chloride rush down their concentration gradients. Millisecond time scale.
TLD: It's not clear to me that Christof's comments were specific to consciousness, despite his deep interest in the topic. What occurred to me on reading his note had to do with the nature of the computations performed by microglia in service to shaping the interactions between neurons. What do these neural interlocutors need to know about the neurons whose activity they influence? It would appear that the synaptic alterations they produce require seconds or minutes to carry out. Is that because they are simultaneously sampling the activity of tens, hundreds or thousands of dendrites in order to initiate synaptic alterations periodically carried out on the scale of minutes? Do the computations they perform require sampling at millisecond resolution? Setting aside the possibility of some sort of quantum information processing¹⁸, neural refractory periods suggest 1000Hz as an upper bound on how fast human neurons can possibly fire, but 200Hz is a more reasonable upper bound for most human cortical neurons with auditory cortex neurons a possible exception given they appear able to resolve frequencies up to 1000Hz. Still, that's pretty fast.
DM: But learning depends upon functions that occur at slower time scales, beginning with Hebbian changes to synapse efficiency which depend on metabotropic as opposed to ionotropic speeds. Metabolic signaling rather than electrochemical signaling — and I suspect even here, when we are talking about the earliest and most fundamental elements of learning — NMDA receptor/calcium flux mediated changes to synaptic efficiency by AMPA receptor synthesis and insertion — we are already above a timescale of the second. And we haven't even left the confines of the neuron. Structural plasticity, the experience and activity mediated changes in dendritic spine and axonal collateral densities — intuited by Ramòn y Cajal's, and confirmed definitively with post-2000 two photon microscopy of the living, behaving dendritic spine (Trachtenberg et al, 2002, Tobias Bonhoeffer 2001) takes place at time scales greater than seconds, minutes, hours, days. These structural morphological changes to living/behaving neurons were correlated to LTP/LTD and learning before the 2005 re-discovery of microglia (which depended on the same two photon microscopy). I'll point you to the papers that correlate the activity of microglia to these neuronal changes which are the biological substrate of learning. But for now I just want to make the point, that the timescale of the observed structural changes to neurons map well with the timescale of the observed behavior of microglia as then project and retract their arm-like processes to and from the synaptic connections between neurons.

TLD: Does a microglial cell passively sample and analyze extracellular fluid in the vicinity of each spine in its purview for 30 seconds or so and then decide on an intervention modifying all its spines simultaneously? Or possibly the analysis is more interactive as if the glia and neurons are solving a recurrent combinatorial reconstruction problem jointly and initiating reconstruction upon arriving at a solution. From what I understand, the synapses a given microglial cell is responsible for don't overlap with those of other microglial cells. This combinatorial problem could be fully stationary requiring no knowledge of the history of its neuronal / dendritic charges not already encoded in the state of the neurons / dendrites or made apparent during pre-intervention analytical stage. What is the nature of the intervention? Does it entail a specific cocktail of cell-signalling proteins and trophic factors tailored to each dendrite? And why this repeated cycle of projection and retraction as if its target neurons / dendrites are not fixed but rather determined dynamically, possibly sampled randomly to ensure coverage?

I'm particularly interested in the possibility of exploiting the way in which microglia tile the brain, effectively covering all of the synapses. Apparently, it is not simply a cover in the mathematical sense of a set of sets whose union is the whole set, but specifically it is a set of sets that are disjoint and whose union is the whole set. The sets that comprise the microglial tiling are also local in the sense that the elements of the set are bounded in space by the microglial span and the processes of mature microglia in the CNS tend to be compact. In keeping with the terminology introduced in [60], I now think of these sets as domains and imagine they would serve as modular functional domains as defined in [60]. It would not surprise me if these domains overlap in space despite being disjoint, a conjecture I would like to pursue perhaps with your help or the assistance of one of the microglia experts you've introduced me to.

Computational neuroscientists are interested in how learning is implemented in the brain and machine learning researchers have been trying for years to come up with biologically plausible-version of gradient descent. Several of the recent attempts have used strategies involving random weights. One surprisingly simple algorithm for deep learning assigns credit or blame by multiplying error signals by random synaptic weights. The local computations involve adaptation of a sort microglia could be capable of performing. Lillicrap et al [185] show that a network can learn to extract useful information from signals sent through these random feedback connections. In essence, the network learns to learn. For your convenience, I've collected the titles and abstracts for several recent papers describing biologically-plausible back-propagation-like algorithms and made them available here¹⁹.

DM: Neurosteroids also play an important role in regulating neural activity. Like much of what goes on in the field, the scientists studyng the brain's amazing capacity to manufacture its own pregnane steroids aren't talking to the folks who study microglia, and neither group are really talking to the folks who study learning from a computational neuroscience perspective. Neurosteroids could provide the glue, in a way, which brings these often disjointed disciplines together.

Regarding neurosteroids and microglia, the brain can produce neurosteroids (NS) in a very localized manner in order to modulate brain activity. Perhaps at the scale of the individual neuron. And there is some evidence that microglia, which also act at the scale of the individual dendritic spine of individual neurons, are crucial to regulation of the local production of NS. I'm pretty sure that the upregulation of NS is one of the essential tools through which the brain homeostatically keeps microglia from adopting, in error, a macrophage-like, immune reactive phenotype in the face of immune challenges. NS is upregulated by the brain in the face of immune challenge with an anti-inflammatory effect — helping microglia to remain on task interacting with the synaptic landscape. The anxiolytic-driven potentiation of NS biosynthesis is almost certainly how it modulates the behavior of microglia.

Regarding neurosteroids and computational neuroscience: But the brain also clearly upregulates NS so as to modulate the balance of neuronal excitation and inhibition. Allopregnanolone, for instance, is an incredibly potent (nano-molar potency) and efficient positive allosteric modulator of the GABA A receptor. To date, this fact has only been of interest to neuroscientists and biotechs (like SAGE Therapeutics) because they can dose an animal systemically with Allo to achieve profound anti-epileptic, and anxiolytic effect. But the brain itself doesn't make NS for the purpose of depressing excitation globally. In the context of excitation / inhibition, the brain produces NS to modulate specific circuits, potentiating the synaptic weight of specific interneurons relative to the weight of excitatory neurons synapting to the same post synaptic cell. And this should be of interest to the likes of computational neuroscientists like Markus Meister whose work has proven the importance of interneurons to the predictive coding wired into the brain's circuitry genetically through evolution.

It would be really neat to discover that microglia not only deploy their immunological tools to reshape (by Hebbian rules) the synaptic connections of excitatory presynaptic neurons to their post synaptic partner as new experience dictates (the type of learning described by Bonhoeffer), but also that microglia deploy their endocrine-like neurosteroidal tools to modulate the relative importance of the predictive, expectational, information stored in the strength of inhibitory interneuron synapses connecting to the same post synaptic cell.

²⁰ How Google's early failure to sell itself shows why we can't recognize good ideas [SOURCE]

By Lila MacLellan
January 29, 2018

A newly published interview with David Cheriton, the Stanford University computer science professor known for writing Google's second check, offers a reminder about how bad most gatekeepers are at picking out the winners among new ideas. When the Globe and Mail (paywall) asked Cheriton about the original pitch Sergey Brin and Larry Page made to him in 2004, he responded:

Actually, they wanted advice about licensing their search technology to other companies. I'd seen people try that before. The view I take is, you give birth to this baby technology, and you think somebody would like to adopt it because you think it's beautiful. But it's very hard to get anyone else to adopt your baby. I told them, "You have to raise your baby yourself." They came back some months later, and I don't think they said I was right, but they'd decided to start their own company because nobody was interested in their baby.

Stories like these are irresistible. Maybe it's the schadenfreude that comes with imagining Silicon Valley investors missing out on a product that would come to be worth billions. Or it's just that we like to believe that had we been sitting in the room when Page and Brin demonstrated their ranking algorithm, we would have recognized their search technology as genius, history-changing genius.

But studies show that's just wishful thinking. Had you been an executive at, say, Yahoo, one of the companies that was not interested in PageRank, which became Google, you likely would have made the same mistake. Justin Berg, a professor at Stanford University's School of Business, calls it a failure of "creative forecasting," something managers and other decision makers are notorious for. In public lectures on his research, Berg runs through a list of famous rejections: the sitcom Seinfeld, the first Harry Potter book, even the graphical user interface. In literature, there's a multitude of such examples where editors missed a genre-bending or stylistically fresh new voice.

Berg researched how this happens, finding a rich body of literature explaining the natural biases against creativity and innovation that exist in most of us, and are not completely irrational. Novel ideas actually make us "deeply uncomfortable," scholars have found, even when we believe we welcome them. Our inclination to avoid risk means managers are unlikely to endorse an idea that doesn't resemble any that came before it. It's safer not to jeopardize your reputation as someone who can be trusted with the company's time and money.

But Berg wanted to answer a different question about creativity within organizations. He wanted to find out who was good at spotting strong, imaginative ideas, and why, so he began by studying artists and circus managers at Cirque Du Soleil, analyzing their predictions about which new acts introduced by performers would be most appreciated by audiences. Berg found that the artists, and not the managers, had a better sense of when they were watching a future hit. As Adam Grant explained in an Aspens Ideas speech that covered Berg's findings, fellow artists "are much more likely to say, 'What are the reasons that I should consider this idea?' as opposed to, 'Why should I walk away from it?'"

Sadly for managers, it seems there's something about the role itself that can close the mind. This is true even for students asked to think like managers, Berg found in lab studies. But there's also famous proof of this in an anecdote about another early Google investor, Jeff Bezos, founder of Amazon. When he visited Harvard Business School in 1997, he was told by the students in a class called "Managing the Marketspace" that he ought to sell Amazon to Barnes and Noble. In his book, The Everything Store, Brad Stone explains that the Harvard students believed Amazon would be squashed by more established brands when they moved online. They were obviously bright students, but also human. They sided with the known.

On a more optimistic note, however, Berg also found that students playing "managers" got better at a creative forecasting task when they took on the role of of "creator," for even five minutes, by spending some time generating new ideas before listening and judging those of peers. In other words, company executives, editors, producers—all the gatekeepers of the world—ought to make research and innovation part of their jobs. Exactly the way professors like Cheriton do.

²¹ By "reasonably easy" I mean that extending the Zinn hierarchical planner to handle a carefully circumscribed subset of programmer's apprentice-related actions, interventions and misunderstandings suitable for simple demos and testing might take a good programmer a few months effort and a great programmer substantially less. Writing such DMS "stubs" in order to carve out a suitably safe demo space is an art but one that can be relatively easily learned.

²² Using training data and examples collected from Google code repositories would very probably preclude any substantive external demos and complicate publication given the proprietary nature and specialized applications of the Google code base.

²³ Suppose you have a stream of items of large and unknown length that you can only iterate over once. Reservoir sampling is an algorithm that randomly chooses an item from this stream such that each item is equally likely to be selected.

²⁴ Here is a small sample of recent titles and references that straddle disciplinary boundaries to offer new insights: Associative Long Short-Term Memory [53], Neural Turing Machines [105], DRAW: A Recurrent Neural Network For Image Generation [119], Recurrent Models of Visual Attention [206], Using Fast Weights to Attend to the Recent Past [10], Learning Model-based Planning From Scratch [223], The Construction System of the Brain [132], Deconstructing Episodic Memory with Construction [131], Imagination-Augmented Agents for Deep Reinforcement Learning [297], The Future of Memory: Remembering, Imagining, and the Brain [254], Human-level Concept Learning Through Probabilistic Program Induction [171], The Consciousness Prior [25], What is Consciousness, and Could Machines Have It? [73], The Attention Schema Theory: A Mechanistic Account of Subjective Awareness [116]

²⁵ In a memorandum issued in 2000, the NIH Biomedical Information Science and Technology Initiative Consortium agreed on the following definition: Computational Biology: The development and application of data-analytical and theoretical methods, mathematical modeling and computational simulation techniques to the study of biological, behavioral, and social systems.

²⁶ John Rader Platt was an American physicist and biophysicist, a professor at the University of Chicago and an advocate of periodically changing scientific fields to keep fresh, gain new perspectives and facilitate the spread of ideas. His early interests were in the field of molecular biophysics and biophysics, in the 1960s he shifted to philosophy of science, vision and perception, and the study of social trends. He was known for his pioneering work on strong inference in the 1960s and his analysis of social science in the 1970s. He wrote that:

I think there are thousands of scientist who would like to change to less crowded and more interesting fields if they thought the change would not be disapproved and if they could plan how to make a living and how to get research support while making the change. I think this would be a good thing. Mobility spreads the skills in a labor market and mobility would spread the skills in science. Kant, Helmholtz and Pasteur all changed fields. Enrico Fermi once said that a scientist should change fields every ten years; that in the first place his ideas were exhausted by then, and he owed it to the younger men in the field to let them advance; and that in the second place his ideas might be of greater value in bringing a fresh viewpoint to a different field. — John Rader Platt [32]

John W. Gardner, the former cabinet secretary and founder of Common Cause, the nonpartisan public-interest lobby for greater political transparency and accountability, first described a career strategy he referred to as "repotting" as a way to stay engaged and innovative. The idea is that a career reboot not only helps prevent managers from staying in one position too long, being lulled into complacency or leadership fatigue, but that it also pushes leaders to keep learning, to see new challenges with a fresh perspective and ultimately find meaningful work that leaves a lasting legacy:

Most of us have potentialities that have never been developed simply because the circumstances of our lives never called them forth. Exploration of the full range of our own potentialities is not something that we can safely leave to the chances of life. It is something to be pursued systematically, or at least avidly, to the end of our days. We should look forward to an endless and unpredictable dialogue between our potentialities and the claims of life — not only the claims we encounter but the claims we invent. And by the potentialities I mean not just skills, but the full range capacities for sensing, wondering, learning, understanding, loving, and aspiring. — John Gardener [97]

²⁷ By academic career, I am referring to the period from 1986 to 2006 during which I was on the faculty of the Computer Science Department at Brown University. During my first ten years at Brown, I focused exclusively on research and teaching. From 1997 until 2002, I served as the chair of Brown's Computer Science Department and simultaneously as Acting Vice President for Computing and Information Services from 2001 until 2002. I then served as the Deputy Provost of Brown University from 2003 to 2005. Once I became involved in administration, the time I had available for research dwindled and, while I continued working with my graduate students, the intensity with which I pursued new research ideas declined. Peter Norvig suggested I spend my sabbatical at Google in 2006 and subsequently invited me to continue at Google full time to pursue research in computational neuroscience for which I will always be grateful to him. While I work full time as a Research Scientist at Google, I spend approximately 20% of my time teaching and advising students at Stanford thanks to Google's enlightened attitude towards collaboration and university relations.

²⁸ Herbert Simon [261] introduced the idea of bounded rationality as an alternative basis for the sort of mathematical modeling that dominated economics during much of the 20th century. It complements the notion of rationality as optimization, which views decision-making as a fully rational process of finding an optimal choice given the information available. Independently of Simon, procedural limitations have also been pointed out by Bayesian authors most notably I.J. Good, who introduced a distinction between what he calls Type I and Type II rationality. Type I rationality is the ideal of Bayesian decision theory, while Type II rationality takes into account the time and cost of analyzing a problem and thus may depart from the Type I ideal. Type II rationality often justifies a compromise between Bayesian and non-Bayesian methods in statistics [150].

²⁹ Here are some excerpts from Mary Shelley's Frankenstein. Shelley (1797–1851) started writing the story when she was 18. The first edition of the novel was published anonymously in London on January 1, 1818, when she was 20. Her name first appeared on the second edition, published in France in 1823. This is the 200th anniversary of the first edition.

The book has been mentioned in several news stories and appeared in book-club lists focusing on the modern Prometheus theme and helping to fuel concerns about rogue artificial intelligence. These excerpts caught my attention for their nuanced view of a young man caught up with the idea of controlling nature and altering the fate of humanity:

Frankenstein (Mary Shelley) — Loc. 343-54

If, instead of this remark, my father had taken the pains to explain to me that the principles of Agrippa had been entirely exploded and that a modern system of science had been introduced which possessed much greater powers than the ancient, because the powers of the latter were chimerical, while those of the former were real and practical, under such circumstances I should certainly have thrown Agrippa aside and have contented my imagination, warmed as it was, by returning with greater ardour to my former studies. It is even possible that the train of my ideas would never have received the fatal impulse that led to my ruin. But the cursory glance my father had taken of my volume by no means assured me that he was acquainted with its contents, and I continued to read with the greatest avidity.

When I returned home my first care was to procure the whole works of this author, and afterwards of Paracelsus and Albertus Magnus. I read and studied the wild fancies of these writers with delight; they appeared to me treasures known to few besides myself. I have described myself as always having been imbued with a fervent longing to penetrate the secrets of nature. In spite of the intense labour and wonderful discoveries of modern philosophers, I always came from my studies discontented and unsatisfied. Sir Isaac Newton is said to have avowed that he felt like a child picking up shells beside the great and unexplored ocean of truth. Those of his successors in each branch of natural philosophy with whom I was acquainted appeared even to my boy's apprehensions as tyros engaged in the same pursuit.

Frankenstein (Mary Shelley) — Loc. 355-63

The most learned philosopher knew little more. He had partially unveiled the face of Nature, but her immortal lineaments were still a wonder and a mystery. He might dissect, anatomize, and give names; but, not to speak of a final cause, causes in their secondary and tertiary grades were utterly unknown to him. I had gazed upon the fortifications and impediments that seemed to keep human beings from entering the citadel of nature, and rashly and ignorantly I had repined. But here were books, and here were men who had penetrated deeper and knew more. I took their word for all that they averred, and I became their disciple.

It may appear strange that such should arise in the eighteenth century; but while I followed the routine of education in the schools of Geneva, I was, to a great degree, self-taught with regard to my favourite studies. My father was not scientific, and I was left to struggle with a child's blindness, added to a student's thirst for knowledge. Under the guidance of my new preceptors I entered with the greatest diligence into the search of the philosopher's stone and the elixir of life; but the latter soon obtained my undivided attention.

Frankenstein (Mary Shelley) — Loc. 366-68

And thus for a time I was occupied by exploded systems, mingling, like an unadept, a thousand contradictory theories and floundering desperately in a very slough of multifarious knowledge, guided by an ardent imagination and childish reasoning, till an accident again changed the current of my ideas.

Frankenstein (Mary Shelley) — Loc. 438-42

I replied in the affirmative. "Every minute," continued M. Krempe with warmth, "every instant that you have wasted on those books is utterly and entirely lost. You have burdened your memory with exploded systems and useless names. Good God! In what desert land have you lived, where no one was kind enough to inform you that these fancies which you have so greedily imbibed are a thousand years old and as musty as they are ancient? I little expected, in this enlightened and scientific age, to find a disciple of Albertus Magnus and Paracelsus. My dear sir, you must begin your studies entirely anew."

Frankenstein (Mary Shelley) — Loc. 464-69

"The ancient teachers of this science," said he, "promised impossibilities and performed nothing. The modern masters promise very little; they know that metals cannot be transmuted and that the elixir of life is a chimera but these philosophers, whose hands seem only made to dabble in dirt, and their eyes to pore over the microscope or crucible, have indeed performed miracles. They penetrate into the recesses of nature and show how she works in her hiding-places. They ascend into the heavens; they have discovered how the blood circulates, and the nature of the air we breathe. They have acquired new and almost unlimited powers; they can command the thunders of heaven, mimic the earthquake, and even mock the invisible world with its own shadows."

Frankenstein (Mary Shelley) — Loc. 546-52

I prepared myself for a multitude of reverses; my operations might be incessantly baffled, and at last my work be imperfect, yet when I considered the improvement which every day takes place in science and mechanics, I was encouraged to hope my present attempts would at least lay the foundations of future success. Nor could I consider the magnitude and complexity of my plan as any argument of its impracticability. It was with these feelings that I began the creation of a human being. As the minuteness of the parts formed a great hindrance to my speed, I resolved, contrary to my first intention, to make the being of a gigantic stature, that is to say, about eight feet in height, and proportionably large. After having formed this determination and having spent some months in successfully collecting and arranging my materials, I began.

Frankenstein (Mary Shelley) — Loc. 569-73

It was a most beautiful season; never did the fields bestow a more plentiful harvest or the vines yield a more luxuriant vintage, but my eyes were insensible to the charms of nature. And the same feelings which made me neglect the scenes around me caused me also to forget those friends who were so many miles absent, and whom I had not seen for so long a time. I knew my silence disquieted them, and I well remembered the words of my father: "I know that while you are pleased with yourself you will think of us with affection, and we shall hear regularly from you. You must pardon me if I regard any interruption in your correspondence as a proof that your other duties are equally neglected."

Frankenstein (Mary Shelley) — Loc. 578-83

A human being in perfection ought always to preserve a calm and peaceful mind and never to allow passion or a transitory desire to disturb his tranquillity. I do not think that the pursuit of knowledge is an exception to this rule. If the study to which you apply yourself has a tendency to weaken your affections and to destroy your taste for those simple pleasures in which no alloy can possibly mix, then that study is certainly unlawful, that is to say, not befitting the human mind. If this rule were always observed; if no man allowed any pursuit whatsoever to interfere with the tranquillity of his domestic affections, Greece had not been enslaved, Caesar would have spared his country, America would have been discovered more gradually, and the empires of Mexico and Peru had not been destroyed.

Frankenstein (Mary Shelley) — Loc. 781-85

A selfish pursuit had cramped and narrowed me, until your gentleness and affection warmed and opened my senses; I became the same happy creature who, a few years ago, loved and beloved by all, had no sorrow or care. When happy, inanimate nature had the power of bestowing on me the most delightful sensations. A serene sky and verdant fields filled me with ecstasy. The present season was indeed divine; the flowers of spring bloomed in the hedges, while those of summer were already in bud. I was undisturbed by thoughts which during the preceding year had pressed upon me, notwithstanding my endeavours to throw them off, with an invincible burden.

³⁰ For the purpose of thinking about episodic memory, the prefrontal cortex is primarily associated with executive function and attention and memory which we cover elsewhere in the research notes. SOURCE

³¹ The temporal lobe (Brodmann area 22) is involved in processing sensory input into derived meanings for the appropriate retention of visual memory, language comprehension, and emotion association. It adjoins Wernnicke's area associated with language comprehension. SOURCE

³² Wernicke's area is classically located in the posterior section of the superior temporal gyrus (STG) in the (most commonly) left cerebral hemisphere. This area encircles the auditory cortex on the lateral sulcus (the part of the brain where the temporal lobe and parietal lobe meet). SOURCE

³³ The retrosplenial cortex (Brodmann area 29 and 30) is located close to visual areas and also to the hippocampal spatial/memory system suggest it may have a role in mediating between perceptual and memory functions. SOURCE

³⁴ The occipital lobe is one of the four major lobes of the cerebral cortex in the brain of mammals. The occipital lobe is the visual processing center of the mammalian brain containing most of the anatomical region of the visual cortex SOURCE

³⁵ The cuneus (Brodmann area 17) receives visual information from the same-sided superior quandrantic retina (corresponding to contralateral inferior visual field). It is most known for its involvement in basic visual processing. The mid-level visual processing that occurs in the extrastriate projection fields of the cuneus are modulated by extraretinal effects, like attention, working memory, and reward expectation. SOURCE

³⁶ The precuneus is located on the inside between the two cerebral hemispheres in the rear region between the somatosensory cortex and forward of the cuneus (which contains the visual cortex). The mental imagery concerning the self has been located in the forward part of the precuneus with posterior areas being involved with episodic memory. Another area has been linked to visuospatial imagery. SOURCE

³⁷ Here is an extended quotation from Conway [48] that, while it doesn't reveal any additional anatomical detail, does provide some useful framing insights from the perspective of cognitive neuroscience:

A distinction might be drawn between a conceptual based frontal-anterior temporal memory system and a posterior temporal-occipital EM system. In the human brain a unique way to access and manipulate EMs is through the conceptual knowledge system. It is this relation between conceptual knowledge and EM that supports two major and fairly amazing abilities: the ability to imagine how the past might have been otherwise and the ability to imagine the future. The ability to imagine how a specific experience might have unfolded differently with different outcomes conveys a huge survival advantage. Different courses of actions can be played out in memory with no physical consequences.
The ability to manipulate memories in this way provides the basis of planning and the projection of alternate forms of the past into the future. It is a powerful way of anticipating possible outcomes. Indeed, it provides a meaning by which long-term goals can be generated and planned for. It is then especially interesting that recent neuroimaging studies comparing brain activations during the recall of autobiographical memories with brain activations during the generation of false but plausible memories has found extensive commonalities between the brain areas active in both.

These commonalities are so marked that it might be more appropriate to refer to a remembering-imaging system rather than simply a "memory system." The remembering-imaging system is a cognitive system in which the various components that are usually constructed into an autobiographical memory (autobiographical knowledge and episodic memory) can be constructed into other mental representations of imaginary scenarios. For example, specific EMs might be constructed into novel conceptual context where "dinner with X" becomes "dinner with Y." or perhaps generic knowledge of locations, actors, and actions are configured into potential EMs of unexperienced events. Recent findings indicate then that EMs and autobiographical knowledge can provide the materials, the content, for the construction of imagined tasks and futures.

³⁸ Here is a short list of books and papers on theory of mind and episodic memory. The Handbook of Episodic Memory is a wonderful reference covering much of what is known about episodic memory. Unlike what is known about how the brain represents physical objects in our environment — the shape, color and motion of visual stimuli are represented in an orderly fashion in the visual association areas, episodic memory, is a great deal more complicated, emerges gradually and develops over many years integrating information from a great many sources. The selected journal articles are but a small sample of what I've read or skimmed, even if I read and understood all of the articles I've skimmed and memorized all 600 plus pages of the handbook, it would be years of work to develop a working model of episodic memory that even approximates human memory.

@book{DereetalHANDBOOK-of-EPISODIC-MEMORY-2008,
       author = {Dere, E. and Easton, A. and Nadel, L. and Huston, J.P.},
        title = {Handbook of Episodic Memory},
    publisher = {Elsevier Science},
       series = {Handbook of Behavioral Neuroscience},
         year = {2008},
}
@article{SaxeetalARP-04,
       author = {R. Saxe and S. Carey and N. Kanwisher},
        title = {Understanding Other Minds: Linking Developmental Psychology and Functional Neuroimaging},
      journal = {Annual Review of Psychology},
       volume = {55},
       number = {1},
         year = {2004},
        pages = {87-124},
     abstract = {Evidence from developmental psychology suggests that understanding other minds constitutes a special domain of cognition with at least two components: an early-developing system for reasoning about goals, perceptions, and emotions, and a later-developing system for representing the contents of beliefs. Neuroimaging reinforces and elaborates upon this view by providing evidence that (a) domain-specific brain regions exist for representing belief contents, (b) these regions are apparently distinct from other regions engaged in reasoning about goals and actions (suggesting that the two developmental stages reflect the emergence of two distinct systems, rather than the elaboration of a single system), and (c) these regions are distinct from brain regions engaged in inhibitory control and in syntactic processing. The clear neural distinction between these processes is evidence that belief attribution is not dependent on either inhibitory control or syntax, but is subserved by a specialized neural system for theory of mind. }
}
@article{DuboisandAdolphsPNAS-16,
       author = {Dubois, Julien and Adolphs, Ralph},
        title = {How the brain represents other minds},
      journal = {Proceedings of the National Academy of Sciences},
       volume = {113},
       number = {1},
         year = {2016},
        pages = {19-21},
     abstract = {How does the brain represent the world? Sensory neuroscience has given us a detailed window into how the brain represents physical objects in our environment: For instance, the shape, color, and direction of motion of visual stimuli are represented in an orderly fashion in higher order visual cortices. But we also represent social objects: other people and their thoughts, beliefs, and feelings. How is that kind of knowledge represented in the brain? In an ambitious new study in PNAS, Tamir et al. used functional MRI (fMRI) to argue that our brains represent other minds along three broad dimensions: social impact, rationality, and valence.}
}
@article{MahyetalDCN-14,
        title = {How and where: Theory-of-mind in the brain},
       author = {Caitlin E.V. Mahy and Louis J. Moses and Jennifer H. Pfeifer},
      journal = {Developmental Cognitive Neuroscience},
       volume = {9},
         year = {2014},
        pages = {68-81},
     abstract = {Theory of mind (ToM) is a core topic in both social neuroscience and developmental psychology, yet theory and data from each field have only minimally constrained thinking in the other. The two fields might be fruitfully integrated, however, if social neuroscientists sought evidence directly relevant to current accounts of ToM development: modularity, simulation, executive, and theory theory accounts. Here we extend the distinct predictions made by each theory to the neural level, describe neuroimaging evidence that in principle would be relevant to testing each account, and discuss such evidence where it exists. We propose that it would be mutually beneficial for both fields if ToM neuroimaging studies focused more on integrating developmental accounts of ToM acquisition with neuroimaging approaches, and suggest ways this might be achieved.}
}
@article{FerreretalFiN-09,
       author = {Ferrer, Emilio and O'Hare, Elizabeth D. and Bunge, Silvia A.},
        title = {Fluid Reasoning and the Developing Brain},
      journal = {Frontiers in Neuroscience},
    publisher = {Frontiers Research Foundation},
       volume = {3},
        issue = {1},
         year = {2009},
        pages = {46-51},
     abstract = {Fluid reasoning is the cornerstone of human cognition, both during development and in adulthood. Despite this, the neural mechanisms underlying the development of fluid reasoning are largely unknown. In this review, we provide an overview of this important cognitive ability, the method of measurement, its changes over the childhood and adolescence of an individual, and its underlying neurobiological underpinnings. We review important findings from psychometric, cognitive, and neuroscientific literatures, and outline important future directions for this interdisciplinary research.},
}
@article{DumontheilDCN-14,
        title = {Development of abstract thinking during childhood and adolescence: The role of rostrolateral prefrontal cortex},
       author = {Iroise Dumontheil},
      journal = {Developmental Cognitive Neuroscience},
       volume = {10},
         year = {2014},
        pages = {57-76},
     abstract = {Rostral prefrontal cortex (RPFC) has increased in size and changed in terms of its cellular organisation during primate evolution. In parallel emerged the ability to detach oneself from the immediate environment to process abstract thoughts and solve problems and to understand other individuals’ thoughts and intentions. Rostrolateral prefrontal cortex (RLPFC) is thought to play an important role in supporting the integration of abstract, often self-generated, thoughts. Thoughts can be temporally abstract and relate to long term goals, or past or future events, or relationally abstract and focus on the relationships between representations rather than simple stimulus features. Behavioural studies have provided evidence of a prolonged development of the cognitive functions associated with RLPFC, in particular logical and relational reasoning, but also episodic memory retrieval and prospective memory. Functional and structural neuroimaging studies provide further support for a prolonged development of RLPFC during adolescence, with some evidence of increased specialisation of RLPFC activation for relational integration and aspects of episodic memory retrieval. Topics for future research will be discussed, such as the role of medial RPFC in processing abstract thoughts in the social domain, the possibility of training abstract thinking in the domain of reasoning, and links to education.}
}

³⁹ Preliminary sample of papers relating to episodic memory and related applications including question answering:

@article{AnonymousICLR-18a,
        title = {Integrating Episodic Memory into a Reinforcement Learning Agent Using Reservoir Sampling},
       author = {Anonymous},
      journal = {International Conference on Learning Representations},
         year = {2018},
          url = {https://openreview.net/forum?id=ByJDAIe0b},
     abstract = {Episodic memory is a psychology term which refers to the ability to recall specific events from the past. We suggest one advantage of this particular type of memory is the ability to easily assign credit to a specific state when remembered information is found to be useful. Inspired by this idea, and the increasing popularity of external memory mechanisms to handle long-term dependencies in deep learning systems, we propose a novel algorithm which uses a reservoir sampling procedure to maintain an external memory consisting of a fixed number of past states. The algorithm allows a deep reinforcement learning agent to learn online to preferentially remember those states which are found to be useful to recall later on. Critically this method allows for efficient online computation of gradient estimates with respect to the write process of the external memory. Thus unlike most prior mechanisms for external memory it is feasible to use in an online reinforcement learning setting.}
}
@article{AnonymousICLR-18b,
        title = {Memory Architectures in Recurrent Neural Network Language Models},
       author = {Anonymous},
      journal = {International Conference on Learning Representations},
         year = {2018},
          url = {https://openreview.net/forum?id=SkFqf0lAZ},
     abstract = {We compare and analyze sequential, random access, and stack memory architectures for recurrent neural network language models. Our experiments on the Penn Treebank and Wikitext-2 datasets show that stack-based memory architectures consistently achieve the best performance in terms of held out perplexity. We also propose a generalization to existing continuous stack models (Joulin & Mikolov,2015; Grefenstette et al., 2015)  to allow a variable number of pop operations more naturally that further improves performance. We further evaluate these language models in terms of their ability to capture non-local syntactic dependencies on a subject-verb agreement dataset  (Linzen et al., 2016) and establish new state of the art results using memory augmented language models. Our results demonstrate the value of stack-structured memory for explaining the distribution of words in natural language, in line with linguistic theories claiming a context-free backbone for natural language.}
}
@misc{DynamicMemoryNetworks-16,
        title = {Dynamic Memory Networks},
       author = {Anonymous},
         year = {2016},
 howpublished = {{LDGN} Notes and Thoughts on Machine Learning and Artificial Intelligence},
          url = {https://ldgn.wordpress.com/2016/01/04/attention-and-memory-in-deep-learning/},
     abstract = {Dynamic Memory Networks (DMN) are a general framework for question answering over inputs. Conceptually, a difference is made between inputs and questions. The DMN takes a sequence of inputs and a question and then employs an iterative attention process to compute the answer. The sequence of inputs can be seen as the history, which complements the general world knowledge (see semantic memory module). The DNM framework consists of five components: (i) input module: processes raw input and maps it to a useful representation, (ii) semantic memory module: stores general knowledge about concepts and facts. It can be instantiated by word embeddings or knowledge bases, (iii) question module: maps a question into a representation, (iv) episodic memory module: an iterative component that in each iteration focuses on different parts of the input, updates its internal state and finally outputs an answer representation (vi) answer module: generates the answer to return.}
}
@article{KumaretalCoRR-16,
       author = {Ankit Kumar and Ozan Irsoy and Jonathan Su and James Bradbury and Robert English and Brian Pierce and Peter Ondruska and Ishaan Gulrajani and Richard Socher},
        title = {Ask Me Anything: Dynamic Memory Networks for Natural Language Processing},
      journal = {CoRR},
       volume = {arXiv:1506.07285},
         year = {2015},
     abstract = {Most tasks in natural language processing can be cast into question answering (QA) problems over language input. We introduce the dynamic memory network (DMN), a neural network architecture which processes input sequences and questions, forms episodic memories, and generates relevant answers. Questions trigger an iterative attention process which allows the model to condition its attention on the inputs and the result of previous iterations. These results are then reasoned over in a hierarchical recurrent sequence model to generate answers. The DMN can be trained end-to-end and obtains state-of-the-art results on several types of tasks and datasets: question answering (Facebook's bAbI dataset), text classification for sentiment analysis (Stanford Sentiment Treebank) and sequence modeling for part-of-speech tagging (WSJ-PTB). The training for these different tasks relies exclusively on trained word vector representations and input-question-answer triplets.}
}
@inproceedings{ChangandTanIJCAI-17,
       author = {Poo-Hee Chang, Ah-Hwee Tan},
        title = {Encoding and Recall of Spatio-Temporal Episodic Memory in Real Time},
    booktitle = {Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, {IJCAI-17}},
        pages = {1490--1496},
         year = {2017},
     abstract = {Episodic memory enables a cognitive system to improve its performance by reflecting upon past events. In this paper, we propose a computational model called STEM for encoding and recall of episodic events together with the associated con- textual information in real time. Based on a class of self-organizing neural networks, STEM is de- signed to learn memory chunks or cognitive nodes, each encoding a set of co-occurring multi-modal activity patterns across multiple pattern channels. We present algorithms for recall of events based on partial and inexact input patterns. Our empirical results based on a public domain data set show that STEM displays a high level of efficiency and ro- bustness in encoding and retrieval with both partial and noisy search cues when compared with a state- of-the-art associative memory model.},
}
@article{TaniJCS-98,
       author = {Tani, Jun},
        title = {An interpretation of the 'self' from the dynamical systems perspective: {A} constructivist approach},
    booktitle = {Journal of Consciousness Studies},
       volume = {5},
         year = {1998},
        pages = {516-542},
     abstract = {This study attempts to describe the notion of the 'self' using dynamical systems language based on the results of our robot learning experiments. A neural network model consisting of multiple modules is proposed, in which the interactive dynamics between the bottom-up perception and the top-down prediction are investigated. Our experiments with a real mobile robot showed that the incremental learning of the robot switches spontaneously between steady and unsteady phases. In the steady phase, the top-down prediction for the bottom-up perception works well when coherence is achieved between the internal and the environmental dynamics. In the unsteady phase, conflicts arise between the bottom-up perception and the top-down prediction; the coherence is lost, and a chaotic attractor is observed in the internal neural dynamics. By investigating possible analogies between this result and the phenomenological literature on the 'self', we draw the conclusions that (1) the structure of the 'self' corresponds to the 'open dynamic structure' which is characterized by co-existence of stability in terms of goal-directedness and instability caused by embodiment; (2) the open dynamic structure causes the system's spontaneous transition to the unsteady phase where the 'self' becomes aware.}
}
@article{PritzeletalCoRR-17,
       author = {Alexander Pritzel and Benigno Uria and Sriram Srinivasan and Adri{\`{a}} Puigdom{\`{e}}nech Badia and Oriol Vinyals and Demis Hassabis and Daan Wierstra and Charles Blundell},
        title = {Neural Episodic Control},
      journal = {CoRR},
       volume = {arXiv:1703.01988},
         year = {2017},
     abstract = {Deep reinforcement learning methods attain super-human performance in a wide range of environments. Such methods are grossly inefficient, often taking orders of magnitudes more data than humans to achieve reasonable performance. We propose Neural Episodic Control: a deep reinforcement learning agent that is able to rapidly assimilate new experiences and act upon them. Our agent uses a semi-tabular representation of the value function: a buffer of past experience containing slowly changing state representations and rapidly updated estimates of the value function. We show across a wide range of environments that our agent learns significantly faster than other state-of-the-art, general purpose deep reinforcement learning agents.}
}
@book{GallagherMODELS-OF-THE-SELF-13,
        title = {Models of the Self},
       author = {Gallagher, Shaun},
         year = {2013},
    publisher = {Imprint Academic}
}
@book{OKeefeandNadelHIPPOCAMPUS-78,
        title = {The hippocampus as a cognitive map},
       author = {O'Keefe, John and Nadel, Lynn},
         year = {1978},
    publisher = {Clarendon Press}
}

⁴⁰ Here is a sample of papers relating to attention and theory-of-mind reasoning focusing on the separate research agendas of Josh Tenenbaum at MIT and Michael Graziano at Princeton along with their many students and colleagues:

@article{GershmanetalPLoS-16,
       author = {Sam Gershman and Tobias Gerstenberg and Chris Baker and Fiery Cushman},
      journal = {PLoS ONE},
        title = {Plans, habits, and theory of mind},
         year = {2016},
     abstract = {Human success and even survival depends on our ability to predict what others will do by guessing what they are thinking. If I accelerate, will he yield? If I propose, will she accept? If I confess, will they forgive? Psychologists call this capacity "theory of mind." According to current theories, we solve this problem by assuming that others are rational actors. That is, we assume that others design and execute efficient plans to achieve their goals, given their knowledge. But if this view is correct, then our theory of mind is startlingly incomplete. Human action is not always a product of rational planning, and we would be mistaken to always interpret others' behaviors as such. A wealth of evidence indicates that we often act habitually—a form of behavioral control that depends not on rational planning, but rather on a history of reinforcement. We aim to test whether the human theory of mind includes a theory of habitual action and to assess when and how it is deployed. In a series of studies, we show that human theory of mind is sensitive to factors influencing the balance between habitual and planned behavior.},
}
@article{LakeetalCoRR-16,
       author = {Brenden M. Lake and Tomer D. Ullman and Joshua B. Tenenbaum and Samuel J. Gershman},
        title = {Building Machines That Learn and Think Like People},
      journal = {CoRR},
       volume = {arXiv:1604.00289},
         year = {2016},
     abstract = {Recent progress in artificial intelligence (AI) has renewed interest in building systems that learn and think like people. Many advances have come from using deep neural networks trained end-to-end in tasks such as object recognition, video games, and board games, achieving performance that equals or even beats humans in some respects. Despite their biological inspiration and performance achievements, these systems differ from human intelligence in crucial ways. We review progress in cognitive science suggesting that truly human-like learning and thinking machines will have to reach beyond current engineering trends in both what they learn, and how they learn it. Specifically, we argue that these machines should (a) build causal models of the world that support explanation and understanding, rather than merely solving pattern recognition problems; (b) ground learning in intuitive theories of physics and psychology, to support and enrich the knowledge that is learned; and (c) harness compositionality and learning-to-learn to rapidly acquire and generalize knowledge to new tasks and situations. We suggest concrete challenges and promising routes towards these goals that can combine the strengths of recent neural network advances with more structured cognitive models.}
}
@article{StuhlmuellerandGoodmanCS-13,
       author = {Stuhlm\"{u}ller, A. and Goodman, N. D.},
      journal = {Journal of Cognitive Systems Research},
        title = {Reasoning about Reasoning by Nested Conditioning: Modeling Theory of Mind with Probabilistic Programs},
       volume = {28},
         year = {2014},
        pages = {80-99},
     abstract = {A wide range of human reasoning patterns can be explained as conditioning in probabilistic models; however, conditioning has traditionally been viewed as an operation applied to such models, not represented in such models. We describe how probabilistic programs can explicitly represent conditioning as part of a model. This enables us to describe reasoning about others’ reasoning using nested conditioning. Much of human reasoning is about the beliefs, desires, and intentions of other people; we use probabilistic programs to formalize these inferences in a way that captures the flexibility and inherent uncertainty of reasoning about other agents. We express examples from game theory, artificial intelligence, and linguistics as recursive probabilistic programs and illustrate how this representation language makes it easy to explore new directions in each of these fields. We discuss the algorithmic challenges posed by these kinds of models and describe how Dynamic Programming techniques can help address these challenges.}
}
@inbook{Goodmanetal2015,
       author = {Goodman, Noah D. and Tenenbaum, Joshua B. and Gerstenberg, T.},
    booktitle = {The Conceptual Mind: New Directions in the Study of Concepts},
       editor = {Morgolis and Lawrence},
    publisher = {MIT Press},
        title = {Concepts in a probabilistic language of thought},
         year = {2015}
}
@inproceedings{ErikvandenBoogaardetal2017,
       author = {Erik van den Boogaard and Jan Treur and Maxim Turpijn},
        title = {A Neurologically Inspired Network Model for Graziano's Attention Schema Theory for Consciousness},
    booktitle = {Natural and Artificial Computation for Biomedicine and Neuroscience - International Work-Conference on the Interplay Between Natural and Artificial Computation},
        pages = {10-21},
         year = {2017},
     abstract = {This paper describes a network-oriented model based on the neuroscientist Graziano’s Attention Schema Theory for consciousness. This theory describes an attention schema as an internal model of the attention process supporting the control of attention, similar to how our mind uses a body schema as an internal model of the body to control its movements. The Attention Schema Theory comes with a number of testable predictions. After designing a neurologically inspired temporal- causal network model for the Attention schema Theory, a few simula- tions were conducted to verify some of these predictions. One prediction is that a noticeable attention control deficit occurs when using attention without awareness. Another is that a noticeable attention control deficit occurs when using only bottom-up influence (from the sensory repre- sentations) without any top-down influence (for example, from goal or control states). The presented model is illustrated by a scenario where a hunter imagines (using internal simulation) a prey which he wants to attend to and catch, but shortly after he or she imagines a predator which he then wants to attend to and avoid. The outcomes of the simulations support the predictions that were made.}
}
@article{GrazianoFiRAI-17,
       author = {Graziano, Michael S. A.},
        title = {The Attention Schema Theory: A Foundation for Engineering Artificial Consciousness},
      journal = {Frontiers in Robotics and AI},
       volume = {4},
        pages = {60},
         year = {2017},
     abstract = {The purpose of the attention schema theory is to explain how an information-processing device, the brain, arrives at the claim that it possesses a non-physical, subjective awareness, and assigns a high degree of certainty to that extraordinary claim. The theory does not address how the brain might actually possess a non-physical essence. It is not a theory that deals in the non-physical. It is about the computations that cause a machine to make a claim and to assign a high degree of certainty to the claim. The theory is offered as a possible starting point for building artificial consciousness. Given current technology, it should be possible to build a machine that contains a rich internal model of what consciousness is, attributes that property of consciousness to itself and to the people it interacts with, and uses that attribution to make predictions about human behavior. Such a machine would “believe” it is conscious and act like it is conscious, in the same sense that the human machine believes and acts.}
}
@article{IgelstrometalENEURO-16,
       author = {Igelstr\"{o}m, Kajsa M. and Webb, Taylor W. and Kelly, Yin T. and Graziano, Michael S. A.},
        title = {Topographical Organization of Attentional, Social, and Memory Processes in the Human Temporoparietal Cortex},
      journal = {eNeuro},
    publisher = {Society for Neuroscience},
       volume = {3},
       number = {2},
         year = {2016},
     abstract = {The temporoparietal junction (TPJ) is activated in association with a large range of functions, including social cognition, episodic memory retrieval, and attentional reorienting. An ongoing debate is whether the TPJ performs an overarching, domain-general computation, or whether functions reside in domain-specific subdivisions. We scanned subjects with fMRI during five tasks known to activate the TPJ, probing social, attentional, and memory functions, and used data-driven parcellation (independent component analysis) to isolate task-related functional processes in the bilateral TPJ. We found that one dorsal component in the right TPJ, which was connected with the frontoparietal control network, was activated in all of the tasks. Other TPJ subregions were specific for attentional reorienting, oddball target detection, or social attribution of belief. The TPJ components that participated in attentional reorienting and oddball target detection appeared spatially separated, but both were connected with the ventral attention network. The TPJ component that participated in the theory-of-mind task was part of the default-mode network. Further, we found that the BOLD response in the domain-general dorsal component had a longer latency than responses in the domain-specific components, suggesting an involvement in distinct, perhaps postperceptual, computations. These findings suggest that the TPJ performs both domain-general and domain-specific computations that reside within spatially distinct functional components.},
}
@article{GrazianoandWebb-15,
       author = {Graziano, Michael S. A. and Webb, Taylor W.},
        title = {The attention schema theory: a mechanistic account of subjective awareness},
      journal = {Frontiers in Psychology},
       volume = {6},
        pages = {500},
         year = {2015},
     abstract = {We recently proposed the attention schema theory, a novel way to explain the brain basis of subjective awareness in a mechanistic and scientifically testable manner. The theory begins with attention, the process by which signals compete for the brain’s limited computing resources. This internal signal competition is partly under a bottom-up influence and partly under top-down control. We propose that the top-down control of attention is improved when the brain has access to a simplified model of attention itself. The brain therefore constructs a schematic model of the process of attention, the ‘attention schema’, in much the same way that it constructs a schematic model of the body, the ‘body schema’. The content of this internal model leads a brain to conclude that it has a subjective experience. One advantage of this theory is that it explains how awareness and attention can sometimes become dissociated; the brain’s internal models are never perfect, and sometimes a model becomes dissociated from the object being modeled. A second advantage of this theory is that it explains how we can be aware of both internal and external events. The brain can apply attention to many types of information including external sensory information and internal information about emotions and cognitive states. If awareness is a model of attention, then this model should pertain to the same domains of information to which attention pertains. A third advantage of this theory is that it provides testable predictions. If awareness is the internal model of attention, used to help control attention, then without awareness, attention should still be possible but should suffer deficits in control. In this article, we review the existing literature on the relationship between attention and awareness, and suggest that at least some of the predictions of the theory are borne out by the evidence.}
}

⁴¹ The abstract syntax tree of the example in the text is a simplified version of the AST for a WHILE loop:

⁴² Titles and abstracts of papers on prefrontal structures relevant to attention and executive control, including recent research on whether or to what extent working memory is situated in the frontal cortex [172]:

@article{BalletalFiN-11,
       author = {Ball, Gareth and Stokes, Paul R. and Rhodes, Rebecca A. and Bose, Subrata K. and Rezek, Iead and Wink, Alle-Meije and Lord, Louis-David and Mehta, Mitul A. and Grasby, Paul M. and Turkheimer, Federico E.},
        title = {Executive Functions and Prefrontal Cortex: A Matter of Persistence?},
      journal = {Frontiers in Systems Neuroscience},
         year = {2011},
    publisher = {Frontiers Research Foundation},
       volume = {5},
        pages = {3},
     abstract = {Executive function is thought to originates from the dynamics of frontal cortical networks. We examined the dynamic properties of the blood oxygen level dependent time-series measured with functional MRI (fMRI) within the prefrontal cortex (PFC) to test the hypothesis that temporally persistent neural activity underlies performance in three tasks of executive function. A numerical estimate of signal persistence, the Hurst exponent, postulated to represent the coherent firing of cortical networks, was determined and correlated with task performance. Increasing persistence in the lateral PFC was shown to correlate with improved performance during an n-back task. Conversely, we observed a correlation between persistence and increasing commission error --- indicating a failure to inhibit a prepotent response --- during a Go/No-Go task. We propose that persistence within the PFC reflects dynamic network formation and these findings underline the importance of frequency analysis of fMRI time-series in the study of executive functions.},
}
@article{BalstersetalNEUROIMAGE-09,
     title = {Evolution of the cerebellar cortex: {T}he selective expansion of prefrontal-projecting cerebellar lobules},
    author = {Balsters, J. H. and Cussans, E. and Diedrichsen, J. and Phillips, K. A. and Preuss, T. M. and Rilling, J. K. and Ramnani, N.},
   journal = {NeuroImage},
     issue = 3,
     pages = {2045-2052},
    volume = 49,
      year = 2010,
  abstract = {It has been suggested that interconnected brain areas evolve in tandem because evolutionary pressures act on complete functional systems rather than on individual brain areas. The cerebellar cortex has reciprocal connections with both the prefrontal cortex and motor cortex, forming independent loops with each. Specifically, in capuchin monkeys cerebellar cortical lobules Crus I and Crus II connect with prefrontal cortex  whereas the primary motor cortex connects with cerebellar lobules V, VI, VIIb, and VIIIa. Comparisons of extant primate species suggest that the prefrontal cortex has expanded more than cortical motor areas in human evolution. Given the enlargement of the prefrontal cortex relative to motor cortex in humans, our hypothesis would predict corresponding volumetric increases in the parts of the cerebellum connected to the prefrontal cortex, relative to cerebellar lobules connected to the motor cortex. We tested the hypothesis by comparing the volumes of cerebellar lobules in structural MRI scans in capuchins, chimpanzees and humans. The fractions of cerebellar volume occupied by Crus I and Crus II were significantly larger in humans compared to chimpanzees and capuchins. Our results therefore support the hypothesis that in the cortico-cerebellar system, functionally related structures evolve in concert with each other. The evolutionary expansion of these prefrontal-projecting cerebellar territories might contribute to the evolution of the higher cognitive functions of humans.},
}
@article{BaraketalPIN-13,
       author = {Omri Barak and David Sussillo and Ranulfo Romo and Misha Tsodyks and L.F. Abbott},
        title = {From fixed points to chaos: Three models of delayed discrimination},
      journal = {Progress in Neurobiology},
       volume = {103},
        pages = {214-222},
         year = {2013},
         note = {Conversion of Sensory Signals into Perceptions, Memories and Decisions},
      comment = {This manuscript argues persuasively that mixed selectivity, a signature of high-dimensional neural representations, is a fundamental component of the computational power of prefrontal cortex.},
     abstract = {Working memory is a crucial component of most cognitive tasks. Its neuronal mechanisms are still unclear despite intensive experimental and theoretical explorations. Most theoretical models of working memory assume both time-invariant neural representations and precise connectivity schemes based on the tuning properties of network neurons. A different, more recent class of models assumes randomly connected neurons that have no tuning to any particular task, and bases task performance purely on adjustment of network readout. Intermediate between these schemes are networks that start out random but are trained by a learning scheme. Experimental studies of a delayed vibrotactile discrimination task indicate that some of the neurons in prefrontal cortex are persistently tuned to the frequency of a remembered stimulus, but the majority exhibit more complex relationships to the stimulus that vary considerably across time. We compare three models, ranging from a highly organized line attractor model to a randomly connected network with chaotic activity, with data recorded during this task. The random network does a surprisingly good job of both performing the task and matching certain aspects of the data. The intermediate model, in which an initially random network is partially trained to perform the working memory task by tuning its recurrent and readout connections, provides a better description, although none of the models matches all features of the data. Our results suggest that prefrontal networks may begin in a random state relative to the task and initially rely on modified readout for task performance. With further training, however, more tuned neurons with less time-varying responses should emerge as the networks become more structured.}
}
@article{CarlenSCIENCE-17,
       author = {Carl\`{e}n, Marie},
        title = {What constitutes the prefrontal cortex?},
      journal = {Science},
    publisher = {American Association for the Advancement of Science},
       volume = {358},
       number = {6362},
         year = {2017},
        pages = {478-482},
     abstract = {During evolution, the prefrontal region grew in size relative to the rest of the cortex. It reached its largest extent in the human brain, where it constitutes 30\% of the total cortical area. This growth was accompanied by phylogenetic differentiation of the cortical areas. It has been argued that the human brain holds prefrontal regions that are both qualitatively and functionally unique. Present-day neuroscientists studying the prefrontal cortex increasingly use mice. An important goal is to reveal how the prefrontal cortex enables complex behavior. However, the prefrontal cortex still lacks a conclusive definition. The structure and function of this brain area across species remain unresolved. This state of affairs is often overlooked, warranting renewed focus on what the prefrontal cortex is and does.},
}
@article{ChaudhurietalNEURON-15,
       author = {Rishidev Chaudhuri and Kenneth Knoblauch and Marie-Alice Gariel and Henry Kennedy and Xiao-Jing Wang},
        title = {A Large-Scale Circuit Mechanism for Hierarchical Dynamical Processing in the Primate Cortex},
      journal = {Neuron},
       volume = {88},
       number = {2},
         year = {2015},
        pages = {419-431},
     abstract = {Summary We developed a large-scale dynamical model of the macaque neocortex, which is based on recently acquired directed- and weighted-connectivity data from tract-tracing experiments, and which incorporates heterogeneity across areas. A hierarchy of timescales naturally emerges from this system: sensory areas show brief, transient responses to input (appropriate for sensory processing), whereas association areas integrate inputs over time and exhibit persistent activity (suitable for decision-making and working memory). The model displays multiple temporal hierarchies, as evidenced by contrasting responses to visual versus somatosensory stimulation. Moreover, slower prefrontal and temporal areas have a disproportionate impact on global brain dynamics. These findings establish a circuit mechanism for "temporal receptive windows" that are progressively enlarged along the cortical hierarchy, suggest an extension of time integration in decision making from local to large circuits, and should prompt a re-evaluation of the analysis of functional connectivity (measured by fMRI or electroencephalography/magnetoencephalography) by taking into account inter-areal heterogeneity.}
}
@article{DehaeneChangeuxNEURON-11,
       author = {Dehaene, Stanislas and Changeux, Jean-Pierre},
        title = {Experimental and Theoretical Approaches to Conscious Processing},
      journal = {Neuron},
         year = {2017},
    publisher = {Elsevier},
       volume = {70},
        issue = {2},
        pages = {200-227},
     abstract = {Recent experimental studies and theoretical models have begun to address the challenge of establishing a causal link between subjective conscious experience and measurable neuronal activity. The present review focuses on the well-delimited issue of how an external or internal piece of information goes beyond nonconscious processing and gains access to conscious processing, a transition characterized by the existence of a reportable subjective experience. Converging neuroimaging and neurophysiological data, acquired during minimal experimental contrasts between conscious and nonconscious processing, point to objective neural measures of conscious access: late amplification of relevant sensory activity, long-distance cortico-cortical synchronization at beta and gamma frequencies, and ?ignition? of a large-scale prefronto-parietal network. We compare these findings to current theoretical models of conscious processing, including the Global Neuronal Workspace (GNW) model according to which conscious access occurs when incoming information is made globally available to multiple brain systems through a network of neurons with long-range axons densely distributed in prefrontal, parieto-temporal, and cingulate cortices. The clinical implications of these results for general anesthesia, coma, vegetative state, and schizophrenia are discussed.},
}
@article{JohnsonetalNATURE-16,
       author = {Johnson, Carolyn M. and Peckler, Hannah and Tai, Lung-Hao and Wilbrecht, Linda},
        title = {Rule learning enhances structural plasticity of long-range axons in frontal cortex},
      journal = {Nature Communications},
    publisher = {Nature Publishing Group},
       volume = {7},
         year = {2016},
     abstract = {Rules encompass cue-action-outcome associations used to guide decisions and strategies in a specific context. Subregions of the frontal cortex including the orbitofrontal cortex (OFC) and dorsomedial prefrontal cortex (dmPFC) are implicated in rule learning, although changes in structural connectivity underlying rule learning are poorly understood. We imaged OFC axonal projections to dmPFC during training in a multiple choice foraging task and used a reinforcement learning model to quantify explore-exploit strategy use and prediction error magnitude. Here we show that rule training, but not experience of reward alone, enhances OFC bouton plasticity. Baseline bouton density and gains during training correlate with rule exploitation, while bouton loss correlates with exploration and scales with the magnitude of experienced prediction errors. We conclude that rule learning sculpts frontal cortex interconnectivity and adjusts a thermostat for the explore-exploit balance.},
}
@article{KoechlinandJubaultNEURON-06,
     title = {Broca's Area and the Hierarchical Organization of Human Behavior},
    author = {Koechlin, Etienne and Jubault, Thomas},
   journal = {Neuron},
    volume = 50,
     issue = 6,
      year = 2006,
     pages = {963-974},
  abstract = {The prefrontal cortex subserves executive control, i.e., the organization of action or thought in relation to internal goals. This brain region hosts a system of executive processes extending from premotor to the most anterior prefrontal regions that governs the temporal organization of behavior. Little is known, however, about the prefrontal executive system involved in the hierarchical organization of behavior. Here, we show using magnetic resonance imaging in humans that the posterior portion of the prefrontal cortex, including Broca's area and its homolog in the right hemisphere, contains a system of executive processes that control start and end states and the nesting of functional segments that combine in hierarchically organized action plans. Our results indicate that Broca's area and its right homolog process hierarchically structured behaviors regardless of their temporal organization, suggesting a fundamental segregation between prefrontal executive systems involved in the hierarchical and temporal organization of goal-directed behaviors.},
}
@article{KoechlinetalPNAS-00,
      journal = {Proceedings of the National Academy of Sciences},
       author = {Koechlin, Etienne and Corrado, Gregory and Pietrini, Pietro and Grafman, Jordan},
        title = {Dissociating the role of the medial and lateral anterior prefrontal cortex in human planning},
       volume = 97,
        issue = 13,
         year = 2000,
        pages = {7651-7656},
     abstract = {The anterior prefrontal cortex is known to subserve higher cognitive functions such as task management and planning. Less is known, however, about the functional specialization of this cortical region in humans. Using functional MRI, we report a double dissociation: the medial anterior prefrontal cortex, in association with the ventral striatum, was engaged preferentially when subjects executed tasks in sequences that were expected, whereas the polar prefrontal cortex, in association with the dorsolateral striatum, was  involved preferentially when subjects performed tasks in sequences that  were contingent on unpredictable events. These results parallel the  functional segregation previously described between the medial and  lateral premotor cortex underlying planned and contingent motor control  and extend this division to the anterior prefrontal cortex, when task  management and planning are required. Thus, our findings support the  assumption that common frontal organizational principles underlie motor  and higher executive functions in humans.},
}
@article{KoechlinetalSCIENCE-03,
       author = {Etienne Koechlin and Chryst\`{e}le Ody and Fr\'{e}d\'{e}rique Kouneiher},
        title = {The architecture of cognitive control in the human prefrontal cortex},
      journal = {Science},
       volume = 302,
         year = 2003,
        pages = {1181-1185},
     abstract = {The prefrontal cortex (PFC) subserves cognitive control: the ability to coordinate thoughts or actions in relation with internal goals. Its functional architecture, however, remains poorly understood. Using brain imaging in humans, we showed that the lateral PFC is organized as a cascade of executive processes from premotor to anterior PFC regions that control behavior according to stimuli, the present perceptual context, and the temporal episode in which stimuli occur, respectively. The results support an unified modular model of cognitive control that describes the overall functional organization of the human lateral PFC and has basic methodological and theoretical implications.},
}
@article{LaraandWallisFiSN-15,
       author = {Lara, Antonio H. and Wallis, Jonathan D.},
        title = {The Role of Prefrontal Cortex in Working Memory: A Mini Review},
      journal = {Frontiers in System Neuroscience},
    publisher = {Frontiers Media S.A.},
       volume = {9},
         year = {2015},
        pages = {173},
     abstract = {A prominent account of prefrontal cortex (PFC) function is that single neurons within the PFC maintain representations of task-relevant stimuli in working memory. Evidence for this view comes from studies in which subjects hold a stimulus across a delay lasting up to several seconds. Persistent elevated activity in the PFC has been observed in animal models as well as in humans performing these tasks. This persistent activity has been interpreted as evidence for the encoding of the stimulus itself in working memory. However, recent findings have posed a challenge to this notion. A number of recent studies have examined neural data from the PFC and posterior sensory areas, both at the single neuron level in primates, and at a larger scale in humans, and have failed to find encoding of stimulus information in the PFC during tasks with a substantial working memory component. Strong stimulus related information, however, was seen in posterior sensory areas. These results suggest that delay period activity in the PFC might be better understood not as a signature of memory storage per se, but as a top down signal that influences posterior sensory areas where the actual working memory representations are maintained.},
}
@article{OReillySCIENCE-06,
     title = {Biologically Based Computational Models of High-Level Cognition},
    author = {O'Reilly, Randall C.},
   journal = {Science},
    volume = 314,
     issue = 5796,
      year = 2006,
     pages = {91-94},
  abstract = {Computer models based on the detailed biology of the brain can help us understand the myriad complexities of human cognition and intelligence. Here, we review models of the higher level aspects of human intelligence, which depend critically on the prefrontal cortex and associated subcortical areas. The picture emerging from a convergence of detailed mechanistic models and more abstract functional models represents a synthesis between analog and digital forms of computation. Specifically, the need for robust active maintenance and rapid updating of information in the prefrontal cortex appears to be satisfied by bistable activation states and dynamic gating mechanisms. These mechanisms are fundamental to digital computers and may be critical for the distinctive aspects of human intelligence.},
}
@article{RottschyetalNEUROIMAGE-12,
       author = {Rottschy, C. and Langner, R. and Dogan, I. and Reetz, K. and Laird, A. R. and Schulz, J. B. and Fox, P. T. and Eickhoff, S. B.},
        title = {Modelling neural correlates of working memory: A coordinate-based meta-analysis},
      journal = {Neuroimage},
         year = {2012},
       volume = {60},
        issue = {1},
        pages = {830-846},
     abstract = {Working memory subsumes the capability to memorize, retrieve and utilize information for a limited period of time which is essential to many human behaviours. Moreover, impairments of working memory functions may be found in nearly all neurological and psychiatric diseases. To examine what brain regions are commonly and differently active during various working memory tasks, we performed a coordinate-based meta-analysis over 189 fMRI experiments on healthy subjects. The main effect yielded a widespread bilateral fronto-parietal network. Further meta-analyses revealed that several regions were sensitive to specific task components, e.g. Broca's region was selectively active during verbal tasks or ventral and dorsal premotor cortex were preferentially involved in memory for object identity and location, respectively. Moreover, the lateral prefrontal cortex showed a division in a rostral and a caudal part based on differential involvement in task-set and load effects. Nevertheless, a consistent but more restricted core network emerged from conjunctions across analyses of specific task designs and contrasts. This core network appears to comprise the quintessence of regions, which are necessary during working memory tasks. It may be argued that the core regions form a distributed executive network with potentially generalized functions for focusing on competing representations in the brain. The present study demonstrates that meta-analyses are a powerful tool to integrate the data of functional imaging studies on a (broader) psychological construct, probing the consistency across various paradigms as well as the differential effects of different experimental implementations.},
}
@incollection{TefferandSemendeferiPBR-12,
       author = {Kate Teffer and Katerina Semendeferi},
        title = {Chapter 9 - Human prefrontal cortex: Evolution, development and pathology},
       editor = {Michel A. Hofman and Dean Falk},
       series = {Progress in Brain Research},
    publisher = {Elsevier},
       volume = {195},
        pages = {191-218},
         year = {2012},
    booktitle = {Evolution of the Primate Brain},
     abstract = {The prefrontal cortex is critical to many cognitive abilities that are considered particularly human, and forms a large part of a neural system crucial for normal socio-emotional and executive functioning in humans and other primates. In this chapter, we survey the literature regarding prefrontal development and pathology in humans as well as comparative studies of the region in humans and closely related primate species. The prefrontal cortex matures later in development than more caudal regions, and some of its neuronal subpopulations exhibit more complex dendritic arborizations. Comparative work suggests that the human prefrontal cortex differs from that of closely related primate species less in relative size than it does in organization. Specific reorganizational events in neural circuitry may have taken place either as a consequence of adjusting to increases in size or as adaptive responses to specific selection pressures. Living in complex environments has been recognized as a considerable factor in the evolution of primate cognition. Normal frontal lobe development and function are also compromised in several neurological and psychiatric disorders. A phylogenetically recent reorganization of frontal cortical circuitry may have been critical to the emergence of human-specific executive and social-emotional functions, and developmental pathology in these same systems underlies many psychiatric and neurological disorders, including autism and schizophrenia.}
}
@article{WatsonetalFiSN-14,
       author = {Watson, Thomas C. and Becker, Nadine and Apps, Richard and Jones, Matthew W.},
        title = {Back to front: cerebellar connections and interactions with the prefrontal cortex},
      journal = {Frontiers in Systems Neuroscience},
         year = {2014},
    publisher = {Frontiers Media S.A.},
       volume = {8},
        pages = {4},
     abstract = {Although recent neuroanatomical evidence has demonstrated closed-loop connectivity between prefrontal cortex and the cerebellum, the physiology of cerebello-cerebral circuits and the extent to which cerebellar output modulates neuronal activity in neocortex during behavior remain relatively unexplored. We show that electrical stimulation of the contralateral cerebellar fastigial nucleus (FN) in awake, behaving rats evokes distinct local field potential (LFP) responses (onset latency {\textasciitilde}13 ms) in the prelimbic (PrL) subdivision of the medial prefrontal cortex. Trains of FN stimulation evoke heterogeneous patterns of response in putative pyramidal cells in frontal and prefrontal regions in both urethane-anesthetized and awake, behaving rats. However, the majority of cells showed decreased firing rates during stimulation and subsequent rebound increases; more than 90\% of cells showed significant changes in response. Simultaneous recording of on-going LFP activity from FN and PrL while rats were at rest or actively exploring an open field arena revealed significant network coherence restricted to the theta frequency range (5-10 Hz). Granger causality analysis indicated that this coherence was significantly directed from cerebellum to PrL during active locomotion. Our results demonstrate the presence of a cerebello-prefrontal pathway in rat and reveal behaviorally dependent coordinated network activity between the two structures, which could facilitate transfer of sensorimotor information into ongoing neocortical processing during goal directed behaviors.},
}
@article{ZhaoetalNATURE-16,
       author = {Zhou, Xin and Zhu, Dantong and Qi, Xue-Lian and Li, Sihai and King, Samson G. and Salinas, Emilio and Stanford, Terrence R. and Constantinidis, Christos},
        title = {Neural correlates of working memory development in adolescent primates},
      journal = {Nature Communications},
    publisher = {Nature Publishing Group},
         year = {2016},
       volume = {9},
        issue = {7},
        pages = {13423},
     abstract = {Working memory ability matures after puberty, in parallel with structural changes in the prefrontal cortex, but little is known about how changes in prefrontal neuronal activity mediate this cognitive improvement in primates. To address this issue, we compare behavioural performance and neurophysiological activity in monkeys as they transitioned from puberty into adulthood. Here we report that monkeys perform working memory tasks reliably during puberty and show modest improvement in adulthood. The adult prefrontal cortex is characterized by increased activity during the delay period of the task but no change in the representation of stimuli. Activity evoked by distracting stimuli also decreases in the adult prefrontal cortex. The increase in delay period activity relative to the baseline activity of prefrontal neurons is the best correlate of maturation and is not merely a consequence of improved performance. Our results reveal neural correlates of the working memory improvement typical of primate adolescence.},
}

⁴³ Here are two excerpts from Dehaene [68] relating to attention, working memory and executive control. Note that working memory and conscious perception were thought to share similar brain mechanisms. Trübutschek et al [283] examine recent reports of non-conscious working memory that challenge this view. They combine visual masking with magnetoencephalography to demonstrate the reality of nonconscious working memory and dissect its neural mechanisms:

The component of the mind that psychologists call working memory is one of the dominant functions of the dorsolateral prefrontal cortex and the areas that it connects with, thus making these areas strong candidates for the depositories of our conscious knowledge. These regions pop up in brain imaging experiments whenever we briefly hold on to a piece of information: a phone number, a color, or the shape of a flashed picture. Prefrontal neurons implement an active memory: long after the picture is gone, they continue to fire throughout the short-term memory task—sometimes as long as dozens of seconds later. And when the prefrontal cortex is impaired or distracted, this memory is lost—it falls into unconscious oblivion.
Together with the physicists Mariano Sigman and Ariel Zylberberg, I have begun to explore the computational properties that such a device would possess. It closely resembles what computer scientists call a "production system," a type of program introduced in the 1960s to implement artificial intelligence tasks. A production system comprises a database, also called "working memory," and a vast array of if-then production rules (e.g., if there is an A in working memory, then change it to the sequence BC). At each step, the system examines whether a rule matches the current state of its working memory. If multiple rules match, then they compete under the aegis of a stochastic prioritizing system. Finally, the winning rule ignites and is allowed to change the contents of working memory before the entire process resumes. Thus this sequence of steps amounts to serial cycles of unconscious competition, conscious ignition, and broadcasting.

⁴⁴ Here is Stanislas Dehaene writing about the relationship between consciousness and maintaining persistent thoughts:

There may be a very good reason why our consciousness condenses sensory messages into a synthetic code, devoid of gaps and ambiguities: such a code is compact enough to be carried forward in time, entering what we usually call “working memory.” Working memory and consciousness seem to be tightly related. One may even argue, with Daniel Dennett, that a main role of consciousness is to create lasting thoughts. Once a piece of information is conscious, it stays fresh in our mind for as long as we care to attend to it and remember it. The conscious brief must be kept stable enough to inform our decisions, even if they take a few minutes to form. This extended duration, thickening of the present moment, is characteristic of our conscious thoughts. — Stanislas Dehaene [68]

⁴⁵ Bibliography references for the four tutorials on neural networks used as supplements in CS379C including URL and date of the cached PDF on the course website:

@misc{KarparthyCONVOLUTIONAL-NEURAL-NETWORKS-16,
        title = {Convolutional Neural Networks for Visual Recognition},
       author = {Andrej Karparthy},
 howpublished = {http://cs231n.github.io/convolutional-networks/},
         year = {2016},
}	 
@Misc{KarparthyUNREASONABLY-EFFECTIVE-RNN-15,
        title = {The Unreasonable Effectivenss of Recursive Neutal Networks},
       author = {Andrej Karparthy},
 howpublished = {http://karpathy.github.io/2015/05/21/rnn-effectiveness/},
         year = {2015},
}	 
@article{OlahandCarterATTENTION-TUTORIAL-16,
       author = {Olah, Chris and Carter, Shan},
        title = {Attention and Augmented Recurrent Neural Networks},
      journal = {Distill},
          url = {http://distill.pub/2016/augmented-rnns},
         year = {2016},
}
@misc{OlahLSTM-NEURAL-NETWORK-TUTORIAL-15,
        title = {Understanding LSTM Networks},
       author = {Christopher Olah},
 howpublished = {http://colah.github.io/posts/2015-08-Understanding-LSTMs},
         year = {2015},
}

⁴⁶ In Greek mythology, Pan is the god of the mountain wilds and wooded glens, the champion of rustic music and the frequent consort of the woodland nymphs. Pan is also the god of theatrical criticism and impromptus. In addition to the usual duties of godhood in Greek mythology, Pan has a reputation of being slyly mischievous and sexually promiscuous, characteristics we will not attempt to recreate in our digital apprentice. After Pan our next choice is Panache, a word of French origin that carries the connotation of flamboyant manner, confidence and reckless courage, and might serve as an acronym for the unwieldy full name [P]rogrammer's [A]pprentice [N]eural-network [A]rchitecture [H]ierarchical [E]xecutive [C]ontroller, allowing only one word out of place to reflect Pan's mischievous wit.

⁴⁷ Recent work and seminal papers relating to the notion of dynamic links — also called fast weights — to attend to the recent past [289, 290]:

@incollection{vonderMalsburgPNN-94,
        title = {The Correlation Theory of Brain Function},
       author = {Christoph von der Malsburg},
    booktitle = {Models of Neural Networks: Physics of Neural Networks},
       editor = {Domany, E. and van Hemmen, J.L. and Schulten, K.},
    publisher = {Springer},
         year = {1994},
     abstract = {A summary of brain theory is given so far as it is contained within the framework of Localization Theory. Diffculties of this "conventional theory" are traced back to a specific deficiency: there is no way to express relations between active cells (as for instance their representing parts of the same object). A new theory is proposed to cure this deficiency. It introduces a new kind of dynamical control, termed synaptic modulation, according to which synapses switch between a conducting and a non- conducting state. The dynamics of this variable is controlled on a fast time scale by correlations in the temporal fine structure of cellular signals. Furthermore, conventional synaptic plasticity is replaced by a refined version. Synaptic modulation and plasticity form the basis for short-term and long-term memory, respectively. Signal correlations, shaped by the variable network, express structure and relationships within objects. In particular, the figure-ground problem may be solved in this way. Synaptic modulation introduces flexibility into cerebral networks which is necessary to solve the invariance problem. Since momentarily useless connections are deactivated, interference between different memory traces can be reduced, and memory capacity increased, in comparison with conventional associative memory.}
}
@techreport{vonderMalsburgMPI-81,
        title = {The Correlation Theory of Brain Function},
       author = {Christoph von der Malsburg},
        issue = {{Internal Report 81-2}},
  institution = {Max Planck Institute for Biophysical Chemistry},
         year = 1981
}
@incollection{HintonandPlautCSS-87,
       author = {Hinton, G. E. and Plaut, D. C.},
        title = {Using fast weights to deblur old},
    booktitle = {Proceedings of the 9th Annual Conference of the Cognitive Science Society},
    publisher = {Lawrence Erlbaum Associates},
         year = {1987},
        pages = {177-186},
     abstract = {Connectionist models usually have a single weight on each connection. Some interesting new properties emerge if each connection has two weights: A slowly changing, plastic weight which stores long-term knowledge and a fast-changing, elastic weight which stores temporary knowledge and spontaneously decays towards zero. If a network learns a set of associations and then these associations are "blurred" by subsequent learning, all the original associations can be "deblurred" by rehearsing on just a few of them. The rehearsal allows the fast weights to take on values that temporarily cancel out the changes in the slow weights caused by the subsequent learning.}
}
@article{BaetalCoRR-16,
       author = {Jimmy Ba and Geoffrey Hinton and Volodymyr Mnih and Joel Z. Leibo and Catalin Ionescu},
        title = {Using Fast Weights to Attend to the Recent Past},
      journal = {CoRR},
       volume = {arXiv:1610.06258},
         year = {2016},
     abstract = {Until recently, research on artificial neural networks was largely restricted to systems with only two types of variable: Neural activities that represent the current or recent input and weights that learn to capture regularities among inputs, outputs and payoffs. There is no good reason for this restriction. Synapses have dynamics at many different time-scales and this suggests that artificial neural networks might benefit from variables that change slower than activities but much faster than the standard weights. These "fast weights" can be used to store temporary memories of the recent past and they provide a neurally plausible way of implementing the type of attention to the past that has recently proved very helpful in sequence-to-sequence models. By using fast weights we can avoid the need to store copies of neural activity patterns.}
}
@article{AnonymousICLR-17b,
        title = {Gated Fast Weights for Associative Retrieval},
       author = {Anonymous},
      journal = {International Conference on Learning Representations},
         year = {2017},
     abstract = {We improve previous end-to-end differentiable neural networks (NNs) with fast weight memories. A gate mechanism updates fast weights at every time step of a sequence through two separate outer-product-based matrices generated by slow parts of the net. The system is trained on a complex sequence to sequence variation of the Associative Retrieval Problem with roughly 70 times more temporal memory (i.e. time-varying variables) than similar-sized standard recurrent NNs (RNNs). In terms of accuracy and number of parameters, our architecture outperforms a variety of RNNs, including Long Short-Term Memory, Hypernetworks, and related fast weight architectures.}
}
@article{ZhuandvonderMalsburgNEUROCOMPUTING-02,
       author = {Junmei Zhu and Christoph von der Malsburg},
        title = {Synapto–synaptic interactions speed up dynamic link matching},
      journal = {Neurocomputing},
       volume = {44-46},
         year = {2002},
        pages = {721-728},
     abstract = {An extended dynamic link matching (DLM) model is presented for the fast creation of mappings invariant to position, scale, deformation and orientation, a process underlying various visual tasks that are fast and effortless for humans but are computationally very difficult. Structured interactions between synapses are represented by control units. Each control unit stands for a group of synapses that are consistent with each other in terms of relative position, scale and orientation. Specifically, it is shown in simulations how these high-order interactions between synapses lead to very fast DLM.}
}
@article{HuangetalCoRR-17,
       author = {{Huang}, Q. and {Smolensky}, P. and {He}, X. and {Deng}, L. and {Wu}, D.},
        title = {{Tensor Product Generation Networks}},
      journal = {CoRR},
         year = 2017,
     abstract = {We present a new tensor product generation network (TPGN) that generates natural language descriptions for images. The model has a novel architecture that instantiates a general framework for encoding and processing symbolic structure through neural network computation. This framework is built on Tensor Product Representations (TPRs). We evaluated the proposed TPGN on the MS COCO image captioning task. The exeprimental results show that the TPGN outperforms the LSTM based state-of-the-art baseline with a significant margin. Further, we show that our caption generation model can be interpreted as generating sequences of grammatical categories and retrieving words by their categories from a plan encoded as a distributed representation.}
}
@book{SmolenskyandLegendre2011,
        title = {The Harmonic Mind: From Neural Computation to Optimality-theoretic Grammar. Linguistic and philosophical implications},
       author = {Smolensky, P. and Legendre, G.},
       number = {{II}},
         year = {2011},
    publisher = {MIT Press}
}

⁴⁹ K. L. Parker, Y. C. Kim, R. M. Kelley, A. J. Nessler, K-H Chen, V. A. Muller-Ewald, N. C. Andreasen, N. S. Narayanan. Delta-frequency stimulation of cerebellar projections can compensate for schizophrenia-related medial frontal dysfunction. Molecular Psychiatry, (2017).

The latest findings by Parker and colleagues provide fresh insights into how the cerebellum influences neural networks in the frontal lobes and the role of the cerebellum in cognitive processing. This study also suggests that delta-frequency cerebellar stimulation might help improve cognitive problems in human patients with schizophrenia.

Mark J. Wagner, Tony Hyun Kim, Joan Savall, Mark J. Schnitzer, Liqun Luo. Cerebellar granule cells encode the expectation of reward. Nature (2017)

Along this line, last week, neuroscientists at Stanford University serendipitously discovered (for the first time) that granule cells in the cerebellum encode and predict rewards. The Stanford study, "Cerebellar Granule Cells Encode the Expectation of Reward," was published March 20 online ahead of print in the journal Nature.

Manish Saggar, Eve-Marie Quintin, Nicholas T. Bott, Eliza Kienitz, Yin-hsuan Chien, Daniel W-C. Hong, Ning Liu, Adam Royalty, Grace Hawthorne, Allan L. Reiss. Changes in Brain Activation Associated with Spontaneous Improvization and Figural Creativity After Design-Thinking-Based Training: A Longitudinal fMRI Study. Cerebral Cortex (2016).

Another team of researchers at Stanford University, led by Manish Saggar, reported that optimizing cerebral-cerebellar connectivity increases creative capacity. Saggar's team found that suppressing the executive-control functions of the cerebrum — while encouraging the cerebellum to become the "controller" — increased spontaneous creative capacity.

⁴⁸ Excerpts and page links from Wikipedia on brain structures relevant to attention and executive control. They are provided here as a convenient reference for quick review. Follow the supplied links for the full story and remember that this is Wikipedia and not the latest — 5th edition (October 26, 2012) — edition of Kandel, Schwartz, Jessell, Siegelbaum and Hudspeth [156]:

Tectum — In adult humans, it only consists of the inferior and the superior colliculi.
1. The superior colliculus is involved in preliminary visual processing and control of eye movements. In non-mammalian vertebrates it serves as the main visual area of the brain, functionally analogous to the visual areas of the cerebral cortex in mammals.
2. The inferior colliculus is involved in auditory processing. It receives input from various brain stem nuclei and projects to the medial geniculate nucleus of the thalamus, which relays auditory information to the primary auditory cortex.
Both colliculi also have descending projections to the paramedian pontine reticular formation and spinal cord, and thus can be involved in responses to stimuli faster than cortical processing would allow. Collectively the colliculi are referred to as the corpora quadrigemina.
Thalamus — The thalamus is the large mass of gray matter in the dorsal part of the diencephalon of the brain with several functions such as relaying of sensory signals, including motor signals, to the cerebral cortex, and the regulation of consciousness, sleep, and alertness.
1. The thalamus has multiple functions, generally believed to act as a relay station, or hub, relaying information between different subcortical areas and the cerebral cortex. In particular, every sensory system (with the exception of the olfactory system) includes a thalamic nucleus that receives sensory signals and sends them to the associated primary cortical area.
2. For the visual system, for example, inputs from the retina are sent to the lateral geniculate nucleus of the thalamus, which in turn projects to the visual cortex in the occipital lobe. The thalamus is believed to both process sensory information as well as relay it—each of the primary sensory relay areas receives strong feedback connections from the cerebral cortex.
Basal Ganglia — The basal ganglia (or basal nuclei) is a group of subcortical nuclei, of varied origin, in the brains of vertebrates including humans, which are situated at the base of the forebrain. Basal ganglia are strongly interconnected with the cerebral cortex, thalamus, and brainstem, as well as several other brain areas. The basal ganglia are associated with a variety of functions including: control of voluntary motor movements, procedural learning, routine behaviors or "habits" such as teeth grinding, eye movements, cognition, and emotion.
1. Popular theories implicate the basal ganglia primarily in action selection — in helping to decide which of several possible behaviors to execute at any given time. In more specific terms, the basal ganglia's primary function is likely to control and regulate activities of the motor and premotor cortical areas so that voluntary movements can be performed smoothly.
2. Experimental studies show that the basal ganglia exert an inhibitory influence on a number of motor systems, and that a release of this inhibition permits a motor system to become active. The "behavior switching" that takes place within the basal ganglia is influenced by signals from many parts of the brain, including the prefrontal cortex, which plays a key role in executive functions⁴⁹.
Prefrontal Cortex — This brain region has been implicated in planning complex cognitive behavior, personality expression, decision making, and moderating social behavior. The basic activity of this brain region is considered to be orchestration of thoughts and actions in accordance with internal goals.

The most typical psychological term for functions carried out by the prefrontal cortex area is executive function. Executive function relates to abilities to differentiate among conflicting thoughts, determine good and bad, better and best, same and different, future consequences of current activities, working toward a defined goal, prediction of outcomes, expectation based on actions, and social "control" (the ability to suppress urges that, if not suppressed, could lead to socially unacceptable outcomes).

⁵⁰ The concept of working memory is defined in terms of how it facilitates problem solving. Its relationship to the concepts of short- and long-term memory are described in [49] and summarized in the paper's abstract below. Cowan [49] emphasizes the connection between working memory and attention. O'Reilly [220] emphasizes the active maintenance and rapid updating of information required for planning and decision making in terms of different proposed gating mechanisms, and uses related mechanisms to explain dynamic variable binding. Eliasmith [86] and Stocco et al [271] provide additional detail concerning the role of the basal ganglia.

The exact location of working memory in the brain is still a matter of some controversy due, in no small part, to fuzziness about what constitutes working memory and how we might go about localizing it. Rottschy et al [249] performed a coordinate-based meta-analysis over 189 fMRI experiments involving various tasks relying on the exercise of working memory tasks. The main effects across all of these experiments involving working memory reveal consistent bilateral activity of fronto-parietal networks clearly visible in the fMRI data accompanying the Rottschy et al [249] paper and can explored using the 3-D graphics tools provided here.

@article{CowanPBR-08,
       author = {Cowan, Nelson},
        title = {What are the differences between long-term, short-term and working memory?},
      journal = {Progress in Brain Research},
         year = {2008},
       volume = {169},
        pages = {323-338},
     abstract = {In the recent literature there has been considerable confusion about the three types of memory: long-term, short-term, and working memory. This chapter strives to reduce that confusion and makes up-to-date assessments of these types of memory. Long- and short-term memory could differ in two fundamental ways, with only short-term memory demonstrating (1) temporal decay and (2) chunk capacity limits. Both properties of short-term memory are still controversial but the current literature is rather encouraging regarding the existence of both decay and capacity limits. Working memory has been conceived and defined in three different, slightly discrepant ways: as short-term memory applied to cognitive tasks, as a multi-component system that holds and manipulates information in short-term memory, and as the use of attention to manage short-term memory. Regardless of the definition, there are some measures of memory in the short term that seem routine and do not correlate well with cognitive aptitudes and other measures (those usually identified with the term 'working memory') that seem more attention demanding and do correlate well with these aptitudes. The evidence is evaluated and placed within a theoretical framework depicted in Figure 1.},
}
@book{Eliasmith2013,
        title = {How to Build a Brain: A Neural Architecture for Biological Cognition},
       author = {Eliasmith, Chris},
       series = {Oxford Series on Cognitive Modeling},
         year = {2013},
    publisher = {Oxford University Press {USA}},
     abstract = {One goal of researchers in neuroscience, psychology, and artificial intelligence is to build theoretical models that are able to explain the flexibility and adaptiveness of biological systems. How to build a brain provides a detailed guided exploration of a new cognitive architecture that takes biological detail seriously, while addressing cognitive phenomena. The Semantic Pointer Architecture (SPA) introduced in this book provides a set of tools for constructing a wide range of biologically constrained perceptual, cognitive, and motor models.}
}
@article{OReillySCIENCE-06,
     title = {Biologically Based Computational Models of High-Level Cognition},
    author = {O'Reilly, Randall C.},
   journal = {Science},
    volume = 314,
     issue = 5796,
      year = 2006,
     pages = {91-94},
  abstract = {Computer models based on the detailed biology of the brain can help us understand the myriad complexities of human cognition and intelligence. Here, we review models of the higher level aspects of human intelligence, which depend critically on the prefrontal cortex and associated subcortical areas. The picture emerging from a convergence of detailed mechanistic models and more abstract functional models represents a synthesis between analog and digital forms of computation. Specifically, the need for robust active maintenance and rapid updating of information in the prefrontal cortex appears to be satisfied by bistable activation states and dynamic gating mechanisms. These mechanisms are fundamental to digital computers and may be critical for the distinctive aspects of human intelligence.},
}
@article{RottschyetalNEUROIMAGE-12,
       author = {Rottschy, C. and Langner, R. and Dogan, I. and Reetz, K. and Laird, A. R. and Schulz, J. B. and Fox, P. T. and Eickhoff, S. B.},
        title = {Modelling neural correlates of working memory: A coordinate-based meta-analysis},
      journal = {Neuroimage},
         year = {2012},
       volume = {60},
        issue = {1},
        pages = {830-846},
     abstract = {Working memory subsumes the capability to memorize, retrieve and utilize information for a limited period of time which is essential to many human behaviours. Moreover, impairments of working memory functions may be found in nearly all neurological and psychiatric diseases. To examine what brain regions are commonly and differently active during various working memory tasks, we performed a coordinate-based meta-analysis over 189 fMRI experiments on healthy subjects. The main effect yielded a widespread bilateral fronto-parietal network. Further meta-analyses revealed that several regions were sensitive to specific task components, e.g. Broca's region was selectively active during verbal tasks or ventral and dorsal premotor cortex were preferentially involved in memory for object identity and location, respectively. Moreover, the lateral prefrontal cortex showed a division in a rostral and a caudal part based on differential involvement in task-set and load effects. Nevertheless, a consistent but more restricted core network emerged from conjunctions across analyses of specific task designs and contrasts. This core network appears to comprise the quintessence of regions, which are necessary during working memory tasks. It may be argued that the core regions form a distributed executive network with potentially generalized functions for focusing on competing representations in the brain. The present study demonstrates that meta-analyses are a powerful tool to integrate the data of functional imaging studies on a (broader) psychological construct, probing the consistency across various paradigms as well as the differential effects of different experimental implementations.},
}
@article{StoccoetalPR-10,
       author = {Stocco, Andrea and Lebiere, Christian and Anderson, John R.},
        title = {Conditional Routing of Information to the Cortex: A Model of the Basal Ganglia's Role in Cognitive Coordination},
      journal = {Psychology Review},
         year = {2010},
       volume = {117},
        issue = {2},
        pages = {541-574},
     abstract = {The basal ganglia play a central role in cognition and are involved in such general functions as action selection and reinforcement learning. Here, we present a model exploring the hypothesis that the basal ganglia implement a conditional information-routing system. The system directs the transmission of cortical signals between pairs of regions by manipulating separately the selection of sources and destinations of information transfers. We suggest that such a mechanism provides an account for several cognitive functions of the basal ganglia. The model also incorporates a possible mechanism by which subsequent transfers of information control the release of dopamine. This signal is used to produce novel stimulus-response associations by internalizing transferred cortical representations in the striatum. We discuss how the model is related to production systems and cognitive architectures. A series of simulations is presented to illustrate how the model can perform simple stimulus-response tasks, develop automatic behaviors, and provide an account of impairments in Parkinson's and Huntington's diseases.},
}
@article{TrubutschekBIORXIV-16,
       author = {Tr\"{u}butschek, Darinka and Marti, S\'{e}bastien and Ojeda, Andr\'{e}s and King, Jean-R\'{e}mi and Mi, Yuanyuan and Tsodyks, Misha and Dehaene, Stanislas},
        title = {A theory of working memory without consciousness or sustained activity},
      journal = {Biorxiv},
         year = {2016},
     abstract = {Working memory and conscious perception are thought to share similar brain mechanisms, yet recent reports of non-conscious working memory challenge this view. Combining visual masking with magnetoencephalography, we demonstrate the reality of nonconscious working memory and dissect its neural mechanisms. In a spatial delayed-response task, participants reported the location of a subjectively unseen target above chance-level after a long delay. Conscious perception and conscious working memory were characterized by similar signatures: a sustained desynchronization in the alpha/beta band over frontal cortex, and a decodable representation of target location in posterior sensors. During non-conscious working memory, such activity vanished. Our findings contradict models that identify working memory with sustained neural firing, but are compatible with recent proposals of ‘activity-silent’ working memory. We present a theoretical framework and simulations showing how slowly decaying synaptic changes allow cell assemblies to go dormant during the delay, yet be retrieved above chance-level after several seconds.}
}
@article{ZhaoetalNATURE-16,
       author = {Zhou, Xin and Zhu, Dantong and Qi, Xue-Lian and Li, Sihai and King, Samson G. and Salinas, Emilio and Stanford, Terrence R. and Constantinidis, Christos},
        title = {Neural correlates of working memory development in adolescent primates},
      journal = {Nature Communications},
    publisher = {Nature Publishing Group},
         year = {2016},
       volume = {9},
        issue = {7},
        pages = {13423},
     abstract = {Working memory ability matures after puberty, in parallel with structural changes in the prefrontal cortex, but little is known about how changes in prefrontal neuronal activity mediate this cognitive improvement in primates. To address this issue, we compare behavioural performance and neurophysiological activity in monkeys as they transitioned from puberty into adulthood. Here we report that monkeys perform working memory tasks reliably during puberty and show modest improvement in adulthood. The adult prefrontal cortex is characterized by increased activity during the delay period of the task but no change in the representation of stimuli. Activity evoked by distracting stimuli also decreases in the adult prefrontal cortex. The increase in delay period activity relative to the baseline activity of prefrontal neurons is the best correlate of maturation and is not merely a consequence of improved performance. Our results reveal neural correlates of the working memory improvement typical of primate adolescence.},
}

⁵¹ Two relative recent papers on simulating and reasoning about physical systems with an emphasis of continuous models that can be easily visualized:

% @article{BattagliaetalCoRR-16,
       author = {Peter W. Battaglia and Razvan Pascanu and Matthew Lai and Danilo Jimenez Rezende and Koray Kavukcuoglu},
        title = {Interaction Networks for Learning about Objects, Relations and Physics},
      journal = {CoRR},
       volume = {arXiv:1612.00222},
         year = {2016},
     abstract = {Reasoning about objects, relations, and physics is central to human intelligence, and a key goal of artificial intelligence. Here we introduce the interaction network, a model which can reason about how objects in complex systems interact, supporting dynamical predictions, as well as inferences about the abstract properties of the system. Our model takes graphs as input, performs object- and relation-centric reasoning in a way that is analogous to a simulation, and is implemented using deep neural networks. We evaluate its ability to reason about several challenging physical domains: n-body problems, rigid-body collision, and non-rigid dynamics. Our results show it can be trained to accurately simulate the physical trajectories of dozens of objects over thousands of time steps, estimate abstract quantities such as energy, and generalize automatically to systems with different numbers and configurations of objects and relations. Our interaction network implementation is the first general-purpose, learnable physics engine, and a powerful general framework for reasoning about object and relations in a wide variety of complex real-world domains.}
}
@article{ChangetalCoRR-16,
       author = {Michael B. Chang and Tomer Ullman and Antonio Torralba and Joshua B. Tenenbaum},
        title = {A Compositional Object-Based Approach to Learning Physical Dynamics},
      journal = {CoRR},
       volume = {arXiv:1612.00341},
         year = {2016},
     abstract = {We present the Neural Physics Engine (NPE), a framework for learning simulators of intuitive physics that naturally generalize across variable object count and different scene configurations. We propose a factorization of a physical scene into composable object-based representations and a neural network architecture whose compositional structure factorizes object dynamics into pairwise interactions. Like a symbolic physics engine, the NPE is endowed with generic notions of objects and their interactions; realized as a neural network, it can be trained via stochastic gradient descent to adapt to specific object properties and dynamics of different worlds. We evaluate the efficacy of our approach on simple rigid body dynamics in two-dimensional worlds. By comparing to less structured architectures, we show that the NPE's compositional representation of the structure in physical interactions improves its ability to predict movement, generalize across variable object count and different scene configurations, and infer latent properties of objects such as mass.},
}

⁵² Maseo Ito's 2012 monograph summarizing 50 years of work on the cerebellum and related cortical structures:

@book{Ito2012,
        title = {The Cerebellum: Brain for an Implicit Self},
       author = {Ito, Masao},
    publisher = {FT Press},
         year = {2012},
}

⁵³ Review articles relating to the control of non-motor activities that involve closed loop connections between the cerebral corted and in particular prefrontal region and the cerebellar cortex via the basal ganglia, dentate gyrus and thalamus:

@article{ItoNRN-08,
       author = {Ito, Masao},
        title = {Control of mental activities by internal models in the cerebellum},
      journal = {Nature Reviews Neuroscience},
    publisher = {Nature Publishing Group},
       volume = 9,
        issue = 4,
         year = 2008,
        pages = {304-313},
     abstract = {The intricate neuronal circuitry of the cerebellum is thought to encode internal models that reproduce the dynamic properties of body parts. These models are essential for controlling the movement of these body parts: they allow the brain to precisely control the movement without the need for sensory feedback. It is thought that the cerebellum might also encode internal models that reproduce the essential properties of mental representations in the cerebral cortex. This hypothesis suggests a possible mechanism by which intuition and implicit thought might function and explains some of the symptoms that are exhibited by psychiatric patients. This article examines the conceptual bases and experimental evidence for this hypothesis.},
}
@article{StricketalARN-09,
       author = {Peter L. Strick and Richard P. Dum and Julie A. Fiez},
        title = {Cerebellum and Nonmotor Function},
      journal = {Annual Review of Neuroscience},
       volume = {32},
       number = {1},
        pages = {413-434},
         year = {2009},
     abstract = {Does the cerebellum influence nonmotor behavior? Recent anatomical studies demonstrate that the output of the cerebellum targets multiple nonmotor areas in the prefrontal and posterior parietal cortex, as well as the cortical motor areas. The projections to different cortical areas originate from distinct output channels within the cerebellar nuclei. The cerebral cortical area that is the main target of each output channel is a major source of input to the channel. Thus, a closed-loop circuit represents the major architectural unit of cerebro-cerebellar interactions. The outputs of these loops provide the cerebellum with the anatomical substrate to influence the control of movement and cognition. Neuroimaging and neuropsychological data supply compelling support for this view. The range of tasks associated with cerebellar activation is remarkable and includes tasks designed to assess attention, executive control, language, working memory, learning, pain, emotion, and addiction. These data, along with the revelations about cerebro-cerebellar circuitry, provide a new framework for exploring the contribution of the cerebellum to diverse aspects of behavior. }
}
@article{WatsonetalFiSN-14,
       author = {Watson, Thomas C. and Becker, Nadine and Apps, Richard and Jones, Matthew W.},
        title = {Back to front: cerebellar connections and interactions with the prefrontal cortex},
      journal = {Frontiers in Systems Neuroscience},
         year = {2014},
    publisher = {Frontiers Media S.A.},
       volume = {8},
        pages = {4},
     abstract = {Although recent neuroanatomical evidence has demonstrated closed-loop connectivity between prefrontal cortex and the cerebellum, the physiology of cerebello-cerebral circuits and the extent to which cerebellar output modulates neuronal activity in neocortex during behavior remain relatively unexplored. We show that electrical stimulation of the contralateral cerebellar fastigial nucleus (FN) in awake, behaving rats evokes distinct local field potential (LFP) responses (onset latency {\textasciitilde}13 ms) in the prelimbic (PrL) subdivision of the medial prefrontal cortex. Trains of FN stimulation evoke heterogeneous patterns of response in putative pyramidal cells in frontal and prefrontal regions in both urethane-anesthetized and awake, behaving rats. However, the majority of cells showed decreased firing rates during stimulation and subsequent rebound increases; more than 90\% of cells showed significant changes in response. Simultaneous recording of on-going LFP activity from FN and PrL while rats were at rest or actively exploring an open field arena revealed significant network coherence restricted to the theta frequency range (5-10 Hz). Granger causality analysis indicated that this coherence was significantly directed from cerebellum to PrL during active locomotion. Our results demonstrate the presence of a cerebello-prefrontal pathway in rat and reveal behaviorally dependent coordinated network activity between the two structures, which could facilitate transfer of sensorimotor information into ongoing neocortical processing during goal directed behaviors.},
}

⁵⁴ Original papers by David Marr and James Albus on theories of learning in the cerebellar cortex plus early related work by Eccles, Itō, and Szentágothai and a more recent overview by Raymond et al in Science:

@article{AlbusMB-71,
       author = {Albus, James S.},
        title = {A Theory of Cerebellar Functions},
      journal = {Mathematical Biology},
       volume = 10,
         year = 1971,
        pages = {25-61},
}
@book{Ecclesetal1967,
        title = {The Cerebellum as a Neuronal Machine},
       author = {Eccles, J.C. and It\={o}, M. and Szent\'{a}gothai, J.},
         year = {1967},
    publisher = {Springer-Verlag}
}
@article{ItoNRN-08,
       author = {Ito, Masao},
        title = {Control of mental activities by internal models in the cerebellum},
      journal = {Nature Reviews Neuroscience},
    publisher = {Nature Publishing Group},
       volume = 9,
        issue = 4,
         year = 2008,
        pages = {304-313},
     abstract = {The intricate neuronal circuitry of the cerebellum is thought to encode internal models that reproduce the dynamic properties of body parts. These models are essential for controlling the movement of these body parts: they allow the brain to precisely control the movement without the need for sensory feedback. It is thought that the cerebellum might also encode internal models that reproduce the essential properties of mental representations in the cerebral cortex. This hypothesis suggests a possible mechanism by which intuition and implicit thought might function and explains some of the symptoms that are exhibited by psychiatric patients. This article examines the conceptual bases and experimental evidence for this hypothesis.},
}
@article{MarrJoP-69,
       author = {Marr, David},
        title = {A Theory of Cerebellar Cortex},
      journal = {Journal of Physiology},
       volume = 202,
         year = 1969,
        pages = {437-470},
}  
@article{RaymondetalSCIENCE-96,
       author = {Jennifer L. Raymond and Stephen G. Lisberger and Michael D. Mauk},
        title = {The Cerebellum: A Neuronal Learning Machine?},
      journal = {Science},
       volume = 722,
        issue = 6158,
         year = 1996,
     abstract = {The comparison of two seemingly quite different behaviors yields a surprisingly consistent picture of the role of the cerebellum in motorlearning. Behavioral and physiological data about classical conditioning of the eye lid response and motor learning in the vestibulo-ocular reflex suggest that (i) plasticity is distributed between the cerebellar cortex and the deep cerebellar nuclei; (i) the cerebellar cortex plays a special role in learning the timing of movement; and (i) the cerebellar cortex guides learning in the deep nuclei, which may for movement coordination allow learning to be transferred from the cortex to the deep nuclei. Because many of the similarities in the data from the two systems typify general features of cerebellar organization, the cerebellar mechanisms of learning in these two systems may represent principles that apply to many motor systems.}
}

⁵⁵ For your convenience, here is the citation including the abstract of the Reed and de Freitas paper that serves as one of the primary inspirations for analyzing execution traces. You might also find convenient these links relating to technologies for analyzing and generating execution traces: software tracing, Unix Kernel function tracers, control flow graph and more conventional data-flow analysis.

@article{ReedandDeFreitasCoRR-15,
       author = {Scott E. Reed and Nando de Freitas},
        title = {Neural Programmer-Interpreters},
      journal = {CoRR},
       volume = {arXiv:1511.06279},
         year = {2015},
     abstract = {We propose the neural programmer-interpreter (NPI): a recurrent and compositional neural network that learns to represent and execute programs. NPI has three learnable components: a task-agnostic recurrent core, a persistent key-value program memory, and domain-specific encoders that enable a single NPI to operate in multiple perceptually diverse environments with distinct affordances. By learning to compose lower-level programs to express higher-level programs, NPI reduces sample complexity and increases generalization ability compared to sequence-to-sequence LSTMs. The program memory allows efficient learning of additional tasks by building on existing programs. NPI can also harness the environment (e.g. a scratch pad with read-write pointers) to cache intermediate results of computation, lessening the long-term memory burden on recurrent hidden units. In this work we train the NPI with fully-supervised execution traces; each program has example sequences of calls to the immediate subprograms conditioned on the input. Rather than training on a huge number of relatively weak labels, NPI learns from a small number of rich examples. We demonstrate the capability of our model to learn several types of compositional programs: addition, sorting, and canonicalizing 3D models. Furthermore, a single NPI learns to execute these programs and all 21 associated subprograms.}
}

⁵⁶ Here is a relatively simple program written in Emacs Lisp that normalizes BibTeX entries for readability and compatibility with the different Emacs modes that I routinely use in my Emacs environment for coding, writing papers and maintaining the repository that I use for my research notes:

;; bibtex-fill-entry = formats fill according to following parameters which are required by the code below  

(defcustom bibtex-align-at-equal-sign t
  "If non-nil, align fields at equal sign instead of field text. If non-nil, the
   column for the equal sign is the value of `bibtex-text-indentation', minus 2."
  :group 'bibtex
  :type 'boolean)

(defcustom bibtex-comma-after-last-field t
  "If non-nil, a comma is put at end of last field in the entry template."
  :group 'bibtex
  :type 'boolean)

;; Offsets are based on longest field name, in this case "organization":

(defcustom bibtex-field-indentation 1
  "Starting column for the name part in BibTeX fields."
  :group 'bibtex
  :type 'integer)

(defcustom bibtex-text-indentation
  (+ bibtex-field-indentation
     (length "organization = "))
  "Starting column for the text part in BibTeX fields."
  :group 'bibtex
  :type 'integer)

(defvar bibtex-field-alignment-adjustments
  '([ " author       = " "       author = " ]
    [ " abstract     = " "     abstract = " ]
    [ " address      = " "      address = " ]
    [ " booktitle    = " "    booktitle = " ]
    [ " comment      = " "      comment = " ]
    [ " crossref     = " "     crossref = " ]
    [ " editor       = " "       editor = " ]
    [ " institution  = " "  institution = " ]
    [ " issue        = " "        issue = " ]
    [ " journal      = " "      journal = " ]
    [ " location     = " "     location = " ]    
    [ " number       = " "       number = " ]
    [ " organization = " " organization = " ]    
    [ " pages        = " "        pages = " ]
    [ " pdf          = " "          pdf = " ]
    [ " publisher    = " "    publisher = " ]
    [ " series       = " "       series = " ]
    [ " title        = " "        title = " ]
    [ " school       = " "       school = " ]
    [ " url          = " "          url = " ]
    [ " volume       = " "       volume = " ]
    [ " year         = " "         year = " ]))

;; /Applications/Emacs.app/Contents/Resources/lisp/textmodes/bibtex.el

(defun bibtex-format-fields-region ()
  (interactive)
  (bibtex-fill-entry)
  (let ((start (bibtex-beginning-of-entry)) (end (bibtex-end-of-entry)))
    (untabify start end)
    (replace-pairs-region start end bibtex-field-alignment-adjustments)))

(defun replace-pairs-region (p1 p2 pairs)
  (let ((mapping-point-seed 1000) index (mapping-points '()))
    (setq index 0)
    (while (< index (length pairs))
      (setq mapping-points (cons (number-to-string (+ mapping-point-seed index)) mapping-points ))
      (setq index (1+ index)))
    (save-excursion
      (save-restriction
        (narrow-to-region p1 p2)
        ;; Replace each find string by corresponding item in mapping-points:
        (setq index 0)
        (while (< index (length pairs))
          (goto-char (point-min))
          (while (search-forward (elt (elt pairs index) 0) nil t)
            (replace-match (elt mapping-points index) t t) )
          (setq index (1+ index)) )
        ;; Replace each mapping-point by the corresponding replacement string:
        (setq index 0)
        (while (< index (length pairs))
          (goto-char (point-min))
          (while (search-forward (elt mapping-points index) nil t)
            (replace-match (elt (elt pairs index) 1) t t) )
          (setq index (1+ index)))
        ;; Replace all double quotes used as delimiters with curly brackets:
        (goto-char (point-min))
        (while (search-forward-regexp "= \"\\([^\"]+\\)\"" nil t) (replace-match "= {\\1}" t nil)) ))))

⁵⁷ A list of BibTeX items relating to automated or assistive programming research. Gulwani et al [122] on program synthesis — foundations and trends — is added to the mix as it provides some useful background on dinfferent methods of both historical and current interest:

@inproceedings{DongandLapataACL-16,
       author = {Li Dong and Mirella Lapata},
        title = {Language to Logical Form with Neural Attention},
    booktitle = {Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics},
        pages = {33-43},
         year = {2016},
    publisher = {Association for Computational Linguistics},
      address = {Stroudsburg, PA, USA},
     abstract = {Semantic parsing aims at mapping natural language to machine interpretable meaning representations. Traditional approaches rely on high-quality lexicons, manually-built templates, and linguistic features which are either domain- or representation-specific. In this paper we present a general method based on an attention-enhanced encoder-decoder model. We encode input utterances into vector representations, and generate their logical forms by conditioning the output sequences or trees on the encoding vectors. Experimental results on four datasets show that our approach performs competitively without using hand-engineered features and is easy to adapt across domains and meaning representations.}
}
@article{LuongetalEMNLP-15,
       author = {Minh{-}Thang Luong and Hieu Pham and Christopher D. Manning},
        title = {Effective Approaches to Attention-based Neural Machine Translation},
    booktitle = {Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing},
        pages = {1412-1421},
         year = {2015},
     abstract = {An attentional mechanism has lately been used to improve neural machine translation (NMT) by selectively focusing on parts of the source sentence during translation. However, there has been little work exploring useful architectures for attention-based NMT. This paper examines two simple and effective classes of attentional mechanism: a global approach which always attends to all source words and a local one that only looks at a subset of source words at a time. We demonstrate the effectiveness of both approaches over the WMT translation tasks between English and German in both directions. With local attention, we achieve a significant gain of 5.0 BLEU points over non-attentional systems which already incorporate known techniques such as dropout. Our ensemble model using different attention architectures has established a new state-of-the-art result in the WMT'15 English to German translation task with 25.9 BLEU points, an improvement of 1.0 BLEU points over the existing best system backed by NMT and an n-gram reranker.}
}
@article{GulwanietalFaTiPL-17,
       author = {Sumit Gulwani and Oleksandr Polozov and Rishabh Singh},
        title = {Program Synthesis},
      journal = {Foundations and Trends in Programming Languages},
       volume = {4},
       number = {1-2},
         year = {2017},
        pages = {1-119},
     abstract = {Program synthesis is the task of automatically finding a program in the underlying programming language that satisfies the user intent expressed in the form of some specification. Since the inception of AI in the 1950s, this problem has been considered the holy grail of Computer Science. Despite inherent challenges in the problem such as ambiguity of user intent and a typically enormous search space of programs, the field of program synthesis has developed many different techniques that enable program synthesis in different real-life application domains. It is now used successfully in software engineering, biological discovery, computeraided education, end-user programming, and data cleaning. In the last decade, several applications of synthesis in the field of programming by examples have been deployed in mass-market industrial products. This survey is a general overview of the state-of-the-art approaches to program synthesis, its applications, and subfields. We discuss the general principles common to all modern synthesis approaches such as syntactic bias, oracle-guided inductive `xbibsearch, and optimization techniques. We then present a literature review covering the four most common state-of-the-art techniques in program synthesis: enumerative search, constraint solving, stochastic search, and deduction-based programming by examples. We conclude with a brief list of future horizons for the field.}
}
@article{GauntetalCoRR-16,
       author = {Alexander L. Gaunt and Marc Brockschmidt and Rishabh Singh and Nate Kushman and Pushmeet Kohli and Jonathan Taylor and Daniel Tarlow},
        title = {{TerpreT}: {A} Probabilistic Programming Language for Program Induction},
      journal = {CoRR},
       volume = {arXiv:1612.00817},
         year = {2016},
     abstract = {We study machine learning formulations of inductive program synthesis; that is, given input-output examples, synthesize source code that maps inputs to corresponding outputs. Our key contribution is TerpreT, a domain-specific language for expressing program synthesis problems. A TerpreT model is composed of a specification of a program representation and an interpreter that describes how programs map inputs to outputs. The inference task is to observe a set of input-output examples and infer the underlying program. From a TerpreT model we automatically perform inference using four different back-ends: gradient descent (thus each TerpreT model can be seen as defining a differentiable interpreter), linear program (LP) relaxations for graphical models, discrete satisfiability solving, and the Sketch program synthesis system. TerpreT has two main benefits. First, it enables rapid exploration of a range of domains, program representations, and interpreter models. Second, it separates the model specification from the inference algorithm, allowing proper comparisons between different approaches to inference. We illustrate the value of TerpreT by developing several interpreter models and performing an extensive empirical comparison between alternative inference algorithms on a variety of program models. To our knowledge, this is the first work to compare gradient-based search over program space to traditional search-based alternatives. Our key empirical finding is that constraint solvers dominate the gradient descent and LP-based formulations.}
}
@article{AnonymousICLR-17,
        title = {Neural Program Search: Solving Programming Tasks from Description and Examples},
       author = {Anonymous},
      journal = {International Conference on Learning Representations},
         year = {2017},
          url = {https://openreview.net/forum?id=B1KJJf-R-},
     abstract = {We present Neural Program Search, an algorithm to generate programs from natural language description and a small number of input / output examples. The algorithm combines methods from Deep Learning and Program Synthesis fields by designing rich domain-specific language (DSL) and defining efficient search algorithm guided by a Seq2Tree model on it. To evaluate the quality of the approach we also present a semi-synthetic dataset of descriptions with test examples and corresponding programs. We show that our algorithm significantly outperforms sequence-to-sequence model with attention baseline.}
}
@article{Alvarez-MelisandJaakkolaCoRR-17,
       author = {David Alvarez{-}Melis and Tommi S. Jaakkola},
        title = {A causal framework for explaining the predictions of black-box sequence-to-sequence models},
      journal = {CoRR},
       volume = {arXiv:1707.01943},
         year = {2017},
     abstract = {We interpret the predictions of any black-box structured input-structured output model around a specific input-output pair. Our method returns an "explanation" consisting of groups of input-output tokens that are causally related. These dependencies are inferred by querying the black-box model with perturbed inputs, generating a graph over tokens from the responses, and solving a partitioning problem to select the most relevant components. We focus the general approach on sequence-to-sequence problems, adopting a variational autoencoder to yield meaningful input perturbations. We test our method across several NLP sequence generation tasks.}
}
@techreport{LinetalUW-CSE-TR-17,
       author = {Xi Victoria Lin and Chenglong Wang and Deric Pang and Kevin Vu and Luke Zettlemoyer and Michael D. Ernst},
        title = {Program synthesis from natural language using recurrent neural networks},
  institution = {University of Washington Department of Computer Science and Engineering},
       number = {UW-CSE-17-03-01},
      address = {Seattle, WA, USA},
         year = {2017},
     abstract = {Oftentimes, a programmer may have difficulty implementing a desired operation. Even when the programmer can describe her goal in English, it can be difficult to translate into code. Existing resources, such as question-and-answer websites, tabulate specific operations that someone has wanted to perform in the past, but they are not effective in generalizing to new tasks, to compound tasks that require combining previous questions, or sometimes even to variations of listed tasks. Our goal is to make programming easier and more productive by letting programmers use their own words and concepts to express the intended operation, rather than forcing them to accommodate the machine by memorizing its grammar. We have built a system that lets a programmer describe a desired operation in natural language, then automatically translates it to a programming language for review and approval by the programmer. Our system, Tellina, does the translation using recurrent neural networks (RNNs), a state-of-the-art natural language processing technique that we augmented with slot (argument) filling and other enhancements. We evaluated Tellina in the context of shell scripting. We trained Tellina’s RNNs on textual descriptions of file system operations and bash one-liners, scraped from the web. Although recovering completely correct commands is challenging, Tellina achieves top-3 accuracy of 80\% for producing the correct command structure. In a controlled study, programmers who had access to Tellina outperformed those who did not, even when Tellina’s predictions were not completely correct, to a statistically significant degree.}
}

⁵⁸ In humans, the cerebellum plays an important role in motor control, and it may also be involved in some cognitive functions such as attention and language as well as in regulating fear and pleasure responses, but its movement-related functions are the most solidly established. The human cerebellum does not initiate movement, but contributes to coordination, precision, and accurate timing: it receives input from sensory systems of the spinal cord and from other parts of the brain, and integrates these inputs to fine-tune motor activity. (SOURCE)

⁵⁹ As this article on Botton rhetorically mentions, "Where else can one find a video on the philosophy of Heidegger with over 350,000 views?" and goes on to suggest that, along with a diverse collection of frequently viewed self help material, they have a dedicated and growing audience as evidenced by the adoption and use of their smartphone applications.

⁶⁰ Here are few examples from The School of Life YouTube Channel that combine clever animated cartoons and Alain de Botton's narrative to encourage emotional intelligence. They are the sort of thing I would have liked to have on hand as a faculty advisor to help distraught graduate students, often living apart from friends and family for the first time in their lives, feeling out of place in a strange culture, under a great deal of academic pressure and questioning their choices. While I don't imagine you being in particular need of such advice, I thought you might enjoy the clever drawings and offered wisdom nonetheless:

How Emotionally Healthy Are You? Emotional health is defined by four markers: our degree of self-love, of openness, of communication and of trust.
Why You Shouldn't Trust Your Feelings It can be very hard to detect just how much our judgement is constantly affected by our feelings. We should — at points — take care to be very skeptical of our first impulses.
How to Process Your Emotions In order to be calm and at ease with ourselves, we need regular periods where we do something rather strange-sounding: process our emotions. Here is a guide to this essential psychological move.
The Dangers of Thinking Too Much; And Thinking Too Little There are dangers associated both with thinking too much — and thinking too little. The trick is to use our minds to access our most sincere, authentic and original thoughts.
Is It Better to Be Polite or Frank? We live in an age that thinks highly of frankness and directness. But there are — nevertheless — a few reasons why politeness remains a hugely important quality.
How to Remain Calm With People Remaining calm around people who annoy us is one of the great life skills. It's also a teachable and learnable skill.
How to Deal With A Crisis of Meaning Many of us are regularly thrown off course by what we might term 'crisis of meaning'; periods when what we are up to seems not to connect up with anything purposeful or properly dignified. It's at such moments that we need to lean on a bigger theory of what meaning is, where it comes from, and how our lives relate to it.

⁶¹ Here are few more examples from The School of Life YouTube Channel covering broader social and political issues. The YouTube channel has plenty fodder for its more intellectually inclined subscribers, e.g., the History of Ideas category including Capitalism, Manners, Rituals and Religion, and the Political Theory category covering Adam Smith, Karl Marx, John Locke, John Maynard Keynes and host of other historically important thinkers. The variety reflects Alain de Botton's eclectic taste, but has expanded over the years to cover a broader set of social issues leveraging different voices and media to reach a wider, perhaps more modern audience:

On the History of Consumerism It’s only very recently in history that we’ve been able to buy more than the bare necessities. Can the history of consumption guide us to a wiser future?
How to Find Fulfilling Work The key to finding fulfilling work is to think a lot, analyse one's fears, understand the market, reflect on capitalism.
How to Find Meaningful Work Contrary to some expectations, it isn’t only money we want from work. We also need our work to feel ‘meaningful’. But what exactly is meaningful work, and where can we find more of it?

⁶² Here is an assortment of papers that use semantic-embedding techniques to encode computer programs for a variety of applications:

@inproceedings{ChistyakovetalICLR-17,
       author = {Alexander Chistyakov and Ekaterina Lobacheva and Arseny Kuznetsov and Alexey Romanenko},
        title = {Semantic embeddings for program behaviour patterns},
    booktitle = {ICLR Workshop},
         year = {2017},
     abstract = {In this paper, we propose a new feature extraction technique for program execution logs. First, we automatically extract complex patterns from a program's behaviour graph. Then, we embed these patterns into a continuous space by training an autoencoder. We evaluate the proposed features on a real-world malicious software detection task. We also find that the embedding space captures interpretable structures in the space of pattern parts.}
}
@article{WangetalCoRR-17,
       author = {Ke Wang and Rishabh Singh and Zhendong Su},
        title = {Dynamic Neural Program Embedding for Program Repair},
      journal = {CoRR},
       volume = {arXiv:1711.07163},
         year = {2017},
     abstract = {Neural program embeddings have shown much promise recently for a variety of program analysis tasks, including program synthesis, program repair, fault localization, etc. However, most existing program embeddings are based on syntactic features of programs, such as raw token sequences or abstract syntax trees. Unlike images and text, a program has an unambiguous semantic meaning that can be difficult to capture by only considering its syntax (i.e. syntactically similar pro- grams can exhibit vastly different run-time behavior), which makes syntax-based program embeddings fundamentally limited. This paper proposes a novel semantic program embedding that is learned from program execution traces. Our key insight is that program states expressed as sequential tuples of live variable values not only captures program semantics more precisely, but also offer a more natural fit for Recurrent Neural Networks to model. We evaluate different syntactic and semantic program embeddings on predicting the types of errors that students make in their submissions to an introductory programming class and two exercises on the CodeHunt education platform. Evaluation results show that our new semantic program embedding significantly outperforms the syntactic program embeddings based on token sequences and abstract syntax trees. In addition, we augment a search-based program repair system with the predictions obtained from our se- mantic embedding, and show that search efficiency is also significantly improved.}
}
@article{XuetalCoRR-17,
       author = {Xiaojun Xu and Chang Liu and Qian Feng and Heng Yin and Le Song and Dawn Song},
        title = {Neural Network-based Graph Embedding for Cross-Platform Binary Code Similarity Detection},
      journal = {CoRR},
       volume = {arXiv:1708.06525},
         year = {2017},
     abstract = {The problem of cross-platform binary code similarity detection aims at detecting whether two binary functions coming from different platforms are similar or not. It has many security applications, including plagiarism detection, malware detection, vulnerability search, etc. Existing approaches rely on approximate graph matching algorithms, which are inevitably slow and sometimes inaccurate, and hard to adapt to a new task. To address these issues, in this work, we propose a novel neural network-based approach to compute the embedding, i.e., a numeric vector, based on the control flow graph of each binary function, then the similarity detection can be done efficiently by measuring the distance between the embeddings for two functions. We implement a prototype called Gemini. Our extensive evaluation shows that Gemini outperforms the state-of-the-art approaches by large margins with respect to similarity detection accuracy. Further, Gemini can speed up prior art's embedding generation time by 3 to 4 orders of magnitude and reduce the required training time from more than 1 week down to 30 minutes to 10 hours. Our real world case studies demonstrate that Gemini can identify significantly more vulnerable firmware images than the state-of-the-art, i.e., Genius. Our research showcases a successful application of deep learning on computer security problems.}
}
@article{PiechetalCoRR-15,
       author = {Chris Piech and Jonathan Huang and Andy Nguyen and Mike Phulsuksombati and Mehran Sahami and Leonidas J. Guibas},
        title = {Learning Program Embeddings to Propagate Feedback on Student Code},
      journal = {CoRR},
       volume = {arXiv:1505.05969},
         year = {2015},
     abstract = {Providing feedback, both assessing final work and giving hints to stuck students, is difficult for open-ended assignments in massive online classes which can range from thousands to millions of students. We introduce a neural network method to encode programs as a linear mapping from an embedded precondition space to an embedded postcondition space and propose an algorithm for feedback at scale using these linear maps as features. We apply our algorithm to assessments from the Code.org Hour of Code and Stanford University's CS1 course, where we propagate human comments on student assignments to orders of magnitude more submissions.}
}
@inproceedings{BalogetalICLR-17,
       author = {Matej Balog and Alexander L. Gaunt and Marc Brockschmidt and Sebastian Nowozin and Daniel Tarlow},
        title = {DeepCoder: Learning to Write Programs},
    booktitle = {International Conference on Learning Representations},
         year = 2017,
     abstract = {We develop a first line of attack for solving programming competition-style problems from input-output examples using deep learning. The approach is to train a neural network to predict properties of the program that generated the outputs from the inputs. We use the neural network's predictions to augment search techniques from the programming languages community, including enumerative search and an SMT-based solver. Empirically, we show that our approach leads to an order of magnitude speedup over the strong non-augmented baselines and a Recurrent Neural Network approach, and that we are able to solve problems of difficulty comparable to the simplest problems on programming competition websites.}
}
@article{AllamanisetalCoRR-17,
       author = {Miltiadis Allamanis and Marc Brockschmidt and Mahmoud Khademi},
        title = {Learning to Represent Programs with Graphs},
      journal = {CoRR},
       volume = {arXiv:1711.00740},
         year = {2017},
     abstract = {Learning tasks on source code (i.e., formal languages) have been considered recently, but most work has tried to transfer natural language methods and does not capitalize on the unique opportunities offered by code's known syntax. For example, long-range dependencies induced by using the same variable or function in distant locations are often not considered. We propose to use graphs to represent both the syntactic and semantic structure of code and use graph-based deep learning methods to learn to reason over program structures.\\
                 In this work, we present how to construct graphs from source code and how to scale Gated Graph Neural Networks training to such large graphs. We evaluate our method on two tasks: VarNaming, in which a network attempts to predict the name of a variable given its usage, and VarMisuse, in which the network learns to reason about selecting the correct variable that should be used at a given program location. Our comparison to methods that use less structured program representations shows the advantages of modeling known structure, and suggests that our models learn to infer meaningful names and to solve the VarMisuse task in many cases. Additionally, our testing showed that VarMisuse identifies a number of bugs in mature open-source projects.}
}

⁶³ Here's the BibTeX entry for a recent survey paper Gulwanietal [122] on current trends in program synthesis that Rishabh Singh co-authored:

@article{GulwanietalFaTiPL-17,
       author = {Sumit Gulwani and Oleksandr Polozov and Rishabh Singh},
        title = {Program Synthesis},
      journal = {Foundations and Trends in Programming Languages},
       volume = {4},
       number = {1-2},
         year = {2017},
        pages = {1-119},
     abstract = {Program synthesis is the task of automatically finding a program in the underlying programming language that satisfies the user intent expressed in the form of some specification. Since the inception of AI in the 1950s, this problem has been considered the holy grail of Computer Science. Despite inherent challenges in the problem such as ambiguity of user intent and a typically enormous search space of programs, the field of program synthesis has developed many different techniques that enable program synthesis in different real-life application domains. It is now used successfully in software engineering, biological discovery, computer-aided education, end-user programming, and data cleaning. In the last decade, several applications of synthesis in the field of programming by examples have been deployed in mass-market industrial products. This survey is a general overview of the state-of-the-art approaches to program synthesis, its applications, and subfields. We discuss the general principles common to all modern synthesis approaches such as syntactic bias, oracle-guided inductive search, and optimization techniques. We then present a literature review covering the four most common state-of-the-art techniques in program synthesis: enumerative search, constraint solving, stochastic search, and deduction-based programming by examples. We conclude with a brief list of future horizons for the field.}
}

⁶⁴ Collection of papers relating to the role of the cerebellum and basal ganglia in language, planning and simulating complex activity:

@book{Lieberman02,
     author = {Philip Lieberman},
      title = {Human language and our reptilian brain: The subcortical bases for speech, syntax and thought},
  publisher = {Harvard University Press},
    address = {Cambridge, MA},
       year = 2002,
}
@article{LiebermanAJPT-02,
     author = {Philip Lieberman},
      title = {On the nature and evolution of the neural bases of human language},
    Journal = {American Journal of Physical Anthropology},
      month = {December},
     volume = 119,
      issue = {S35},
      pages = {36-62},
       year = 2002,
   abstract = {The traditional theory equating the brain bases of language with Broca's and Wernicke's neocortical areas is wrong. Neural circuits linking activity in anatomically segregated populations of neurons in subcortical structures and the neocortex throughout the human brain regulate complex behaviors such as walking, talking, and comprehending the meaning of sentences. When we hear or read a word, neural structures involved in the perception or real-world associations of the word are activated as well as posterior cortical regions adjacent to Wernicke's area. Many areas of the neocortex and subcortical structures support the cortical-striatal-cortical circuits that confer complex syntactic ability, speech production, and a large vocabulary. However, many of these structures also form part of the neural circuits regulating other aspects of behavior. For example, the basal ganglia, which regulate motor control, are also crucial elements in the circuits that confer human linguistic ability and abstract reasoning. The cerebellum, traditionally associated with motor control, is active in motor learning. The basal ganglia are also key elements in reward-based learning. Data from studies of Broca's aphasia, Parkinson's disease, hypoxia, focal brain damage, and a genetically transmitted brain anomaly (the putative ``language gene,'' family KE), and from comparative studies of the brains and behavior of other species, demonstrate that the basal ganglia sequence the discrete elements that constitute a complete motor act, syntactic process, or thought process. Imaging studies of intact human subjects and electrophysiologic and tracer studies of the brains and behavior of other species confirm these findings. As Dobzansky put it, ``Nothing in biology makes sense except in the light of evolution'' (cited in Mayr, 1982). That applies with as much force to the human brain and the neural bases of language as it does to the human foot or jaw. The converse follows: the mark of evolution on the brains of human beings and other species provides insight into the evolution of the brain bases of human language. The neural substrate that regulated motor control in the common ancestor of apes and humans most likely was modified to enhance cognitive and linguistic ability. Speech communication played a central role in this process. However, the process that ultimately resulted in the human brain may have started when our earliest hominid ancestors began to walk.},
}
@article{MerkerBBS-07,
       author = {Merker, Bjorn},
        title = {Consciousness without a cerebral cortex: A challenge for neuroscience and medicine},
      journal = {Behavioral and Brain Sciences},
    publisher = {Cambridge University Press},
       volume = {30},
       number = {1},
         year = {2007},
        pages = {63–81},
     abstract = {A broad range of evidence regarding the functional organization of the vertebrate brain - spanning from comparative neurology to experimental psychology and neurophysiology to clinical data - is reviewed for its bearing on conceptions of the neural organization of consciousness. A novel principle relating target selection, action selection, and motivation to one another, as a means to optimize integration for action in real time, is introduced. With its help, the principal macrosystems of the vertebrate brain can be seen to form a centralized functional design in which an upper brain stem system organized for conscious function performs a penultimate step in action control. This upper brain stem system retained a key role throughout the evolutionary process by which an expanding forebrain - culminating in the cerebral cortex of mammals - came to serve as a medium for the elaboration of conscious contents. This highly conserved upper brainstem system, which extends from the roof of the midbrain to the basal diencephalon, integrates the massively parallel and distributed information capacity of the cerebral hemispheres into the limited-capacity, sequential mode of operation required for coherent behavior. It maintains special connective relations with cortical territories implicated in attentional and conscious functions, but is not rendered nonfunctional in the absence of cortical input. This helps explain the purposive, goal-directed behavior exhibited by mammals after experimental decortication, as well as the evidence that children born without a cortex are conscious. Taken together these circumstances suggest that brainstem mechanisms are integral to the constitution of the conscious state, and that an adequate account of neural mechanisms of conscious function cannot be confined to the thalamocortical complex alone.}
}
@article{NottebohmPLoS-05,
       author = {Nottebohm, Fernando},
      journal = {{PLOS} Biology},
    publisher = {Public Library of Science},
        title = {The Neural Basis of Birdsong},
         year = {2005},
       volume = {3},
     abstract = {There is a tradition in biology of using specific animal models to study generalizable basic properties of a system. For example, the giant axon of squid was used for the pioneering work on nerve transmission; the fruit fly (Drosophila) has played a key role in researchers discovering the role of homeobox genes in embryogenesis; the sea slug (Aplysia) is used to study the molecular biology of learning; and the round worm (Caenorhabditis elegans) is used to study programmed cell death. Basic insights gained from these four systems apply widely to other multicellular animals. Here, I will review basic discoveries made by studying birdsong that have helped answer more general questions in vertebrate neuroscience.},
}
@misc{PattaniBRAINLESS-16,
        title = {Preliminary Research -- Living Without Brain Structures},
       author = {Aneri Pattani},
         year = {2016},
 howpublished = {StoryLab Posting},
          url = {http://www.northeastern.edu/storylab/sp16/2016/01/27/preliminary-research-living-without-brain-structures/},
     abstract = {How much of our brain do we actually need? A number of stories have appeared in the news in recent months about people with chunks of their brains missing or damaged. These cases tell a story about the mind that goes deeper than their initial shock factor. It isn't just that we don't understand how the brain works, but that we may be thinking about it in the entirely wrong way.}
}
@article{Stephenson-JonesetalCURRENT-BIOLOGY-11,
       author = {Stephenson-Jones, M. and Samuelsson, E. and Ericsson, J. and Robertson, B. and Grillner, S.},
        title = {Evolutionary conservation of the basal ganglia as a common vertebrate mechanism for action selection},
      journal = {Current Biology},
         year = {2011},
       volume = {21},
       number = {13},
        pages = {1081-1091},
     abstract = {Although the basal ganglia are thought to play a key role in action selection in mammals, it is unknown whether this mammalian circuitry is present in lower vertebrates as a conserved selection mechanism. We aim here, using lamprey, to elucidate the basal ganglia circuitry in the phylogenetically oldest group of vertebrates (cyclostomes) and determine how this selection architecture evolved to accommodate the increased behavioral repertoires of advanced vertebrates.\\ We show, using immunohistochemistry, tract tracing, and whole-cell recordings, that all parts of the mammalian basal ganglia (striatum, globus pallidus interna [GPi] and externa [GPe], and subthalamic nucleus [STN]) are present in the lamprey forebrain. In addition, the circuit features, molecular markers, and physiological activity patterns are conserved. Thus, GABAergic striatal neurons expressing substance P project directly to the pallidal output layer, whereas enkephalin-expressing striatal neurons project indirectly via nuclei homologous to the GPe and STN. Moreover, pallidal output neurons tonically inhibit tectum, mesencephalic, and diencephalic motor regions.\\ These results show that the detailed basal ganglia circuitry is present in the phylogenetically oldest vertebrates and has been conserved, most likely as a mechanism for action selection used by all vertebrates, for over 560 million years. Our data also suggest that the mammalian basal ganglia evolved through a process of exaptation, where the ancestral core unit has been co-opted for multiple functions, allowing them to process cognitive, emotional, and motor information in parallel and control a broader range of behaviors.}
}

⁶⁵ The field of automated planning in AI in the 80's was often criticized for focusing on toy problems such as the blocks world. The criticism was leveled by disciplines like operations research and applied control theory that spanned a wide range of real-world problems including inventory management, vehicle routing, chemical process control and robotic assembly, whereas the blocks world was so general it could be used to solve any — AI Complete and NP-hard — symbolic problem.

⁶⁶ Prior work listed on this slide deck presented by Dan Abolafia joint work with Quoc Le and Mohammad Norouzi. Dan introduced this project as part of the code synthesis moonshot, stating that "[w]e want to build an ML system that can learn to write code".

@article{BalogetalCoRR-16,
       author = {Matej Balog and Alexander L. Gaunt and Marc Brockschmidt and Sebastian Nowozin and Daniel Tarlow},
        title = {{DeepCoder}: Learning to Write Programs},
      journal = {CoRR},
       volume = {arXiv:1611.01989},
         year = {2016},
     abstract = {We develop a first line of attack for solving programming competition-style problems from input-output examples using deep learning. The approach is to train a neural network to predict properties of the program that generated the outputs from the inputs. We use the neural network's predictions to augment search techniques from the programming languages community, including enumerative search and an SMT-based solver. Empirically, we show that our approach leads to an order of magnitude speedup over the strong non-augmented baselines and a Recurrent Neural Network approach, and that we are able to solve problems of difficulty comparable to the simplest problems on programming competition websites.}
}
@article{DevlinetalCoRR-17,
       author = {Jacob Devlin and Jonathan Uesato and Surya Bhupatiraju and Rishabh Singh and Abdel{-}rahman Mohamed and Pushmeet Kohli},
        title = {{RobustFill}: Neural Program Learning under Noisy {I/O}},
      journal = {CoRR},
       volume = {arXiv:1703.07469},
         year = {2017},
     abstract = {The problem of automatically generating a computer program from some specification has been studied since the early days of AI. Recently, two competing approaches for automatic program learning have received significant attention: (1) neural program synthesis, where a neural network is conditioned on input/output (I/O) examples and learns to generate a program, and (2) neural program induction, where a neural network generates new outputs directly using a latent program representation. Here, for the first time, we directly compare both approaches on a large-scale, real-world learning task. We additionally contrast to rule-based program synthesis, which uses hand-crafted semantics to guide the program generation. Our neural models use a modified attention RNN to allow encoding of variable-sized sets of I/O pairs. Our best synthesis model achieves 92\% accuracy on a real-world test set, compared to the 34\% accuracy of the previous best neural synthesis approach. The synthesis model also outperforms a comparable induction model on this task, but we more importantly demonstrate that the strength of each approach is highly dependent on the evaluation metric and end-user application. Finally, we show that we can train our neural models to remain very robust to the type of noise expected in real-world data (e.g., typos), while a highly-engineered rule-based system fails entirely.}
}
@article{LiangetalCoRR-16,
       author = {Chen Liang and Jonathan Berant and Quoc Le and Kenneth D. Forbus and Ni Lao},
        title = {Neural Symbolic Machines: Learning Semantic Parsers on Freebase with Weak Supervision},
      journal = {CoRR},
       volume = {arXiv:1611.00020},
         year = {2016},
     abstract = {Harnessing the statistical power of neural networks to perform language understanding and symbolic reasoning is difficult, when it requires executing efficient discrete operations against a large knowledge-base. In this work, we introduce a Neural Symbolic Machine (NSM), which contains (a) a neural "programmer”, i.e., a sequence-to-sequence model that maps language utterances to programs and utilizes a key-variable memory to handle compositionality (b) a symbolic "computer”, i.e., a Lisp interpreter that performs program execution, and helps find good programs by pruning the search space. We apply REINFORCE to directly optimize the task reward of this structured prediction problem. To train with weak supervision and improve the stability of REINFORCE we augment it with an iterative maximum-likelihood training process. NSM outperforms the state-of-the-art on the WEBQUESTIONSSP dataset when trained from question-answer pairs only, without requiring any feature engineering or domain-specific knowledge.}
}
@article{NeelakantanetalCoRR-15,
       author = {Arvind Neelakantan and Quoc V. Le and Ilya Sutskever},
        title = {Neural Programmer: Inducing Latent Programs with Gradient Descent},
      journal = {CoRR},
       volume = {arXiv:1511.04834},
         year = {2015},
     abstract = {Deep neural networks have achieved impressive supervised classification performance in many tasks including image recognition, speech recognition, and sequence to sequence learning. However, this success has not been translated to applications like question answering that may involve complex arithmetic and logic reasoning. A major limitation of these models is in their inability to learn even simple arithmetic and logic operations. For example, it has been shown that neural networks fail to learn to add two binary numbers reliably. In this work, we propose Neural Programmer, an end-to-end differentiable neural network augmented with a small set of basic arithmetic and logic operations. Neural Programmer can call these augmented operations over several steps, thereby inducing compositional programs that are more complex than the built-in operations. The model learns from a weak supervision signal which is the result of execution of the correct program, hence it does not require expensive annotation of the correct program itself. The decisions of what operations to call, and what data segments to apply to are inferred by Neural Programmer. Such decisions, during training, are done in a differentiable fashion so that the entire network can be trained jointly by gradient descent. We find that training the model is difficult, but it can be greatly improved by adding random noise to the gradient. On a fairly complex synthetic table-comprehension dataset, traditional recurrent networks and attentional models perform poorly while Neural Programmer typically obtains nearly perfect accuracy.}
}
@article{ReedandDeFreitasCoRR-15,
       author = {Scott E. Reed and Nando de Freitas},
        title = {Neural Programmer-Interpreters},
      journal = {CoRR},
       volume = {arXiv:1511.06279},
         year = {2015},
     abstract = {We propose the neural programmer-interpreter (NPI): a recurrent and compositional neural network that learns to represent and execute programs. NPI has three learnable components: a task-agnostic recurrent core, a persistent key-value program memory, and domain-specific encoders that enable a single NPI to operate in multiple perceptually diverse environments with distinct affordances. By learning to compose lower-level programs to express higher-level programs, NPI reduces sample complexity and increases generalization ability compared to sequence-tosequence LSTMs. The program memory allows efficient learning of additional tasks by building on existing programs. NPI can also harness the environment (e.g. a scratch pad with read-write pointers) to cache intermediate results of computation, lessening the long-term memory burden on recurrent hidden units. In this work we train the NPI with fully-supervised execution traces; each program has example sequences of calls to the immediate subprograms conditioned on the input. Rather than training on a huge number of relatively weak labels, NPI learns from a small number of rich examples. We demonstrate the capability of our model to learn several types of compositional programs: addition, sorting, and canonicalizing 3D models. Furthermore, a single NPI learns to execute these programs and all 21 associated subprograms.}
}  
@inproceedings{GuadarramaetalICRAS-13,
       author = {Sergio Guadarrama and Lorenzo Riano and Dave Golland and Daniel Gouhring and Yangqing Jia and Dan Klein and Pieter Abbeel and Trevor Darrell},
        title = {Grounding spatial relations for human-robot interaction},
    booktitle = {2013 {IEEE/RSJ} International Conference on Intelligent Robots and Systems},
         year = 2013,
        pages = {1640-1647},
     abstract = {We propose a system for human-robot interaction that learns both models for spatial prepositions and for object recognition. Our system grounds the meaning of an input sentence in terms of visual percepts coming from the robot’s sensors in order to send an appropriate command to the PR2 or respond to spatial queries. To perform this grounding, the system recognizes the objects in the scene, determines which spatial relations hold between those objects, and semantically parses the input sentence. The proposed system uses the visual and spatial information in conjunction with the semantic parse to interpret statements that refer to objects (nouns), their spatial relationships (prepositions), and to execute commands (actions). The semantic parse is inherently compositional, allowing the robot to understand complex commands that refer to multiple objects and relations such as: "Move the cup close to the robot to the area in front of the plate and behind the tea box". Our system correctly parses 94\% of the 210 online test sentences, correctly interprets 91\% of the correctly parsed sentences, and correctly executes 89\% of the correctly interpreted sentences.}
}  
@inproceedings{TamaretalNIPS-16,
       author = {Aviv Tamar and Yi Wu and Garrett Thomas and Sergey Levine and and Pieter Abbeel},
        title = {Value Iteration Networks},
    booktitle = {Proceedings of the 30th International Conference on Neural Information Processing Systems},
    publisher = {Curran Associates Inc.},
         year = {2016},
     location = {Barcelona, Spain},
     abstract = {We introduce the value iteration network (VIN): a fully differentiable neural network with a ‘planning module’ embedded within. VINs can learn to plan, and are suitable for predicting outcomes that involve planning-based reasoning, such as policies for reinforcement learning. Key to our approach is a novel differentiable approximation of the value-iteration algorithm, which can be represented as a convolutional neural network, and trained end-to-end using standard backpropagation. We evaluate VIN based policies on discrete and continuous path-planning domains, and on a natural-language based search task. We show that by learning an explicit planning computation, VIN policies generalize better to new, unseen domains.},
}

⁶⁷ Recent papers from DeepMind on model-based planning that focus on how an agent might construct / imagine an action occuring in a given situation to produce a given outcome:

@article{PascanuCoRR-17,
       author = {Razvan Pascanu and Yujia Li and Oriol Vinyals and Nicolas Heess and Lars Buesing and S{\'{e}}bastien Racani{\`{e}}re and David P. Reichert and Theophane Weber and Daan Wierstra and Peter Battaglia},
        title = {Learning model-based planning from scratch},
      journal = {CoRR},
       volume = {arXiv:1707.06170},
         year = {2017},
     abstract = {Conventional wisdom holds that model-based planning is a powerful approach to sequential decision-making. It is often very challenging in practice, however, because while a model can be used to evaluate a plan, it does not prescribe how to construct a plan. Here we introduce the "Imagination-based Planner", the first model-based, sequential decision-making agent that can learn to construct, evaluate, and execute plans. Before any action, it can perform a variable number of imagination steps, which involve proposing an imagined action and evaluating it with its model-based imagination. All imagined actions and outcomes are aggregated, iteratively, into a "plan context" which conditions future real and imagined actions. The agent can even decide how to imagine: testing out alternative imagined actions, chaining sequences of actions together, or building a more complex "imagination tree" by navigating flexibly among the previously imagined states using a learned policy. And our agent can learn to plan economically, jointly optimizing for external rewards and computational costs associated with using its imagination. We show that our architecture can learn to solve a challenging continuous control problem, and also learn elaborate planning strategies in a discrete maze-solving task. Our work opens a new direction toward learning the components of a model-based planning system and how to use them.}
}  
@article{WeberetalCoRR-17,
       author = {Theophane Weber and  S{\'{e}}bastien Racani{\`{e}}re and David P. Reichert and Lars Buesing and Arthur Guez and Danilo Jimenez Rezende and Adri{\`{a}} Puigdom{\`{e}}nech Badia and Oriol Vinyals and Nicolas Heess and Yujia Li and Razvan Pascanu and Peter Battaglia and David Silver and Daan Wierstra},
        title = {Imagination-Augmented Agents for Deep Reinforcement Learning},
      journal = {CoRR},
       volume = {arXiv:/1707.06203},
         year = {2017},
     abstract = {We introduce Imagination-Augmented Agents (I2As), a novel architecture for deep reinforcement learning combining model-free and model-based aspects. In contrast to most existing model-based reinforcement learning and planning methods, which prescribe how a model should be used to arrive at a policy, I2As learn to interpret predictions from a learned environment model to construct implicit plans in arbitrary ways, by using the predictions as additional context in deep policy networks. I2As show improved data efficiency, performance, and robustness to model misspecification compared to several baselines.}
}
@article{SilveretalCoRR-17,
       author = {David Silver and Hado van Hasselt and Matteo Hessel and Tom Schaul and Arthur Guez and Tim Harley and Gabriel Dulac{-}Arnold and David P. Reichert and Neil C. Rabinowitz and Andr{\'{e}} Barreto and Thomas Degris},
        title = {The Predictron: End-To-End Learning and Planning},
      journal = {CoRR},
       volume = {arXiv:1612.08810},
      comment = {ICML 2017 preprint},
         year = {2017},
     abstract = {One of the key challenges of artificial intelligence is to learn models that are effective in the context of planning. In this document we introduce the predictron architecture. The predictron consists of a fully abstract model, represented by a Markov reward process, that can be rolled forward multiple "imagined" planning steps. Each forward pass of the predictron accumulates internal rewards and values over multiple planning depths. The predictron is trained end-to-end so as to make these accumulated values accurately approximate the true value function. We applied the predictron to procedurally generated random mazes and a simulator for the game of pool. The predictron yielded significantly more accurate predictions than conventional deep neural network architectures.}
}

⁶⁸ Collection of papers on attention, imagination and executive control from various researchers in cognitive neuroscience:

@article{HassabisetalNEURON-17,
       author = {Demis Hassabis and Dharshan Kumaran and Christopher Summerfield and Matthew Botvinick},
        title = {Neuroscience-Inspired Artificial Intelligence},
      journal = {Neuron},
       volume = {95},
       number = {2},
        pages = {245-258},
         year = {2017},
     abstract = {The fields of neuroscience and artificial intelligence (AI) have a long and intertwined history. In more recent times, however, communication and collaboration between the two fields has become less commonplace. In this article, we argue that better understanding biological brains could play a vital role in building intelligent machines. We survey historical interactions between the AI and neuroscience fields and emphasize current advances in AI that have been inspired by the study of neural computation in humans and other animals. We conclude by highlighting shared themes that may be key for advancing future research in both fields.},
}
@article{HassabisandMaguireTiCS-07,
        title = {Deconstructing episodic memory with construction},
       author = {Hassabis, Demis and Maguire, Eleanor A.},
      journal = {Trends in Cognitive Science},
    publisher = {Elsevier Current Trends},
       volume = 11,
        issue = 7
         year = 2007,
        pages = {299-306},
     abstract = {It has recently been observed that the brain network supporting recall of episodic memories shares much in common with other cognitive functions such as episodic future thinking, navigation and theory of mind. It has been speculated that ‘self-projection’ is the key common process. However, in this Opinion article, we note that other functions (e.g. imagining fictitious experiences) not explicitly connected to either the self or a subjective sense of time, activate a similar brain network. Hence, we argue that the process of ‘scene construction’ is better able to account for the commonalities in the brain areas engaged by an extended range of disparate functions. In light of this, we re-evaluate our understanding of episodic memory, the processes underpinning it and other related cognitive functions.} 
}
@article{HassabisandMaguirePTRS_B-09,
        title = {The construction system of the brain},
       author = {Hassabis, Demis and Maguire, Eleanor A.},
      journal = {Philosophical Transactions of the Royal Society London B Biological Science},
    publisher = {The Royal Society},
       volume = 364,
        issue = 1521,
         year = 2009,
        pages = {1263-1271},
     abstract = {The ability to construct a hypothetical situation in one's imagination prior to it actually occurring may afford greater accuracy in predicting its eventual outcome. The recollection of past experiences is also considered to be a reconstructive process with memories recreated from their component parts. Construction, therefore, plays a critical role in allowing us to plan for the future and remember the past. Conceptually, construction can be broken down into a number of constituent processes although little is known about their neural correlates. Moreover, it has been suggested that some of these processes may be shared by a number of other cognitive functions including spatial navigation and imagination. Recently, novel paradigms have been developed that allow for the isolation and characterization of these underlying processes and their associated neuroanatomy. Here, we selectively review this fast-growing literature and consider some implications for remembering the past and predicting the future.},
}
@article{PritzeletalCoRR-17,
       author = {Alexander Pritzel and Benigno Uria and  Sriram Srinivasan and Adri\`{a} Puigdom\`{e}nech Badia and Oriol Vinyals and Demis Hassabis and Daan Wierstra and Charles Blundell},
        title = {Neural Episodic Control},
      journal = {CoRR},
       volume = {arXiv:1703.01988},
         year = {2017},
     abstract = {Deep reinforcement learning methods attain super-human performance in a wide range of environments. Such methods are grossly inefficient, often taking orders of magnitudes more data than humans to achieve reasonable performance. We propose Neural Episodic Control: a deep reinforcement learning agent that is able to rapidly assimilate new experiences and act upon them. Our agent uses a semi-tabular representation of the value function: a buffer of past experience containing slowly changing state representations and rapidly updated estimates of the value function. We show across a wide range of environments that our agent learns significantly faster than other state-of-the-art, general purpose deep reinforcement learning agents.}
}
@article{OReillyNC-96,
       author = {O'Reilly, Randall C.},
        title = {Biologically Plausible Error-driven Learning Using Local Activation Differences: The Generalized Recirculation Algorithm},
      journal = {Neural Computation},
    publisher = {MIT Press},
      address = {Cambridge, MA, USA},
       volume = {8},
       number = {5},
         year = {1996},
        pages = {895-938},
     abstract = {The error backpropagation learning algorithm (BP) is generally considered biologically implausible because it does not use locally available, activation-based variables. A version of BP that can be computed locally using bidirectional activation recirculation (Hinton and McClelland 1988) instead of backpropagated error derivatives is more biologically plausible. This paper presents a generalized version of the recirculation algorithm (GeneRec), which overcomes several limitations of the earlier algorithm by using a generic recurrent network with sigmoidal units that can learn arbitrary input/output mappings. However, the contrastive Hebbian learning algorithm (CHL, also known as DBM or mean field learning) also uses local variables to perform error-driven learning in a sigmoidal recurrent network. CHL was derived in a stochastic framework (the Boltzmann machine), but has been extended to the deterministic case in various ways, all of which rely on problematic approximations and assumptions, leading some to conclude that it is fundamentally flawed. This paper shows that CHL can be derived instead from within the BP framework via the GeneRec algorithm. CHL is a symmetry-preserving version of GeneRec that uses a simple approximation to the midpoint or second-order accurate Runge-Kutta method of numerical integration, which explains the generally faster learning speed of CHL compared to BI. Thus, all known fully general error-driven learning algorithms that use local activation-based variables in deterministic networks can be considered variations of the GeneRec algorithm (and indirectly, of the backpropagation algorithm). GeneRec therefore provides a promising framework for thinking about how the brain might perform error-driven learning. To further this goal, an explicit biological mechanism is proposed that would be capable of implementing GeneRec-style learning. This mechanism is consistent with available evidence regarding synaptic modification in neurons in the neocortex and hippocampus, and makes further predictions.}
}
@article{OReillySCIENCE-06,
     title = {Biologically Based Computational Models of High-Level Cognition},
    author = {O'Reilly, Randall C.},
   journal = {Science},
    volume = 314,
     issue = 5796,
      year = 2006,
     pages = {91-94},
  abstract = {Computer models based on the detailed biology of the brain can help us understand the myriad complexities of human cognition and intelligence. Here, we review models of the higher level aspects of human intelligence, which depend critically on the prefrontal cortex and associated subcortical areas. The picture emerging from a convergence of detailed mechanistic models and more abstract functional models represents a synthesis between analog and digital forms of computation. Specifically, the need for robust active maintenance and rapid updating of information in the prefrontal cortex appears to be satisfied by bistable activation states and dynamic gating mechanisms. These mechanisms are fundamental to digital computers and may be critical for the distinctive aspects of human intelligence.},
}
@article{OReillyTiCS-98,
       author = {O'Reilly, R. C.},
        title = {{{S}ix principles for biologically based computational models of cortical cognition}},
      journal = {Trends Cognitive Science)},
         year = {1998},
       volume = {2},
       number = {11},
        pages = {455-462},
     abstract = {This review describes and motivates six principles for computational cognitive neuroscience models: biological realism, distributed representations, inhibitory competition, bidirectional activation propagation, error-driven task learning, and Hebbian model learning. Although these principles are supported by a number of cognitive, computational and biological motivations, the prototypical neural-network model (a feedforward back-propagation network) incorporates only two of them, and no widely used model incorporates all of them. It is argued here that these principles should be integrated into a coherent overall framework, and some potential synergies and conflicts in doing so are discussed.}
}
@article{OReillyandFrankNC-06,
       author = {O'Reilly, Randall C. and Frank, Michael J.},
        title = {Making Working Memory Work: A Computational Model of Learning in the Prefrontal Cortex and Basal Ganglia},
      journal = {Neural Computation},
       volume = 18,
        issue = 2,
         year = 2006,
        pages = {283-328},
          url = {http://psych.colorado.edu/~oreilly/papers/OReillyFrank06_pbwm.pdf},
    publisher = {MIT Press},
      address = {Cambridge, MA, USA},
     abstract = {The prefrontal cortex has long been thought to subserve both working memory (the holding of information online for processing) and executive functions (deciding how to manipulate working memory and perform processing). Although many computational models of working memory have been developed, the mechanistic basis of executive function remains elusive, often amounting to a homunculus. This article presents an attempt to deconstruct this homunculus through powerful learning mechanisms that allow a computational model of the prefrontal cortex to control both itself and other brain areas in a strategic, task-appropriate manner. These learning mechanisms are based on subcortical structures in the midbrain, basal ganglia, and amygdala, which together form an actor-critic architecture. The critic system learns which prefrontal representations are task relevant and trains the actor, which in turn provides a dynamic gating mechanism for controlling working memory updating. Computationally, the learning mechanism is designed to simultaneously solve the temporal and structural credit assignment problems. The model's performance compares favorably with standard backpropagation-based temporal learning mechanisms on the challenging 1-2-AX working memory task and other benchmark working memory tasks.}
}
@book{OReillyandMunakata00,
     title = {Computational Explorations in Cognitive Neuroscience: Understanding the Mind by Simulating the Brain},
    author = {Randall O'Reilly and Yuko Munakata},
 publisher = {MIT Press},
   address = {Cambridge, Massachusetts},
      year = 2000,
}
@article{OReillyetalCON-10,
     title = {Computational models of cognitive control},
    author = {Randall C. O'Reilly and Seth A. Herd and Wolfgang M. Pauli},
   journal = {Current Opinion in Neurobiology},
    volume = 20,
     issue = 2,
      year = 2010,
     pages = {257-261},
}
@article{RosenfeldetalCoRR-17,
        title = {Priming Neural Networks},
       author = {Amir Rosenfeld, Mahdi Biparva, John K. Tsotsos},
      journal = {CoRR},
       volume = {arXiv:1711.05918},
         year = 2017,
     abstract = {Visual priming is known to affect the human visual system to allow detection of scene elements, even those that may have been near unnoticeable before, such as the presence of camouflaged animals. This process has been shown to be an effect of top-down signaling in the visual system triggered by the said cue. In this paper, we propose a mechanism to mimic the process of priming in the context of object detection and segmentation. We view priming as having a modulatory, cue dependent effect on layers of features within a network. Our results show how such a process can be complementary to, and at times more effective than simple post-processing applied to the output of the network, notably so in cases where the object is hard to detect such as in severe noise. Moreover, we find the effects of priming are sometimes stronger when early visual layers are affected. Overall, our experiments confirm that top-down signals can go a long way in improving object detection and segmentation.}
}

⁶⁹ In a recent discussion, Christof mentioned that AIBS scientists regularly work with both mouse and human samples of neural tissue and that, apart from their size, the cortical tissue from mice and humans are virtually identical to trained anatomists and histologists. This makes him wonder whether mice would develop, say, consciousness or more sophisticated signaling if they had a cortex as large as ours. As far as we know, there is nothing special about spindle / von Economo neurons or the elusive mirror neurons, except that the — the former — are larger and are noteworthy for their size and location. Simple neural network architectures including convolutional networks work efficiently to learn structure, induce sparse feature libraries, construct concept hierarchies, and can even learn value functions to support reinforcement learning and attentional mechanisms.

⁷⁰ This is another example of progress primarily due to improvements in computing hardware. The basic idea gets invented a dozen times and never gains a purchase until Moore’s law catches up enabling lots of researchers to essentially “play” in the conceptual space that opens up due to the confluence of a good idea and adequate computing resources. It's not just the length of time that an engineer has to sit on his or her hands waiting while some mysterious process plays out over an indeterminate interval of time and, at the end, provides an inscrutable and generally unsatisfying answer, it's the anxiety, frustration and helplessness that this induces in the engineer. When you can launch millions of experiments and get millions of answers in the same or a shorter amount of time, everything changes. An anomaly in the result of your only experiment is a problem, while an anomaly in one result out of millions is an outlier or a bug in your experimental design.

⁷¹ Here is a sample of recent review papers relating to the idea of dual-loop models that integrate sensorimotor maps with semantic processing:

@incollection{WeilleretalNL-16,
       author = {Cornelius Weiller and Tobias Bormann and Dorothee Kuemmerer and Mariachristina Musso and Michel Rijntjes},
        title = {The Dual Loop Model in Language},
       editor = {Hickok, Gregory  and Small, Steven L.},
    booktitle = {Neurobiology of Language},
    publisher = {Academic Press},
      address = {San Diego},
         year = {2016},
        pages = {325-337},
     abstract = {A dual loop model for the processing of language in the brain was proposed early in history and has also formed the basis for many neuropsychological models. These models incorporate a (direct) route for sensorimotor mapping and a (indirect) route for "semantic" processing. Dual loop models also emerged in the field of visual processing, motor control, or spatial attention. Thus, a general dual loop system may provide the framework for the interpretation of cognition in human and primate brains independent of the modality and species. Modern imaging techniques like diffusion tensor imaging (DTI)-based fiber tracking identified long human association tracts for ventral and dorsal pathways. The extreme capsule (EmC) and uncinate fascicle (UF) are part of the ventral system, and the superior longitudinal fasciculi (SLF I, II, III) and the arcuate fasciculus (AF) are all dorsal pathways. This chapter reviews language processing in the brain in the context of a domain general dual loop system.}
}
@incollection{RizzolattiandRozziNL-16,
       author = {Giacomo Rizzolatti and Stefano Rozzi},
        title = {Motor Cortex and Mirror System in Monkeys and Humans},
       editor = {Hickok, Gregory  and Small, Steven L.},
    booktitle = {Neurobiology of Language},
    publisher = {Academic Press},
      address = {San Diego},
         year = {2016},
        pages = {59-72},
     abstract = {The view of the functions of the motor system has radically changed in past years. It is now clear that the motor system not only is a "producer" of movements but also is involved in cognitive functions. In this chapter we review the anatomical and functional organization of the cortical motor system in monkeys and humans with particular emphasis on those areas that are endowed of the mirror mechanism and involved in communication. We discuss the possible evolutionary link between basic unintentional communication and voluntary communication. We conclude examining the hypothesis that purports that motor activation is involved in phonemes and understanding of action words.}
}
@article{HamzeietalCC-16,
       author = {Hamzei, Farsin and Vry, Magnus-Sebastian and Saur, Dorothee and Glauche, Volkmar and Hoeren, Markus and Mader, Irina and Weiller, Cornelius and Rijntjes, Michel},
        title = {The Dual-Loop Model and the Human Mirror Neuron System: an Exploratory Combined fMRI and DTI Study of the Inferior Frontal Gyrus},
      journal = {Cerebral Cortex},
       volume = {26},
       number = {5},
        pages = {2215-2224},
         year = {2016},
     abstract = {The inferior frontal gyrus (IFG) is active during both goal-directed action and while observing the same motor act, leading to the idea that also the meaning of a motor act (action understanding) is represented in this "mirror neuron system" (MNS). However, in the dual-loop model, based on dorsal and ventral visual streams, the MNS is thought to be a function of the dorsal steam, projecting to pars opercularis (BA44) of IFG, while recent studies suggest that conceptual meaning and semantic analysis are a function of ventral connections, projecting mainly to pars triangularis (BA45) of IFG. To resolve this discrepancy, we investigated action observation (AO) and imitation (IMI) using fMRI in a large group of subjects. A grasping task (GR) assessed the contribution from movement without AO. We analyzed connections of the MNS-related areas within IFG with postrolandic areas with the use of activation-based DTI. We found that action observation with imitation are mainly a function of the dorsal stream centered on dorsal part of BA44, but also involve BA45, which is dorsally and ventrally connected to the same postrolandic regions. The current finding suggests that BA45 is the crucial part where the MNS and the dual-loop system interact.}
}
@article{RijntjesetalFiEN-12,
       author = {Rijntjes, Michel and Weiller, Cornelius and Bormann, Tobias and Musso, Mariachristina},
        title = {The dual loop model: its relation to language and other modalities},
      journal = {Frontiers in Evolutionary Neuroscience},
       volume = {4},
        pages = {9},
         year = {2012},
     abstract = {The current neurobiological consensus of a general dual-loop system scaffolding human and primate brains gives evidence that the dorsal and ventral connections subserve similar functions, independent of the modality and species. However, most current commentators agree that, although bees dance and chimpanzees grunt, these systems of communication differ qualitatively from human language. So why is language in humans unique? We discuss anatomical differences between humans and other animals, the meaning of lesion studies in patients, the role of inner speech, and compare functional imaging studies in language with other modalities in respect to the dual loop model. These aspects might be helpful for understanding what kind of biological system the language faculty is, and how it relates to other systems in our own species and others.}
}
%

⁷² Examples of qualia include the perceived sensation of pain of a headache, the taste of wine, as well as the redness of an evening sky. As qualitative characters of sensation, qualia stand in contrast to "propositional attitudes", where the focus is on beliefs about experience rather than what is it directly like to be experiencing. (SOURCE)

⁷³ Natural language programming (NLP) is an ontology-assisted way of programming in terms of natural language sentences, e.g. English. A structured document with Content, sections and subsections for explanations of sentences forms a NLP document, which is actually a computer program. Natural languages and natural language user interfaces include Inform7, a natural programming language for making interactive fiction, Ring, a general purpose language, Shakespeare, an esoteric natural programming language in the style of the plays of William Shakespeare, and Wolfram Alpha, a computational knowledge engine, using natural language input. The smallest unit of statement in NLP is a sentence. Each sentence is stated in terms of concepts from the underlying ontology, attributes in that ontology and named objects in capital letters. In an NLP text every sentence unambiguously compiles into a procedure call in the underlying high-level programming language such as MATLAB, Octave, SciLab, Python, etc. (SOURCE)

⁷⁴ Here are the BibTeX references and abstracts for several papers on using natural language to write programs:

@book{Veres2008,
        title = {{Natural Language Programming of Agents and Robotic Devices: Publishing for agents and humans in sEnglish}},
       author = {S\'{a}ndor M. Veres},
    publisher = {SysBrain Ltd},
      address = {London},
         year = {2008},
     abstract = {This paper is an application of ontology theory, conceptual graphs and programming language theories to develop the theoretical foundations of natural language programming (NLP) that has in recent years been used to produce natural language documents for intelligent agents and human readers. The analysis given reveals three benefits of NLP. First, it is “conceptualized” programming that enables developers to write less bug prone programs due to clarity of code presentation and enforced structuring of data. Secondly, NLP can aid programming of the all important abstractions for robots: event, action and world model abstractions can be created by sentences. Thirdly, NLP can be used to publish natural language documents by researchers, i.e. English language documents on control theory and procedures, on the Internet or in printed documents. This theoretical paper also defines a large class of intelligent agents that can read such documents. This enables human users and agents to have shared understanding of how application systems work.}
}
@inproceedings{LeietalACL-13,
       author = {Tao Lei, Fan Long, Regina Barzilay and Martin Rinard},
        title = {From Natural Language Specifications to Program Input Parsers},
    booktitle = {The 51st Annual Meeting of the Association for Computational Linguistics},
     location = {Sofia, Bulgaria},
         year = {2013},
     abstract = {We present a method for automatically generating input parsers from English specifications of input file formats. We use a Bayesian generative model to capture relevant natural language phenomena and translate the English specification into a specification tree, which is then translated into a C++ input parser. We model the problem as a joint dependency parsing and semantic role labeling task. Our method is based on two sources of information: (1) the correlation between the text and the specification tree and (2) noisy supervision as determined by the success of the generated C++ parser in reading input examples. Our results show that our approach achieves 80.0\% F-Score accuracy compared to an F-Score of 66.7\% produced by a state-of-the-art semantic parser on a dataset of input format speci- fications from the ACM International Collegiate Programming Contest (which were written in English for humans with no intention of providing support for automated processing).}
}
@article{LocascioetalCoRR-16,
       author = {Nicholas Locascio and Karthik Narasimhan and Eduardo DeLeon and Nate Kushman and Regina Barzilay},
        title = {Neural Generation of Regular Expressions from Natural Language with Minimal Domain Knowledge},
      journal = {CoRR},
       volume = {arxiv:1608.03000},
         year = {2016},
     abstract = {This paper explores the task of translating natural language queries into regular expressions which embody their meaning. In contrast to prior work, the proposed neural model does not utilize domain-specific crafting, learning to translate directly from a parallel corpus. To fully explore the potential of neural models, we propose a methodology for collecting a large corpus of regular expression, natural language pairs. Our resulting model achieves a performance gain of 19.6\% over previous state-of-the-art models.}
}
@inproceedings{KushmanandBarzilayACL-13,
       author = {Kushman, Nate and Barzilay, Regina},
        title = {Using Semantic Unification to Generate Regular Expressions from Natural Language},
    booktitle = {Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies},
         year = {2013},
    publisher = {Association for Computational Linguistics},
        pages = {826-836},
     location = {Atlanta, Georgia},
     abstract = {We consider the problem of translating natural language text queries into regular expressions which represent their meaning. The mismatch in the level of abstraction between the natural language representation and the regular expression representation make this a novel and challenging problem. However, a given regular expression can be written in many semantically equivalent forms, and we exploit this flexibility to facilitate translation by finding a form which more directly corresponds to the natural language. We evaluate our technique on a set of natural language queries and their associated regular expressions which we gathered from Amazon Mechanical Turk. Our model substantially outperforms a stateof-the-art semantic parsing baseline, yielding a 29\% absolute improvement in accuracy}
}
@inproceedings{ErnstSNAPL-17,
       author = {Michael D. Ernst},
        title = {Natural language is a programming language: Applying natural language processing to software development},
    booktitle = {{SNAPL 2017: the 2nd Summit oN Advances in Programming Languages}},
        pages = {4:1-4:14},
      address = {Asilomar, CA, USA},
         year = {2017},
     abstract = {A powerful, but limited, way to view software is as source code alone. Treating a program as a sequence of instructions enables it to be formalized and makes it amenable to mathematical techniques such as abstract interpretation and model checking. A program consists of much more than a sequence of instructions. Developers make use of test cases, documentation, variable names, program structure, the version control repository, and more. I argue that it is time to take the blinders off of software analysis tools: tools should use all these artifacts to deduce more powerful and useful information about the program. Researchers are beginning to make progress towards this vision. This paper gives, as examples, four results that find bugs and generate code by applying natural language processing techniques to software artifacts. The four techniques use as input error messages, variable names, procedure documentation, and user questions. They use four different NLP techniques: document similarity, word semantics, parse trees, and neural networks. The initial results suggest that this is a promising avenue for future work.},
}

⁷⁵ Our current best guess for the beginning of life on Earth is 3.8 billion years ago. The oldest fossils of single-celled organisms date from about about 3.5 billion years ago. Viruses are present by 3 billion years ago, but may be as old as life itself. Oxygen from photosynthetic cyanobacteria starts to build up in the atmosphere around 2.4 billion years ago. Eukaryotic cells with internal organelles come into being 2.0 billion years ago. The eukaryotes divide into three groups: the ancestors of modern plants, fungi and animals 1.5 billion years ago. The first multicellular life develops around 900 million years ago, and the fossil evidence shows that animals were exploring the land by 500 million years ago. Humans diverge from their closest relatives, the chimpanzees and bonobos around 6 million years ago. (SOURCE)

⁷⁶ Andrew Ng, one of the cofounders of the Google Brain project, will be the new chairman of Woebot, a company that operates a chatbot of the same name.

The bot is designed to help people work on their own mental health issues using techniques that originate with cognitive behavioral therapy. It’s a sort of therapy that focuses on helping people manage their moods by developing personal coping strategies for mental illnesses like depression and anxiety.

A study from Stanford University showed that people who used Woebot reported a decrease in anxiety and depression symptoms after two weeks. Ng thinks that machine learning could provide a massive benefit in the mental health space, which is why he’s going to be working with Woebot.

For example, Woebot is available to talk when people are having problems at odd hours, during times when a therapist might not be awake to take a phone call. And while it’s not a replacement for a human therapist (the bot is quite clear about that), it can help augment human therapies by giving people an easier way to externalize their thoughts if nobody else is available.

Ng said in an interview that he’s still going to be working on other projects. His role at Woebot will be to serve on the company’s board and provide support for its work, not take on a full-time job at the startup. For example, he’s still working on finishing up his series of deep learning courses for Coursera, the online learning site that he cofounded.

Health care is one of the focuses of his current work, after leaving his post as Baidu’s head of machine learning. Earlier this year, a team at Stanford that he worked with released a paper showing that they had trained a machine learning system to read electrocardiograms better than a cardiologist could. (SOURCE)

⁷⁷ Defined concisely here as "awareness of an external object or something within oneself", and, with some additional detail and hyperbole by Schneider and Velmans [256] as "[a]nything that we are aware of at a given moment forms part of our consciousness, making conscious experience at once the most familiar and most mysterious aspect of our lives".

⁷⁸ This proposal is based on a regularization term which encourages the top-level representation (meant to be at the most abstract level) to be such that when a sparse attention mechanism focuses on a few elements of the state representation (factors, variables or concepts, i.e. a few axes or dimensions in the representation space), that small set of variables of which the agent is aware at a given moment can be combined to make a useful statement about reality or usefully condition an action or policy [25].

⁷⁹ Apropos our discussion of sharing data and knowledge in field of neuroscience, between 2012 and 2016 Google research scientist published over 700 papers, and 2017 promises to be another record year. ://www.technologyreview.com/s/603984/googles-ai-explosion-in-one-chart/(SOURCE)

⁸⁰ By way of comparison, The IBM TrueNorth chip contains 4,096 cores and 5.4 million transistors, while only drawing 70 milliwatts of power. The chip simulates a million neurons and 256 million synapses. TrueNorth is (very) roughly able to simulate the equivalent of a bee's brain. In contrast to the IBM and Intel focus on hardware, Qualcomm's Zeroth is a platform for brain-inspired computing from Qualcomm. It is based around a — thus far vaporware — neural processing unit (NPU) AI accelerator chip and a software API to interact with the platform. It is advertised as making deep learning available to mobile devices. The software operates locally rather than as a cloud application.

⁸¹ This 2012 article in Science Daily suggests that scientists now know how humans solve the cocktail party problem: "In the experiments, patients listened to two speech samples played to them simultaneously in which different phrases were spoken by different speakers. They were asked to identify the words they heard spoken by one of the two speakers.

The authors then applied new decoding methods to "reconstruct" what the subjects heard from analyzing their brain activity patterns. Strikingly, the authors found that neural responses in the auditory cortex only reflected those of the targeted speaker. They found that their decoding algorithm could predict which speaker and even what specific words the subject was listening to based on those neural patterns. In other words, they could tell when the listener's attention strayed to another speaker.

The authors claim that the algorithm worked so well that they could predict not only the correct responses, but also determine when the patients paid attention to the wrong word. [...] The new findings suggest that the representation of speech in the cortex does not just reflect the entire external acoustic environment but instead just what we really want or need to hear." — Nima Mesgarani, Edward F. Chang. Selective cortical representation of attended speaker in multi-talker speech perception. Nature, 2012.

⁸² Application of their work 3D covolutional networks trained for action recognition [125]:

@article{GyglietalCoRR-17,
        title = {Deep Value Networks Learn to Evaluate and Iteratively Refine Structured Outputs},
       author = {Michael Gygli and Mohammad Norouzi and Anelia Angelova},
      journal = {CoRR},
       volume = {arXiv:1703.04363},
         year = {2017},
     abstract = {We approach structured output prediction by optimizing a deep value network (DVN) to precisely estimate the task loss on different output configurations for a given input. Once the model is trained, we perform inference by gradient descent on the continuous relaxations of the output variables to find outputs with promising scores from the value network. When applied to image segmentation, the value network takes an image and a segmentation mask as inputs and predicts a scalar estimating the intersection over union between the input and ground truth masks. For multi-label classification, the DVN’s objective is to correctly predict the F1 score for any potential label configuration. The DVN framework achieves the state-of-the-art results on multi-label prediction and image segmentation benchmarks.}
}

⁸³ Think of consciousness as the formation of low-dimensional combinations of simple concepts originating in an abstract rather than sensory space:

@article{BengioCoRR-17,
       author = {Yoshua Bengio},
        title = {The Consciousness Prior},
      journal = {CoRR},
       volume = {arxiv:1709.08568},
         year = {2017},
     abstract = {A new prior is proposed for representation learning, which can be combined with other priors in order to help disentangling abstract factors from each other. It is inspired by the phenomenon of consciousness seen as the formation of a low-dimensional combination of a few concepts constituting a conscious thought, i.e., consciousness as awareness at a particular time instant. This provides a powerful constraint on the representation in that such low-dimensional thought vectors can correspond to statements about reality which are either true, highly probable, or very useful for taking decisions. The fact that a few elements of the current state can be combined into such a predictive or useful statement is a strong constraint and deviates considerably from the maximum likelihood approaches to modeling data and how states unfold in the future based on an agent’s actions. Instead of making predictions in the sensory (e.g. pixel) space, the consciousness prior allow the agent to make predictions in the abstract space, with only a few dimensions of that space being involved in each of these predictions. The consciousness prior also makes it natural to map conscious states to natural language utterances or to express classical AI knowledge in the form of facts and rules, although the conscious states may be richer than what can be expressed easily in the form of a sentence, a fact or a rule.}
}

⁸⁴ By introducing the term cognitive apparatus I'm probably being too cautious here. The word "mind" seems pretentious and somewhat presumptive. "Brain" is close but I'm not sure that I want to divide up the autonomic (sympathetic and parasympathetic) and somatic (sensory and motor) nervous system along traditional lines.

⁸⁵ It takes an effort for me to listen to David Chalmers (EDGE) on augmented reality, consciousness or virtual worlds; Nick Bostrom (BIO) on what the chances are we are already living in a virtual world; Jaron Lanier on the myth of AI (EDGE). I'm only now learning to really listen carefully.

Historically, I've disagreed with all three of these philosophy / culture mavens, but we are always always stumbling on interesting lectures and interviews, and, while there are only so many minutes in a day, you could do worse than listen to a smart, thoughtful person talking about a subject he or she has spent a lifetime studying ... even if you disagree with a lot of what they say or the way in which they say it. Here's a suggestion: never actually look at the person when they're speaking and preferably avoid looking at them altogether, before, during or after their speaking. If you think appearances don't matter; think again.

Hidden biases arise from any number of factors besides physical appearance. It's not just whether you are male or female, young or old, your skin tone, race, religion, social standing, voice, etc. Humans instinctively and incorrectly believe that these characteristics tell you a lot about the person, their beliefs, the credence they attach to their claims, the strength of their conviction, the depth of their knowledge regarding whatever subject they are expostulating about, ... expose any of the markers we are programmed to believe reveal these character traits and you have been framed and your audience is now anchored in Kahneman and Tversky's sense of those terms.

⁸⁶ Imagine if you had a Fender Stratocaster physically integrated into your body. You need not be limited to six fingers and two hands but if we were to limit ourselves to a traditional Stratocaster with six strings, perhaps the most useful extension would be if you could press any six string-fret combinations on the fret-board as long as there was at most one combination per string.

You might have similar flexibility in picking with the other hand resulting in a level of dexterity that would put Chet Atkins and Jimi Hendrix to shame. The integration of a wah-wah pedal and tremolo bar could be controlled by twisting your torso or turning your head and you could control a room full of special-effects simply by directing your gaze at the icons in an image of a fully equipped studio sound-stage projected on your retina.

⁸⁷ Here are the BibTeX citations and abstracts from the three papers on learning how to write programs mentioned in the text [16, 15, 189]:

@inproceedings{LongandMartinICSE-16,
       author = {Long, Fan and Rinard, Martin},
        title = {An Analysis of the Search Spaces for Generate and Validate Patch Generation Systems},
    booktitle = {Proceedings of the 38th International Conference on Software Engineering},
     location = {Austin, Texas},
    publisher = {ACM},
      address = {New York, NY, USA},
         year = {2016},
        pages = {702-713},
     abstract = {We present the first systematic analysis of key characteristics of patch search spaces for automatic patch generation systems. We analyze sixteen different configurations of the patch search spaces of SPR and Prophet, two current state-of-the-art patch generation systems. The analysis shows that (i) correct patches are sparse in the search spaces (typically at most one correct patch per search space per defect), (ii) incorrect patches that nevertheless pass all of the test cases in the validation test suite are typically orders of magnitude more abundant, and (iii) leveraging information other than the test suite is therefore critical for enabling the system to successfully isolate correct patches.}
}
@inproceedings{BalogetalICLR-17,
       author = {Matej Balog and Alexander L. Gaunt and Marc Brockschmidt and Sebastian Nowozin and Daniel Tarlow},
        title = {{DeepCoder}: Learning to Write Programs},
    booktitle = {International Conference on Learning Representations},
         year = 2017,
     abstract = {We develop a first line of attack for solving programming competition-style problems from input-output examples using deep learning. The approach is to train a neural network to predict properties of the program that generated the outputs from the inputs. We use the neural network's predictions to augment search techniques from the programming languages community, including enumerative search and an SMT-based solver. Empirically, we show that our approach leads to an order of magnitude speedup over the strong non-augmented baselines and a Recurrent Neural Network approach, and that we are able to solve problems of difficulty comparable to the simplest problems on programming competition websites.}
}

⁸⁸ Here are the BibTeX citations and abstracts from three papers on the biology, psychology and related technology for theory-of-mind (ToM) reasoning [236, 34, 187, 87]:

@inbook{PynadathetalCAGI-14,
       author = {Pynadath, David V. and Rosenbloom, Paul S. and Marsella, Stacy C.},
       editor = {Goertzel, Ben and Orseau, Laurent and Snaider, Javier},
        title = {Reinforcement Learning for Adaptive Theory of Mind in the Sigma Cognitive Architecture},
    booktitle = {Proceedings of Artificial General Intelligence: 7th International Conference},
    publisher = {Springer International Publishing},
         year = {2014},
        pages = {143-154},
     abstract = {One of the most common applications of human intelligence is social interaction, where people must make effective decisions despite uncertainty about the potential behavior of others around them. Reinforcement learning (RL) provides one method for agents to acquire knowledge about such interactions. We investigate different methods of multiagent reinforcement learning within the Sigma cognitive architecture. We leverage Sigma's architectural mechanism for gradient descent to realize four different approaches to multiagent learning: (1) with no explicit model of the other agent, (2) with a model of the other agent as following an unknown stationary policy, (3) with prior knowledge of the other agent's possible reward functions, and (4) through inverse reinforcement learning (IRL) of the other agent's reward function. While the first three variations re-create existing approaches from the literature, the fourth represents a novel combination of RL and IRL for social decision-making. We show how all four styles of adaptive Theory of Mind are realized through Sigma's same gradient descent algorithm, and we illustrate their behavior within an abstract negotiation task.},
}
@article{BowmanetalDS-12,
       author = {Bowman, L. C. and Liu, D. and Meltzoff, A. N. and Wellman, H. M.},
        title = {Neural correlates of belief- and desire-reasoning in 7- and 8-year-old children: an event-related potential study},
      journal = {Development Science},
       volume = {15},
       number = {5},
         year = {2012},
        pages = {618-632},
     abstract = {Theory of mind requires belief- and desire-understanding. Event-related brain potential (ERP) research on belief- and desire-reasoning in adults found mid-frontal activations for both desires and beliefs, and selective right-posterior activations only for beliefs. Developmentally, children understand desires before beliefs; thus, a critical question concerns whether neural specialization for belief-reasoning exists in childhood or develops later. Neural activity was recorded as 7- and 8-year-olds (N = 18) performed the same diverse-desires, diverse-beliefs, and physical control tasks used in a previous adult ERP study. Like adults, mid-frontal scalp activations were found for belief- and desire-reasoning. Moreover, analyses using correct trials alone yielded selective right-posterior activations for belief-reasoning. Results suggest developmental links between increasingly accurate understanding of complex mental states and neural specialization supporting this understanding.}
}
@article{LiuetalCD-09,
       author = {Liu, D.  and Meltzoff, A. N.  and Wellman, H. M.},
        title = {Neural correlates of belief- and desire-reasoning},
      journal = {Child Development},
       volume = {80},
       number = {4},
         year = {2009},
        pages = {1163-1171},
     abstract = {Theory of mind requires an understanding of both desires and beliefs. Moreover, children understand desires before beliefs. Little is known about the mechanisms underlying this developmental lag. Additionally, previous neuroimaging and neurophysiological studies have neglected the direct comparison of these developmentally critical mental-state concepts. Event-related brain potentials were recorded as participants (N = 24; mean age = 22 years) reasoned about diverse-desires, diverse-beliefs, and parallel physical situations. A mid-frontal late slow wave (LSW) was associated with desire and belief judgments. A right-posterior LSW was only associated with belief judgments. These findings demonstrate neural overlap and critical differences in reasoning explicitly about desires and beliefs, and they suggest children recruit additional neural processes for belief judgments beyond a common, more general, mentalizing neural system.}
}
@article{EmeretalSN-06,
       author = {Ermer, E. and Guerin, S. A. and Cosmides, L. and Tooby, J. and Miller, M. B.},
        title = {Theory of mind broad and narrow: reasoning about social exchange engages {T}o{M} areas, precautionary reasoning does not},
      journal = {Society of Neuroscience},
       volume = {1},
       number = {3-4},
         year = {2006},
        pages = {196-219},
     abstract = {Baron-Cohen (1995) proposed that the theory of mind (ToM) inference system evolved to promote strategic social interaction. Social exchange--a form of co-operation for mutual benefit--involves strategic social interaction and requires ToM inferences about the contents of other individuals' mental states, especially their desires, goals, and intentions. There are behavioral and neuropsychological dissociations between reasoning about social exchange and reasoning about equivalent problems tapping other, more general content domains. It has therefore been proposed that social exchange behavior is regulated by social contract algorithms: a domain-specific inference system that is functionally specialized for reasoning about social exchange. We report an fMRI study using the Wason selection task that provides further support for this hypothesis. Precautionary rules share so many properties with social exchange rules--they are conditional, deontic, and involve subjective utilities--that most reasoning theories claim they are processed by the same neurocomputational machinery. Nevertheless, neuroimaging shows that reasoning about social exchange activates brain areas not activated by reasoning about precautionary rules, and vice versa. As predicted, neural correlates of ToM (anterior and posterior temporal cortex) were activated when subjects interpreted social exchange rules, but not precautionary rules (where ToM inferences are unnecessary). We argue that the interaction between ToM and social contract algorithms can be reciprocal: social contract algorithms requires ToM inferences, but their functional logic also allows ToM inferences to be made. By considering interactions between ToM in the narrower sense (belief-desire reasoning) and all the social inference systems that create the logic of human social interaction--ones that enable as well as use inferences about the content of mental states--a broader conception of ToM may emerge: a computational model embodying a Theory of Human Nature (ToHN).}
}

⁸⁹ I may be delusional but I believe we can now build a complete, reasonably accurate facsimile of the human brain. If I'm right that means we can build an intelligence that is at least as capable of human beings in terms of its ability to interact with and perform computations similar to human beings but is also extensible allowing us to engineer artificial beings based on a silicon substrate that would be substantially more durable and computationally powerful than unaugmented human beings. I believe that all the clues are available for anyone to see. Clearly it will take substantial computational and engineering resources to deliver on the extrapolation from current technology I just made, but it implies that super-human level intelligence will scale with Moore's law and in general with respect to the law of accelerating exponential return from advanced technology.

In Issue 42 of The Wild and Crazy Guide to the Singularity, I mentioned we could build an AI today comparable in function to a human being. By that I mean engineers could sit down with cognitive and systems neuroscientists and figure out how to put together existing pieces of technology to produce a reasonable simulation of a human brain, I also mentioned that because of the law of accelerating returns we could extend this basic brain design by trading computational capacity and functional capability for power since we still don’t know how to achieve comparable efficiency. What's perhaps scary is that we could add as prosthetics the many powerful tools we use every day as technology for accelerating scientific progress and exploit these as add-ons to create a more powerful and capable machine than those designed by evolution and natural selection.

This claim does not imply we don't have a great deal to learn about the detailed organization, function and pathology of biological brains, crucial in diagnosing neurological disorders. Moreover, at the systems neuroscience level, advances in computer architecture and algorithm design will profit from a better understanding of the details of how biological brains work. It is also worth noting that humans encompass a huge variety of biological diversity in both form and function — including the molecular machinery comprising our immune systems and the extensive biota that populate our bodies and considerably expand our inherent assets, most of which is more complex than what is needed in order to replicate the basic function and architecture of the human brain, but that, as long as we inhabit flesh bodies, will be essential to our health and well-being.

⁹⁰ When it comes to research and development in a commercial setting, you have to establish a reasonable tradeoff between the two. If the problem is completely terra incognita and you have no backup plan, then you're likely to fail. If the problem is tractable with little or no research required — figure out how to put well-understood components together, make the resulting system scale and secure and improve the user interface, then it's "just engineering" as some managers disparagingly quip.

The line between research and development can be thin at times, and, in this case, the question of whether to "hack it" or "go boldly" is often left hanging. Whether a manager should bet on a project or not, depends on the risk profile of the people assessing the R&D tradeoff; a good technical manager is able to work with overly optimistic research scientists and overly pessimistic software engineers to produce a reasonably accurate assessment of risks and rewards.

This aside reminded me of the pitch I made at the beginning of launching an early personal assistant project — here's a video of a presentation I gave a senior VP in early 2014. The VP was definitely of the "hack it" persuasion and almost convinced the technical lead — who was already chomping at the bit to start hiring infrastructure talent — to create a music DJ demo modeled after some locally famous NYC radio personality whom none of us had ever heard of.

In the end, we collaborated with Google Play Music, but when NLP problems started to surface, the program manager focused on infrastructure, duplicating effort elsewhere in the organization, and neglecting the problem of continuous dialog that was core to the product. The PM was unwilling to consider the new recurrent neural network "chatbot" technology being developed in the Google Brain group at the time. Needless to say, the project was discontinued soon after.

⁹¹ Recently we discussed the prospects for building better retinal prosthetics. The problem is plagued by several techncial challenges, including (i) we don't know exactly how the retina works, (ii) technology for interfacing silicon and neural circuitry is in its infancy, and (iii) we don't know how the prosthetic should communicate with the visual system. Since the R&D tradeoffs are complicated and technical challenges daunting, it makes sense to consider simple low-tech alternatives.

I remember our experience with early BrainGate technology. Michael Black was using relatively sophisticated sensing and pattern recognition techniques for the BCI and in the end we found that back-to-basics kalman filtering with no fancy spike sorting or preprocessing was better. The lesson was clear, you provide the brain with meaningful information and it will sort it out and make use of it.

David Eagleman made this even clearer for the layperson in his self-promotional TED talk showing off the VEST (Versatile Extra-Sensory Transducer) technology now being developed at the kickstarter launched Neosensory.

Simple KISS engineering exploiting what the brain does best. The VEST doesn't seem adequate to handling complex sensory experience until you think about how many bits you could actually convey by multiplexing a visual or auditory signal. Think about how body-schema mapped regions of the somatosensory cortex increase in size / density of neurons in response to learning or shrink as in the case of the rat barrel cortex when you pluck or clip the rat's whiskers.

Disclaimer: I did not contribute directly to the early BrainGate technology. I recruited Michael when he was at Xerox PARC and eventually hired him at Brown University where I was Chair of the Computer Science Department and served on the Scientific Advisory Board of the Brown Institute for Brain Science during its formation . John Donoghue who co-founded Cyberkinetics was the Chair of the Neuroscience Department and also chaired the SAB. One of my former students, Arjun Bansal, did his PhD with Donoghue, then joined BrainGate and later worked at Qualcomm before founding Nervana Systems which was acquired by Intel in 2016.

⁹² Whenever we become aware of an unexpected piece of information, the brain suddenly seems to burst into a large-scale activity pattern. My colleagues and I have called this property global ignition. We were inspired by the Canadian neurophysiologist Donald Hebb, who first analyzed the behavior of collective assemblies of neurons in his 1949 best seller The Organization of Behavior. Hebb explained, in very intuitive terms, how a network of neurons that excite one another can quickly fall into a global pattern of synchronized activity—much as an audience, after the first few handclaps, suddenly bursts into broad applause. Like the enthusiastic spectators who stand up after a concert and contagiously spread the applause, the large pyramidal neurons in the upper layers of cortex broadcast their excitation to a large audience of receiving neurons.

Global ignition, my colleagues and I have suggested, occurs when this broadcast excitation exceeds a threshold and becomes self-reinforcing: some neurons excite others that, in turn, return the excitation. The net result is an explosion of activity: the neurons that are strongly interconnected burst into a self-sustained state of high-level activity, a reverberating "cell assembly," as Hebb called it.

This collective phenomenon resembles what physicists call a "phase transition," or mathematicians a "bifurcation": a sudden, nearly discontinuous change in the state of a physical system. Water that freezes into an ice cube epitomizes the phase transition from liquid to solid. Early on in our thinking about consciousness, my colleagues and I noted that the concept of phase transition captures many properties of conscious perception. Like freezing, consciousness exhibits a threshold: a brief stimulus remains subliminal, while an incrementally longer one becomes fully visible. Most physical self-amplifying systems possess a tipping point where global change happens or fails depending on minute impurities or noise. The brain, we reasoned, may be no exception.

⁹³ So how could we separate the genuine conscious code from its accompanying unconscious bells and whistles? In principle, the answer is easy. We need to search the brain for a decodable neural representation whose content correlates 100 percent with our subjective awareness. The conscious code that we are looking for should contain a full record of the subject’s experience, replete with exactly the same level of detail as the person perceives. It should be insensitive to features that she misses, even if they are physically present in the input. Conversely, it should encode the subjective content of conscious perception, even if that perception is an illusion or a hallucination. It should also preserve our subjective sense of perceived similarity: when we see a diamond and a square as two distinct shapes, rather than rotated versions of each other, so should the brain’s conscious representation.

The conscious code should also be highly invariant: it should stay put whenever we feel that the world is stable, but change as soon as we see it moving. This criterion strongly constrains the search for signatures of consciousness, because it almost certainly excludes all our early sensory areas. As we walk down a corridor, the walls project a constantly changing image on our retinas—but we are oblivious to this visual motion and perceive a stable room. Motion is omnipresent in our early visual areas but not in our awareness. Three or four times per second, our eyes jiggle around. As a result, on the retina as well as in most of our visual areas, the entire image of the world slips back and forth.

Fortunately, we remain oblivious to this nauseating swirling: our perception remains steady. Even when we gaze at a moving target, we do not perceive the background scenery gliding in the opposite direction. In the cortex, our conscious code must therefore be similarly stabilized. Somehow, thanks to the motion sensors in our inner ear and to predictions arising from our motor commands, we manage to subtract out our own motion and perceive our environment as an invariant entity. Only when these predictive motor signals are bypassed—for instance, when you move your eye by gently poking it with a finger—does the whole world seem moving.

⁹⁴ Forkhead box protein P2 (FOXP2) is a protein that, in humans, is encoded by the FOXP2 gene is required for proper development of speech and language.

⁹⁵ As an inspiration and relaxing interlude, you may find it worthwhile listening to the documentary entitled "Joshua Bell," produced in 1995 for BBC Television’s "Omnibus" series, on the life of former child prodigy Joshua Bell. At age 26, he was regarded as one of the greatest violinists in the world. The film looks back at Bell's childhood in Indiana, focusing on his early life and introduction to the violin. Despite his early interest in the violin, he led an otherwise normal childhood. His first teacher was rather taciturn and did little to inspire.

However, his second teacher Josef Gingold was warm, supportive and emanated such a deep love of music that Bell blossomed under his tutelage. Gingold introduced Bell to the work of Fritz Kreisler figuring so prominently in Gingold's career and provided Bell with additional inspiration and a deeper appreciation for the art and craft of a classical musician. It is interesting to speculate how Gingold's encouragement shaped the neural basis of Bell's emotional experience of listening to and playing classical music and the physical and intellectual challenges of playing the violin.

As we'll see later, Graziano's theory relies on the idea of attention schema [107]. He writes that "[a] schema is a complex information structure that is used to represent something. One of the first schemas to be extensively studied in psychology and neuroscience was the body schema, first proposed by Head and Holmes in 1911. The body schema is an internal model — an organized set of information that represents the shape, structure, and movement of the body, that distinguishes between objects belonging to the body and objects that are foreign." You might find it interesting to learn about how the body schema changes when someone learns to ride a bike or play a musical instrument [28].

⁹⁶ Attention, physiological attention as it is understood by neuroscientists, is a procedure. It is a competition between signals in the brain. It occurs because of the specific way that neurons interact with each other. One set of signals, carried by one set of neurons, rises in strength and suppresses other, competing signals carried by other neurons. For example, the visual system builds informational models of the objects in a scene. If you are looking at a cluttered scene, which informational model in your brain will win the competition of the moment, rise in signal strength, suppress other models, and dominate the brain’s computations? This competition among signals—the process by which one signal wins and dominates for a moment, then sinks down as another signal dominates—is attention. Attention may be complicated, but it is not mysterious. It is physically understandable. It could be built into a circuit [107].

⁹⁷ The correspondence between awareness and attention is close. In the previous chapter I outlined a list of similarities, and in later chapters I will discuss the relationship in greater detail. The two are so similar that it is tempting to think they might be the same thing with no distinction at all. But there is a fundamental difference. Attention, the competition among signals and the enhancement of signals in the brain, is a mechanistic process. It is not explicit knowledge. Awareness, in contrast, is accessible as explicit knowledge. The brain does attention but knows awareness. The relationship between attention and awareness is therefore exactly the relationship between [...] between a real thing and a representation of it in the brain that is cognitively accessible. Awareness, in this view, is a description, a useful if physically inaccurate sketch of what it means for the brain to focus its attention [107].

⁹⁸ Much has been learned recently about the neuronal basis of decision-making, especially in the relatively simple case of visual motion. Suppose that you are looking at a blurry or flickering image and are asked to decide its direction of motion. It can drift either to the left or the right, but because of the noisy quality of the image, you have trouble determining the direction. By making the task difficult in this way, neuroscientists can slow down the decision process, thereby making it easier to study.

This decision process appears to work as follows. First, the machinery in the visual system constructs signals that represent the motion of the image. Because the visual image is noisy, it may result in conflicting signals indicating motion in a variety of directions. Second, those signals are received elsewhere in the brain by decision integrators. The decision integrators determine which motion signal is consistent enough or strong enough to cross a threshold. Once the threshold is crossed, a response is triggered. In this way, the system decides which direction the image is likely to be moving.

Strictly speaking, the system is not deciding whether the visual image is moving to the right or left. It is deciding between two information streams in the brain: is the left or right motion signal stronger? The decision can even be manipulated by inserting a fine electrode into the brain, into a particular part of the visual system, passing a very small electrical current and thereby boosting one or another of the motion signals.

Since neuroscientists have some notion of the brain’s machinery for decision-making, what can be inferred about awareness? As noted above, you can decide whether you have, inside of you, an awareness of something. Therefore awareness—or at least whatever it is you are deciding you have when you say you have awareness—can be fed into a decision integrator. We can make the task even more comparable to the visual-motion task. By making the images extremely brief or dim, we can make the task difficult. You may struggle for a moment, trying to decide whether any awareness of a stimulus is present. The decision machinery is engaged. This insight that conscious report depends on the machinery of decision-making has been pointed out before.

A crucial property of decision-making is that not only is the decision itself a manipulation of data, but the decision machine depends on data as input. It does not take any other input. Feeding in some res cogitans will not work on this machine. [...] You can’t feed it ectoplasm. You can’t feed it an intangible, ineffable, indescribable thing. You can’t feed it an emergent property that transcends information. You can only feed it information.

In introspecting, in asking yourself whether you have an awareness of something, and in making the decision that you have it, what you are deciding on, what you are assessing, the actual stuff your decision engine is collecting, weighing, and concluding that you have, is information. Strictly speaking, the neuronal machinery is deciding that certain information is present in your brain at a signal strength above a threshold [107].

⁹⁹ Simple pleasant and unpleasant feelings come from an ongoing process inside you called interoception. Interoception is your brain’s representation of all sensations from your internal organs and tissues, the hormones in your blood, and your immune system. Your insides are in motion. Your heart sends blood rushing through your veins and arteries. Your lungs fill and empty. Your stomach digests food. This interoceptive activity produces the spectrum of basic feeling [19].

¹⁰⁰ [W]hen doing insight practices, the only useful gold standard for reality is your own sensate experience. From the conventional point of view, things are usually thought to be there even when you can no longer experience them, and are thus assumed with only circumstantial evidence to be somewhat stable entities. Predictability is used to assume continuity of existence. For our day-to-day lives, this assumption is adequate and often very useful.

However, when doing insight practices, it just happens to be much more useful to assume that things are only there when you experience them and not there when you don’t. Thus, the gold standard for reality when doing insight practices is the sensations that make up your reality in that instant. Sensations not there at that time do not exist, and thus only the sensations arising in that instant do exist. In short, the vast majority of what you usually think of as making up your universe doesn’t exist the vast majority of the time, from a pure sensate point of view [198].

¹⁰¹ Your nervous system sends sensory input to your brain when your muscles or joints are injured, or your body tissues are damaged by excessive heat or cold, or in response to a chemical irritation like a pinch of pepper in your eye. This process is called nociception. Pain is an experience that occurs not only from physical damage but also when your brain predicts damage is imminent.

You might feel the pain even before the nurse pricks your arm with a needle. Your predictions are then corrected by actual nociceptive input from the body—the injection occurs—and once any prediction errors are dealt with, you have categorized the nociception sensations and made them meaningful. The pain you experience as coming from the needle is really in your brain. [19].

¹⁰² Your brain’s 86 billion neurons, which are connected into massive networks, never lie dormant awaiting a jump-start. Your neurons are always stimulating each other, sometimes millions at a time. Given enough oxygen and nutrients, these huge cascades of stimulation, known as intrinsic brain activity, continue from birth until death. This activity is nothing like a reaction triggered by the outside world. It’s more like breathing, a process that requires no external catalyst [21].

¹⁰³ Excerpts from Daniel Ingram's Mastering the Core Teachings of the Buddha, An Unusually Hardcore Dharma Book [198] relating to the experience of arising and passing (A&P) during insight meditation — also called the "4th insight stage" or the "2nd Vipassana jhana" in texts on advanced meditative practice:

Perceptual Thresholds — My favorite of the criteria, particularly found in technical and skilled meditators but also found in many others: people during the arising and passing may have the ability to perceive sensations with a speed, precision, and consistency that may be radically beyond what they were capable of before, such that they may perceive sensations up to maybe 40 times/second arising and vanishing during certain peak perceptual moments, particularly during the middle phase of the breath and in the center of wherever they place their attention. The phase characteristics of the A&P borrows from the 2nd jhana in general and involves the ability to perceive the arising and passing clearly of phenomena in a way that can feel quite effortless, and any vibrations noticed tend to be harmonically simple and change in frequency sinusoidally.
Insights into Selflessness — Some may perceive that all phenomena are arising and passing away, and wherever they turn their attention may notice the transience of sensations. Extrapolating from this clear perception, they may realize: "Ah, this means that there is no permanent self." Further, unitive experiences may have the same effect, and further, in some strange intuitive way the same basic notion of something having changed in the basic notion of self-hood may shift to something less solid. Also, the fact that the A&P experiences tend to happen in a way that is seemingly effortless or even unbidden, this experience of natural occurrences can also reinforce the notion that there is less control of things than one initially suspected, adding to the sense of there being somehow less of a self in things.

Cognitive Abilities — People who are in and have crossed the A&P tend to have an easier time with processes variously called things like "vision logic", "metacognitive processing" and the like. For those who are prone to such things, they will tend to have philosophical talents beyond those who have not crossed the A&P, realizing that things like age, underlying intelligence however defined, exposure to philosophical and related branches of thinking, and education level can significantly effect how this presents. They will also tend to have an increased ability to understand and navigate in terminology that may reluctantly be termed "spiritual", though this may show in other ways, such as an increased appreciation of things like the profundity and beauty of differential equations, the implications of modern physics for questions of Subject-Object non-duality, debates of free will vs super-determinism, and the like.

¹⁰⁴ Here is a speculative evolutionary timeline of consciousness from Consciousness and the Social Brain by Michael S. A. Graziano [107]:

The theory at a glance: from selective signal enhancement to consciousness. About half a billion years ago, nervous systems evolved an ability to enhance the most pressing of incoming signals. Gradually, this attentional focus came under top-down control. To effectively predict and deploy its own attentional focus, the brain needed a constantly updated simulation of attention. This model of attention was schematic and lacking in detail. Instead of attributing a complex neuronal machinery to the self, the model attributed to the self an experience of X — the property of being conscious of something.
Just as the brain could direct attention to external signals or to internal signals, that model of attention could attribute to the self a consciousness of external events or of internal events. As that model increased in sophistication, it came to be used not only to guide one’s own attention, but for a variety of other purposes including understanding other beings. Now, in humans, consciousness is a key part of what makes us socially capable.

In this theory, consciousness emerged first with a specific function related to the control of attention and continues to evolve and expand its cognitive role. The theory explains why a brain attributes the property of consciousness to itself, and why we humans are so prone to attribute consciousness to the people and objects around us.

Timeline: Hydras evolve approximately 550 million years ago (MYA) with no selective signal enhancement; animals that do show selective signal enhancement diverge from each other approximately 530 MYA; animals that show sophisticated top-down control of attention diverge from each other approximately 350 MYA; primates first appear approximately 65 MYA; hominids appear approximately 6 MYA; Homo sapiens appear approximately 0.2 MYA

¹⁰⁵ Notes and Excerpts on Language and Sensory Association Areas:

"I argue that language comprehension in sighted people might best be thought of as a kind of code-directed scene comprehension that draws heavily upon specifically visual, and probably largely prelinguistic processing constraints. The key processes of word-recognition and the assembly of visual word meaning patterns into interacting chains, however, may be mediated in part by species-specific activity patterns in secondary auditory cortex similar to those generated by uninterpreted speech-sound sequences." — (SOURCE: [117] in [269])
"This tendency to ignore the structure of the brain is quite unfortunate in light of the recent progress made in primate neurobiology. Most current texts of physiological psychology, neuropsychology, and cognitive neuroscience (e.g., Caplan, 1987; Ellis and Young, 1988) still implicitly employ a model of the organization and evolution of the cortex that dates to the associationists of the late nineteenth century. In this way of thinking, ’primitive’ mammals like rats start out with primary visual, auditory, and somatosensory areas almost touching.

Next up the rung of an essentially pre-evolutionary scala natura come animals like cats, which have a small amount of ’uncommitted’ space in between. Finally, at the top, are primates and especially humans, where we find a great deal of uncommitted ’association’ cortex, properly situated to integrate and associate the modality-specific information presented to it by visual, auditory and somatosensory cortices (see e.g., Fodor, 1983; Ellis and Young, 1988, on the ’semantic system’ postulated in most models of word processing; Damasio, 1989)." — (SOURCE: [258])

¹⁰⁶ Von Economo neurons (VENs) (spindle cells) are not ubiquitous across animal brains. Only humans, great apes, certain whales, and elephants have been shown to possess VENs thus far. The relative number of cells varies by species, with humans having by far the greatest amount — over twice the number of our nearest evolutionary neighbor. One argument is that some aspect of VEN function is giving rise to the cognitive abilities that make us uniquely human. SOURCE

¹⁰⁷ Notes and Excerpts on Mirror Neurons and Physical Intuition:

Related mimetic concepts, acronyms, terminology and pseudo technical babble: Mirror Neural Network; Reflective Artificial Intelligence Dynamical System; Conscious Embodied and Virtually Embedded Models; association areas & sensory integration; revisiting couch potato; body-centric analogical machines, embodied cognition [28], brain systems supporting allostasis and interoception in humans [162].
"Mirror neurons represent a distinctive class of neurons that discharge both when an individual executes a motor act and when he observes another individual performing the same or similar motor act", "In humans, brain activity consistent with that of mirror neurons has been found in the premotor cortex, the supplementary motor area, the primary so much a sensory cortex, and the inferior parietal cortex", "a recent experiment [...] Shown that, in both humans and monkeys, the mere system also responds to the sound of actions. Functional magnetic resonance imaging (fMRI) can examine the entire brain at once and suggest that a much wider network of brain areas shows mere properties in humans than previously thought. These additional areas include the somatosensory and are thought to make the observer feel what it feels like to move in the observed", "the researchers found a small number of neurons that fired or showed their greatest activity both when the individual performed a task and when he observed a task. Other neurons had anti-mirror properties, that is, they responded when the participant saw an action but were inhibited when the participant perform that action. The mirror neurons found were located in the supplementary motor area and medial temporal cortex."

Mirror neurons are associated with one of the most intriguing aspect of our complex thought process, that is "Intention understanding". There are two distinct processes of information that one can get observing an action done by another individual.

"Thus, the activation of IPL actionconstrained mirror neurons give information not only about, but also on why grasping is done (grasping for eating or grasping for placing). This specificity allowed the observer not only to recognize the observed motor act, but also to code what will be the next motor act of the not—yet—observed action, in other words to understand the intentions of the action's agent."

"The observation of an individual grasping an apple is immediately understood because it evokes the same motor representation in the parieto frontal mirror system of the observer. On the basis of this fundamental property of mirror neurons and the fact that the observation of actions like hand grasping activates the caudal part of IFG (Broca's area), neuroscientists proposed that the mirror mechanism is the basic mechanism from which language evolved."

"It is obvious that the mirror mechanism does not explain by itself the enormous complexity of speech. but, it solves one of the fundamental difficulties for understanding language evolution, that is, how and what is valid for the sender of a message become valid also for the receiver. Hypotheses and speculations on the various steps that have led from the monkey mirror system to language have been recently advanced."

"The ability to make consistent connections across different senses may have initially evolved in lower primates, but it went on developing in a more sophisticated manner in humans through remapping of mirror neurons which then became coopted for other kinds of abstraction that humans excel in, like reasoning metaphors."

"Michael Corballis, an eminent cognitive neuroscientist, argues that what distinguishes us in the animal kingdom is our capacity for recursion, which is the ability to embed our thoughts within other thoughts. "I think, therefore I am" is an example of recursive thought, because the thinker has inserted himself into his thought. Recursion enables us to conceive of our own minds and the minds of others. It also gives us the power of mental "time travel" that is the ability to insert past experiences, or imagined future ones, into present consciousness. Corballis demonstrates how these recursive structures led to the emergence of language and speech, which ultimately enabled us to share our thoughts, plan with others, and reshape our environment to better reflect our creative imaginations. Mirror neurons shape the power of recursive embedding."

"This theory suggests that humans can construct a model in their brains of the thoughts and intentions of others. We can predict the thoughts, actions of others. The theory holds that humans anticipate and make sense of the behavior of others by activating mental processes that, if carried into action, would produce similar behavior. This includes intentional behavior as well as the expression of emotions. The theory states that children use their own emotions to predict what others will do. Therefore, we project our own mental states onto others. Mirror neurons are activated both when actions are executed, and the actions are observed. This unique function of mirror neurons may explain how people recognize and understand the states of others; mirroring observed action in the brain as if they conducted the observed action."

¹⁰⁸ Oliver Selfridge and I. J. Good — often disparaging called Turing's statistician but actually one of the most prolific scientists of his generation — were the only two Bletchley Park veterans I knew personally. Each of them was unique in their particular brand of genius. Each was iconoclastic and scientifically engaged throughout their long professional careers. Good wrote a huge amount about AI [100] and influenced my, Stuart Russell's and Eric Horvitz's research on bounded rationality [101].

¹⁰⁹ One important contribution of Stanislas Dehaene's work is that, building on the philosophical and cognitive-psychology work of Daniel Dennett, Paul and Patricia Churchland and other more enlightened thinkers, the clarity of Dehaene's theories and experiments will put to rest the arguments of Thomas Nagel ("What is it like to be a bat?), Ned Block ("Inverted Earth") and, to a lesser degree, David Chalmers' equivocating ("Facing Up to the Problem of Consciousness") or at the very least put them on a shelf labeled "Qualia and other made-up phenomena that are practically useless, intellectually barren and psychologically disturbing". See this entry in the Stanford Encyclopedia of Philosophy for a discussion of qualia, or, better yet, read Chapter 12 entitled "Qualia Disqualified" in Dan Dennett's Consciousness Explained [76].

¹¹⁰ If you want a quick review of the theories surrounding consciousness, I suggest you watch this 30 minute documentary featuring: Eben Alexander, Susan Blackmore, David Chalmers, Deepak Chopra, Patricia Churchland, Stanislas Dehaene, Daniel Dennett, Stuart Hameroff, Dean Radin, John Searle and Rupert Sheldrake, and, if you can't dismiss all but Dehaene for their lack of precision and all the rest except Dennett, Churchland and Blackmore for their lack of clarity and a profound state of befuddlement, then you have a lot of remedial reading and thinking to do.

¹¹¹ See here for an exercise that might help you to better understand consciousness both viscerally and computationally. Note that, however, while a purely computational account may turn out to be relatively straightforward for a computer scientist to understand intellectually, the implications of such an account could prove uncomfortable if they undermine the stories we tell about ourselves, reflect negatively on the special status we accord our species, or seem at odds with our everyday experience.

¹¹² Here is a small selection of papers published between 2000 and 2010 relating to language development and the evolution of ToM reasoning. The theories are like all scientific theories, subject revision based on subsequent contravening evidence, but the theories represented in this list are particular vulnerable due to the paucity of the evidence for and against at the time of their publications. Unfortunately, at the time of this writing — September 2017 — none of these theories have definitively been shown to be false or convincingly demonstrated to being even contingently true:

@article{PyersandSenghasPSYCHOLOGICAL-SCIENCE-09,
       author = {Pyers, Jennie E. and Senghas, Ann},
        title = {Language Promotes False-Belief Understanding: Evidence From Learners of a New Sign Language},
      journal = {Psychological Science},
       volume = {20},
        issue = {7},
         year = {2009},
        pages = {805-812},
     abstract = {Developmental studies have identified a strong correlation in the timing of language development and false-belief understanding. However, the nature of this relationship remains unresolved. Does language promote false-belief understanding, or does it merely facilitate development that could occur independently, albeit on a delayed timescale? We examined language development and false-belief understanding in deaf learners of an emerging sign language in Nicaragua. The use of mental-state vocabulary and performance on a low-verbal false-belief task were assessed, over 2 years, in adult and adolescent users of Nicaraguan Sign Language. Results show that those adults who acquired a nascent form of the language during childhood produce few mental-state signs and fail to exhibit false-belief understanding. Furthermore, those whose language developed over the period of the study correspondingly developed in false-belief understanding. Thus, language learning, over and above social experience, drives the development of a mature theory of mind.},
}
@article{BednyetalPNAS-09,
       author = {Bedny, Marina and Pascual-Leone, Alvaro and Saxe, Rebecca R.},
        title = {Growing up blind does not change the neural bases of Theory of Mind},
      journal = {Proceedings of the National Academy of Sciences},
       volume = {106},
       number = {27},
        pages = {11312-11317},
         year = {2009},
     abstract = {Humans reason about the mental states of others; this capacity is called Theory of Mind (ToM). In typically developing adults, ToM is supported by a consistent group of brain regions: the bilateral temporoparietal junction (TPJ), medial prefrontal cortex (MPFC), precuneus (PC), and anterior temporal sulci (aSTS). How experience and intrinsic biological factors interact to produce this adult functional profile is not known. In the current study we investigate the role of visual experience in the development of the ToM network by studying congenitally blind adults. In experiment 1, participants listened to stories and answered true/false questions about them. The stories were either about mental or physical representations of reality (e.g., photographs). In experiment 2, participants listened to stories about people's beliefs based on seeing or hearing; people's bodily sensations (e.g., hunger); and control stories without people. Participants judged whether each story had positive or negative valance. We find that ToM brain regions of sighted and congenitally blind adults are similarly localized and functionally specific. In congenitally blind adults, reasoning about mental states leads to activity in bilateral TPJ, MPFC, PC, and aSTS. These brain regions responded more to passages about beliefs than passages about nonbelief representations or passages about bodily sensations. Reasoning about mental states that are based on seeing is furthermore similar in congenitally blind and sighted individuals. Despite their different developmental experience, congenitally blind adults have a typical ToM network. We conclude that the development of neural mechanisms for ToM depends on innate factors and on experiences represented at an abstract level, amodally.},
}
@article{AnanthaswamyNEW-SCIENTIST-09,
        title = {Language may be key to theory of mind},
       author = {Anil Ananthaswamy},
      journal = {New Scientist},
         year = 2009,
      comment = {This is a general-science survey article that references the two, more-rigorous papers mentioned above.}, 
     abstract = {How blind and deaf people approach a cognitive test regarded as a milestone in human development has provided clues to how we deduce what others are thinking. Understanding another person's perspective, and realising that it can differ from our own, is known as theory of mind.},
}
@article{deVilliersLINGUA-07,
       author = {de Villiers, Jill},
        title = {The Interface of Language and Theory of Mind},
      journal = {Lingua: International Review of General Linguistics},
         year = {2007},
       volume = {117},
        issue = {11},
        pages = {1858-1878},
     abstract = {The proposal is made that the interface between language and theory of mind is bidirectional. It seems probable that the conceptual developments of early Theory of Mind form an essential basis for helping to fix at least word reference. In development from two to four years, no basis exists in research for conclusions about the direction of influence between language and Theory of Mind. At the stage of false belief reasoning, after age four, the role of the mastery of syntactic complementation is highlighted as a representational tool, that is, language development assists reasoning. The paper presents a brief summary of Theory of Mind, ranging from its earliest beginnings in infancy to the appreciation around age four years that others might hold false beliefs and act according to them. For each development, the parallel language developments are described, and questions are raised about the interface between the two. In particular, research that might determine the direction of influence from one to the other is discussed. More work is called for, especially with nonverbal tasks, good experimental linguistic work, and other special populations, that might allow a more precise delineation of how language and Theory of Mind interrelate at the interface.},
}
@incollection{MalleLANGUAGE-THEORY-OF-MIND-02,
       author = {Malle, B. F.},
        title = {The relation between language and theory of mind in development and evolution},
       editor = {T. Giv\'{o}n and B. F. Malle},
    booktitle = {The evolution of language out of pre-language},
    publisher = {Benjamins},
      address = {Amsterdam},
         year = {2002},
        pages = {265-284},
     abstract = {There is reason to believe that language and theory of mind have co- evolved, given their close relation in development and their tight connection in social behavior. However, they are clearly not inseparable—neurologically, cognitively, or functionally. So the question becomes, "What is the exact relation between language and theory of mind, in evolution, development, and social behavior?" To answer this question is a daunting task; I will try merely to clear a path toward an answer. I will consider several possible relations between the two faculties, bring conceptual arguments and empirical evidence to bear on them, and end up arguing for an escalation process in which language and theory of mind have fueled each other's evolution.}
}
@article{CharmanetalCOGNITIVE-DEVELOPMENT-00,
        title = {Testing joint attention, imitation, and play as infancy precursors to language and theory of mind},
      journal = {Cognitive Development},
       volume = {15},
       number = {4},
        pages = {481-498},
         year = {2000},
       author = {Tony Charman and Simon Baron-Cohen and John Swettenham and Gillian Baird and Antony Cox and Auriol Drew},
     abstract = {Various theoretical accounts propose that an important developmental relation exists between joint attention, play, and imitation abilities, and later theory of mind ability. However, very little direct empirical evidence supports these claims for putative "precursor" theory of mind status. A small sample (N=13) of infants, for whom measures of play, joint attention, and imitation had been collected at 20 months of age, was followed-up longitudinally at 44 months and a battery of theory of mind measures was conducted. Language and IQ were measured at both timepoints. Imitation ability at 20 months was longitudinally associated with expressive, but not receptive, language ability at 44 months. In contrast, only the joint attention behaviours of gaze switches between an adult and an active toy and looking to an adult during an ambiguous goal detection task at 20 months were longitudinally associated with theory of mind ability at 44 months. It is argued that joint attention, play, and imitation, and language and theory of mind, might form part of a shared social–communicative representational system in infancy that becomes increasingly specialised and differentiated as development progresses.}
}
@article{FranketalCOGNITION-08,
       author = {Frank, M. C.  and Everett, D. L.  and Fedorenko, E.  and Gibson, E. },
        title = {Number as a cognitive technology: evidence from {P}irahã language and cognition},
      journal = {Cognition},
         year = {2008},
       volume = {108},
       number = {3},
        pages = {819-824},
     abstract = {Does speaking a language without number words change the way speakers of that language perceive exact quantities? The Pirahã are an Amazonian tribe who have been previously studied for their limited numerical system [Gordon, P. (2004). Numerical cognition without words: Evidence from Amazonia. Science 306, 496-499]. We show that the Pirahã have no linguistic method whatsoever for expressing exact quantity, not even "one." Despite this lack, when retested on the matching tasks used by Gordon, Pirahã speakers were able to perform exact matches with large numbers of objects perfectly but, as previously reported, they were inaccurate on matching tasks involving memory. These results suggest that language for exact number is a cultural invention rather than a linguistic universal, and that number words do not change our underlying representations of number but instead are a cognitive technology for keeping track of the cardinality of large sets across time, space, and changes in modality.}
}
@article{HesposandSpelkeNATURE-04,
       author = {Hespos, S. J. and Spelke, E. S.},
        title = {Conceptual precursors to language},
      journal = {Nature},
         year = {2004},
       volume = {430},
       number = {6998},
        pages = {453-456},
     abstract = {Because human languages vary in sound and meaning, children must learn which distinctions their language uses. For speech perception, this learning is selective: initially infants are sensitive to most acoustic distinctions used in any language, and this sensitivity reflects basic properties of the auditory system rather than mechanisms specific to language; however, infants' sensitivity to non-native sound distinctions declines over the course of the first year. Here we ask whether a similar process governs learning of word meanings. We investigated the sensitivity of 5-month-old infants in an English-speaking environment to a conceptual distinction that is marked in Korean but not English; that is, the distinction between 'tight' and 'loose' fit of one object to another. Like adult Korean speakers but unlike adult English speakers, these infants detected this distinction and divided a continuum of motion-into-contact actions into tight- and loose-fit categories. Infants' sensitivity to this distinction is linked to representations of object mechanics that are shared by non-human animals. Language learning therefore seems to develop by linking linguistic forms to universal, pre-existing representations of sound and meaning.}
}
@book{SchallerandSacks1991,
        title = {A Man Without Words},
       author = {Schaller, S. and Sacks, O.W.},
    publisher = {University of California Press},
         year = {1991},
}