# Research Discussions

The following log contains entries starting several months prior to the first day of class, involving colleagues at Brown, Google and Stanford, invited speakers, collaborators, and technical consultants. Each entry contains a mix of technical notes, references and short tutorials on background topics that students may find useful during the course. Entries after the start of class include notes on class discussions, technical supplements and additional references. The entries are listed in reverse chronological order with a bibliography and footnotes at the end.

# Class Discussions

Welcome to the 2019 class discussion list. Preparatory notes posted prior to the first day of classes are available here. Introductory lecture material for the first day of classes is available here, a sample of final project suggestions here and last year's calendar of invited talks here. Since the class content for this year builds on that of last year, you may find it useful to search the material from the 2018 class discussions available here. Several of the invited talks from 2018 are revisited this year and, in some cases, are supplemented with new reference material provided by the list moderator.

## April 20, 2019

%%% Sat Apr 20  3:45:33 PDT 2019


I'm counting on Adam, Loren and Michael to provide the class us with a perspective on the hippocampus and basal ganglia and their possible function in cognitive architectures of the sort we are attempting to understand in this class. The expectation is that we can learn something from each of recent advances in (i) artificial neural networks, (ii) cellular and molecular neuroscience and (iii) systems and cognitive neuroscience. We hope to apply what we learn toward developing computational models that might serve as a bridge or common language for characterizing the computational properties of such systems. Practically speaking, this implies that the resulting architectures, while they exhibit novel behavior, are composed of familiar components and employ conventional methods of training, without need of arbitrary hacks to assemble the components into working models. This is an exercise designed to explore possibilities, not an hypothesis we are willing to die for, as Karl Popper would say.

Adam led off with an analysis of two important challenges to successfully integrating some form of symbolic reasoning within a (primarily) connectionist architecture. Earlier Randall O'Reilly underscored the advantages of such hybrid architectures, and, in his talk Adam discussed his work with Greg Wayne at DeepMind and Ken Hayworth at HHMI Janelia Research Campus on stable, addressable locations in working memory and the related variable-binding problem. Next Tuesday, we will review Greg Wayne's talk from last year on memory, prediction and learning in decision problems characterized as partially observable Markov processes (POMDP) that Adam referred to in his presentation.

In preparation for Loren and Michael's participation, I will suggest some questions that we might tackle in class and then describe a straw-man architecture to help ground the discussion of modeling with artificial neural networks. For Loren1, what do we know from directly observing neural activity in mouse models during repeated trials of maze learning? For Michael, what further have we learned from fMRI studies of human subjects solving memory related problems and postmortem studies of patients with lesions in relevant cortical and subcortical regions? For all of us, what is the state-of-the-art in developing computational models of mouse navigational-related cognition and human cognitive capabilities involving both the hippocampal-entorhinal-cortex complex and the cortico-thalamocortical plus basal-ganglia circuits ostensibly responsible for executive-control and variable-binding in human symbolic reasoning?

Finally, relating to the previous item and relevant to the practical application of these ideas in building AI systems that employ existing ANN architectural components, how might we design systems that, while they are inspired by biological neural networks, deviate from these biological architectures by employing second-generation components like neural Turing machines and novel training strategies to coerce these component networks to mimic our reimagining of how advanced cognitive strategies and symbolic reasoning in particular can be efficiently implemented?

This last strategy is related to Terrence Deacon’s argument for the co-evolution of language and neural circuits that can be trained by developmentally staged exposure to ambient language processing. Essentially, natural selection need only engineer the basic circuitry such that it is within the capacity of a young child to coerce this general circuitry to serve the needs of language processing, where the language itself is shaped by selection pressures imposed by adults who both rely on language to survive and benefit from learning new ways in which language can assist survival while at the same time ensuring that it can easily be passed on to the next generation via some combination of unsupervised and reinforcement-based learning.

Figure 67:  This diagram shows a cognitive architecture assembled entirely from conventional, off-the-shelf neural network components including multi-layer convolutional networks for uni-modal primary sensory and multi-modal association areas, an attentional network configured with an objective function based on a version of Yoshua Bengio's consciousness prior [18], two differentiable neural computer networks, one capable of storing a relatively small number of key-value pairs and a second of storing a much larger set of pairs, to model, respectively, working memory and episodic memory.

As an example, consider the architecture shown in 67 featuring a stack of standard neural network components including both attentional circuits and differentiable neural computing elements, i.e., stock Neural Turing Machines with configurable controllers. In architecture shown, there is no (explicit) provision made for many of the critical cognitive capabilities attributed to the circuitry of the prefrontal cortex, cortico-thalamic-basal-ganglia or hippocampus-entorhinal-perirhinal-and-parahippocampal systems. The conceit is that various hidden layers, for example, in the boxes shown with dashed borders, are supplied specifically to encourage these capabilities by teaching the system how to control the machinery of the supplied memory and routing systems strategically placed to serve such roles. Given what we know about how difficult it is for children to learn basic mathematical skills involving symbolic reasoning, this learning may require some form of curriculum-based, staged developmental cultivation2.

## April 19, 2019

%%% Fri Apr 19  5:02:46 PDT 2019


Slide 11 in Adam's slides relating to Solari and Stoner [180] (HTML) is the start of his description of Hayworth and Marblestone [93] "How thalamic relays might orchestrate supervised deep training and symbolic computation in the brain". While Adam and Ken provide a coherent account of how thalamic latches and working memory buffers might arise naturally in human brains. The following is an excerpt from an email message to Adam concerning the part of his presentation at 00:34:20 into the recorded video of his talk:

I enjoyed the description of the model that you and Ken developed and especially your theoretical explanation of how cortico-thalamic-basal-ganglia circuitry might have evolved thalamic latches to support working memory buffers. It reminded me of Terrence Deacon's The Symbolic Species: The Co-evolution of Language and the Brain [39] in which he takes Chomsky to task by first making the case that syntax and semantics together are needed to bootstrap language, observing that the foundation of symbolic processing as it relates to meaning comes down to the semiotic theory of Charles Sanders Peirce with its focus on three basic types of sign — icons, indices and symbols, and noting that learning a gramar building on such a foundation is rather simple and therefore there is no need for an innate universal grammar.

He then explains how language and symbolic processing co-evolved drawing on Jeffrey Elman's recurrent neural network model — the so called simple recurrent networks (SRN) of Elman and Jordan to explain (a) the need for recurrent connections in processing combinatorial symbolic representations and (b) how the problem of local minima can be solved using an early form of curriculum learning that involves learning layer by layer and strategically clamping weights to shape the energy landscape as the network's symbolic manipulation circuity gradually becomes more and more capable. He spends an entire section describing how language and the brain's symbol processing ability co-evolved with the former an emergent phenomenon arising out of human social intercourse and providing the essential scaffolding on which to erect civilization.

It's a wonderful treatise and convincing argument. I've included two excerpts in the footnote at the end of this sentence to give you a flavor of his argument3. The book is out of print, but there's a PDF here and if you open it in the Chrome browser — rather than Apple Preview — then it is easy to search and select text since Google knows how to do search in documents with uncommon fonts, kerning and other typesetting intricacies — whereas Apple either doesn't know how or doesn't care enough to get it right. Deacon's latest book Incomplete Nature: How Mind Emerged from Matter [40] is even more ambitious and complex — possibly completely off base but interesting nonetheless — and you can check out my short summary here if you are tempted.

## April 17, 2019

%%% Wed Apr 17 04:17:31 PDT 2019


Relating to some conversations right after class on Tuesday, understanding how the hippocampus works in memory retrieval and subsequent reconsolidation is key to developing powerful episodic memory systems that support imaginative recall (what researchers at DeepMind call imagination-based planning and optimization) and various means of organizing episodic memory to support replay for training reinforcement-learning models and subroutine construction (what psychologists and cognitive neuroscientists going back to Tulving call chunking [197]).

One of the big mysteries has to do with the reciprocal nature of hippocampal-cortical interaction and the patterns of connectivity between hippocampus and the perirhinal and parahippocampal cortex and between these regions and association areas in the parietal and temporal cortex. Patch-clamp electrophysiology and macroscale tractography provide tantalizing hints but better detail and coverage would substantially accelerate our understanding.

See the attached introduction to the special issue of Hippocampus on hippocampal-cortical interactions for a nice survey of the state of the art in 2000 and the recent articles citing [125] from PubMed more up-to-date research. If as suggested in [125] the feedback efferent projections from the perirhinal and parahippocampal cortices to the neocortex largely reciprocate the afferent projections from the neocortex to these areas, this would simplify designing an architecture recapitulating hippocampal function. The Maguire et al [132] paper points out some of the diagnostic implications4.

## April 15, 2019

%%% Mon Apr 15 03:40:11 PDT 2019


Thanks again to Brian and Michael for their presentation last week. Tomorrow, Elizabeth, Jing, Olina, Orien, Riley and Vidya will be leading the discussion of Randall O'Reilly's presentation from last year. There are three primary topics covered in Randy's talk:

1. The distinction between symbolic computing characterized as slow, essentially-serial and combinatorial on the one hand, and connectionist computing characterized as fast, highly-parallel, distributed and context-sensitive on the other. Randy's talk begins with this topic and the "limited systematicity" paper [155] provides the necessary detail. This may seem esoteric but, in fact, it is central to both the applied development of AI and our understanding of merits and limitations of biological computation. We will return to this issue in our discussions with Adam Marblestone, Loren Frank and Michael Frank, all three of which will be participating in classes. Adam Marblestone who will be joining us on Thursday and is currently working with Greg Wayne at DeepMind has written extensively about related issues [133134135].

2. The role of the hippocampal-entorhinal-cortex complex system (HPC-EHC) that underlies episodic memory and imaginative recall. In the first two lectures, I described how one might map the neural correlates of this fundamental cognitive function onto a relatively simple artificial neural network utilizing Differentiable Neural Computing / Neural Turing Machine as the basis for associative encoding of experience and subsequent recall. Loren Frank — participating on Thursday, April 25 —is an expert on the associated biological networks and Michael Frank contributed substantially to the Leabra System and corresponding computational models of long-term memory and executive control.

3. The role of the basal-ganglia, thalamus and prefrontal cortex (BF-PFC) in action selection and executive control as well as the interaction between the BF-EHC and HPC-EHC systems. Michael Frank — participating Tuesday, April 20 — has made several important contributions to our understanding of this system and Randy's overview does a great job presenting the relevant concepts. As a visual supplement you might look at the series of short videos created by Matthew Harvey with help from Felix May and available on the Brain Explained YouTube channel.

Upcoming presentations / talks listed on the class calendar include:

1. Randall O'Reilly, University of Colorado, Boulder, Tuesday, April 16 [HTML] — discussion led by Elizabeth, Jing, Riley, Orien, Olina and Vidya;

3. Greg Wayne, Google DeepMind, Tuesday, April 23 [HTML] — discussion led by Albert, Ben, Julia, Lucas, Manon and Stephen;

4. Loren Frank, University of California, San Francisco, Thursday, April 25 [HTML] — Loren will be presenting in class;

5. Michael Frank, Brown University, Tuesday, April 30 [HTML] — Michael will be presenting in class.

## April 11, 2019

%%% Thu Apr 11 10:07:33 PDT 2019


Some of my recent correspondence included ideas, commentary and recommendations that I normally would have cleaned up and included in the class discussion log had I the time — and so I'll take the expedient of including relevant excerpts that were partly inspired by conversations in class or in email exchanges with students.

To: Oriol Vinyals

%%% Thu Apr 11  3:46:31 PDT 2019


We've been discussing class projects that involve agent architectures that employ variations on the network illustrated in Figure 66 and composed of the following subnetworks:

1. Semantic Memory — refers to circuits corresponding to multi-modal association areas primarily in the posterior that encode abstract representations grounded in various sensory — including somatasensory and auditory language-related input [197196];

2. Episodic Memory — implemented using DNC networks inspired by the hippocampal complex including frontal and entorhinal cortex — the connection to pattern separation and pattern completion in the hippocampus was mentioned in the Nature paper [78];

3. Conscious Attention — a narrow characterization of conscious awareness involving circuits in the frontal cortex that use reinforcement learning to learn to survey activity in semantic memory and attend to exactly one thing at time thereby serializing conscious thought [4919445];

Figure 66:  Here is a simple schematic illustrating the primary reciprocal connections linking the three subnetworks described in the text. Information enters via the peripheral nervous system as sensory data that serves to ground the agent in the physics of its body and accessible environment. The data — including the obvious sensory systems plus the somatasensory and subcortical nuclei comprising the limbic system — is processed by the same highly-parallel machinery that underlies all forms of perception, progressing through multiple layers of abstraction and increasing integration of sensory modalities until activating highly-abstract patterns in semantic memory. These patterns are surveyed and selectivelyand serially activated enabling the hippocampal-entorhinal complex to probe episodic memory and recover related patterns from prior experience to compare with and possibly (imaginatively) enrich current experience.

The full architecture also includes an action selection component modeled derived from the work of Matt Botvinick and Randy O'Reilly described in their respective presentations from last year here and here, but for now we're primarily interested in a simpler agent that just observes, selectively commits patterns of activity in semantic-memory to episodic memory and learns to predict simple dynamic phenomena the agent is biased to be interested in. We were interested in your comments concerning strategies for training imagination-augmented agents from your recorded talk in class last year. How hard would it be to train such a system given the recurrent patterns of activity between the three major subnetworks as illustrated in the attached? The attentional system would be trained by reinforcement learning using an objective based on Yoshua Bengio's "consciousness" prior [18]. The primary and secondary sensory components would use some sort of semi-supervised learning based on samples collected from the target environment. The DNC as shown is divided into a write-only long-term memory and a short-term memory for the active maintenance of patterns of activity in semantic memory highlighted by conscious attention and could be replaced with a variant of Geoff Hinton's fast weights [17411].

To: Adam, Michael, Loren and Jessica

%%% Fri Apr 12 11:52:53 PDT 2019


I'm not nearly as fussy as Adam and Greg. In my work with Rishabh on automated code synthesis, I developed a neural-network representation of procedures that allows one to embed a collection of procedures that can call one another and (recursively) themselves in a Neural Turing Machine [77] (NTM) / Differentiable Neural Computer [78] (DNC) in order represent structured programs. To complement this framework and provide tools for automatic programming, I described how one might build a program emulator — essentially an interactive read-eval-print interpreter — that would enable an ANN to analyze and execute such programs. The computations performed by the NN emulator only approximately simulate the actual program, but the distributed (connectionist) representations of the intermediate and final results of emulating the NTM representation of a program could be very useful for code synthesis if combined with the sort of imagination-based planning and optimization discussed in several recent DeepMind papers [15890].

Since my rough-and-ready model of the hippocampal-entorhinal-cortex (HPC-EHC) complex is essentially an NTM with a specialized controller that supports a differentiable version of C pointers, it naturally occured to me to think of the key-value pairs as a neural-network model of place- and / or grid-cells that could be used to follow paths though computational space, i.e., execution traces, in order to analyze, alter and debug programs. And, as long as I'm spinning speculative tales, serve as the conceptual basis for storing, applying and adapting everyday plans in episodic memory, with the basal-ganglia-prefrontal-cortex (BG-PFC) executive-controller serving as the component responsible for figuring out when to initiate, terminate / abort, adapt or repair programs / subroutines. It might be interesting to the neuroscience community if were able to demonstrate such a model observing traces of executing programs, encoding traces as episodic memory in an NTM model of the HPC-EHC complex, and subsequently using the BG-PFC controller to imaginatively adapt old traces to work in new circumstances.

To: David Chalmer

%%% Sat Apr 13 03:54:23 PDT 2019


Here's a note I sent David Chalmers after we exchanged email following his presentation at Google on April 2. You can watch the video or read the paper [29] — the questions from the audience and Chalmer's responses were my favorite part, and below is my effort to summarize my thoughts in less than a thousand words — it weighs in around 750:

I don't want to sound too cavalier in my dismissal of the nuances regarding consciousness that philosophers seem concerned about. I've always liked Dan Dennett's understanding of free will worth having [Dennett, 2015] and so it was natural for me to read his books about consciousness, but I got tired of the seemingly endless wrangling over the details and the tendency to draw in the opinion of every philosopher since Socrates in order to buttress or defend their theories — the problem is that the nuggets of wisdom are buried in (putatively) didactic prose. A couple of years ago I ran across your adviser's "I Am a Strange Loop" [Hofstadter, 2007], and instantly felt enlightened — I ended up never reading the book but the inspiration I got from the title alone seemed profound at the time and remains so for me. I read and for the most part agreed with Dehaene and Graziano, but they did little to increase my understanding of the larger, evolutionary puzzle of consciousness.

Then I read about research relating to the so-called dual-streams hypothesis as applies to language processing and it clicked for me that there was another (reciprocal) pathway completing the language-production-and-understanding loop, and that this bidirectional path was essential to understanding how being conscious might feel — or at least how people report it to feel. And then finally, I read — or mainly skimmed — Nick Chater's "The Mind is Flat" [Chater, 2018], and his account of human subconscious thinking made the rest seem plausible. If you haven't read Chater's book, the main idea is that all of our subconscious thinking is the result of interpreting sensory input — broadly construed, using the same neural circuitry we employ in perception. Everything that enters the brain through the peripheral nervous system is fodder for constructing abstract representations grounded in the physics of our bodies and the environment in which we live.

These abstract representations form what systems neuroscientists refer to as semantic memory and they are constructed by highly-parallel neural processing all of which succumbs to the same sort of perceptual / conceptual errors we make in thinking that when we look at or listen to a complex scene or polyphonic recording, we are simultaneously attending to all the details, when in fact the brain fools us into thinking this by a combination of imaginatively filling in the details and literally bringing into focus the details only when we alter our focus of attention to emphasize these details. Hence it feels to us as though we are both deep — in terms of our interpretations of what we perceive — and comprehensive — in that we take in the entire sensory experience in one fell swoop, whereas in reality our apprehension and analysis are shallow and piecemeal.

The dual-streams hypothesis suggests that the basic circuitry of the primate visual system with its ventral and dorsal streams constructing complementary representations of visual scenes through many layers of primary, secondary and association areas, is recapitulated in the auditory sensory cortex with one important detail of particular importance in this discussion, namely that these two streams form a loop such that two important brain regions — Broca's and Wernicke's areas — are reciprocally connected by fast, myelinated tracts that facilitate unsupervised learning and, importantly, inner speech [Fernyhough, 2015; Hickok and Poeppel, 2007]. The obvious implication being that when we talk to ourselves we not only hear ourselves but what we say modifies what we think, and just as in other areas of perception even as we change our focus of attention and serialize our thoughts and vocal expression we create a narrative that in some very real sense becomes the story of our lives.

Moreover, since our thoughts are primarily the result of exploiting patterns of activity highlighted by conscious attention in order to recall from our past experience episodes that might help in shaping our future experience, this recursive storytelling — what Hofstadter calls a "strange loop" — produces a narrative that includes a play-by-play account that personalizes everything we see, hear, feel and indeed what we think. Moreover, this personalization is pervasive since, to the mind hidden away in — or, as I like to say. "haunting" — the dark recesses of our brains, the distinction between and origins of perceptual events are largely inaccessible, and, according to many accounts, engineered by natural selection so that we don't notice and become alarmed of just how shallow and incomplete our understanding and apprehension of the world actually is.

## April 7, 2019

%%% Sun Apr  7 04:24:06 PDT 2019


Prepare for class on Tuesday (April 9) by reading one or more of the three assigned papers and viewing at least the first part of Oriol Vinyals' talk here. You don't learn complex concepts by osmosis. You have to read the material for an initial exposure and then review it in order to consolidate your understanding — class on Tuesday will help with the review and Oriol and your fellow students will be able to answer your questions. You're studying machines that learn and so you have no excuse for not following what we know about how humans learn. I won't apologize for pointing out what should be obvious, but I promise I won't do it again.

We need volunteers for the next few speakers / presentations. Note that I'm giving you a chance to volunteer since you made it crystal clear that you would prefer this to my making random assignments. Jessica Hamrick will be joining us on Thursday (April 11). Jessica's recommended papers, presentation and video are all available on her calendar page here. I've listed Jessica's four suggested papers below along with BibTeX references and links to PDF provided on the calendar page; if you're interested in volunteering, send me email ASAP with a list, e.g., 1.C, 2.E, 1.D, sorted by preference, starting with your most preferred:

1. Analogues of mental simulation and imagination in deep learning [88] — 1.A

2. Metacontrol for adaptive imagination-based optimization [15] — 1.B

3. Relational inductive biases, deep learning, and graph networks [90] — 1.C

4. Relational inductive bias for physical construction in humans and machines [89] — 1.D

Next Tuesday (April 16) we will go over Randall O'Reilly's 2018 presentation in the same way we'll go over Oriol's presentation this coming week. Note that both Oriol and Randy have agreed to answer questions via email and help out presenters. Randy's presentation is directly relevant to our understanding of fundamental computational tradeoffs in biological intelligence. He covers (a) the hippocampus (episodic memory), (b) the basal ganglia (action selection) and (c) the key issue of how brains integrate parallel, contextual, distributed (connectionist) representations with largely-serial, systematic, combinatorial (symbolic) representations. Below are the papers I will be assigning — you can find BibTeX, PDF, slides and video here:

1. Making Working Memory Work: A Model of Learning in the Prefrontal Cortex and Basal Ganglia [152] — 2.A

2. Six principles for biologically based computational models of cortical cognition [149] — 2.B

3. Towards an executive without a homunculus: models of prefrontal cortex/basal ganglia system [94] — 2.C

4. Biologically Based Computational Models of High-Level Cognition [151] — 2.D

5. How Limited Systematicity Emerges: A Computational Cognitive Neuroscience Approach [155] — 2.E

Finally, next Thursday (April 18) Adam Marblestone will be joining us on VC from London where he is currently working with Greg Wayne at DeepMind. Adam is a polymath and incredibly prolific and diverse in his interest as you can see from this description. He majored in quantum physics at Harvard, did his PhD with George Church on DNA origami and molecular labeling using in situ hybridization, has done a lot of research with Ed Boyden and is still affiliated with Ed's lab, worked for a while at Bryan Johnson's Kernel startup and now is at DeepMind. We've got something special planned for April 18th that I'll tell you about as soon as we get things a little more organized.

## April 5, 2019

%%% Fri Apr  5  2:44:05 PDT 2019


Note to Tyler, Julia, Jerry, John, Meg and Paul regarding the papers for Tuesday's class discussion featuring Oriol Vinyals' 2018 presentation. Thanks again for agreeing to do this on relatively short notice. Here's a table of names and email addresses in case you want to coordinate — which is probably a good idea just so everyone is on the same page on Tuesday:

 Tyler Stephen Benster Julia Gong Jerry Meng John Michael Mern Megumi Sano Paul Warren

The slides and videos are in the course archives here and your paper assignments are listed below along with links to the assigned papers:

1. Learning model-based planning from scratch [158] [John, Tyler]

Here's what I expect on Tuesday: Shreya will cue up the videos and slides and have the overhead projector and audio working. Each of you will summarize your assigned paper in the order that they appear in Oriol's presentation. I don't expect you to create any new slides unless you want to, but I hope you will make good use of Oriol's slides and video. Shreya will cue up and advance the slides as you request. If you show parts of one of the four video segments with Oriol speaking, that's fine ... indeed having Oriol speak for himself is encouraged ... it simplifies your life and allows you interject commentary, pose questions and generally add value to the presentation from your careful reading.

Don't expect the students to have watched the videos or read the papers. I'm a realist; class participation is the bane of seminar courses, hence the approach I'm trying out this year with you three teams of two being the first guinea pigs. You don't have to and really shouldn't try to present every detail in your paper. Summarize the highlights with enough coverage that you would want if you were listening rather than presenting. If you want to show an excerpt of one of the videos, write down which of the four videos and an offset HH:MM:SS so that Shreya can cue it up without a lot searching for the right space.

Since there are two of you assigned to each paper, I expect you to figure out a presentation strategy that you feel comfortable with and try it out — split it up and present as a tag team or divide the labor however you see fit. We'll all learn from the experience and subsequent presentations will benefit and can either follow your lead or come up with an alternative. Since I don't expect you to go crazy and practice — much less polish — your presentations, it can be as casual and experimental as you like. Thanks again and if you have any questions for us, please don't hesitate to reach out.

## April 4, 2019

%%% Thu Apr  4 03:57:54 PDT 2019


We'll talk about the following listing in class this afternoon in the context of creating a method of encouraging class participation that is fair, flexible, not punitive, and, if approached in the right spirit, will turn out to be pleasingly educational. This is a reboot following up on several conversations after class on Tuesday regarding an earlier proposal mentioned briefly in the first class of the quarter on Tuesday. For those of you reading this who didn't attend class today, I'll follow up with details of what conspired when I get a chance tomorrow.

1. There's nothing magical about consciousness [484547] Stanislas Dehaene VIDEO

2. What's consciousness and how can we build it [8079108] Michael Graziano VIDEO

3. There is no "hard problem" of consciousness [51] Daniel Dennett see Felipe de Brigard — one of Dennett's students at — 01:03:15 into VIDEO

4. Theory of Mind Reasoning is Easy and Hard [164] Neil Rabinowitz VIDEO

5. Strange loops and what it's like to be you [101] Douglas Hofstadher VIDEO

6. False memories and counterfactual reasoning [15638] Felipe de Brigard VIDEO

7. Thinking Shallow or Deep and Fast or Slow [30116] Nick Chater VIDEO

8. Meta-Control of Reinforcement Learning Matt Botvinick Slides and video in the class archives VIDEO

1. Prefrontal cortex as a meta-reinforcement learning system [203]

2. Episodic control as a meta-reinforcement learning system [167]

9. Computational Models of High-Level Cognition Randal O'Reilly Slides and video in the class archives VIDEO

1. The Leabra Cognitive Architecture: How to Play 20 Principles with Nature and Win! [153]

2. Complementary Learning Systems [154]

3. Making Working Memory Work: Model of Learning in the Prefrontal Cortex and Basal Ganglia [152]

4. Towards an Executive Without a Homunculus: Models of the Prefrontal Cortex and Basal Ganglia [94]

10. Prediction, Planning and Partial Obervability [204] Greg Wayne Slides and video in the class archives VIDEO

11. Predictron: End-To-End Learning and Planning [178] David Silver Slides and video available on Vimeo VIDEO

12. Imagination, Model-based and Model-free learning Oriol Vinyals Slides and video in the class archives VIDEO

1. Learning model-based planning from scratch [158]

2. Imagination-Augmented Agents for Deep Reinforcement Learning [205]

3. Learning to Search with Monte Carlo Tree Search (MCTS) networks [84]

Miscellaneous Loose Ends: I have at least two faults as a lecturer and public speaker: First, I try to pack too much into too short a time. Second, I don't repeat myself often enough, even though I know both from experience and my understanding of the research on learning that repetition, replay, rephrasing and having students repeat what they just heard is key to learning — note that I just did it.

My lecture on Tuesday was a good example though I admit that I did make an attempt to repeat some of the key ideas, unfortunately I couldn't help but change the examples as a consequence of my misguided intuition that nobody likes to hear the same example twice. In addition I didn't really motivate why I introduced the quote by Max Planck — as if it needed clarification [...] you can add "wry sense of humor" to my faults.

Apart from those pedagogical failings, neither did I realize how much people don't want to believe the point I was trying to make. I understand that it is hard coming to grips with the fact that knowledge is not static. Over time some propositions we believe to be fact turn out to be false, and some propositions we adamantly believe to be false turn out to be among the fundamental truths that govern the universe — until they are overturned by even more fundamental truths.

There was another point that I meant to make but forgot in trying to cram too much information into little time and that is Planck's funereal prediction continues to be true and, moreover, the consequences of its impact are exacerbated by the fact that we continue to increase our average lifespan. If I had had the time, I would've added another list of scientists alongside the neuroscientists and computer scientists that I listed, and that would be biologists and geneticists — if I had had the time I would gone off on other tangents.

One of my favorite examples concerns one of my least favorite scientists Carl von Linné — also known as Linnaeus — who in the 10th edition of his magnum opus taxonomy about biological diversity included "wild children" as a separate subspecies of humanity characterized by mental and physical disabilities. Johann Friedrich Blumenbach ultimately set the record straight and went on to further annoy various of his contemporaries by writing that "man is a domesticated animal ... born and appointed by nature the most completely domesticated animal" — puzzle that out if you can5.

And as if the taxonomy of the natural world was not fraught with additional controversy in its long history, you only have to read David Quammen's "The Tangled Tree: A Radical New History of Life" account of the discovery of how the discovery of a new "third kingdom" of life changed our understanding of evolution — by identifying the widespread evidence of horizontal gene transfer — and the lives of the scientists — including Carl Woese and Lynn Margolis — who fought against entrenched interests to bring this discovery to the attention of the scientific community and convince their peers of this unsettling truth — did it again, as if Stanford students have time to read potboiler histories of science that chronicle heroic human endeavors that changed our fundamental understanding of life.

## April 3, 2019

%%% Wed Apr  3  4:49:10 PDT 2019


### A Personalized History of Artificial Intelligence

I've been working in the field of artificial intelligence for nearly 40 years. I was a graduate student during the Halcyon days of the early 80s when IJCAI, the premier international conference on AI, was held in Los Angeles with nearly 5,000 attendees. That year the convention center floor space for vendors was packed with small companies looking to attract customers to demo their AI systems. Larger companies rented lavish hotel suites and booked fancy yachts complete with catered dining to lure talented graduate students. Pretty heady times for young graduate students unaware their field of study was so popular — and apparently lucrative.

When I co-chaired IJCAI 1999 in Stockholm, the attendance was less than 2,000 and we had trouble filling the conference hotel. I watched as the field became enamored of so-called expert systems, started believing their own hype and entered a period of decline generally referred to as the AI Winter that spanned the late 80s and early 90s. During that time the field splintered into specialized subfields and dispersed to favor more focused conferences and academic journals.

Throughout my career I've contributed to research in several core areas of AI including automated planning, computer vision, robotics, machine learning and probabilistic graphical models. I've also published papers in journals that focus on control theory, computational neuroscience, decision support systems operations research and stochastic optimal control, but I was unusual in having such eclectic taste. The 80s represent a time during which it was unwise to deviate much from the mainstream. There was less tolerance for embracing other disciplines and graduate students were discouraged from deviating too far from the mainstream. AI was trying to define itself.

I grew up in the 60s, dropped out of Marquette University within a few weeks of registering as a freshman and ended up hitchhiking around the country, living in communes and participating in many of the political and countercultural activities of that era. After a few years of homesteading in rural Virginia while trying to make a living building houses and refurbishing industrial machine tools, I got interested in computers and enrolled Virginia Polytechnic Institute in 1968. I majored in mathematics but also took classes in computer science and worked with a CS professor writing code for a Prolog-based automated robot planning system. That was my first exposure to AI.

When I started graduate work at Yale, the field of AI was in thrall of first-order predicate logic. John McCarthy and his colleagues at Stanford University had wrested the intellectual focus of AI from MIT and CMU and symbolic reasoning was king. This was important to my career for a couple of reasons. It would occasion my making several trips to the West Coast and spending a summer at SRI7 in Menlo Park California next door to Stanford and right in the center of Silicon Valley. Graduate students from all over the world participated in an SRI-sponsored effort they codenamed Common Sense Summer to represent all of commonsense reasoning in first-order logic. I was absolved of having to work on the project so I could complete my Ph.D. thesis in time to join the faculty of Brown University starting in January of 1986.

In hindsight, first-order logic was not a good tool for modeling naïve reasoning about everyday physics. Some of that work has continued but it is no longer central to the main thrust of current research in artificial intelligence with its focus on neural networks. Part of the motivation for exploring first-order logic was to formalize and make rigorous the sort of symbolic processing that was emblematic of the work of Allen Newell and Herbert Simon at Carnegie Mellon and Terry Winograd Stanford. I admired Minsky's broad range of interests and was fascinated with his early work at Princeton in developing an analog artificial neural network in hardware8, but Minsky was to have a devastating impact on artificial neural network research.

Concerned with what they saw as a lack of rigor and focus, Minsky and Seymour Papert, Minsky's colleague at MIT, were successful in convincing many researchers that work on neural networks was misguided and that they should instead focus their efforts on the new field of artificial intelligence that Minsky, McCarthy and Claude Shannon had helped to launch at the Dartmouth Summer Research Project on Artificial Intelligence held in 19569. While many heeded the warnings of Minsky and Papert, an assemblage of researchers primarily located at Carnegie Mellon calling themselves the PDP group and led by the psychologists David Rumelhart and Jay McClelland along with computer scientists, physicists and cognitive scientists including Geoffrey Hinton, Ronald Williams and Terry Sejnowski made important contributions to our understanding of artificial neural networks. PDP stands for parallel distributed processing and it is the predominant approach to what is now known as connectionism10 emphasizing the parallel nature of neural processing and the distributed nature of neural representations.

In late 80s, Jerry Fodor and Zenon Pylyshyn, cognitive scientists at Rutgers University, made the claim that, while both connectionist distributed-processing and traditional symbolic-processing models postulate representational mental states, only the latter is committed to symbol-level representational states that have combinatorial syntactic and semantic structure. They saw the existing evidence for this claim as making a strong case that the architecture of the brain is not connectionist at the cognitive level [63]. Since then there has been a good deal of work showing that the context sensitivity of connectionist architectures — objects and activities are defined by the context in which they appear, and the systematicity of symbolic architectures — the ability to process the form or structure of something independent of its specific contents, are complementary from a functional perspective and separately realized in the brain from a anatomical perspective [155].

Simply put, these two complementary information processing systems enable us to recognize that a pony is a horse even though we've never seen such a small equine and parse a sentence in our native tongue even if we don't understand the meaning of most of the words. Context sensitivity allows us to deal with concepts and analogies that lack clear boundaries or allow for some degree ambiguity. Systematicity makes it possible for us to do mathematics and design complex machines. In subsequent chapters we explore cognitive architectures modeled after the human brain that exhibit these characteristics in their ability to employ language to communicate and collaborate, formulate and carry out complex plans, recall past experience from episodic memory and adapt that experience to present circumstances and perform at a superhuman level such tasks as writing computer programs and proving mathematical theorems. We'll also investigate the possible advantages of machines that are consciously aware, that create and maintain a sense of self, that can reason about their own mental state as well as the mental states of others — characteristics considered by many as uniquely human.

## April 2, 2019

%%% Tue Apr  2  2:43:58 PDT 2019


### Course Organization

The class is organized along two dimensions: cognitive architecture — how to design and implement systems that operate in complex environments, using episodic memory to adapt past experience in the process of planning for and predicting future events, and embodied cognition — disembodied minds are the stuff of science fiction, but minds integrated with powerful physical and cognitive enhancements that facilitate interacting with complex environments are the future.

Tom Dean will take primary responsibility for the discussions and invited talks related to topics in cognitive architecture, emphasizing the design of artificial neural networks implementing memory, perception and attention models. Rishabh Singh will handle the discussions and talks related to embodied cognition, with an emphasis on synthesizing, repairing and refactoring source code, and designing digital assistants that, among other skills, supports interactive code completion.

Primary References and Related Talks and Presentations:

Concept: Conscious Attention
Resources: Nick Chater [30], Stanislas Dehaene [45], Michael Graziano [80]
Lectures: Stanislas Dehaene, Michael Graziano, Felipe de Brigard*

Concept: Episodic Memory
Resources: Dere et al [52], Frank et al [65], O'Keefe and Nadel [147]
Lectures: Randall O'Reilly+, Loren Frank*, Michael Frank*, Greg Wayne+

Concept: Action Selection
Resources: Frank et al [65], Ito [111], O'Reilly and Munakata [150]
Lectures: Matt Botvinick+, Jessica Hamrick*, Peter Battaglia*, Adam Marblestone*

Concept: Code Synthesis
Resources: Gulwani et al [86]
Lectures: Vijayaraghavan Murali*, Miltos Allamanis*, Tejas Kulkarni*

[*] Participate in discussions
[+] Welcome class questions

Note: I've put several reference books11 on reserve in the Engineering library. BibTeX references, including many abstracts, for all the papers cited in the class materials as of this writing are available for reference here. Two supplementary readings for the first week of classes are available on the course website: Stanford students can access the prologue to Nick Chater's The Mind is Flat and a chapter from Helen Thomson's Unthinkable about a woman whose ability to navigate in everyday life is severely compromised — in the process of telling this woman's fascinating personal story, Thomson does an excellent job of describing how the hippocampus, entorhinal cortex and retrosplenial cortex support navigation and landmark recognition using a collection of specialized neurons, including place cells, grid cells, border cells and head-direction cells.

## April 1, 2019

%%% Mon Apr  1  4:32:56 PDT 2019


The grading breakdown for the class is as follows: 30% for class participation, 20% for project proposals and 50% for final projects. Project proposals are due around midterm and final projects by the end of the exam period. There will be plenty of opportunity for feedback on proposals and advising and brainstorming on projects. We'll spend most of the class on Thursday, April 4 discussing example projects, including the length and content of proposals, the scope of projects, working in teams and what you'll have to turn in for grading.

In the remainder of this post, I'll describe what's expected for class participation. This is a seminar course and so I expect students to attend class having prepared by reading the assigned papers. I realize that you might have job interviews, family emergencies and other exigencies that will prevent you from attending one or more classes, but I expect you to do your best. Obviously you could read the papers on your own. The real benefits of this class derive from participating in discussion, interacting with speakers and student presenters.

I expect every student to participate in leading at least one class discussion. To facilitate participation, Rishabh, Shreya and I will provide a set of topics and assign students to small teams responsible for each topic. There are eighteen (18) class slots on the calendar. I'm lecturing in the first two classes, Rishabh is lecturing in another when we shift the focus to automatic code synthesis and embodied cognition, and we'll be setting aside two or three slots for project discussions once we finalize the invited speakers. So there at most twelve (12) slots that feature an invited speaker / discussion leader.

Depending on the final size (s) of the class following the normal shopping and shakeout period, Rishabh, Shreya and I will select some number (t) of topics that align with one or more of the featured presentations. For each of the t topics we will randomly assign approximately s/t students to research and produce a study guide and presentation that all of the students can use to prepare for the corresponding class.

In some cases, the students responsible for a given topic will use their presentation to lead a discussion in class — for instance in the case where the invited speaker can't easily participate either remotely or in the flesh due to their being in an inconvenient time zone. In all cases, the study guide is due two days before the designated class time, the t teams are expected to self organize and share the effort. My lecture tomorrow, Tuesday, April 2, will provide examples of what I'd like to see in your study guides.

Rishabh, Shreya, and I will be available to assist in suggesting technical papers and recorded presentations, introducing you to local experts at Google and Stanford and some of the remote labs we collaborate with, as well as brainstorming and generally being available to help out however we can. You can use any of the past class presentations as templates for your presentation and the study guide can use your favorite notebook or weblog format to organize your content.

My research notes for class are probably more discursive than your fellow students would prefer for a topic-focused study guide, but several of the entries in the class discussion list might serve as starting point for your efforts. You can also think of these documents as a starting point for final project proposals. The following sequence of linked documents and suggestions might serve as an outline for a study guide on designing hippocampus-inspired memory systems:

The "Seven Deadly Sins" piece by Daniel Schacter highlights what episodic memory is good for. The Brain Atlas shows that each hemisphere has an associated hippocampus. The "Two-Minute Neuroscience" video describes the structure of the hippocampal formation and "Brains Explained" describes its function. The ED-TED piece discusses HM, Wilder Penfield's most famous patient, and the following Wikipedia page answers the related question of what happens when the damage is localized in one hemisphere.

Loren Frank's talk at the Simons Institute is too long and too technical to include the whole video, and so you'll have pick out one or two interesting or useful points to focus on. The DeepMind blog article and Alex Graves' NIPS talk are full of interesting snippets to feature or summarize in discussing how one might build an episodic memory modeled after the hippocampus. Fleshed out a little more and this would make an excellent way to prepare for Lauren Frank's discussion in class on Thursday, April 23.

# Class Preparation

## March 31, 2019

%%% Sun Mar 31  4:21:31 PDT 2019


Science and technology are moving quickly, some would say that the pace is accelerating, but when you're standing in the knee of the exponential13 it is hard to get a feeling for how quickly we are careening into the future. The big breakthroughs probably won't seem so at the time if you're in the middle of pushing the technology. Despite what some say, I believe that most of what has to be done to achieve AI relies on good engineering. This is true of a lot of disciplines. Along similar lines, I've heard scientists lament that the Nobel laureates of the last few decades don't seem to be in the same league with the likes of Isaac Newton, James Clerk Maxwell or Albert Einstein. Even our recent popular-press-anointed geniuses like Richard Feynman and Murray-Gell Mann pale in comparison to a Maxwell or Einstein. Perhaps I will change my tune if Edward Witten succeeds in using some variant of string theory to unify gravity and quantum field theory and thereby succeed where Einstein failed after decades of trying to come up with a grand unified theory or GUT.

In discussing navigation and landmark recognition in the hippocampus and entorhinal cortex, Ellen Thompson writes [192] that "we fill our mental maps with things that are meaningful to us" — how would we know? It might seem reasonable that, if we work on the "right" things, e.g., things that we like to do or things that we need to do in order to get along with our lives, then what is useful to us in learning how to actually accomplish these things, will be meaningful to us. That last sentence was either tautological or inconsistent! Nick Chater seems to imply that all of our sensations — visual, auditory, motor, somatosensory, etc., are meaningful in the sense that they arise from our engagement — whether actively or passively, with our environment, i.e., they are grounded in our bodies and the extended physical environment that exists outside our bodies and in which we are embedded The term semantic memory is often used to refer to all the products of our primary, secondary and multi-modal association areas, despite the fact that these products of interpretation and inference can only be linked back to the physical world thought multiple layers of abstraction.

Greg Wayne's predictive memory, MERLIN, for solving POMDP (Partially Observable Markov Decision Process) problems provides a possible solution to this quandary [204]. If we assume that everything we'll ever need to know is potentially either directly observable or predictable from other things that we can observe, then all we have to do is determine the values of the state variables that are preconditions for executing the optimal policy and then figure out how to observe or predict them from state variables that are observable — which may require that, for example, that we have to move in order to perform some of those observations. MERLIN learns both a forward and an inverse model to figure out what observations are required to solve the POMDP problem. The resulting combined forward and inverse model is referred to as an internal model and studied in the field adaptive stochastic optimal control and applied in many disciplines including machine learning and computational neuroscience as well as diverse applications including the automatic control of manipulators for robotic assembly.

Miscellaneous Loose Ends: Daniel Schacter is Professor of Psychology at Harvard University focusing on cognitive and neural aspects of human memory. If you're interested in psychology and cognitive science as it pertains to memory and, in particular, the details concerning specific experiments involving human subjects, you might be interested in Schacter speaking on Episodic Retrieval and Constructive Memory & Imagination at UC Irvine in 2018. This lecture was presented at the 2018 International Conference on Learning and Memory. For more information on Schacter's research visit the Schacter Memory Lab. For a relaxed, conversational introduction to his research relevant to the topics covered in CS379C you might find this interview more appropriate to your interests.

## March 29, 2019

%%% Fri Mar 29  3:24:13 PDT 2019


In the final chapter entitled "The Secret of Intelligence", Chater suggests that there is good news to offset any disappointment his readers may feel at the loss of their — misattributed — unconscious cognitive depth. Earlier in the book he has made the case that all of our unconscious thought is due to the interpretative power of our perceptual system, and in this chapter he suggests that's not to be dismissed: "[T]he fact that we can make such remarkable perceptual leaps is a reminder of the astonishing flexibility and, one might even say, creativity of the perceptual system."

The take home message is summarized in this passage: "These imaginative jumps are, I believe, at the very core of human intelligence. The ability to select, recombine and modify past precedents to deal with present experience is what allows us to be able to cope with an open-ended world we scarcely understand. The cycle of thought does not merely refer passively to past precedents — we imaginatively create the present using the raw materials of the past." He draws on earlier examples of how humans can make sense of a barely recognizable sketch of a face [141], an ambiguous painting of an interrupted dinner [212] and a stylized black and white image of a Dalmatian almost impossible to see against of background of Fall leaves [193].

Characterizing metaphor as when we see one thing as another by drawing (creatively) upon our experience [Page 210], he talks about how we can decode the often incomplete, often incomprehensible present by inventive transformation of the past, and suggests that while such "exhuberant mental leaps" may seem at first frivolous they are essential for perceiving the world as it is [Page 212]. He concludes that the human brain's secret is imaginative interpretation and not "cold logic" ... best when used when followed by more disciplined analysis [Page 215], as in the case of making sense of the unexpected predictions and using them as jumping off places for constructing theories of all sorts using the kinetic theory of gasses as an example.

In the section entitled "The Distant Prospect of Intelligent Machines" he predicts that artificial intelligence on a par with human intelligence is sometime off. He correctly assesses the failure of earlier attempts in the 1970s and 1980s to build AI systems by encoding rules, and notes that "[a]rtificial intelligence since the 1980s has been astonishingly successful in tackling these specialized problems. This success has come, though, from completely bypassing the extraction of human knowledge into common-sense theories."

If our spectacular mental elasticity — our ability to imaginatively interpret complex, open-ended information into rich and varied patterns — is the secret of human intelligence, what does this imply for the possibility of artificial intelligence? [...]

My suspicion is that the implications are far-reaching. As we saw, the early attempts to extract and codify human 'reasoning' and knowledge into a computer database failed comprehensively. The hoped-for hidden inner principles from which our thoughts and behaviour supposedly flow turned out to be illusory.

Instead, human intelligence is based on precedents — and the ability to stretch, mash together and re-engineer such precedents to deal with an open-ended and novel world. The secret of intelligence is the astonishing flexibility and cleverness by which the old is re-engineered to deal with the new. Yet the secret of how this is done has yet to be cracked. [Page 217]

However, his characterization of the current state-of-the-art and assessment of the future is not well informed despite his access to some of most forward thinking researchers working in AI today. In particular, he correctly summarizes the inadquacy of methods that rely entirely on large ammounts of carefully annotated and currated ground truth,

But my suspicion is that it is our mental elasticity that is one of the keys to what makes human intelligence so remarkable and so distinctive. The creative and often wildly metaphorical interpretations that we impose on the world are far removed from anything we have yet been able to replicate by machine.

To those who, like me, are fascinated by the possibilities of artificial intelligence, the moral is that we should expect further automation of those mental activities that can be solved by ‘brute force’ rather than mental elasticity — the routine, the repetitive, the well defined. [Page 218]

but misses the ascendancy of more sophisticated memory models, new methods for unsupervised learning, improvements in reinforcement learning and the increasing emphasis on embodied cognition involving robotics and simulated environments leveraging powerful physics engines developed for interactive computer games. See, for example, recent research work by Maguire, Hassabis, Schacter and others focusing on constructive memory, imagination and future thinking, episodic memory and goal-based planning [148214172921739111927], as well as demonstrations of working models for imagination-based planning, optimization, relational and code synthesis [8972151589020514].

## March 27, 2019

%%% Wed Mar 27 15:08:22 PDT 2019


Adapted from the section entitled "The Four Principles of the Cycle of Thought" in [30] with additional commentary here14:

I. The first principle is that attention is the process of interpretation. At each moment, the brain selects and attends to a target pattern of activity that it then attempts to organize and interpret. The target might consist of parts of our sensory experience, a fragment of language or a memory. The brain always attends to exactly one target at a time. The neural circuitry responsible for attention has reciprocal connections to circuits throughout the cortex, allowing for a wide range of analysis and imaginative re-interpretation to find meaning in our experience of the world.

II. The second principle concerns the nature of consciousness and states that our only conscious experience is our interpretation of sensory information. We are aware of each such interpretations, but the neural correlates from which this interpretation is derived and the process by which it is constructed are not consciously accessible. Perception produces interpretations in the form of patterns of activity that are derived from activity originating in peripheral sensory systems. The interpretive machinery has no way of identifying where these interpretations come from.

III. The third principle is that we are conscious of nothing else: all conscious thought concerns the meaningful interpretation of sensory information. The claim that we are aware of nothing more than our meaningful organization of sensory experience isn't quite as restrictive as it sounds. Sensory information need not necessarily be gathered by our senses, but may be invented in our dreams or by active imagery. Much of our sensory information comes not from the external world but from our own bodies — including many aspects of our discomfort, pleasure and sensation of effort or boredom.

IV. Conscious thought is the process of meaningfully organizing our sensory experience. The fourth principle is that the stream of consciousness is nothing more than a succession of thoughts, an irregular series of experiences that are the results of our sequential organization of different aspects of sensory input. We attend to and impose meaning on one set of information at a time. The involuntary, unconscious autonomic nervous system (ANS) controls breathing, heart-rate and balance independent of conscious thought, but central nervous system (CNS) activities beyond the sequential cycle of thought are limited.

## March 25, 2019

%%% Mon Mar 25  3:24:13 PDT 2019


Stanislas Dehaene is an accomplished systems neuroscientist and a first rate science writer. His The Number Sense: How the Mind Creates Mathematics [43] and subsequent Reading in the Brain: The Science and Evolution of a Human Invention [44] were two of the most influential books attracting me to pursue systems neuroscience more comprehensively — in addition to my existing avid interest in cellular and molecular neuroscience. His Consciousness and the Brain: Deciphering How the Brain Codes Our Thoughts [45] is the clearest account of consciousness I've read so far and has the most convincing empirical basis for believing it.

In talking with Rishabh Singh and Sarah Loos at different times over the last two months, I have suggested that natural language might serve as both a declarative and a procedural language for, respectively, specifying what programs do and reasoning about how they do it. We've also explored the possible role of natural language as the language of thought, but that strikes me as presumptive and entirely too parochial. In The Number Sense, Dehaene [43] had some interesting things to say about the relationship between mathematical and natural language including relevant comments by John von Neumann.

The other day, my wife pointed me to a series of YouTube videos from a public symposium, The Amazing Brain", held at Lunds Universitet on September 6, 2017. In his invited talk entitled "A Close Look at the Mathematician's Brain", Dehaene considers a wide range of specialized graphical languages going back to ancient cave paintings and featuring both highly stylized animals and symbolic notations and extending to the present era in which the language of mathematics has myriad dialects and specialized applications.

Dehaene and his colleagues have used various neuroimaging technologies to identify how these specialized languages map to systems-level circuits involving different collections of component functional brain areas. Surprisingly, both professional and amateur mathematicians seldom employ the familiar circuits and functional areas, such as Broca's and Wernicke's, associated with everyday natural language discourse, and there appears to be reproducible circuit-level agreement across subjects working with the same mathematical objects including, for example, trees, directed graphs, geometric shapes and algebraic equations.

These neuroimaging studies lead Dehaene to conjecture that perhaps, while different sensory modalities often employ similar computational strategies, they also employ different specialized features and exhibit different preferences for downstream associations depending on the peculiarities of their defining characteristics. Depending on when during development a given mathematical facility is typically acquired, the functional locus of the corresponding neural substrate might tend to gravitate to different brain areas.

Just as Terrence Deacon suggests that language, along with the unique human capacity for symbolic thought, co-evolved with the brain, perhaps, in accord with the purported generality of the dual-streams solution to neural signal processing, the arcuate fasciculus pathway connecting Broca's and Wernicke's areas is just the obvious selective response to a convenient nascent speaking-and-hearing technology for communication. Evolution might have taken a different turn if we had evolved with the ability to modulate and demodulate RF signals.

In Dehaene's experiments, much of the variation is exhibited in the frontal lobe which makes sense since the base input modality might correspond to speaking, writing, signing, drawing or some combination of these. It would only be later when the input is parsed into a suitable specialized internal language that we would see variation related to the specific area of mathematics, requiring specialized circuitry or proximity to other related circuits. It might be worth picking up a copy of Space, Time and Number in the Brain: Searching for the Foundations of Mathematical Thought edited by Dehaene and Elisabeth Brannon [46].

## March 23, 2019

%%% Sat Mar 23  3:44:41 PDT 2019


Here are some of the notes I compiled in preparing for the first class lecture held on Tuesday, April 2nd:

### Deep Reasoning About Shallow Thinking

I am going to begin and end this lecture with quotations from Nick Chater's latest book The Mind is Flat ... not because he is the only person to suggest such a theory of mind — we'll look at several related theories later in the coming weeks, but because he is, in my opinion, an articulate and competent cognitive scientist with a plausible theory of human cognition supported by a good deal of compelling experimental results and, since this lecture is all about building systems based on ideas from cognitive science, Chater's ideas are particularly relevant to our endeavor. In a recent lecture, Chater likens the mind to an iceberg:

The computations that the brain performs are very complicated, but they are nothing like the computations we are consciously aware of. So I think using our conscious mind as a window into the brain is a disastrous error. It is as though we imagine the mind as the tip of an iceberg. We can see the iceberg poking out of the sea, and the illusion is that we think "I can see the tip of the iceberg which is my flow of conscious experience, but I bet that the whole iceberg is doing the same sort of thing." In fact, the machinery that is operating at the unconscious level is a complicated memory retrieval and reuse system that is searching our past experience and using that past experience to understand the present. It performs tasks like scanning all the faces you've seen in order to recognize a new face. All of that perceptual processing is fast and runs in parallel, but it is nothing like the sequential computations performed by the conscious mind. The basic operations performed in the different sensory and motor areas of the brain are pretty much the same. They all involve perceptual processing of one sort or another. Even abstract thought involving, say, mathematics, physics or formal logic, are all carried out on a similar neural substrate by performing similar operations. I think of these calculations as being similar to those we employ in thinking about and recognizing objects. They are just more abstract versions of our innate perceptual capabilities. There are some who believe there are a number of specialized systems handling different types of problems — consider, for example, Jerry Fodor's [62] original modularity of mind hypothesis [REFERENCE] and the extended massive modularity hypothesis advocated by evolutionary psychologists including the Swiss Army knife metaphor of Cosmides and Tooby [33]. But I don't agree and resist making such strong modularity assumptions15.

In making decisions, you have the impression that your reasoning is based on some well-thought-out analysis occurring at deeper subconscious level — you can feel the strength of convictions, your judgments seem born out by the fact of your emotional responses. Unfortunately, those feelings, that strong impression, your enthusiasm or repulsion ... these are not the consequence of some deeper analysis or carefully reasoned argument. They aren't propositions arrived at by sound inference procedures that serve to generate antecedent preconditions in support of subsequent forward chaining ... rock solid facts built upon a foundation of other facts to construct an impregnable edifice of rationality — Chater tells us that it is nothing of the sort.

Moran Cerf is an Israeli neuroscientist and hacker — his use of the term — who has worked in cybersecurity. Cerf has studied human subjects undergoing surgeries that require they spend several days walking around with an electrode array implanted in their brains. The purpose of the implanted array is to collect data so as to accurately position a more permanent electrode for medical purposes. Cerf and his colleagues have gained their consent to study their brains and carry out experiments along the lines of those described in Chater's book, but, in Cerf's case, allowing the experimenters to observe changes in populations of spiking neurons. For more information, check out his Wikipedia page and this YouTube video in which he describes his work related to Chater's theory.

Chater's work has many antecedents and draws upon research in cognitive psychology, behavioral economics, systems neuroscience and sociobiology — among other disciplines — to support his thesis. In particular, Chater cites work by David Kahneman and Amos Tversky and their colleagues including Paul Slovic and Richard Thaler on the psychology of judgment and decision-making, as well as behavioral economics, and cognitive basis for common human errors that arise from heuristics and biases. David Kahneman's Thinking Fast and Slow [116] describes a related theory and body of experimental research results, and draws some of the same conclusions. While the primary conclusions of both Chater and Kahneman may seem to some as an insult to our intellect and an indictment of our rationality, Chater and Kahneman seem more sanguine. If anything the ability to rise above our base instincts and make use of our antediluvian computing machinery to achieve incredible accomplishments in the arts and sciences, should encourage our efforts to further ourselves.

### Science Advances One Funeral at a Time

The German physicist Max Planck is quoted as saying: "A new scientific truth does not triumph by convincing its opponents and making them see the light, but rather because its opponents eventually die, and a new generation grows up that is familiar with it." You may know that Planck of the eponymous constant — the so-called Planck constant h in the formula E = hν where E is the energy of a photon and ν is its frequency — felt compelled to introduce the constant named for him in his revised formula for the observed spectrum of black body radiation. This move required that he rely on Boltzmann's statistical interpretation of the second law of thermodynamics. It was he said "an act of despair ... I was ready to sacrifice any of my previous convictions about physics." Boltzmann's career ended in suicide after decades of abuse by his colleagues who derided his theory of statistical mechanics, now heralded as one of the most important achievements in physics along with those of Isaac Newton, James Clerk Maxwell and Albert Einstein. Here is a short list of other scientists whose careers were thwarted and work derided by more senior scientists whose theories they challenged:

Emillo Golgi — reticular theory, Golgi (stain) method
Santiago Ramón Cajal — neuron doctrine
Thomas Hunt Morgan — chromosome hereditary role
Seymour Benzer — single-gene behavioral traits
Vernon Mountcastle — columnar organized cortex
John Eccles — electrical synaptic transmission
Paul Greengard — chemical neuromodulation
Frederick Griffith — horizontal gene transfer
Galarreta and Hestrin — electrical gap junctions
Barres, Bilbo and Stevens — dual role of microglia

In the same vein — tales from the history of science relating to death — Karl Popper was quoted to have said that in science we propose theories and then seek to undermine them, and so scientists in order to eliminate their false theories, they try to let them die in their stead16. I expect we would all like think of the scientists whom we most respect as truth seekers and myth busters who are relentlessly vigilant in their quest for answers to nature's puzzles. Alas, we are all too human to be accorded such respect. I am particularly drawn to instrument builders like Michael Faraday and Ernest Rutherford, careful observers like Tycho Brahe, Santiago Ramón Cajal and Edwin Hubble, and experimentalists like Enrico Fermi and Rosalind Franklin. Ed Boyden, George Church, Karl Deisseroth, Winfried Denk, Mario Galarreta and Mark Ellisman are among my living heroes. They provide the data necessary to come up with new theories and the evidence to falsify existing ones.

It wasn't always that scientists thought of themselves as responsible for actively falsifying their theories and those of their peers. Popper writing again: "As opposed to this, traditional epistemology is interested in the second world [REFERENCE] in knowledge as a certain kind of belief—justifiable belief, such as belief based upon perception. As a consequence, this kind of belief philosophy cannot explain (and does not even try to explain) the decisive phenomenon that scientists criticize their theories and so kill them. Scientists try to eliminate their false theories, they try to let them die in their stead. The believer — whether animal or man — perishes with his false beliefs." [REFERENCE]

### Role of Inner Voices and Hallucinations

Most of us are cognizant of voices chattering endlessly in our heads. In most cases, we associate those voices with ourselves, but many of us hear other voices both familiar and unfamiliar. Hearing and relating to inner voices is often associated with schizophrenia — a serious brain disorder characterized by thoughts or experiences that seem out of touch with reality, disorganized speech or behavior and decreased participation in daily activities. Difficulty with concentration and memory is also common.

In most of us these inner voices appear to serve useful purposes; they can help us to organize our thoughts, consolidate episodic memory by rehearsing and re-imagining our past experience in order to deal with the present and plan for the future. The ability to construct narratives, sub—vocally describe our hopes and plans in a rich, shared language and privately listen to ourselves as we refine and apply these narratives in everyday life appears to be uniquely human — at least given the extent we do so.

Our facility with inner speech appears to be physically mediated by specific cortical structures. The arcuate fasciculus is a bundle of axons that forms part of the superior longitudinal fasciculus, an association fiber tract. The arcuate bidirectionally connects caudal temporal cortex and inferior parietal cortex to locations in the frontal lobe. The arcuate fasciculus connects Broca's area and Wernicke's area.

This Broca-to-Wernicke pathway implies there is direct connection — via an association fiber tract — from circuits ostensibly involved in the production of speech, i.e., Broca's area adjacent to areas in the frontal responsible for motor planning, to areas involved in understanding speech, i.e., Wernicke's area located in the posterior section of the superior temporal gyrus encircling the auditory cortex on the lateral sulcus, where the temporal lobe and parietal lobe meet.

There is growing consensus that there exists a general dual-loop system scaffolding in human and primate brains providing evidence that the dorsal and ventral connections subserve similar functions, independent of the modality and species. The Broca-to-Wernicke pathway is but one instance such a dual loop. As visual information exits the occipital lobe, and as sound leaves the phonological network, it follows two main pathways, or "streams". The ventral stream (also known as the "what pathway") is involved with object and visual identification and recognition. The dorsal stream (or, "where pathway") is involved with processing the object's spatial location relative to the viewer and with speech repetition17.

Could these dual-stream-mediated sensory-motor systems serve as a feedback loop for unsupervised predictive or reconstructive learning similar to the reciprocal EHC-CA1 connections in hippocampal-entorhinal complex? Could this be the means by which we construct the narratives that produce hippocampal traces that encode the episodic memories allowing us encode procedural knowledge and enable us to learn how to perform long division, that allow us to perform the amazing feats of memory recall that professional mnemonists are able to demonstrate, and that serve as the basis for Nick Chater's shallow model of human cognition.

## March 21, 2019

%%% Thu Mar 21  3:24:13 PDT 2019


Nick Chater and Stanislas Dehaene are trying to convince their respective readers that their particular theories [3045] explain how the human brain functions in their most recent books. In contrast, Helen Thompson [192] is a science writer in the same tradition as Oliver Sacks [170171] and Horace Freeland Judson [114]. Her goal is to tell you interesting stories as a way of educating you about how your brain works and what happens when it goes awry.

The accolades on her book cover from Ed Yong, Robert Sapolsky plus reviewers including the journal Nature are well deserved. The chapter on how the hippocampus and various subregions of the frontal cortex enable us to store episodic memories and navigate and recognize landmarks in our physical environment is a wonderful supplement to related work in scientific journals.

Her story about a man whose brain is significantly altered by an aneurysm that produced a subarachnoid hemorrhage resulting in burst blood vessels and blood flooding into and around his brain is both excellent scientific reporting and an extraordinary real-world instance of how the human brain can change for the better — at least from the patient’s perspective — even when that change is precipitated by a traumatic insult to our arguably most cherished organ.

Tommy McHugh was a petty criminal, distant parent and occasionally abusive husband living in poverty in Liverpool. His personality was almost completely altered — by most accounts for the better — by the aneurysm that damaged his brain. This particular method of improving oneself is obviously not one that anyone would intentionally endure, but the transformation is so dramatic and so positive that it makes you wonder if we could build technology or perform precise surgeries that could, with high reliability, purposefully produce such transformations.

Just after finishing this chapter in Thomson’s book, my wife Jo read me excerpts from Vincent van Gogh's letters [198146] to his brother Theo during his time in Arles when he painted several hundred works of art in a frenzy of manic activity. Van Gogh was articulate and emotional in the descriptions of his psychological distress; his accounts are painful to read and at the same time one can’t help but empathize with Theo who patiently listened to cared for and supported his brother despite being financially and psychologically burdened by Vincent’s relentlessly insensitive and intrusive treatment.

While participating in a recent neurotech conference, I made a comment concerning the prospects for developing possibly-invasive, potentially-risky technology to eliminate chronic depression and related debilitating brain disorders. One person in the audience suggested van Gogh's contributions to modern art would have suffered if his psychological infirmities were alleviated or eliminated. I thought his suggestion profoundly ignorant or grossly insensitive, but simply offered that he might think differently if he or one of his loved ones suffered from bipolar disorder or one of the more severe variants of autism spectrum disorder.

## March 19, 2019

%%% Tue Mar 19  3:17:48 PDT 2019


You are the confluence of a set of emergent properties of an evolved arrangement of carbon-based molecules. In some respects, you are more an ephemeral than corporeal phenomenon. Your conscious awareness is as ephemeral as a process running on a cell phone, and, while you are undoubtedly embodied, the essential you is manifest as a thread of computation, fragile as a snowflake and potentially as powerful as any force of nature.

In this class we continue with a grand quest that is as old as language: to re-create and reimagine ourselves as subtle machines of incredible complexity and potential. Our immediate objective is to model a human being — or at least those aspects of being human that we believe to be the most efficacious in building artificially intelligent agents with whom we can interact and collaborate with naturally.

Three years ago I was focusing on computational models attempting to account for the function of complex neural circuits on the scale of entire organisms. I concluded it would be some time before we could record from a large fraction of the neurons in the brain of any organism as complicated as a fruit fly. I simply didn't see any possibility of whole brain recording within the next two years and possibly longer.

Since coming to these conclusions and abandoning the idea of starting a project at Google to construct functional models of whole brains, my interest has changed to modeling human cognitive behavior, which poses different challenges and offers a richer perspective on cognition than is possible by studying flies or mice. My scientific focus changed from that of a cellular and molecular neuroscientist to that of a systems and cognitive neuroscientist, and required a shift from studying simpler organisms at the level of individual neurons to studying humans at the level of relatively large, anatomically well defined but internally opaque regions of neural tissue.

I am still focused on learning models of animal behavior, but the class of models I employ as an inductive bias has changed considerably. In particular, whereas in the case of studying flies and mice at the level of neurons I employed a prior that sought to define and model the composite function of small neuronal circuits, now I use a prior that leverages what we know about the function of relatively large regions of the brain thought to implement specific functions relating to the sort of behavior I want to model. That behavior is infinitely more complex and varied than we observe in simpler animals, especially given that we focus on human behavior involving complex problem solving and natural language communication.

It might seem audaciously conceited to think that we could somehow construct a model of the human brain that even begins to account for the diverse behavior of human beings and their most successful invention yet, that of human language. However, I don't believe this is necessarily the case and I will spend the remainder of this quarter trying to convince you that what we are attempting to do is within the range of what's possible and now is a good time to be doing so. I should mention however that what I am suggesting we work on, while of considerable practical value, does not begin to address the more complicated problem of understanding the underlying biology at the scale and level of detail required to diagnose and find cures for human brain disorders, though I hold out some hope that the sort of high-level computational models of human cognition we are pursuing will one day contribute to that effort.

## March 11, 2019

%%% Mon Mar 11 04:19:48 PDT 2019


This entry serves as a parking spot for my notes on Nick Chater's The Mind is Flat. As an introduction and test to see if you are interested in his theory concerning human cognition, I suggest you start with his Google Talks book tour presentation. If you find that interesting, but still are not convinced enough that you want to read the book [30]. You might get a better idea by watching Episode #111 of the The Dissenter podcast hosted by Ricardo Lopes. Here is an excerpt relating to Chater's main thesis that we are misled by introspection into believing that below the threshold of our conscious thoughts there is a great deal of supporting unconscious thinking going on — unconscious, but of the same general depth and complexity as our conscious thoughts:

The things the brain does are very complicated, but they are nothing like the things we are consciously aware of. So I think using our conscious mind as a window into the brain is a disastrous error. It's like we think the mind is the tip of an iceberg. We see the iceberg poking out of the sea, and the illusion is that we think "I got the tip of the iceberg which is my flow of conscious experience, but I bet that the whole iceberg is just the same stuff". [...] The machinery that is operating is this incredibly complicated memory retrieval and reuse system which is searching our past experience and using that past experience to understand the present. [...] Like scanning all the faces you've seen in order to recognize a new face. All of that is fast and parallel, but its nothing like the sequential nature of the mind. I think the basic operation performed in these different areas [of the brain] is pretty much the same from one area to the next. They all involve perception memory in one way or another. Abstract thought whether mathematics, physics or the law [...] I think of these as all pretty much similar in spirit to thinking about and recognizing objects. [...] They are just more abstract versions of that. [There are some who believe that there are a number specialized systems handling different types of problems] The Toby and Cosmides Swiss Army knife model [59] [But I don't agree.] So I want to push against this [strong] modularity assumption.

Ricardo provides the following convenient bookmarks that take you directly to the relevant location in the podcast:

00:01:06 The basic premise of "The Mind is Flat"

00:05:33 We are like fictional characters

00:09:59 The problem with stories and narratives

00:13:58 The illusions our minds create (about motives, desires, goals, etc.)

00:17:44 The distinction between the conscious mind and brain activity

00:22:34 Does dualism make sense?

00:27:11 Is modularity of mind a useful approach?

00:31:21 How our perceptual systems work

00:41:49 How we represent things in our minds

00:44:57 The Kuleshov effect, and the interpretation of emotions

00:55:42 Why do we need our mental illusions?

00:59:10 The importance of our imagination

01:01:31 Can AI systems produce the same illusions (emotions, consciousness)?

### Lament Over Sloshed Milk

Here are the last few lines of Chapter 2 entitled "Anatomy of a Hoax" in which Chater commiserates with himself and the reader over the fact — actually a presupposition — that scientists (might) have deluded themselves regarding some of the most basic facts about human cognition. I will certainly admit that he makes a good case for his view of how we experience and make sense of the world around us. His theory explains some of the predictions one could make concerning the models I've been working on and so I will have little reason to complain if he is proved right. But I will hold out for a while and watch for more experimental evidence before celebrating my modeling choices or adopting his theory wholesale.

From time to time, I have found myself wondering, somewhat despairingly, how much the last hundred and fifty years or so of psychology and neuroscience has really revealed about the secrets of human nature. How far have we progressed beyond what we can gather from philosophical reflection, the literary imagination, or from plain common sense? How much has the scientific study of our minds and brains revealed that really challenges our intuitive conception of ourselves?

The gradual uncovering of the grand illusion through careful experimentation is a wonderful example of how startlingly wrong our intuitive conception of ourselves can be. And once we know the trick, we can see that it underlies the apparent solidity of our verbal explanations too. Just as the eye can dash into action to answer whatever question about the visual world I happen to ask myself, so my inventive mind can conjure up a justification for my actions, beliefs and motives, just as soon as I wonder about them. We wonder why puddles form or how electric and immediately we find explanations springing into our consciousness. And if we query any element of our explanation, more explanations spring into existence, and so on. Our powers of invention are so fluent that we can imagine that these explanations were pre-formed within us in all their apparently endless complexity. But, of course, each answer was created in the moment.

So whether we are considering sensory experience or verbal explanations, the story is the same. We are, it turns out, utterly wrong about a subject on which we might think we should be the ultimate arbiter: the contents of our own minds. Could we perhaps be equally or even more deluded when we turn to consider the workings of our imagination?

### Collective Decision Making

Here is an extended thought experiment inspired by my reading of Chater's The Mind is Flat [30] that explores how Chater's theory of human cognition might play out in a collective endeavor:

When we engage in a group undertaking whether that be evaluating candidates for a job position or deciding upon a strategy for investing in new markets, we are collectively creating a shared illusion that serves as the basis of our own individual thinking as well as any possible consensus regarding, for example, specific actions being contemplated.

Think about what happens when one of us makes some contribution to the discussion whether it be a comment or criticism or an addition or modification to some possible outcome of our joint focus, say a job offer, contract or new species of investment. In voicing an opinion, we influence one another's mind state by how our contribution is individually and jointly perceived. Given what Nick Chater tells us about human behavior, it is highly likely that our contribution will be misunderstood and our resulting thoughts and those of others thinly apprehended but richly fantasized.

It makes sense to think of this shared space as a collection of thought clouds in the sense that Geoff Hinton uses the term. Each thought cloud is no more than a sparse representation of an individual’s state vector. It includes, among many other things, the activation of state variables that correspond to our internal representation of the mental states of those sitting around the table — a representation that is no doubt poorly informed and incredibly misleading.

These individual idiosyncratic representations of the evolving joint space are tied together very loosely by our notoriously error-prone efforts to read one another's thoughts, but, whether or not we are able to read "minds", there is still the possibility of interpreting what each contributor actually says or how they act. Alas, we are just as fallible in reading body language and interpreting intent from what is explicitly said or acted out in pose, gesture or facial tick.

As each participant voices their opinion, makes his or her case, expresses their support for or opposition to what was said earlier, all of the individual thought clouds are separately altered according to the inscrutable adjustments of diverse hormonal secretions and neuromodulatory chemical gradients. The individuals may believe — or possibly hope — that some consensus will eventually be reached; however, unless carefully led by a very skilled facilitator, the separate thought clouds will be cluttered, full of contradictions and misunderstandings and yet by some independent measure oddly aligned — which could be simply due to the length of time the meeting was scheduled for or the perceived duration necessary for this particular group to reach consensus.

There will very likely be a good deal of wishful thinking among those who either want the meeting to end quickly irregardless of the outcome, hope that a consensus can be amicably reached or have already reached their final opinion and will become increasingly more strident in support of their views as the meeting drags on. There will be those who will want — or pretend — to hear their colleagues voiced support for their ideas, but will interpret whatever they say to suit their own selfish interests and expectations.

In Chater’s theory, each one of us is a single thread of conscious thought informed by and constructed upon a history of memories dredged up in response to sensory input, in this case, resulting from what was seen and heard in the meeting. This means that, in particular, each one of us will have a different context depending on our own stored memories and the degree to which we have attended to the discussion in the meeting. This will result in a collection of state vectors that in the best of circumstances are only roughly aligned, and, in the more realistic case, significantly discordant.

It would be interesting to know what sort of dimensions are more likely to appear in some semblance of their current grounding in fact or that, while they may have a different valence, at least represent an emotional trajectory accounting for roughly the same physiological state across some fraction of the individuals present in the discussion. While I don't believe this sort of dimensional alignment is likely for dimensions of a more abstract sort, I wouldn't be surprised if one were able to do a factor analysis on all the different thought vectors represented in a given collective, that we might be able to identify factors representing some alignment that translates across individuals — one that plays an important role in our evolution as successful social organisms.

The picture I have in my head is of a collection of thought clouds with some dimensional alignment across individuals with respect to perceptually — and in particular emotionally — mediated factors but very little alignment across abstract dimensions that capture more of the concrete aspects of the collective-focus intended by those who organized the meeting in the first place. All of the usual cognitive biases are likely to be at play in the interactions going on during the meeting. Individual opinions will not be obviously manifest in behavior and will likely be repressed and prettified to make them more palatable to the group as a whole.

Moreover, many if not most of the individuals will likely misinterpret the opinions and other hidden state of their co-contributors, and also likely adjust the valence and magnitude of related dimensions to suit their own idiosyncratic beliefs and desires with respect to the outcome of the collective effort.

It would be instructive to imagine the sort of collective enterprise as playing out in a protracted meeting and how, for example, participants might align their viewpoints based upon a particularly articulate opinion rendered by someone thought highly — or perhaps fondly — of, versus some sort of discordant alignment resulting from an incoherent but forcefully rendered opinion by someone not well thought of. The exercise would be not be necessarily to figure out a strategy for effectively coming to a joint understanding so much as how cognition would play out given sparse serialized thought processes operating on internal representations that only thinly capture the collective experience and ground much of what is heard or seen in their own idiosyncratic, suspiciously self-promoting or self-effacing episodic memory18 .

As a logical next step along these lines, it would be interesting to ask how the outcome might be different in the case of a group of very smart, highly motivated, super engaged individuals with a history of working together and a facilitator of particularly sharp intellect, unusually well-calibrated emotional and social intelligence and highly motivated to do the right thing in charge of overseeing the meeting and fully invested in guiding the participants toward a consensus worth the effort.

In this case, the focus would be on how the participants might rise above their (instinctual) predilections by using the same cognitive substrate with the same energy and focus as they would devote to something they might find intellectually more satisfying such as writing code and solving interesting programming problems. Specifically, how can the basic cycle of apprehend (sense), attend (select), integrate personal experience (recall), and respond to an internal model of the present circumstances (act) be successfully applied to effectively make difficult decisions given what Chater refers to as the essentially "flat" character of our world view / internal model of present circumstances.

P.S. The original name of this file — Laughably_Sparse_Embarrassingly_Serial_Chatter.txt — is a tongue-in-cheek reference to a model of parallel / distributed computation (https://en.wikipedia.org/wiki/Embarrassingly_parallel) that describes much of the parallelism available in modern industrial cloud services and corporate data and computing centers.

## March 9, 2019

%%% Sat Mar  9 14:57:23 PST 2019


I read a recent paper by researchers at Arizona State and Stanford University — see the Press Release from ASU — making a case that hippocampal pattern separation supports reinforcement learning by "forming conjunctive representations that are dissociable from feature components and that these representations, along with those of cortex, influence striatal prediction errors" — see Ballard et al [13]. A colleague sent me this link to a recent TED Talk by Ed Boyden — he has a great technology story to tell, one that is both exciting from a purely scientific and technology perspective and inspiring from an aspirational perspective.

My latest readings for class include journal articles relating to inner speech and the two-streams hypothesis as it relates to hearing and speaking [1211188113], and articles relating to episodic memory including selected Chapters19 in the Handbook of Episodic Memory [53] and journal articles cited in recents books by Trevor Cox [34] and Charles Fernyhough [60] relevant to these topics [1841795710]. Chapters in the handbook are accessible to students online from the Stanford libraries website. Stanford doesn't have a print version, but you can use the chapter title to search the web since many of the chapters were previously published as journal articles.

## March 7, 2019

%%% Thu Mar  7 09:19:37 PST 2019


Lisa Giocomo (Giocomo Lab) suggested that I ask Loren Frank from UCSF to participate in class given his focus on episodic memory, the hippocampal-complex, making predictions and decisions: Loren Frank. "Continuous 8 Hz Alternation between Divergent Representations in the Hippocampus" — a presentation at the Simons Institute, Berkeley, Monday, February 12th, 2018 2:00 pm. The Simons website also includes his original and corrected slides.

Page 160-162, Chapter 11: A Brain Listening to Itself. In The Voices Within by Charles Fernyhough [60]:

Despite the limitations of existing research, the inner speech model has provided a useful framework for making sense of neuro-scientific findings concerning voice hearing. some of the most impressive evidence has come from findings of structural differences between people who hear voices and those who don't. The inner speech model has often been translated into neuro-scientific language in terms of the connection between the part of the brain that generates an inner utterance (particularly the left inferior frontal gyrus or Broca's area) and the region that perceives it (part of the superior temporal gyrus20 or Wernicke's area). Recall that, in the model of action monitoring put forward by Chris Frith and colleagues [66], a signal is sent from the system that produces inner speech to the speech detection areas of the brain, effectively saying, "Don't pay any attention to this; this is you speaking." In schizophrenia, Frith argued, something goes wrong with the transmission of the signal. The "listening" part of the brain does not respect the signal that is coming, and so it processes the signal as an external voice21.

Studying connectivity between these areas of the brain should allow us to see whether this kind of transmission error is occurring. Neuroscientists make a broad distinction between two kinds of brain material: gray matter, which takes its name from the cell bodies of the neurons, or nerve cells, the color it; and white matter, which consist of the parts of the nerve cell that communicate with other nerve cells (roughly, the brain's wiring). Studying the integrity of white matter can tell you something about how different parts of the brain are talking to each other, or at least how they are wired up to talk. To make an analogy, you can learn a lot about the structure of a communication system — a telephone exchange, for example — simply by studying how it is connected up, even if no signals are actually flowing through the system.

For the inner speech model of voice hearing, one tract of white matter has been of particular interest. It's the stretch of neural wiring that (very roughly) joins Broca's area to Wernicke's area, the area in the superior temporal gyrus that perceives speech. This group of fibers is called the arcuate fasciculus. Recall that an utterance in inner speech is supposedly generated, but the speech perception area doesn't get the usual tip-off about it. In Frith's theory this happens because Broca's area usually sends a copy of the instruction to Wernicke's area, effectively telling it not to listen for what's about to happen. That so-called "efference" copy is sent along the arcuate fasciculus.

The integrity of this tract of white matter has indeed been linked to auditory verbal hallucinations. Along with looking at the physical structure of the pathway, researchers have used neurophysiological methods such as electroencephalography (EEG), to find out whether communication between these brain regions is disturbed. Judith Ford and her colleagues at Yale University showed that the usual "dampening" that occurs in Wernicke's area as a result of receiving the efference copy does not occur so markedly in patients with schizophrenia. That interpretation gained support from an fMRI study looking at how patients brains responded when they were perceiving external speech in comparison to when they were generating inner speech. The listening areas of the control participant's brains activated less when they were imagining sentences that when they were hearing sentences spoken out loud. The difference was significantly less noticeable in the schizophrenia patient's brains suggesting a problem with the transmission of the efference copy between Broca's and Wernicke's areas.

## March 6, 2019

%%% Wed Mar  6 04:56:32 PST 2019


History of Science: John O'Keefe — Place Cells in the Hippocampus, Past and Present (YouTube) speaking at the SUNY (Downstate) Medical Center honoring the work Dr. James Ranck. James B. Ranck, Jr. MD is distinguished teaching professor emeritus of physiology and pharmacology at SUNY Downstate Medical Center, where he taught from 1975 until his retirement in 2014. In 1968, he was one of the first to record electrical activity in single neurons, opening up a new direction of brain study. In 1984, he discovered head-direction cells in the brain, which, along with place cells and grid cells, underlie the neural basis of navigation and spatial behavior. Dr. Ranck founded the hippocampal laboratory at Downstate, which became widely known in neuroscience circles as "The Brooklyn Group," working on memory and navigation.

John O'Keefe's discovery of hippocampal "place cells" launched the notion that the hippocampus was the brain's "cognitive map" (O'Keefe and Nadel, 1978). Starting in the early days Jim Ranck made fundamental contributions including the discovery of "head-direction cells" — cells that are the basis of a "sense of direction". The field grew rapidly and flourished. There are now dozens of labs and broad recognition that this line of research is yielding fundamental insights into the neuronal mechanisms that produce cognition. The symposium will summarize Jim Ranck's contributions and survey current work that begins to reveal how mind is produced in the brain.

## March 5, 2019

%%% Tue Mar  5 04:28:58 PST 2019


Having watched one interview — with Terrence Deacon — conducted by Ricardo Lopes and considered it thought provoking, I tried another. Here is Ricardo interviewing Michael Graziano on his theories of consciousness. If you haven't already, check out Michael's presentation in class here. I found the contrast between Michael's presentation and Stanislas Dehaene discussing his work fascinating. In particular, what each of them took for granted. Michael probably had some idea of his audience from looking at some of Lopes' other interviews and their comments. Dehaene had the advantage that his presentation was professionally edited and scripted, so I won't comment on their respective deliveries, but rather what they included and what they left out. Michael was careful to target a general audience.

In particular, Michael didn't assume that listeners would know what a "representation" is, much less what a representation of one's self might look like. Or what it would mean to construct a representation of one's self. Getting it right makes a big difference and I know from experience that students without a computer science or cognitive science background don't deeply understand the difference between a scene — what one sees when viewing, say, a garden or mountain range, a photo of that scene, a cave drawing or painting, an icon signifying the scene and a word or phrase used to refer to the scene. Charles Sanders Peirce's Semiotic Theory has a great deal of interesting things to say about representations.

## March 4, 2019

%%% Mon Mar  4 03:37:39 PST 2019


Last night I finally got around to skimming through the rest of Terrence Deacon's book, Incomplete Nature, and this morning Jo and I listened to an interview with Deacon that spent most of the time on Incomplete Nature and the remainder on The Symbolic Species: The Co-evolution of Language and the Brain. Between the book and the interview and a convenient Wikipedia page I was able to get what I wanted out of those books, with Incomplete Nature being the real challenge. I've included my Cliff's notes summary at the bottom in case you're curious and would like to know what secrets were revealed.

It's interesting how each complex concept that I've been working to understand over the last few months has its own arc of discovery. The trajectory of seeking out appropriate authors and relevant books and papers generally takes the same form: initially the material seems hopelessly impenetrable and the authors deviously dissembling, but if you persevere eventually you start finding little footholds that you can use to gain some purchase on the abstract concepts. The last stage is the most intriguing or perhaps unsettling as it generally involves going to sleep one night thinking that you are dense and clueless, and waking the next morning and find the concepts simple to understand and straightforward to communicate.

I've had that experience in reading both of Deacon's books as well as scores of other topics that I am researching for the apprentice project and the related Stanford class. For example, how does pattern separation and completion work in the hippocampal-entorhinal-cortex complex, or how is the value function learned and where does it reside in the basal ganglia circuits responsible for action selection and what goes wrong when the indirect-path inhibitory neurons behave erratically giving rise to the debilitating symptoms of Parkinson's disease.

The August 31 and August 29 entries the 2018 class discussion list contain my initial impressions from reading Deacon. In you're curious, you can find my Cliff's Notes summary of Deacon's Incomplete Nature: How Mind Emerged from Matter with a little help from Wikipedia and Ricardo Lopes — also known as "The Dissenter", in the footnote at the end of this paragraph22.

Here is Deacon speaking at the University of Trento, Center for Mind / Brain Sciences, on October 28, 2016 on the topic of "Neither nature nor nurture: The semiotic basis of language universals". Deacon summarizes his views on Chomsky's theory of humans possessing an innate language facility in the form of a universal grammar, and then launches into a departure / extrapolation from his two earlier books [3940] that promises to be very interesting. I've summarized Deacon's perspective on Chomsky in this footnote23.

Here is an interview with John Krakauer, in which Krakauer and host Paul Middlebrooks talk about the intellectual value of scientists writing book-length treatises, the importance of the long-form expression of ideas and how our understanding of an idea or our ability to express it evolves over time. Reminded me of the aphorism24 that fortune favors the prepared mind. Note that "preparation" in this context means something different from scanning through a book without comprehension, reading the Cliff's Notes without making an effort to reconstruct the arguments, or binge watching popular science shows from PBS, NPR or BBC. Aside from that thought, the interview had little to recommend it.

## December 7, 2018

%%% Fri Dec  7 04:12:22 PST 2018


Here are some example courses and related course materials from my colleagues at Brown University aimed at students in computer science, computational linguistics, natural language processing and neuroscience wanting a better foundation in statistics and probability theory:

DATA 1010Stuart Geman (Applied Mathematics) designed this course for students in engineering, cognitive science, etc., who want to understand the math but don't need the theory, e.g., the mathematical foundations in terms of Borel sets, etc. Mark Johnson (Cognitive Science) co-taught the course with Stu and co-authored this related paper on probability and linguistics.

CSCC 0241Eugene Charniak (Computer Science) wrote one of the first dissertations linking artificial intelligence and natural language processing and was a pioneer in statistical language learning. His monograph is still required reading for many courses on statistical NLP and I recommend it highly — AI and NLP students alike appreciate its introduction to Bayesian statistics.

CSCI 1550Eli Upfal (Computer Science) designed this course for computer scientists and students interested in understanding the computational and algorithmic issues pertaining to statistics and probability. Eli and Michael Mitzenmacher wrote an excellent textbook that is used in this course. The expanded second edition provides a great complement to courses in machine learning.

## November 23, 2018

%%% Fri Nov 23  4:00:25 PST 2018


While satisfied with our implementation of conscious awareness from a purely engineering standpoint, it is revealing to consider what Stanislas Dehaene and his colleagues [4845] have to say about the nature of conscious and unconscious thought. Chapter 2 of [45] summarizes a large body of experimental work addressing this question. For example, in the following excerpt Dehaene discusses one particular experiment aimed at better understanding the efficacy of unconscious incubation in decision making:

Having a hunch is not exactly the same as resolving a mathematical problem. But an experiment by Ap Dijksterhuis comes closer to Hadamard's taxonomy [87] and suggests that genuine problem solving may indeed benefit from an unconscious incubation period [54]. The Dutch psychologist presented students with a problem in which they were to choose from among four brands of cars, which differed by up to twelve features. The participants read the problem, then half of them were allowed to consciously think about what their choice would be for four minutes; the other half were distracted for the same amount of time (by solving anagrams). Finally, both groups made their choice. Surprisingly, the distracted group picked the best car much more often than the conscious-deliberation group (60 percent versus 22 percent, a remarkably large effect given that choosing at random would result in 25 percent success). — excerpt from Page 82 of Dehaene [45]

Dehaene goes on to consider whether and to what extent conscious and unconscious thought are different in their ability to discern subtle properties of the phenomena they are called upon to interpret. In so doing, he comments on the important role of symbolic, combinatorial thinking in resolving ambiguity in sense data, and the serial versus parallel processing distinction made elsewhere in these notes:

Henri Poincaré, in Science and Hypothesis [161], anticipated the superiority of unconscious brute-force processing over slow conscious thinking:

The subliminal self is in no way inferior to the conscious self; it is not purely automatic; it is capable of discernment; it has tact, delicacy; it knows how to choose, to divine. What do I say? It knows better how to divine than the conscious self, since it succeeds where that has failed. In a word, is not the subliminal self superior to the conscious self?

Contemporary science answers Poincaré's question with a resounding yes. In many respects, our mind's subliminal operations exceed its conscious achievements. Our visual system routinely solves problems of shape perception and invariant recognition that boggle the best computer software. And we tap into this amazing computational power of the unconscious mind whenever we ponder mathematical problems.

But we should not get carried away. Some cognitive psychologists go as far as to propose that consciousness is a pure myth, a decorative but powerless feature, like frosting on a cake. All the mental operations that underlie our decisions and behavior, they claim, are accomplished unconsciously. In their view, our awareness is a mere bystander, a backseat driver that contemplates the brain's unconscious accomplishments but lacks effective powers of its own. As in the 1999 movie The Matrix, we are prisoners of an elaborate artifice, and our experience of living a conscious life is illusory; all our decisions are made in absentia by the unconscious processes within us.

The next chapter will refute this zombie theory. Consciousness is an evolved function, I argue — a biological property that emerged from evolution because it was useful. Consciousness must therefore fill a specific cognitive niche and address a problem that the specialized parallel systems of the unconscious mind could not.

Ever insightful, Poincaré noted that in spite of the brain's subliminal powers, the mathematician's unconscious cogs did not start clicking unless he had made a massive initial conscious attack on the problem during the initiation phase. And later on, after the "aha" experience, only the conscious mind could carefully verify, step by step, what the unconscious seemed to have discovered. Henry Moore made exactly the same point in The Sculptor Speaks [142]:

Though the non-logical, instinctive, subconscious part of the mind must play its part in [the artist's] work, he also has a conscious mind which is not inactive. The artist works with a concentration of his whole personality, and the conscious part of it resolves conflicts, organizes memories, and prevents him from trying to walk in two directions at the same time. — excerpt from Chipp and Correia [31]

In Chapter 3 of [45], Dehaene asks what purpose does conscious thought serve and why did it evolve. In this excerpt at the beginning of the chapter, he succinctly summarizes his conclusions:

Why did consciousness evolve? Can some operations be carried out only by a conscious mind? Or is consciousness a mere epiphenomenon, a useless or even illusory feature of our biological makeup? In fact, consciousness supports a number of specific operations that cannot unfold unconsciously. Subliminal information is evanescent, but conscious information is stable — we can hang on to it for as long as we wish. Consciousness also compresses the incoming information, reducing an immense stream of sense data to a small set of carefully selected bite-size symbols. The sampled information can then be routed to another processing stage, allowing us to perform carefully controlled chains of operations, much like a serial computer. This broadcasting function of consciousness is essential. In humans, it is greatly enhanced by language, which lets us distribute our conscious thoughts across the social network. — excerpt from Page 89 of Dehaene [45]

%%% Sun Nov 25 04:53:54 PST 2018


Miscellaneous Loose Ends: There has been a good deal of controversy surrounding the use of functional Magnetic Resonance Imaging25 (fMRI) to infer brain function. Some of the controversy stems from the incorrect use of common measures of statistical significance in claiming conclusive research findings [109]. In the case of fMRI, it has also taken some time to understand how to apply the technology for brain imaging and interpret the resulting images. If you want to dive a little deeper, I suggest the first four chapters of Russell Poldrack's The New Mind Readers: What Neuroimaging Can and Cannot Reveal about Our Thoughts (EXCERPT) describing how cognitive neuroscience has harnessed the power of fMRI to yield new insights into human cognition [162].

%%% Thu Nov 29  4:03:57 PST 2018


While researching Douglas Hofstadter's work on analogy for an earlier entry in this log, I ran across his book "I Am A Strange Loop" [101] in which he explores the idea of consciousness and the concept of "self". In comparing his account of consciousness with that of Dehaene, Graziano, Dennett and others we've discussed in these notes, I find Hofstadter's liberating in the way he avoids many of the earlier philosophical, psychological, biophysical and metaphysical conundrums that make this topic so confusing and fraught with controversy for the uninitiated. That said, I think some of you may find that this video retelling by Will Schoder entitled You Are A Strange Loop does Hofstadter's account one better and achieves its weight and clarity in a scant twenty minutes.

## November 21, 2018

%%% Wed Nov 21 03:42:65 PST 2018


As mentioned in an earlier note, Weston et al [207] have developed a set of tasks — Facebook's bAbI dataset — to test reading comprehension and question answering. These tasks require chaining facts, simple induction and deduction, all of which are required for analogical and relational modeling. In this entry, we consider several proposed approaches that have been evaluated on the bAbI dataset.

We've already discussed one such proposal: Daniel Johnson's Gated Graph Transformer Neural Network model [112] can learn to construct and modify graphs in sophisticated ways based on natural language input and has been shown to successfully learn to solve almost all of the bAbI tasks. In the same paper introducing bAbI, Weston et al [207] describe their proposal based on memory networks [208] and evaluate it on bAbI.

All the bAbI tasks [207] are generated using simulator that operates like a text adventure game as described here26. The authors implement five models including an LSTM-based model, a vanilla Memory Network (MemNN) model and a second MemNN model that depends on extensions to the original MemNN framework [208], plus two baseline models, one using an N-gram classifier and the other a structured SVM.

In the simpler MemNN, the controller "performs inference" over the stored memories that consist of the statements in a story or entries in a knowledge base. The simpler MemNN model performs two hops of inference: finding the first supporting fact with the maximum match score with the question, and then the second supporting fact with the maximum match score with both the question and the first fact that was found. The extended MemNN model uses a meta-controller trained to perform a variable number of hops rather than just two, as in the case of the simpler MemNN. Compare Table 3 in [207] with Table 2 in [112] and Table 1 in [118].

Kumar et al [118] introduce Dynamic Memory Networks (DMN) as a general framework for answering questions given a semantic memory consisting of general knowledge about concepts and facts, a set of inputs corresponding to recent history complementing the semantic memory, and an episodic memory that focuses on different parts of the input updating its state and finally generates an answer to the initially posed question27.

Figure 65:  The basic Key-Value Memory Network architecture developed for question answering and described in [138] is shown here. Several variants are described in the paper and additional variants are covered in Jason Weston's tutorial at ICML 2016. — the graphic shown here is reproduced from Figure 1 in [138]. The basic architecture is similar to a Neural Turing Machine with a key-value memory. Memory slots are designated as key-value pairs $$(k_{h_{1}}, v_{h_{1}})$$, ..., $$(k_{h_{N}}, v_{h_{N}})$$ and the target question is notated as $$x$$. Assume for simplicity that $$A$$, $$B$$ and $$K$$ are all $$d \times{} D$$ matrices and $$\Phi$$ is a feature map of dimension $$D$$. The initial memory addressing step uses the access query $$q_{1} = A\Phi_{X}(x)$$, and each memory slot is assigned a relevance probability $$p_{h_{i}} =$$ Softmax$$(A\Phi_{X}(x) \cdot{} A\Phi_{K}(k_{h_{i}}))$$. The initial output is $$o = \Sigma_{i} p_{h_{i}}A\Phi_{V}(k_{h_{i}})$$ after 0 hops. Subsequent addressing steps use the updated access query $$q_{j+1} = R_{j}(q + o)$$ for $$1 \leq{} j \leq{} H$$ where $$H$$ is the maximum number of hops, and compute the relevance probability as $$p_{h_{i}} =$$ Softmax$$(q^{T}_{j+1} A\Phi_{K}(k_{h_{i}}))$$. Backpropagation and stochastic gradient descent are used to learn the matrices $$A$$, $$B$$, and $$R_{1}, \dots{}, R_{H}$$.

Jason Weston's tutorial on Memory Networks at ICML 2016 covers the original model described in [207] plus the following variants: Key-Value Memory Networks [138], End-to-End Memory Networks [185], Dynamic Memory Networks [211] and Weakly Supervised Memory Networks [186]. The Weston et al and Kumar et al approaches to solving the tasks in the bAbI dataset are similar to the contextual method of integrating episodic memory and attentional networks described in Figure 63. Figure 65 reproduced from Figure 1 in [138] provides some insight into the power of these methods.

At each step, the collected information from the memory is cumulatively added to the original target question to build the context for the next round of inference. In principle, the system can perform multiple inductive or deductive inferential steps, incorporating intermediate results as additional context, and adjusting the weight of each item in memory in anticipation of the next step. As per our earlier discussion about shaping context by recalling items from episodic memory — see Figure 52, the challenge is to organize memory so that our encodings of the present circumstances provide the keys necessary to access relevant past experience.

Memory Network models have also been applied to dialogue management and demonstrated state-of-the-art performance on several challenging datasets — see "Learning End-to-End Goal-Oriented Dialog" from Bordes and Weston [24] and "Dialog-based Language Learning" from Weston [206] with credit for the performance gains touted in the latter assigned to a method for predictive lookahood that enables the system to "learn to answer questions correctly without any reward-based supervision at all." See Fisac et al [61] and Hamrick et al [90] for related models.

Miscellaneous Loose Ends: Andrade et al [8] propose using Memory Network models and analogy-based reasoning for future prediction, but the algorithmic details are sketchy and their evaluation unconvincing. Foster and Jones [64] suggest that analogical reasoning and reinforcement learning complement one another synergistically and describe a method for schema induction that is more conceptual than practical. This paper is primarily motivated by issues of interest to cognitive neuroscientists. If you are interested in pursuing this line of inquiry, Holyoak [104] surveys the related work on analogy and relational reasoning, and Gick and Holyoak [75] provide an overview of schema induction and analogical transfer from the perspective and cognitive psychology.

## November 19, 2018

%%% Mon Nov 19 4:55:47 PST 2018


I've been reading cognitive science papers on analogical and relational modeling trying to get a handle on what an architecture efficiently supporting this type of reasoning might look like. It seems inevitable that such an architecture will be a hybrid system combining aspects of connectionist and symbolic processing — at least if we are to leverage what is known about human analogical and relational reasoning:

Analogy is an inductive mechanism based on structured comparisons of mental representations. It is an important special case of role-based relational reasoning, in which inferences are generated on the basis of patterns of relational roles. Analogical reasoning is a complex process involving retrieval of structured knowledge from long-term memory, representing and manipulating role-filler bindings in working memory, identifying elements that play corresponding roles, generating new inferences, and learning abstract schemas. [...] Human analogical reasoning is heavily dependent on working memory and other executive functions supported by the prefrontal cortex, with the frontopolar subregion being selectively activated when multiple relations must be integrated to solve a problem. — excerpt from Page 234 of Holyoak [104] (PDF)

This description is certainly consistent with the discussion found in Battaglia et al [15] and conversations we have had with Randall O'Reilly. It is also clear that the underlying cognitive infrastructure that supports human-level analogical and relational reasoning is substantially more complicated than we have encountered so far in our discussion of the programmer's apprentice application. Most of the proposed solutions are either too simple to be taken seriously as engineering solutions or too baroque to easily integrate into existing architectures that have already been demonstrated to scale and provide solutions to other key problems.

In reading the literature, it soon becomes obvious that the term "analogy" has broad application and little agreement on what it means. In the context of multi-task learning, Lampinen et al [124] note that, "if you train a neural network to solve two identical tasks, using separate sets of inputs and outputs but sharing the hidden units, in some cases it will generate representations that reflect the analogy [...] leading to the ability to correctly make analogical inferences about items not explicitly taught [168]. Transfer learning is often cast as a structural analogy problem and tackled by a wide range of methods including graph-based label propagation [95] and reproducing kernel Hilbert spaces [202]28.

Battaglia et al [15] use the term combinatorial generalization to describe one of the most important advantages of working with compositional models. The following excerpt from their paper — parts of which appear elsewhere in these notes — illustrates the basic idea and links it to analogy and relational structure:

Humans' capacity for combinatorial generalization depends critically on our cognitive mechanisms for representing structure and reasoning about relations. We represent complex systems as compositions of entities and their interactions. [...] We use hierarchies to abstract away from fine-grained differences, and capture more general commonalities between representations and behaviors, such as parts of an object, objects in a scene, neighborhoods in a town, and towns in a country. We solve novel problems by composing familiar skills and routines, for example traveling to a new location by composing familiar procedures and objectives, such as "travel by airplane", "to San Diego", "eat at", and "an Indian restaurant". We draw analogies by aligning the relational structure between two domains and drawing inferences about one based on corresponding knowledge about the other.

Kenneth Craik's The Nature of Explanation (1943), connects the compositional structure of the world to how our internal mental models are organized:

...[a human mental model] has a similar relation-structure to that of the process it imitates. By 'relation-structure' I do not mean some obscure non-physical entity which attends the model, but the fact that it is a working physical model which works in the same way as the process it parallels... physical reality is built up, apparently, from a few fundamental types of units whose properties determine many of the properties of the most complicated phenomena, and this seems to afford a sufficient explanation of the emergence of analogies between mechanisms and similarities of relation-structure among these combinations without the necessity of any theory of objective universals. — excerpt from Pages 51-55 in [35] (PDF)

That is, the world is compositional, or at least, we understand it in compositional terms. When learning, we either fit new knowledge into our existing structured representations, or adjust the structure itself to better accommodate (and make use of) the new and the old.

O'Reilly et al [155] point out, in their retrospective look at Fodor and Pylyshyn's [63] 1988 critique of connectionism, that combinatorial generalization is a challenge in connectionist models, but that it is possible to achieve a type of limited systematicity by using the same prefrontal-cortex / basal-ganglia circuitry [152] that they employ to explain executive function and rule-based reasoning in humans:

Despite a bias toward context-sensitivity, it is possible for simple neural networks to learn a basic form of combinatoriality — to simply learn to process a composite input pattern in terms of separable, independent parts. These models develop "slot-based" processing pathways that learn to treat each separable element separately and can thus generalize directly to novel combinations of elements. However, they are strongly constrained in that each processing slot must learn independently to process each of the separable elements, because as described above, neurons cannot communicate symbolically, and each set of synapses must learn everything on its own from the ground up. Thus, such systems must have experienced each item in each "slot" at least a few times to be able to process a novel combination of items. Furthermore, these dedicated processing slots become fixed architectural features of the network and cannot be replicated ad hoc — they are only applicable to well-learned forms of combinatorial processing with finite numbers of independent slots. In short, there are strong constraints on this form of combinatorial systematicity, which we can partially overcome through the PFC-based indirection mechanism described below. Nevertheless, even within these constraints, combinatorial generalization captures a core aspect of the kind of systematicity envisioned by [63], which manifests in many aspects of human behavior. — see Section 3.3 entitled "Combinatorial Generalization (Neocortex)" in [155] (PDF)

The take home message from these observations is that if we want to achieve the advantages of analogical and relational modeling in addition to the benefits of fully-differentiable deep neural networks we must necessarily design hybrid systems that embrace a trade-off between context-sensitivity and combinatoriality [155]. The real challenge is to design hybrid systems that combine the best of symbolic and connectionist architectures. While such hybrids do not yet exist, there have been some interesting attempts to address the tradeoff by using different variants of differentiable memory [7877208] that we'll look at in the next entry in this discussion list.

## November 17, 2018

%%% Sat Nov 17 03:48:40 PST 2018


Several researchers have suggested that the term graph networks distracts from the focus on relationships and the notion of a relational inductive bias as articulated in Battaglia et al [15]. Carlos Perez notes that "Hofstadter argues that the more common knowledge structuring mechanism known as categorization (or classification) is the same as the generation of analogies [which are essentially] relationships between concepts".

Anthony Repetto characterizes analogies in terms of "relating disparate inputs and processing them using the same heuristic". He understands analogies "as a form of compression, allowing the brain to simulate the dynamics of many different systems" by allocating minimal resources to each system such that if two systems behave similarly, a single analogy can be used to describe both" — excerpted from [165].

Concerning relevant datasets, Weston et al [207] have developed a "set of proxy tasks to evaluate reading comprehension via question answering [...] that measure understanding in [term of] whether a system is able to answer questions via chaining facts, simple induction, deduction [...] these tasks are designed to be prerequisites for any system that aims to be capable of conversing with a human". The data is available here.

## November 11, 2018

%%% Sun Nov 11 03:33:52 PST 2018


In the previous entry, I suggested that Douglas Hofstadter's work on analogy might be relevant to Yoshua Bengio's paper on deep learning and cultural evolution [16]. Indeed, Hofstadter has written a good deal arguing that analogy is the core of cognition, starting with his popular Gödel, Escher, Bach: An Eternal Golden Braid published in 1979 [98]. Later, in his 2001 book chapter [100] entitled Epilogue: Analogy as the core of cognition he set out the main premise and continued to develop it over the following decade. His presidential lecture) at Stanford on February 6, 2006 and subsequent invited talk at the University of Illinois at Urbana-Champaign on September 14, 2006 provide accessible introductions to the basic ideas. Then in 2013, Hofstadter and his co-author, Emmanuel Sander, published a book-length exposition [103] arguing that analogy plays a key role in cognition. From the prologue:

In this book about thinking, analogies and concepts will play the starring role, for without concepts there can be no thought, and without analogies there can be no concepts. This is the thesis that we will develop and support throughout the book. What we mean by this thesis is that each concept in our mind owes its existence to a long succession of analogies made unconsciously over many years, initially giving birth to the concept and [then] continuing to enrich it over the course of our lifetime.

In the Stanford lecture and UIUC talk, Hofstadter's presentation is informal and dialectical as befits a talk for a general audience. At the end of the UIUC talk, the audience questions touch upon the question of what sort of computational abstraction would best support pervasive analogy as he imagines it. Hofstadter does not think that connectionist architectures are well suited to analogy since, according to him, they aren't able to create and manipulate large-scale structures — particularly hierarchical structures, and they lack a suitably powerful method of chunking, since he characterizes human thought as a "relentless, lifelong process of chunking — taking small concepts and putting them together into bigger and bigger ones"29. He pointed to his book Fluid Concepts and Creative Analogies: Computer Models Of The Fundamental Mechanisms Of Thought for a discussion computational models and their implementation details [99].

Miscellaneous Loose Ends: Two recent books worth reading, David Quammen's "The Tangled Tree" [163] on the origins of life and horizontal gene transfer and Carl Zimmer's "She Has Her Mother's Laugh" [215] on the history and future of heredity. I enjoyed both, but already knew much of the science and many of the anecdotes described in Zimmer's book. I really learned a lot from Quammen's book on horizontal gene transfer, endosymbiosis and the history concerning the many different manifestations of the "tree of life" concept. Quammen's writing reminds me of Horace Freeland Judson's The Eighth Day of Creation. The science is painstakingly researched and the stories of the scientists benefit from personal interviews and detail-oriented reporting. Among the many interesting examples of HGT covered in his book, Quammen chronicles the discovery of syncytin, a captive retroviral envelope protein that plays an important role in mammalian placental morphogenesis [32136].

## November 9, 2018

%%% Fri Nov  9  4:51:55 PST 2018


I was rereading Yoshua Bengio's consciousness prior paper [18] and noticed a reference [17] in the bibliography with the title "Deep learning and cultural evolution". The reference appears to be an invited talk that mentions a wide range of recent work but is most clearly summarized in [16] where it is organized in terms of a set of hypotheses reproduced in the following account.

Optimization Hypothesis: When the brain of a single biological agent learns without the aid of an auxiliary teacher, it performs an approximate optimization with respect to some endogenous objective function.

The paper starts by reviewing the analysis presented in [157] on the saddle point problem for non-convex optimization, noting that training a deep net from end-to-end is difficult but gets easier if an auxiliary training signal can be used to guide the training of the intermediate layers.

Local Descent Hypothesis: When the brain of a single biological agent learns, it relies on a form of approximate local descent by making small, gradual adjustments which on average tend to reduce the expected error.

Bengio mentions curriculum learning as one source of such auxiliary training signals [19] along with the application of unsupervised pre-training as a means of regularization [8558], but these papers are prologue to a more intriguing possibility concerning human intelligence.

Deeper the Harder Hypothesis: Higher-level abstractions in brains are represented by deeper computations going through more areas — localized brain regions — or more computational steps in sequence over the same areas.

It is claimed that a single human learner is unlikely to discover the high-level abstractions required to simplify learning by chance because these abstractions are represented by deep sub-networks in the brain.

Local Minima Hypothesis: The ability of any single human to learn is limited by the ability of the learner to determine if it is near to a local minimum — what Bengio refers to as effective local minima.

Bengio asks the reader to consider a hierarchy of gradually more complex features, constructing detectors for very abstract concepts that are activated whenever a stimulus within a very large set of possible input stimuli are presented. For a higher-level abstraction, this set of stimuli represents a highly-convoluted set of points, resulting in a a highly curved manifold and thus is likely to be hard to learn.

Deep Abstractions Harder Hypothesis: A single human learner is unlikely to discover by chance the high-level abstractions necessary for efficient inference since these are represented by deep sub-networks in the brain.

Bengio suggests it may be possible for individuals to exchange information providing insight into the representations in the deep layers of a multi-layer network that could serve to simplify learning by altering the energy landscape. He points out how various forms of reference, e.g., the different modes, iconic, indexical and symbolic, in C. S. Peirce's semiotic theory [159] — might facilitate such insight30.

Guided Learning Hypothesis: A human brain can much more easily learn high-level abstractions if guided by the signals produced by other humans that serve as hints or indirect supervision for these high-level abstractions31.

The intellectual bounty resulting from centuries of civilization and culture, including the efforts of a great many gifted savants and diligent scribes, provides anyone possessing the requisite interest and resolve with the opportunity to reconstruct the networks necessary to harness this bounty and exploit the combinatorial and indexical properties of language to "stand on the shoulders of giants" as Issac Newton grudgingly acknowledged.

Memes Divide-and-Conquer Hypothesis: Language, individual learning and the recombination of memes32 constitute an efficient evolutionary recombination operator, and this gives rise to rapid search in the space of memes, that helps humans build up better high-level internal representations of their world.

Bengio opines that deep learning research suggests that "cultural evolution helps to collectively deal with a difficult optimization problem that single humans could not alone solve" and that moreover:

• This has social and political implications for organizing our societies towards maximum efficiency thereby encouraging growth that engenders cultural wealth: brains that better understand the world around us.

• The implications for AI research suggest collections of learning agents building on each other's discoveries to build up towards higher-level abstractions — guiding computers just like we guide children.

As suggested earlier in these notes, the idea of natural language as a lingua franca for intra-cortical communication perversely appeals to me. Figure 5 in Bengio [16] alludes to this mode of communication, even allowing for the possibility of communication between brains, typically through natural language, in a way that provides suggestions as to how the fundamental structure or topology of the higher levels in the representation of a concept in one brain might be reproduced in the higher levels in the representation of the same concept in another brain.

Figure 64 provides a rendering of one possible method of using language to address the problem of discovering high-level abstractions of the sort that fundamentally alter the search space by revealing a relatively-smooth, low-dimensional manifold projection so as to accelerate learning concepts that would be difficult otherwise. In Figure 64 the learner observes an instance in support of the general claim: "all birds have wings". The teacher in this case either makes the claim explicit by uttering a suitable statement or underscores the implied relationship by pointing to the apparent instance, and identifying the related entities by their common names.

Figure 64:  This drawing depicts a graphical rendering of Yoshua Bengio's guided learning hypothesis [16]. The primary sensory areas responsible for speech (A) and vision (B) process the sensory input corresponding to, respectively, the sentence "all birds have wings" and the image of a bird. This results in the activation of several abstract conceptual representations, shown here as cartoon thought vectors corresponding to the concept of "wing" labeled C, the concept of "bird" labeled D, and the embedding of the sentence labeled E which is related to C and D by recurrent connections.

The auditory and visual stimuli are initially processed in the primary sensory areas — labeled A and B — identifying low-level spatially-localized features that are subsequently combined in several higher layers — labeled C, D and E — to construct increasingly abstract, spatially-diffuse composite representations that are eventually (reciprocally) connected in the association areas situated in the parietal, temporal and occipital lobes of the posterior portion of the cortex where they are integrated and made available to higher executive-function and memory systems in the prefrontal cortex.

The assumption is that our understanding of the words "wing" and "bird" is informed both by our highly contextualized understanding of the symbols as they occur in written and spoken language and our visual experience of corresponding instances encountered in nature. These two sources of meaning are related through the rich set of learned associations between the corresponding representations. The question is could we somehow exploit these associations to serve as a regularization term in learning a model encompassing the many ways that physically realized wings and birds relate to one another. The basic idea reminds me of the Deep Visual-Semantic Embedding (DEVISE) model developed by Frome et al [67] to leverage semantic similarity in order improve image classification and enable a form of zero-shot learning.

A more interesting challenge would be to explore how one might learn a predictive model of a dynamical system using a similar linguistic-embedding approach to simplify and accelerate learning. Perhaps the sort of "hint" that Yoshua has in mind might take the form of an analogy, e.g., using our commonsense understanding of personal relationships to learn a simple model of the physical interactions that occur between electrically charged particles. It might be worth asking Peter Battaglia and Jessica Hamrick if they know of any related work from their research on learning interaction networks to model dynamical systems [14] and exploiting a relational inductive bias in solving structured reasoning problems [89]. For your convenience, I've included a set of bibliographical references including the abstracts that relate to the ideas covered in this log entry33.

To work well, this approach would require an extensive semantic embedding space leveraging the rich latent semantic structure inherent in natural language [9123169]. Intuitively, taking advantage of someone telling you that "particle physics is like interpersonal relationships" would require an alignment of two very different domains. In principle, this might be done by having a knowledgeable teacher supply a naïve student with a narrative that includes excerpts like "particles are like individual persons", "attraction works when opposites become emotionally entangled", "attraction is subject to an inverse square law similar to gravitational force", etc.

The idea is to align the two models — one involving electrically charged particles and the other emotionally entangled individuals — so that the embedded relationships governing the latter provide a structural prior in guiding the formation of new networks explaining the former. Douglas Hofstadter's work on analogy is worth a look [10310298]. Daniel Johnson's paper on gated graph transformer networks mentioned earlier in these notes provides examples of how such graph networks can be trained to generate graphical models of the underlying dynamics from stories [112].

## November 5, 2018

%%% Mon Nov  5 03:55:14 PST 2018


Context is everything in language and problem solving. When we converse with someone or read a book we keep in mind what was said or written previously. When we attempt to understand what was said in a conversation or formulate what to say next we draw upon our short-term memories of earlier mentioned people and events, but we also draw upon our long-term episodic memories involving the people, places and events related to those explicitly mentioned in the conversation. In solving complex design problems, it is often necessary to keep in mind a large number of specific facts about the different components that go into the design as well as general knowledge pertaining to how those components might be adapted and assembled to produce the final product.

Much of a programmer's procedural knowledge about how to write code is baked into various cognitive subroutines that can be executed with minimal thinking. For example, writing a simple FOR loop in Python to iterate through a list is effortless for an experienced Python programmer, but may require careful thought for an analogous code block in a less familiar programming language like C++. In thinking about how the apprentice's knowledge of programming is organized in memory, routine tasks would likely be baked into value functions trained by reinforcement learning. When faced with a new challenge involving unfamiliar concepts or seldom used syntax, we often draw upon less structured knowledge stored in episodic memory. The apprentice uses this same strategy.

The neural network architecture for managing dialogue and writing code involves encoder-decoder pairs comprised of gated recurrent networks that are augmented with attention networks. We'll focus on dialogue to illustrate how context is handled in the process of ingesting (encoding) fragments of an ongoing conversation and generating (decoding) appropriate responses, but the basic architecture is similar for ingesting fragments of code and generating modified fragments that more closely match a specification. The basic architecture employs three attention networks, each of which is associated with a separate encoder network specialized to handle a different type of context. The outputs of the three attention networks are combined and then fed to a single decoder.

The (user response) encoder ingests the most recent utterance produced by the programmer and corresponds to the encoder associated with the encoder-decoder architectures used in machine translation and dialogue management. The (dialogue context) encoder ingests the N words prior to the last utterance. The (episodic memory) encoder ingests older dialogue selected from episodic memory. The attentional machinery responsible for the selection and active maintenance of relevant circuits in the global workspace (GWS) will likely notice and attend to every utterance produced by the programmer. Attentional focus and active maintenance of such circuits in the GWS will result in the corresponding thought vector added to NTM the partition responsible for short-term memory.

The controller for the NTM partition responsible for short-term (active) memory then generates keys from the newly added thought vectors and transmits these keys to the controller of the NTM partition responsible for long-term (episodic) memory. The episodic memory controller uses these keys to select episodic memories relevant to the current discourse, combining the selected memories into a fixed-length composite thought vector that serves as input for the corresponding encoder. Figure 63 depicts the basic architecture showing only two of the three encoders and their associated attention networks, illustrating how the outputs of the attention networks are combined prior to being used by the decoder to generate the next word or words in the assistant's next utterance.

Figure 63:  In the programmer's assistant, the dialogue management and program-transformation systems are implemented using encoder-decoder sequence-to-sequence networks with attention. We adapt the pointer-generator network model developed by See et al [175] to combine and bring to bear contextual information from multiple sources including short- and long-term memory systems implemented as Neural Turing Machines as summarized in Figures 52 and 53. This graphic illustrates two out of the three contextual sources of information employed by the apprentice. Each source is encoded separately, the relevance of its constituent elements represented as a probability distribution and resulting distributions combined to guide the decoder in generating output.

One step of the decoder could add zero, one, or more words, i.e., a phrase, to the current utterance under construction. Memories — both short- and long-term — are in the form of thought vectors or word sequences that could be used to reconstruct the original thought vectors for embedding or constructing composites by adding context or conditioning to emphasize relevant dimensions. The dialogue manager — a Q-function network trained by reinforcement learning — can also choose not to respond at all or could respond at some length perhaps incorporating references to code, explanations for design choices and demonstrations showing the results of executing code in the IDE.

To control generation, we adapt the pointer-generator network framework developed by See et al for document summarization [175]. In the standard sequence-to-sequence machine-translation model a weighted average of encoder states becomes the decoder state and attention is just the distribution of weights. In See et al attention is simpler: instead of weighting input elements, it points at them probabilistically. It isn't necessary to use all the pointers; such networks can mark excerpts by pointing to their start and end constituents. We apply their approach here to digest and integrate contextual information originating from multiple sources.

In humans, memory formation and consolidation involves several systems, multiple stages and can span hours, weeks or months depending on the stage and associated neural circuitry34. Our primary interest relates to the earliest stages of memory formation and role of the hippocampus and entorhinal region of the frontal cortex along with several ancillary subcortical circuits including the basal ganglia (BG). Influenced by the work of O'Reilly and Frank [152], we focus on the function of the dentate gyrus (DG) in the hippocampal formation and encode thought vectors using a sparse, invertible mapping thereby providing a high degree of pattern separation in encoding new information while avoiding interference with existing memories.

We finesse the details of what gets stored and when by simply storing everything. We could store the sparse representation provided by the DG, but prefer to use this probe as the key in a key-value pair in the NTM partition dedicated to episodic memory and store the raw data as the value. This means we have to reconstruct the original encoding produced when initially ingesting the text of an utterance. This is preferable for two reasons: (i) we need the words — or tokens of an abstract syntax tree in the case of ingesting code fragments — in order for the decoder to generate the apprentice's response, and (ii) the embeddings of the symbolic entities that constitute their meaning are likely to drift during ongoing training.

## October 27, 2018

%%% Sat Oct 27 04:34:51 PDT 2018


Graph Networks is a neural network framework for constructing, modifying and performing inference on differentiable encodings of graphical structures. Battaglia et al [15] describe Graph Networks as a "new building block for the AI toolkit with a strong relational inductive bias35, the graph network, which generalizes and extends various approaches for neural networks that operate on graphs" by constraining the rules governing the composition of entities and their relationships.

In related work, Li et al [128] describe a model they refer to as a Gated Graph Sequence Neural Network (GGS-NN) that operates on graph networks to produce sequences from graph-structured input. Johnson [112] introduced the Gated Graph Transformer Neural Network (GGT-NN), an extension of GGS-NNs that uses graph-structured data as an intermediate representation. The model can learn to construct and modify graphs in sophisticated ways based on textual input, and also to use the graphs to produce a variety of outputs. The Graph Network (GN) Block described in Section 3.2 of Battaglia et al [15] provides a similar set of capabilities.

The network shown in Figure 61 demonstrates how to package the five general transformations described in Johnson’s paper to provide a Swiss-army-knife utility that can be used to manipulate abstract syntax trees in code synthesis simplifying the construction of differentiable neural programs introduced earlier. This graph-networks utility could be integrated into a reinforcement-learning code synthesis module that would learn how to repair programs or perform other forms of synthesis by learning how to predict the best alterations on the program under construction. The Graph Network Block provides many of the same operations.

Figure 61:  The above graphic depicts a utility module that takes a graph in the Graph Networks representation and a command corresponding to one of the transformations described in [112], carries out the indicated transformation and produces the transformed graph in a recurrent output layer. See the definition of Graph Network Block in Section 3.2 of Battaglia et al [15] for an alternative formulation.

The imagination-based planning (IBP) for reinforcement learning framework [158] serves as an example for how the code synthesis module might be implemented. The IBP architecture combines three separate adaptive components: (a) the CONTROLLER + MEMORY system which maps a state s ∈ S and history h ∈ H to an action a ∈ A; (b) the MANAGER maps a history h ∈ H to a route u ∈ U that determines whether the system performs an action in the COMPUTE environment, e.g., single-step the program in the FIDE, or performs an imagination step, e.g., generates a proposal for modifying the existing code under construction; the IMAGINATION MODEL is a form of dynamical systems model that maps a pair consisting of a state s ∈ S and an action a ∈ A to an imagined next state s′ ∈ S and scalar-valued reward r ∈ R.

The IMAGINATION MODEL is implemented as an interaction network [14] that could also be represented using the graph-networks framework introduced here. The three components are trained by three distinct, concurrent, on-policy training loops. The IBP framework shown in Figure 62 allows code synthesis to alternate between exploiting by modifying and running code, and exploring by using the model to investigate and analyze what would happen if you actually did act. The MANAGER chooses whether to execute a command or predict (imagine) its result and can generate any number of trajectories to produce a tree ht of imagined results. The CONTROLLER takes this tree plus the compiled history and chooses an action (command) to carry out in the FIDE.

Figure 62:  The above graphic illustrates how we might adapt the imagination-based planning (IBP) for reinforcement learning framework [158] for use as the core of the apprentice code synthesis module. Actions in this case correspond to transformations of the program under development. States incorporate the history of the evolving partial program. Imagination consists of exploring sequences of program transformations.

## October 25, 2018

%%% Thu Oct 25 03:21:56 PDT 2018


Graph Networks [15] is a natural for encoding AST representations and so I came up with a simple utility module for exercising Daniel Johnson's Gated Graph Transformer Neural Networks framework [112]. In reading and thinking about the Weber et al imagination-augmented agent architecture [205], it seemed a good framework for leveraging some of Rishabh's suggested workflows and so I added that to the mix for my talk at Stanford next week.

If you're not familiar with this work, Oriol Vinyals did a great job explaining both the Weber et al paper and the DeepMind imagination-based planner framework in his lecture for my class, and Anusha Nagabandi and Gregory Kahn at Berkeley applied the framework in this robotics project.

I read Battaglia et al twice in the last few days. There are a bunch of reasons why but the most concrete came up when I started thinking about apprentice workflows for various programming tasks and that got me thinking about the ecosystem in which the apprentice operated. From Battaglia et al I was interested in the file system or version control system and how these might be implemented. It struck me just how important these technologies are and how we take them for granted.

It's not exactly rocket science but I also found it useful to think about how they might work in the case of the apprentice given the requirement for maintaining multiple representations. Once you have this puzzled out, it is lot easier to imagine different workflows that serve different use cases. The idea is to design the ecosystem with these different use cases in mind and put it to the test by imagining how the corresponding workflows would actually work.

## October 23, 2018

%%% Tue Oct 23 04:46:06 PDT 2018


One reason that researchers in the 1980s focused primarily on symbolic systems is that the alternative was impractical. We had neither the computing power nor the access to training data that has fueled the recent renaissance in connectionist models. As Battaglia et al [15] note, we have become dependent on these resources and that dependence is impeding progress our on a number of critical dimensions.

Most deep learning models are dominated by their connectionist components, with symbolic components, such as Neural Turing Machines, being clumsily bolted on the side. In the next generation of systems, we can expect the connectionist and symbolic components to be on a more equal footing. In thinking about how to implement some new capability, it will be possible to choose between these two paradigms or some hybrid of the two.

Biological brains currently don’t have such flexibility given the degree to which natural selection has exploited the advantages of highly-parallel distributed representations. While this could change in principle, at least for now, we don't understand how to directly interface the human brain to a laptop and its file system in order to enhance our ability to efficiently store and reliably access large amounts of information.

There are practical problems that come up in purely connectionist approaches to code synthesis such as the organizational structure of modern hierarchical file systems. File systems, blackboard models and theorem provers are but a few of the technologies that complement connectionist models. A simple way to extend a rule-based system is to add rules that call programs using variable bindings to pass arguments and return results.

With the advent of Differentiable Neural Computers, it's become easier to harness more conventional computing resources from within a connectionist system using an appropriate interface and get the best characteristics of both paradigms. Many of the information-managing tools we rely upon depend on trees and graphs for their implementation, e.g., taxonomies, dictionaries, calendars, hierarchical task networks, PERT and GANTT charts.

When we think about a biological system managing its activity much of the relevant information will be stored in its episodic memory, but the apprentice will also benefit from having explicit descriptions of plans represented within a connectionist data structure to simplify keeping track of what it has do do — this applies to everything from managing multiple topics in conversation to keeping up with pending code reviews.

We can ingest programs by representing them as vectors in an embedding space and subsequently search for and find similar programs by using nearest neighbor search and maintaining a lookup table that allows us to index programs by their embedding vectors thereby allowing us to maintain programs in two formats and quickly alternate between the representations. The file system is itself a tree that the apprentice shares with programmer.

Battaglia et al [15] mention that graph networks have limitations that may restrict their application in reasoning about conventional computer programs:

More generally, while graphs are a powerful way of representing structure information, they have limits. For example, notions like recursion, control flow, and conditional iteration are not straightforward to represent with graphs, and, minimally, require additional assumptions (e.g., in interpreting abstract syntax trees). Programs and more "computer-like" processing can over greater representational and computational expressivity with respect to these notions, and some have argued they are an important component of human cognition. — excerpt from Page 23 in [15]

I expect it will be relatively easy to extend graph networks or use the basic framework as is for one component of a more expressive representational framework that handles computer programs. In particular, graph networks will make it relatively easy to represent — as opposed to interpret — programs as abstract syntax trees. The DNP approach illustrated in Figure 54 can dispense with the need for NTM program memory, while continuing to rely on an NTM to implement the highly dynamic call stack. Moreover, graph networks will considerably simplify the manipulation of differentiable neural programs in code synthesis, as well as simplify the design of an DNP emulator.

The generality of graph networks will certainly make it easier to create new abstractions, and I can easily imagine applications for graph algorithms like Dykstra's shortest path, Floyd–Warshall shortest path in a weighted graph, graph homomorphism and isomophism algorithms, etc. The DeepMind graph-nets Python library will also make it much easier for ML researchers to build graph-based abstractions on a solid foundation and then easily share and compare the results. I'm less sanguine about the extent to which graph networks representation will facilitate analogical reasoning and mental models, but reserve judgment until I've read the cited papers [12219135].

Battaglia et al mention Kenneth Craik [35] writing in 1943 on how the compositional structure of the world relates to how internal mental models are organized — see this 1983 retrospective review [120] of Craik's The Nature of Explanation for a short synopsis (PDF) and the following excerpt that was quoted in the introduction of Battaglia et al [15]:

[A human mental model] has a similar relation-structure to that of the process it imitates. By 'relation-structure' I do not mean some obscure non-physical entity which attends the model, but the fact that it is a working physical model which works in the same way as the process it parallels [...] physical reality is built up, apparently, from a few fundamental types of units whose properties determine many of the properties of the most complicated phenomena, and this seems to afford a sufficient explanation of the emergence of analogies between mechanisms and similarities of relation-structure among these combinations without the necessity of any theory of objective universals. — excerpt from Pages 51-55 in [35] (PDF)

Miscellaneous Loose Ends: Sean Carroll interviews David Poeppel [97] on the topic of "Thought, Language and How to Understand the Brain" in Episode 15 of Carroll's Mindcast podcast program. Poeppel is the Director at the Max Planck Institute for Empirical Aesthetics which sounds somewhat odd, but makes more sense when you hear Poeppel explain his research program early in the Carroll interview. The YouTube link is set to start 48 minutes into the podcast for those of you who understand the basic neurophysiology and have some familiarity with the dual-stream model of visual processing, since Poeppel will focus on the somewhat more controversial dual-stream model of speech processing in the interview.

Apropos of Poeppel's discussion, language production has been linked to a region in the frontal lobe of the dominant hemisphere since Pierre Paul Broca (1824-1880) reported language-related impairments in this region in two patients in a paper published in 1861. In 1874, Karl Wernicke (1848-1905) suggested a connection between the left posterior region of the superior temporal gyrus and the reflexive mimicking of words and their syllables leading some neurophysiologists to hypothesize that these areas are linked to language comprehension. It is interesting to note that these sketchy theories have prevailed to this day — nearly 150 years — despite their meager supporting evidence.

## October 21, 2018

%%% Sun Oct 21 03:47:08 PDT 2018


I read Li et al [128] and Johnson [112] this morning in that order. My original impetus was to catch up on some of the background references before tackling Battaglia et al [15] the latest version — 3 — of which was just released — 17 October — on arXiv and promises in the abstract to generalize and extend various approaches for neural networks that operate on graphs. Rishabh Singh suggested that the representations and tools described in these three papers might provide a good foundation for implementing Differentiable Neural Programs — see Figure 54 — and their corresponding emulator architecture — see Figure 55. I like the clarifying graphics in Johnson's paper and his presentation at ICLR (PDF). In the following, I've included three of the figures in Johnson [112] to simplify my own brief account in these discussion notes.

Li et al [128] use the general term graph nets to refer neural network representations of graphs. The nomenclature will eventually get sorted out — perhaps with the recent Battaglia et al paper providing the definitive reference [15]. We've already mentioned some earlier work on graph networks in these notes including [1174256], but the current crop of papers mentioned in the previous paragraph attempt to tackle the general problem of learning generative models of graphs from a dataset of graphs of interest. To facilitate research and technology development on graph networks, researchers at DeepMind have provided the research community with a Python library for experimenting with graph networks in Tensorflow and Sonnet.

Figure 58:  Diagram of the differentiable encoding of a graphical structure, as described in Section 3 of Johnson [112]. On the left, the desired graph we wish to represent, in which there are 6 node types (shown as blue, purple, red, orange, green, and yellow) and two edge types (shown as blue/solid and red/dashed). Node 3 and the edge between nodes 6 and 7 have a low strength. On the right, depictions of the node and edge matrices: annotations, strengths, state, and connectivity correspond to xv, sv, hv, and C, respectively. Saturation represents the value in each cell, where white represents 0, and fully saturated represents 1. Note that each node’s annotation only has a single nonzero entry, corresponding to each node having a single well-defined type, with the exception of node 3, which has an annotation that does not correspond to a single type. State vectors are shaded arbitrarily to indicate that they can store network-determined data. The edge connectivity matrix C is three dimensional, indicated by stacking the blue-edge cell on top of the red-edge cell for a given source-destination pair. Also notice the low strength for cell 3 in the strength vector and for the edge between node 6 and node 7 in the connectivity matrix — adapted from Figure 1 in Johnson [112].

Figure 58 from Johnson [112] depicts the various components that comprise a differentiable encoding of a graphical structure. A similar representation is provided in Section 3.2 of Li et al [128]. What's important to note here is that these graph representations are expressive and straightforward to manipulate. Li et al [128] go on to describe how to learn and represent unconditional or conditional densities on a space of graphs given a representative sample of graphs, whereas Johnson [112] is primarily interested in using graphs as intermediate representations in reasoning tasks. As an example of the former, Li et al [128] describe how to create and sample from conditional generative graph models.

Figure 59:  Summary of the graph transformations. Input and output are represented as gray squares. (a) Node addition (Tadd), where the input is used by a recurrent network (white box) to produce new nodes, of varying annotations and strengths. (b) Node state update (Th), where each node receives input (dashed line) and updates its internal state. (c) Edge update (TC), where each existing edge (colored) and potential edge (dashed) is added or removed according to the input and states of the adjacent nodes (depicted as solid arrows meeting at circles on each edge). (d) Propagation (Tprop), where nodes exchange information along the current edges, and update their states. (e) Aggregation (Trepr), where a single representation is created using an attention mechanism, by summing information from all nodes weighted by relevance (with weights shown by saturation of arrows) — adapted from Figure 2 in Johnson [112].

In service to the latter — performing reasoning tasks, Johnson [112] demonstrates how his graph structure model can be made fully differentiable, and provides a set of graph tranformations that can be applied to a graph structure to transform it in various ways. For example, the propagation transfomation, Tprop, allows nodes to trade information across the existing edges and then update their internal states on the basis of the information received. Figure 59 summarizes the five differentiable graph transformations described in Johnson [112] and Figure 60 describes the set of operations performed for each class of transformations.

Figure 60:  Diagram of the operations performed for each class of transformation. Graph state is shown in the format given by Figure 58. Input and output are shown as gray boxes. Black dots represent concatenation, and + and × represent addition and multiplication, respectively. The notation 1 − # represents taking the input value and subtracting it from 1. Note that for simplicity, operations are only shown for single nodes or edges, although the operations act on all nodes and edges in parallel. In particular, the propagation section focuses on information sent and received by the first node only. In that section the strengths of the edges in the connectivity matrix determine what information is sent to each of the other nodes. Light gray connections indicate the value zero, corresponding to situations where a given edge is not present — adapted from Figure 4 in Johnson [112].

## October 19, 2018

%%% Fri Oct 19 06:16:58 PDT 2018


The ability to maintain an appropriate context by drawing upon episodic memory in order to deal with long-term dependencies in deep learning models is an important requirement for conversational agents. Young et al [213] introduce a method based on reservoir sampling focusing on the ability to easily assign credit to a state when the retrieved information is determined to be useful. Their approach maintains a fixed number of past states and "preferentially remember[s] those states [that] are found to be useful to recall later on. Critically this method allows for efficient online computation of gradient estimates with respect to the write process of the external memory. [...] At each time step a single item Sti is drawn from the memory to condition the policy according to:

where τ is a positive learnable temperature parameter [and] the query network q(St) outputs a vector of size equal to the number of states. The state, mt, selected from memory is given as input to the policy network along with the current state, both of which condition the resulting policy" — see Figure 56 for more detail. The underlying sampling problem has several efficient algorithms including one by Jeff Vitter [201] that is constant space and order O(n(1 + log( N / n ))) expected time which is optimal up to a constant. For practical purposes, the approach described in Young et al [213] might work as well as the DNC solutions [78] we considered here.

Figure 56:  Episodic memory architecture, each gray circle represents a neural network module. Input state (S) is given separately to the query (q), write (w), value (V) and policy (π) networks at each time step. The query network outputs a vector of size equal to the input state size which is used (via Equation 1) to choose a past state from the memory (m1, m2 or m3 in the above diagram) to condition the policy. The write network assigns a weight to each new state determining how likely it is to stay in memory. The policy network assigns probabilities to each action conditioned on current state and recalled state. The value network estimates expected return (value) from the current state — adapted from Young et al [213]

Figure 57 represents a group of papers relating to managing context for dialogue management that came up in a recent search for related work. The Hybrid Code Networks described in Williams et al [209] reminded me of early assistant architectures in that it attempts to identify a collection of relevant content sources that can be deployed to answer a particular class of queries and subsequently integrated and optimized with supervised learning, reinforcement learning or a mixture of both to accelerate training.

Su et al [182] is another approach to optimizing a dialogue policy via reinforcement learning in cases where reward signal is unreliable and it is too costly to pre-train a task success-predictor off-line. The authors "propose an on-line learning framework whereby the dialogue policy is jointly trained alongside the reward model via active learning with a Gaussian process model. This Gaussian process operates on a continuous space dialogue representation generated in an unsupervised fashion using a recurrent neural network encoder-decoder." I found the paper interesting primarily for its use of Gaussian processes.

Sordoni et al [181] describe a model architecture for generative context-aware query suggestion called Hierarchical Recurrent Encoder-Decoder (HRED). Their model is similar to our Contextual LSTM (CLSTM) model for large scale NLP tasks [73] which was originally conceived as a possible solution to the problem of maintaining a suitable context for dialogue management. The paper by Serban et al [176] came out a year later focusing on the dialogue management system and using the HRED architecture. They evaluate their approach on a large corpus of dialogue from movies and compare favorably with existing state-of-the-art methods. Truong et al [195] is a more complicated architecture employing the HRED architecture, but the paper also describes the Modular Architecture for Conversational Agents (MACA) — a framework for rapid prototyping and plug-and-play modular design.

The paper by See et al [175] on an extension of pointer-networks building on the work of Vinyals et al [199] to apply sequence-to-sequence attentional models to abstractive text summarization. The paper is relevant as it offers yet another way to selectively manage context that applies to a wide range of applications by enabling the decoder to select from multiple sources for word selection, accounting for both recency and meaning. Young et al [213] we discussed earlier and is represented here as yet another practical tool for achieving scale.

Figure 57:  Here is a sample of recent papers relating to managing conversational context for dialogue management that came up in a recent search for related work. The papers are summarized briefly in the text proper.

Battaglia et al [15] argue that combinatorial generalization is necessary for AI to achieve human-like abilities, and that structured representations and computations are key to realizing this objective. As O'Reilly and Botvinick have pointed out how context sensitivity of highly-parallel connectionist approaches contrast with the systematic, combinatorial nature of largely-serial symbolic systems, Battaglia et al underscore the importance of the latter in distinguishing human intelligence. The paper promises to review and unify related work including two important papers, one by Li et al [128] on learning deep generative models of graphs, and a second by Daniel Johnson [112] that builds on Li et al by describing a set of differentiable graph transformations and then applying them to build a model with internal state that can extract structured data from text and use it to answer queries.

## October 17, 2018

%%% Wed Oct 17 04:47:11 PDT 2018


Our objective in developing systems that incorporate characteristics of human intelligence is three fold: First, humans provide a compelling solution to the problem of building intelligent systems that we can use as a basic blueprint and then improve upon. Second, the resulting AI systems are likely to be well suited to developing assistants that complement and extend human intelligence while operating in a manner comprehensible to human understanding. Finally, cognitive and systems neuroscience provide clues to engineers interested in exploiting what we know concerning how humans think about and solve problems. In this appendix, we demonstrate one attempt to concretely realize what we've learned from these disciplines in an architecture constructed from off-the-shelf neural networks.

The programmer's apprentice relies on multiple sources of input, including dialogue in the form of text utterances, visual information from an editor buffer shared by the programmer and apprentice and information from a specially instrumented integrated development environment designed for analyzing, writing and debugging code adapted to interface seamlessly with the apprentice. This input is processed by a collection of neural networks modeled after the primary sensory areas in the primate brain. The output of these networks feeds into a hierarchy of additional networks corresponding to uni-modal secondary and multi-modal association areas that produce increasingly abstract representations as one ascends the hierarchy — see Figure 50.

Figure 50:  The architecture of the apprentice sensory cortex including the layers corresponding to abstract, multi-modal representations handled by the association areas can be realized as a multi-layer hierarchical neural network model consisting of standard neural network components whose local architecture is primarily determined by the sensory modality involved. This graphic depicts these components as encapsulated in thought bubbles of the sort often employed in cartoons to indicate what some cartoon character is thinking. Analogously, the technical term "thought vector" is used to refer to the activation state of the output layer of such a component.

Stanislas Dehaene and his colleagues at the Collège de France in Paris developed a computational model of consciousness that provides a practical framework for thinking about consciousness that is sufficiently detailed for much of what an engineer might care about in designing digital assistants [45]. Dehaene’s work extends the Global Workspace Theory of Bernard Baars [12]. Dehaene’s version of the theory combined with Yoshua Bengio’s concept of a consciousness prior and deep reinforcement learning [140144] suggest a model for constructing and maintaining the cognitive states that arise and persist during complex problem solving [18].

Global Workspace Theory accounts for both conscious and unconscious thought with the primary distinction for our purpose being that the former has been selected for attention and the latter has not been so selected. Sensory data arrives at the periphery of the organism. The data is initially processed in the primary sensory areas located in posterior cortex, propagates forward and is further processed in increasingly-abstract multi-modal association areas. Even as information flows forward toward the front of the brain, the results of abstract computations performed in the association areas are fed back toward the primary sensory cortex. This basic pattern of activity is common in all mammals.

The human brain has evolved to handle language. In particular, humans have a large frontal cortex that includes machinery responsible for conscious awareness and that depends on an extensive network of specialized neurons called spindle cells that span a large portion of the posterior cortex allowing circuits in the frontal cortex to sense relevant activity throughout this area and then manage this activity by creating and maintaining the persistent state vectors that are necessary when inventing extended narratives or working on complex problems that require juggling many component concepts at once. Figure 51 suggests a neural architecture combining the idea of a global workspace with that of an attentional system for identifying relevant input.

Figure 51:  The basic capabilities required to support conscious awareness can be realized in a relatively simple computational architecture that represents the apprentice’s global workspace and incorporates a model of attention that surveys activity throughout somatosensory and motor cortex, identifies the activity relevant to the current focus of attention and then maintains this state of activity so that it can readily be utilized in problem solving. In the case of the apprentice, new information is ingested into the model at the system interface, including dialog in the form of text, visual information in the form of editor screen images, and a collection of programming-related signals originating from a suite of software development tools. Single-modality sensory information feeds into multi-modal association areas to create rich abstract representations. Attentional networks in the prefrontal cortex take as input activations occurring throughout the posterior cortex. These networks are trained by reinforcement learning to identify areas of activity worth attending to and the learned policy selects a set of these areas to attend to and sustain. This attentional process is guided by a prior that prefers low-dimensional thought vectors corresponding to statements about the world that are either true, highly probable or very useful for making decisions. Humans can sustain only a few such activations at a time. The apprentice need not be so constrained.

Fundamental to our understanding of human cognition is the essential tradeoff between fast, highly-parallel, context-sensitive, distributed connectionist-style computations and slow, serial, systematic, combinatorial symbolic computations. In developing the programmer's apprentice, symbolic computations of the sort common in conventional computing are realized using extensions that provide a differentiable interface to conventional memory and information processing hardware and software. Such interfaces include the Neural Turing Machine [77] (NTM), Memory Network Model [208186] and Differentiable Neural Computer [78] (DNC).

The global workspace summarizes recent experience in terms of sensory input, its integration, abstraction and inferred relevance to the context in which the underlying information was acquired. To exploit the knowledge encapsulated in such experience, the apprentice must identify and make available relevant experience. The apprentice’s experiential knowledge is encoded as tuples in a Neural Turing Machine (NTM) memory that supports associative recall. We’ll ignore the details of the encoding process to focus on how episodic memory is organized, searched and applied to solving problems.

In the biological analog of an NTM the hippocampus and entorhinal region of the frontal cortex play the role of episodic memory and several subcortical circuits including the basal ganglia comprise the controller [153151]. The controller employs associative keys in the form of low-dimensional vectors generated from activations highlighted in the global workspace to access related memories that are then actively maintained in the prefrontal cortex and serve to bias processing throughout the brain but particularly in those circuits highlighted in the global workspace. Figure 52 provides a sketch of how this is accomplished in the apprentice architecture.

Figure 52:  You can think of the episodic memory encoded in the hippocampus and entorhinal cortex as RAM and the actively maintained memories in the prefrontal cortex as the contents of registers in a conventional von Neumann architecture. Since the activated memories have different temporal characteristics and functional relationships with the contents of the global workspace, we implement them as two separate NTM memory systems each with its own special-purpose controller. Actively maintained information highlighted in the global workspace is used to generate keys for retrieving relevant memories that augment the highlighted activations. In the DNC paper [78] appearing in Nature, the authors point out that "an associative key that only partially matches the content of a memory location can still be used to attend strongly to that location [allowing] allowing the content of one address [to] effectively encode references to other addresses". The contents of memory consist of thought vectors that can be composed with other thought vectors to shape the global context for interpretation.

Figure 53 combines the components that we've introduced so far in a single neural network architecture. The empty box on the far right includes both the language processing and dialogue management systems as well the networks that interface with FIDE and the other components involved in code synthesis. There are several classes of programming tasks that we might tackle in order to show off the apprentice, including commenting, extending, refactoring and repairing programs. We could focus on functional languages like Scheme or Haskell, strongly typed languages like Pascal and Java or domain specific languages like HTML or SQL.

However, rather than emphasize any particular programming language or task, in the remainder of this appendix we focus on how one might represent structured programs consisting of one or more procedures in a distributed connectionist framework so as to exploit the advantages of this computational paradigm. We believe the highly-parallel, contextual, connectionist computations that dominate in human information processing will complement the primarily-serial, combinatorial, symbolic computations that characterize conventional information processing and will have a considerable positive impact on the development of practical automatic programming methods.

Figure 53:  This slide summarizes the architectural components introduced so far in a single model. Data in the form of text transcriptions of ongoing dialogue, source code and related documentation and output from the integrated development environment are the primary input to the system and are handled by relatively standard neural network models. The Q-network for the attentional RL system is realized as a multi-layer convolutional network. The two DNC controllers are straightforward variations on existing network models with a second controller responsible for maintaining a priority queue encodings of relevant past experience retrieved from episodic memory. The nondescript box labeled "motor cortex" serves as a placeholder for the neural networks responsible for managing dialogue and handling tasks related to programming and code synthesis.

The integrated development environment and its associated software engineering tools constitute an extension of the apprentice’s capabilities in much the same way that a piano or violin extends a musician or a prosthetic limb extends someone who has lost an arm or leg. The extension becomes an integral part of the person possessing it and over time their brain creates a topographic map that facilitates interacting with the extension36.

As engineers designing the apprentice, part of our job is to create tools that enable the apprentice to learn its trade and eventually become an expert. Conventional IDE tools simplify the job of software engineers in designing software. The fully instrumented IDE (FIDE) that we engineer for the apprentice will be integrated into the apprentice’s cognitive architecture so that tasks like stepping a debugger or setting breakpoints are as easy for the apprentice as balancing parentheses and checking for spelling errors in a text editor is for us.

As a first step in simplifying the use of FIDE for coding, the apprentice is designed to manipulate programs as abstract syntax trees (AST) and easily move back and forth between the AST representation and the original source code in collaborating with the programmer. Both the apprentice and the programmer can modify or make references to text appearing in the FIDE window by pointing to items or highlighting regions of the source code. The text and AST versions of the programs represented in the FIDE are automatically synchronized so that the program under development is forced to adhere to certain syntactic invariants.

Figure 54:  We use pointers to represent programs as abstract syntax trees and partition the NTM memory, as in a conventional computer, into program memory and a LIFO execution (call) stack to support recursion and reentrant procedure invocations, including call frames for return addresses, local variable values and related parameters. The NTM controller manages the program counter and LIFO call stack to simulate the execution of programs stored in program memory. Program statements are represented as embedding vectors and the system learns to evaluate these representations in order to generate intermediate results that are also embeddings. It is a simple matter to execute the corresponding code in the FIDE and incorporate any of the results as features in embeddings.

To support this hypothesis, we are developing distributed representations for programs that enable the apprentice to efficiently search for solutions to programming problems by allowing the apprentice to easily move back and forth between the two paradigms, exploiting both conventional approaches to program synthesis and recent work on machine learning and inference in artificial neural networks. Neural Turing Machines coupled with reinforcement learning are capable of learning simple programs. We are interested in representing structured programs expressed in modern programming languages. Our approach is to alter the NTM controller and impose additional structure on the NTM memory designed to support procedural abstraction.

What could we do with such a representation? It is important to understand why we don’t work with some intermediate representation like bytecodes. By working in the target programming language, we can take advantage of both the abstractions afforded by the language and the expert knowledge of the programmer about how to exploit those abstractions. The apprentice is bootstrapped with several statistical language models: one trained on a natural language corpus and the other on a large code repository. Using these resources and the means of representing and manipulating program embeddings, we intend to train the apprentice to predict the next expression in a partially constructed program by using a variant of imagination-based planning [158]. As another example, we will attempt to leverage NLP methods to generate proposals for substituting one program fragment for another as the basis for code completion.

Figure 55:  This slide illustrates how we make use of input / output pairs as program invariants to narrow search for the next statement in the evolving target program. At any given moment the call stack contains the trace of a single conditioned path through the developing program. A single path is unlikely to provide sufficient information to account for the constraints implicit in all of the sample input / output pairs and so we intend to use a limited lookahead planning system to sample multiple execution traces in order to inform the prediction of the next program statement. These so-called imagination-augmented agents implement a novel architecture for reinforcement learning that balances exploration and exploitation using imperfect models to generate trajectories from some initial state using actions sampled from a rollout policy [1582059084]. These trajectories are then combined and fed to an output policy along with the action proposed by a model-free policy to make better decisions. There are related reinforcement learning architectures that perform Monte Carlo Markov chain search to apply and collect the constraints from multiple input / output pairs.

The Differentiable Neural Program (DNP) representation and associated NTM controller for managing the call stack and single-stepping through such programs allow us to exploit the advantages of distributed vector representations to predict the next statement in a program under construction. This model makes it easy to take advantage of supplied natural language descriptions and example input / output pairs plus incorporate semantic information in the form of execution traces generated by utilizing the FIDE to evaluate each statement and encoding information about local variables on the stack.

## October 9, 2018

%%% Tue Oct  9 02:48:10 PDT 2018


Read a couple of papers on integrating episodic memory in reinforcement learning dialogue systems from Sieber and Krenn [177] and Young et al [213]. Along similar lines as Williams et al [209], Su et al [182] explore a method for active reward learning that significantly reduces the amount of annotated data required for dialogue policy learning37 .

Miscellaneous Loose Ends: I learned a bit about early work in understanding the chemical transmission of nerve impulses listening to an interview with Paul Greengard. Greengard shared with Arvid Carlsson and Eric Kandel the 2000 Nobel Prize in Physiology or Medicine for their discoveries concerning signal transduction in the nervous system. Greengard's autobiographical video on the Society for Neuroscience website is an excellent introduction to the history surrounding the discovery the chemical pathways in neural circuits, the related biochemistry and the controversy and cast of interesting characters who encouraged or discouraged work in this direction, including Eric Kandel and Rodolfo Llinás in their role as understanding the electroneurophysiology, and John Eccles who shared the 1963 Nobel Prize in Physiology or Medicine for his work on the synapse with Andrew Huxley and Alan Lloyd Hodgkin and believed that synaptic transmission was purely electrical. Actually, Henry Hallett Dale was the first to identify acetylcholine as an agent in the chemical transmission of nerve impulses (neurotransmission) for which he shared the 1936 Nobel Prize in Physiology or Medicine with Otto Loewi. Eccles was his most prominent adversary at the time and continued his opposition throughout a prolonged period of debate that continued through the 1940s and much of the 1950s and was one of the most significant in the history of neuroscience in the 20th century PDF.

## October 7, 2018

%%% Sun Oct  7 04:35:26 PDT 2018


Here is a short email exchange that includes some ideas about distributed representations of programs along with references to recent papers on code embedding:

TLD: I've been talking about the possibility of embedding code fragments using variants of the skip-gram and CBOW models for some time now [126137]. I assumed that someone else must have thought of this and implemented an instantiation of the basic idea. However, when I actually started looking, I was unable to come up with more than a few promising directions [26143]. I'm not counting these papers [2543] first authored by Miltiadis Allamanis, that I think you pointed me to earlier. The only really concrete example I could find is this recent paper [7] on predicting method names from continuous distributed vector presentations of code. Are you aware of any more interesting work along these lines that I'm overlooking?

I'm most interested in using some version of the continuous bag of words (CBOW) model to predict a fragment corresponding to a target node in the abstract syntax tree from all the nodes in its immediate context, e.g., adjacent nodes in the AST or all nodes within a fixed path length of the target. I'm also considering the possiblity of using input-output pairs as invariants to constrain statements in defining a procedure or method to meet some specification — especially the statements at the beginning or end of the definition, or, as I think you mentioned in an earlier meeting, to provide suggestions for applicable data types that might hint at an implementation.

For example, this expression, (equal? output ((lambda (parameters) definition) input)), might serve as a probe that could be used to generate proposals for code fragments or parameter data types appearing in the body of the lambda expression corresponding to this description:

(define describe_A "Given a list consisting of three words represented as
strings, find and replace the first occurrence of the second word with
the third word in a given string if the occurrence of the second word
is preceded by the first word by no more than two separating words.")

((lambda (triple document separation)
(let ((given (first triple)) (maiden (second triple)) (married (third triple)))
(do ((dist (+ 1 separation) (+ 1 dist))
(words (string-split document) (rest words)) (out (list)))
((null? words) (print (string-join out)))
(cond ((equal? given (first words))
(set! dist 0)
(set! out (append out (list given))))
((and (equal? maiden (first words)) (< dist separation))
(set! out (append out (list married))))
(else
(set! out (append out (list (first words))))))
)
)
)


I just borrowed this example from my 2018 course notes and so it's not representative of the use case I really have in mind, but it reminded me of how important it will be to use a graded set of examples, starting out with much simpler examples than those illustrated in the above description. Thanks in advance for any suggestions.

RIS: Yes, I also don't think there has been a lot of interesting work on learning good program embeddings other than trying to use traditional word2vec style program embeddings using sequence models. There were couple of recent papers I was reading that may also be related, but still seemed a bit preliminary: (1) There was an extension to the code2vec paper entitled "code2seq: "Generating Sequences from Structured Representations of CodeGenerating Sequences from Structured Representations of Code" [6], where the authors made the embedding network recurrent to learn path embeddings in an AST between two leaf nodes as a sequence of nodes, and (2) "Code Vectors: Understanding Programs Through Embedded Abstracted Symbolic Traces" [96]: this paper abstracts program source code into a set of paths, where each path is abstracted using only a sequence of function calls. The paths are then embedded using a word2vec style approach.

Thanks for the example — this looks quite interesting. For this particular example, the input-output example based specification would require a few corner cases to precisely cover all the cases where the first word precedes the second word by one word, two words, three or more words. On the other hand, the natural language based description might be more concise and clearer to specify the intent.

Predicting code fragments as a target node in an AST given some context AST nodes sounds like a great application domain to train and evaluate the embeddings. There are a few papers looking at code completion like problems, where the goal is to predict an expression given the surrounding program context: "PHOG: Probabilistic Model for Code" [21], "Learning Python Code Suggestion with a Sparse Pointer Network" [20], "Code Completion with Neural Attention and Pointer Networks" [127], "Neural Code Completion" [131].

I don't think any of these papers explicitly consider the problem of embedding program nodes, which is a key central problem and would be an important one to solve for performing any downstream task like code completion or synthesis.

TLD: Thanks for the additional references. I have been toying with the idea of using distinct execution traces as a basis for learning vector program representations for the purpose of automated reconstruction but was — and still am, to a certain extent — concerned about the inevitable exponential explosion. That said, one can imagine sampling traces to different depths in order to iteratively construct a program with branches. Two of the papers you mentioned suggest similarly motivated strategies: Henkel et al [96] employ "abstractions of traces obtained from symbolic execution of a program as a representation for learning word embeddings" and Alon et al [6] represent code snippets as "the set of paths in [the corresponding] abstract syntax tree[s] and [use] attention to select the relevant paths during decoding, much like contemporary [neural machine translation] models." I've got some ideas about using Wavenet and the differentiable neural program representation that I sketched in Slide 19 of the draft presentation to create embeddings that include both semantics and syntactic information. The idea is not well developed but I hope to do so later this week.

Miscellaneous Loose Ends: A series of entries in the 2018 class notes focused Terrence Deacon's theories about language as laid out in his book The Symbolic Species: The Co-evolution of Language and the Brain [39]. This past weekend I read Deacon's more recent book [40] entitled Incomplete Nature: How Mind Emerged from Matter and wrote up a short synopsis that you can read here38 if you're interested in the possible relationships between consciousness and the origins of life.

## September 27, 2018

This morning I thought more about implementing subroutines in the latest version of the Programmer's Apprentice architecture. I was swimming at Stanford, trying to re-program my outlook to enjoy exercise and mostly failing to make any headway. I had posed two questions to focus on during the swim in order to take my mind off the cold and tedium of swimming laps:

1. How do you exploit conscious awareness — highlighting activation in the global workspace?

2. How do you exploit automatic episodic recall — making related content available for access?

About half way through my routine, I started thinking about how to build a neural analog of a simple register machine: processor clock, program counter, read from memory, write to memory, etc. I had initially dismissed the idea as unfeasible despite the fact that there have been a number of papers describing how you can use differentiable memory models and reinforcement learning to implement simple algorithms, e.g., Neural Turing Machines (NTM/DNC) [Graves et al [7877]], Memory Networks [Weston et al [208]] and differentiable versions of Stacks, Queues and DeQues (double-ended queues) [Grefenstette et al [81]].

The hardware in a conventional ALU includes an instruction sequencer, program counter (instruction register), memory address register, instruction address register and branch address register. My first inclination was to simplify the addressing by using a variant of Greg Wayne's unsupervised predictive memory for goal-directed agents in partially observable Markov decision processes as a branch predictor, assuming every step is a branch and avoiding sequential memory addresses altogether [204].

At some point, it dawned on me that there was nothing wrong with the idea of a register machine that pointers and associative indexing couldn't cure. I came up with several plausible architectures. In the simplest case, the attentional circuit that implements conscious awareness in the frontal cortex serves as a processor clock and program counter, and the (episodic) memory system composed of the hippocampus, basal ganglia and prefrontal cortex serves as a next instruction / branch predictor. Since episodic memory is implemented as an NTM we can easily encode an associative key, pointer (a) to the current instruction, and pointer (b) to the next instruction.

Once you get the basic idea there are all sorts of alternatives to consider, e.g., pointer (a) could be a thought vector or sparse hippocampal probe providing an extra level of indirection, pointer (b) could be another associative key, or more generally you could feed pointer (a) and (b) to a microcode sequencer that handles microcode execution and branching. The neural register machine could be used to run native programs in the IDE, execute hierarchical task networks as a form of microcode that runs in a separate processor, serving a role similar to that played by the cerebellar cortex in fine-motor and cognitive control [25111110]

## September 15, 2018

Almost everyone wants to improve themselves in some way and that often requires reprogramming yourself. How to accomplish this is the \$64,000 question and the topic of our class discussion with Randall O'Reilly. There are self-help books for learning how to do almost anything including improving personal relationships [28], controlling your emotions [76], practicing mindfulness meditation [115], and improving your tennis game [71]. However, as far as I know, none of them tell you how to write programs that will run on your innate computational substrate.

Randy and I discussed how you might possibly teach someone how to load a list of parameter values into memory and call a procedure on those values as arguments. This would certainly facilitate learning how to execute an algorithm for performing long division or simplifying algebraic equations in your head as children are taught to do in the fourth grade. Of course, we want the apprentice to learn to synthesize programs from specifications or from input/output pairs. That's not the same thing as learning a program for writing programs from scratch, but it suggests a different way of approaching the problem.

In terms of taking stock of where we are at this stage in our investigations, it seems we are expecting parallel distributed (connectionist) representations to provide some advantage over or to fundamentally and usefully complement serial (combinatorial) symbolic representations39. A related issue concerns whether natural language is an advantage or liability in developing automated programming systems. It certainly plays a crucial role if we are set on building an assistant as opposed to a fully-capable, standalone software engineering savant that achieves or surpasses the competence of an expert human programmer.

One problem concerns the content and structure of thought vectors. It is already complicated to compose and subsequently decode thought vectors that are composed of two or more contributing thought vectors of the same sort, e.g., derived from natural language utterances. If we take the idea of multi-modal association areas seriously then we will be creating thought vectors that combine vectors constructed from natural language and programming language fragments. Of course, we can rely on context to avoid confusing the two, e.g., while as a word in natural language or a token in a programming language, as in [175].

The idea of natural language as a lingua franca for intra-cortical communication perversely appeals to me. There is something seductive about learning to translate natural language specifications into programs written in a specialized domain-specific-language (DSL) that can be directly executed using some control policy to write, debug, refactor or repair code. Perhaps I will simply have to abandon this idea for now, given that my current conception just trades one complicated algorithmic protocol for another. The other option that I keep returning to relies entirely on the composition of existing program fragments.

Miscellaneous Loose Ends: The dialogue between technical collaborators and pair-programmers in particular is full of interruptions, negotiations, suggestions, pregnant pauses, unsolicited restatements, thinking out loud, counterfactuals, off-the-cuff reactions, carefully formulated analyses, wild guesses, confusions over ambiguous references, accommodations of different strategies for engagement, requests for an alternative formulations, sustained misunderstanding, ueries concerning ambiguous references, requests for self-retraction, e.g., "forget about that" or "that didn't make any sense; ignore the last thing I said", declarations of fact, imperative statements, e.g., "insert a new expression at the cursor" or "add a comment after the assignment statement", etc.

All of these can be thought of as interlocutory plans, conversational gambits, speech acts. They can be internalized and generalized using an intermediate language derived from hierarchical goal-based planning and formalized as hierarchical task networks, you can think of this as a specialized intra-cortical lingua franca enabling the apprentice to formulate sophisticated plans that allow for conditional branching, decompose complicated tasks into partially ordered sequences of subtasks and support executive-control over plan deployment and execution including the termination of entire subnetworks, the ability to replan on the fly, and the use of continuations representing partially expanded plans awaiting additional input.

Buried in such exchanges is a great deal of information, the imparting of knowledge, the creation of new technology, the emergence of joint ideation and serendipitous discovery. In the best of collaborations, there is a merging complementary intellects in which individuals become joined in their efforts to create something of value. Francis Crick and Jim Watson, David Hubel and Torsten Wiesel, Alan Hodgkin and Andrew Huxley, Daniel Kahneman and Amos Tversky were not obviously made for one another and yet in each case they forged a collaboration in which they complemented one another and often became so engaged that they completed one another's sentences. We think it is possible to create artificial assistants that could provide a similar level of intellectual intimacy and technical complementarity.

## September 7, 2018

In this log entry, we continue where we left off in the previous class discussion notes on developing an end-to-end training protocol for bootstrapping a variant of the programmer's apprentice application. We begin with the analogous stages in early child development. Each of the following four stages is briefly introduced with additional technical details provided in the accompanying footnote:

• Basic cognitive bootstrapping and linguistic grounding40:

• modeling language: statistical n-gram language model trained on programming corpus;

• hierarchical planning: automated tutor generates lessons using curriculum training;

• Simple interactive behavior for signaling and editing42:

• following instruction: learning to carry out simple plans one instruction at a time;

• explaining behavior: providing short explanations of behavior, acknowledging failure;

• Mixed dialogue interleaving instruction and mirroring45:

• classifying intention: learning to categorize tasks and summarize intentions to act;

• confirming comprehension: conveying practical understanding of specific instructions;

• Composite behaviors corresponding to simple repairs46:

• executing complex plans: generating and executing multi-step plans with contingencies;

• recovering from failure: backtracking, recovering, retracting steps on failed branches;

# References

1 That wasn't exactly what I intended [...] it needs a qualification: "For Loren, what have we learned from directly observing neural activity in mouse models during repeated trials of maze learning [[ about the wider range of animal episodic memory, e.g., the memory of eating some tasty fruit with a particular smell (the mouse version of Proust's memory of eating a madeleine) or the time the mouse was almost caught by a cat, dog or hawk [...] seems likely that some mouse memories are constructed from abstract combinations of sight, sound, tickled whiskers, temperature and scent [...] so what I am really after is, what have we learned from mice about how human memories — especially the abstract sort that that might correspond to working on a mathematical proof or computer program [...] in general, it seems like it shouldn't matter since the stimulus is just a pattern of activity [...] the fact that it arrives from the perirhinal or parahippocampal region likely makes some difference [...] and then there's this talk about "spacetime coordinates" versus "spatial and temporal coordinates" ]]

2 This is an excerpt from the supporting online materials produced for the PBS special Misunderstood Minds that profiles a variety of learning problems and expert opinions. These materials are designed to give parents and teachers a better understanding of learning processes, insights into difficulties and strategies for responding. This material is particularly relevant in understanding the challenges that arise in developing hybrid — symbolic-plus-connectionist — neural-network architectures that emulate various aspects — and, likely as not, deficits — of human cognition and are expected to acquire skills such as computer programming and mathematical thinking by some combination of machine learning and automated tutoring:

Math disabilities can arise at nearly any stage of a child's scholastic development. While very little is known about the neurobiological or environmental causes of these problems, many experts attribute them to deficits in one or more of five different skill types. These deficits can exist independently of one another or can occur in combination. All can impact a child's ability to progress in mathematics.
• Incomplete Mastery of Number Facts: Number facts are the basic computations (9 + 3 = 12 or 2 x 4 = 8) students are required to memorize in the earliest grades of elementary school. Recalling these facts efficiently is critical because it allows a student to approach more advanced mathematical thinking without being bogged down by simple calculations.

• Computational Weakness: Many students, despite a good understanding of mathematical concepts, are inconsistent at computing. They make errors because they misread signs or carry numbers incorrectly, or may not write numerals clearly enough or in the correct column. These students often struggle, especially in primary school, where basic computation and "right answers" are stressed. Often they end up in remedial classes, even though they might have a high level of potential for higher-level mathematical thinking.

• Difficulty Transferring Knowledge: One fairly common difficulty experienced by people with math problems is the inability to easily connect the abstract or conceptual aspects of math with reality. Understanding what symbols represent in the physical world is important to how well and how easily a child will remember a concept. Holding and inspecting an equilateral triangle, for example, will be much more meaningful to a child than simply being told that the triangle is equilateral because it has three equal sides. And yet children with this problem find connections such as these painstaking at best.

• Connections Between Separate Concepts: Some students have difficulty making meaningful connections within and across mathematical experiences. For instance, a student may not readily comprehend the relation between numbers and the quantities they represent. If this kind of connection is not made, math skills may be not anchored in any meaningful or relevant manner. This makes them harder to recall and apply in new situations.

• Understanding of the Language of Math: For some students, a math disability is driven by problems with language. These children may also experience difficulty with reading, writing, and speaking. In math, however, their language problem is confounded by the inherently difficult terminology, some of which they hear nowhere outside of the math classroom. These students have difficulty understanding written or verbal directions or explanations, and find word problems especially difficult to translate.

• Comprehending Visual and Spatial Aspects: A far less common problem — and probably the most severe — is the inability to effectively visualize math concepts. Students who have this problem may be unable to judge the relative size among three dissimilar objects. This disorder has obvious disadvantages, as it requires that a student rely almost entirely on rote memorization of verbal or written descriptions of math concepts that most people take for granted. Some mathematical problems also require students to combine higher-order cognition with perceptual skills, for instance, to determine what shape will result when a complex 3-D figure is rotated.

3 Excerpt from Pages 99-101 of The Symbolic Species: The Co-evolution of Language and the Brain [39] by Terrence Deacon:

In summary, then, symbols cannot be understood as an unstructured collection of tokens that map to a collection of referents because symbols don't just represent things in the world, they also represent each other. Because symbols do not directly refer to things in the world, but indirectly refer to them by virtue of referring to other symbols, they are implicitly combinatorial entities whose referential powers are derived by virtue of occupying determinate positions in an organized system of other symbols. Both their initial acquisition and their later use requires a combinatorial analysis. The structure of the whole system has a definite semantic topology that determines the ways symbols modify each other's referential functions in different combinations. Because of this systematic relational basis of symbolic reference, no collection of signs can function symbolically unless the entire collection conforms to certain overall principles of organization. Symbolic reference emerges from a ground of nonsymbolic referential processes only because the indexical relationships between symbols are organized so as to form a logically closed group of mappings from symbol to symbol. This determinate character allows the higher-order system of associations to supplant the individual (indexical) referential support previously invested in each component symbol. This system of relationships between symbols determines a definite and distinctive topology that all operations involving those symbols must respect in order to retain referential power. The structure implicit in the symbol-symbol mapping is not present before symbolic reference, but comes into being and affects symbol combinations from the moment it is first constructed. The rules of combination that are implicit in this structure are discovered as novel combinations are progressively sampled. As a result, new rules may be discovered to be emergent requirements of encountering novel combinatorial problems, in much the same way as new mathematical laws are discovered to be implicit in novel manipulations of known operations.

Symbols do not, then, get accumulated into unstructured collections that can be arbitrarily shuffled into different combinations. The system of representational relationships, which develops between symbols as symbol systems grow, comprises an ever more complex matrix. In abstract terms, this is a kind of tangled hierarchic network of nodes and connections at defines a vast and constantly changing semantic space. Though semanticists and semiotic theorists have proposed various analogies to explain these underlying topological principles of semantic organization (such as +/- feature lists, dictionary analogies, encyclopedia analogies), we are far from a satisfactory account. Whatever the logic of this network of symbol-symbol relationships, it is inevitable that it will be reflected in the patterns of symbol-symbol combinations in communication.

Abstract theories of language, couched in terms of possible rules for combining unspecified tokens into strings, often implicitly assume that there is no constraint on theoretically possible combinatorial rule systems. Arbitrary strings of uninterpreted tokens have no reference and thus are unconstrained. But the symbolic use of tokens is constrained both by each token's use and by the use of other tokens with respect to which it is defined. Strings of symbols used to communicate and to accomplish certain ends must inherit both the intrinsic constraints of symbol-symbol reference and the constraints imposed by external reference.

Some sort of regimented combinatorial organization is a logical necessity for any system of symbolic reference. Without an explicit syntactic framework and an implicit interpretive mapping, it is possible neither to produce unambiguous symbolic information nor to acquire symbols in the first place. Because symbolic reference is inherently systemic, there can be no symbolization without systematic relationships. Thus syntactic structure is an integral feature of symbolic reference, not something added and separate. It is the higher-order combinatorial logic, grammar, that maintains and regulates symbolic reference; but how a specific grammar is organized is not strongly restricted by this requirement. There need to be precise combinatorial rules, yet a vast number are possible that do not ever appear in natural languages. Many other factors must be then into account in order to understand why only certain types of syntactic systems are actually employed in natural human languages and how we are able to learn the incredibly complicated rule systems that result.

So, before turning to the difficult problem of determining what it is about human brains that makes the symbolic recoding step so much easier for us than for the chimpanzees Sherman and Austin (and members of all other nonhuman species as well), it is instructive to reflect on the significance of this view of symbolization for theories of grammar and syntax. Not only does this analysis suggest that syntax and semantics are deeply interdependent facets of language—a view at odds with much current linguistic theory—it also forces us entirely to rethink current ideas about the nature of grammatical knowledge and how it comes to be acquired.

Excerpt from Pages 135-136 of The Symbolic Species: The Co-evolution of Language and the Brain [39] by Terrence Deacon:

The co-evolutionary argument that maps languages onto children's learning constraints caR be generalized one step further to connect to the most basic problem of language acquisition: decoding symbolic reference. Symbolic associations are preeminent examples of highly distributed relationships that are only very indirectly reflected in the correlative relationships between symbols and objects. As was demonstrated in the last chapter, this is because symbolic reference is indirect, based on a system of relationships among symbol tokens that recodes the referential regularities between their indexical links to objects. Symbols are easily utilized when the system-to-system coding is known, because at least superficial analysis can be reduced to a simple mapping problem, but it is essentially impossible to discover the coding given only the regularities of word-object associations. As in other distributed pattern-learning problems, the problem in symbol learning is to avoid getting attracted to learning potholes-tricked into focusing on the probabilities of individual sign-object associations and thereby missing the nonlocal marginal probabilities of symbol-symbol regularities. Learning even a simple symbol system demands an approach that postpones commitment to the most immediately obvious associations until after some of the less obvious distributed relationships are acquired. Only by shifting attention away from the details of word-object relationships is one likely to notice the existence of superordinate patterns of combinatorial relationships between symbols, and only if these are sufficiently salient is one likely to recognize the buried logic of indirect correlations and shift from a direct indexical mnemonic strategy to an indirect symbolic one.

In this way, symbol learning in general has many features that are similar to the problem ofleaming the complex and indirect statistical architecture of syntax. This parallel is hardly a coincidence, because grammar and syntax inherit the constraints implicit in the logic of symbol-symbol relationships. These are not, in fact, separate learning problems, because systematic syntactic regularities are essential to ease the discovery of the combinatorial logic underlying symbols. The initial stages of the symbolic shift in mnemonic strategies almost certainly would be more counterintuitive for a quick learner, who learns the details easily, than for a somewhat impaired learner, who gets the big picture but seems to lose track of the details. In general, then, the initial shift to reliance on symbolic relationships, especially in a species lacking other symbol-learning supports, would be most likely to succeed ifthe process could be shifted to as young an age as possible. The evolution of symbolic communication systems has therefore probably been under selection for early acquisition from the beginning of their appearance in hominid communication. So it is no surprise that the optimal time for beginning to discover grammatical and syntactic regularities in language is also when symbolic reference is first discovered. However, the very advantages that immature brains enjoy in their ability to make the shift from indexical to symbolic referential strategies also limit the detail and complexity of what can be learned. Learning the details becomes possible with a maturing brain, but one that is less spontaneously open to such "insights." This poses problems for brain-language co-evolution that will occupy much of the rest of the book in one form or other. How do symbolic systems evolve structures that are both capable of being learned and yet capable of being highly complex? And how have human learning and language-use predispositions evolved to support these two apparently contrary demands?

Elissa Newport was one of the first to suggest that we should not necessarily think of children's learning proficiency in terms of the function of a special language-learning system. She suggests that the relationship might be reversed. Language structures may have preferentially adapted to children's learning biases and limitations because languages that are more easily acquired at an early age will tend to replicate more rapidly and with greater fidelity from generation to generation than those that take more time or neurological maturity to be mastered. As anyone who has tried to learn a second language for the first time as an adult can attest, one's first language tends to monopolize neural-cognitive resources in ways that make it more difficult for other languages to "move in" and ever be quite as efficient. Consequently, strong social selection forces will act on language regularities to reduce the age at which they can begin to be learned. Under constant selection pressure to be acquirable at ever earlier and earlier stages in development, the world's surviving languages have all evolved to be learnable at the earliest age possible. Languages may thus be more difficult to learn later in life only because they evolved to be easier to learn when immature. The critical period for language learning may not be critical or time limited at all, but a mere "spandrel,"19 or incidental feature of maturation, that just happened to be coopted in languages' race to colonize ever younger brains.

Kanzi's immaturity made it easier to make the shift from indexical to symbolic reference and to learn at least the global grammatical logic hidden behind the surface structure of spoken English. But equally important is the fact that both the lexigram training paradigms used with his mother and the structure of English syntax itself had evolved in response to the difficulties this imposes, anq so had spontaneously become more conducive to the learning patterns of immature minds. The implications of Kanzi's advantages are relevant to human language acquisition as well, because if his prodigious abilities are not the result of engaging some special time-limited language acquisition module in his nonhuman brain, then such a critical period mechanism is unlikely to provide the explanation for the language prescience of human children either. The existence of a critical period for language learning is instead the expression of the advantageous limitations of an immature nervous system for the kind of learning problem that language poses. And language poses the problem this way because it has specifically evolved to take advantage of what immaturity provides naturally. Nat being exposed to language while young deprives one of these learning advantages and makes both symbolic and syntactic learning far more difficult. Though older animals and children may be more cooperative, more attentive, have better memories, and in general may be better learners of many things than are toddlers, they gain these advantages at the expense of symbolic and syntactic learning predispositions. This is demonstrated by many celebrated "feral" children who, over the years, have been discovered after they grew up isolated from normal human discourse. Their persistent language limitations attest not to the turning off of a special language instinct but to the waning of a nonspecific language-learning bias.

4 Here are a few papers in the 2000 special issue on hippocampal-cortical interactions — use the Lavenex and Amaral [125] paper as a starting place to follow the citation trail forward in search of more recent papers:

@article{MaguireetalHIPPOCAMPUS-00,
author = {Maguire, Eleanor A. and Mummery, Catherine J. and B\"{u}chel, Christian},
title = {Patterns of hippocampal-cortical interaction dissociate temporal lobe memory subsystems},
journal = {Hippocampus},
volume = {10},
number = {4},
year = {2000},
pages = {475-482},
abstract = {Abstract A distributed network of brain regions supports memory retrieval in humans, but little is known about the functional interactions between areas within this system. During functional magnetic resonance imaging (fMRI), subjects retrieved real-world memories: autobiographical events, public events, autobiographical facts, and general knowledge. A common memory retrieval network was found to support all memory types. However, examination of the correlations (i.e., effective connectivity) between the activity of brain regions within the temporal lobe revealed significant changes dependent on the type of memory being retrieved. Medially, effective connectivity between the parahippocampal cortex and hippocampus increased for recollection of autobiographical events relative to other memory types. Laterally, effective connectivity between the middle temporal gyrus and temporal pole increased during retrieval of general knowledge and public events. The memory types that dissociate the common system into its subsystems correspond to those that typically distinguish between patients at initial phases of Alzheimer's disease or semantic dementia. This approach, therefore, opens the door to new lines of research into memory degeneration, capitalizing on the functional integration of different memory-involved regions. Indeed, the ability to examine interregional interactions may have important diagnostic and prognostic implications. Hippocampus 10:475–482, 2000 © 2000 Wiley-Liss, Inc.},
}
@article{LarocheetalHIPPOCAMPUS-00,
author = {Laroche, Serge and Davis, Sabrina and Jay, Th\'{e}r\{e}se M.},
title = {Plasticity at hippocampal to prefrontal cortex synapses: Dual roles in working memory and consolidation},
journal = {Hippocampus},
volume = {10},
number = {4},
year = {2000},
pages = {438-446},
abstract = {Abstract The involvement of the hippocampus and the prefrontal cortex in cognitive processes and particularly in learning and memory has been known for a long time. However, the specific role of the projection which connects these two structures has remained elusive. The existence of a direct monosynaptic pathway from the ventral CA1 region of the hippocampus and subiculum to specific areas of the prefrontal cortex provides a useful model for conceptualizing the functional operations of hippocampal-prefrontal cortex communication in learning and memory. It is known now that hippocampal to prefrontal cortex synapses are modifiable synapses and can express different forms of plasticity, including long-term potentiation, long-term depression, and depotentiation. Here we review these findings and focus on recent studies that start to relate synaptic plasticity in the hippocampo-prefrontal cortex pathway to two specific aspects of learning and memory, i.e., the consolidation of information and working memory. The available evidence suggests that functional interactions between the hippocampus and prefrontal cortex in cognition and memory are more complex than previously anticipated, with the possibility for bidirectional regulation of synaptic strength as a function of the specific demands of tasks. Hippocampus 10:438–446, 2000 © 2000 Wiley-Liss, Inc.},
}
@article{OReillyandRudyHIPPOCAMPUS-00,
author = {O'Reilly, Randall C. and Rudy, Jerry W.},
title = {Computational principles of learning in the neocortex and hippocampus},
journal = {Hippocampus},
volume = {10},
number = {4},
year = {2000},
pages = {389-397},
abstract = {Abstract We present an overview of our computational approach towards understanding the different contributions of the neocortex and hippocampus in learning and memory. The approach is based on a set of principles derived from converging biological, psychological, and computational constraints. The most central principles are that the neocortex employs a slow learning rate and overlapping distributed representations to extract the general statistical structure of the environment, while the hippocampus learns rapidly, using separated representations to encode the details of specific events while suffering minimal interference. Additional principles concern the nature of learning (error-driven and Hebbian), and recall of information via pattern completion. We summarize the results of applying these principles to a wide range of phenomena in conditioning, habituation, contextual learning, recognition memory, recall, and retrograde amnesia, and we point to directions of current development. Hippocampus 10:389–397, 2000 © 2000 Wiley-Liss, Inc.},
}
@article{RollsHIPPOCAMPUS-00,
author = {Rolls, Edmund T.},
title = {Hippocampo-cortical and cortico-cortical backprojections},
journal = {Hippocampus},
volume = {10},
number = {4},
year = {2000},
pages = {380-388},
abstract = {Abstract First, the information represented in the primate hippocampus, and what is computed by the primate hippocampus, are considered. Then a theory is described of how the information represented in the hippocampus is able to influence the cerebral cortex by a hierarchy of hippocampo-cortical and cortico-cortical backprojection stages. The recalled backprojected information in the cerebral neocortex could then be used by the neocortex as part of memory recall, including that required in spatial working memory; to influence the processing that each cortical stage performs based on its forward inputs; to influence the formation of long-term memories; and/or in the selection of appropriate actions. Hippocampus 10:380–388, 2000 © 2000 Wiley-Liss, Inc.},
}
@article{LavenexandAmaralHIPPOCAMPUS-00,
author = {Lavenex, Pierre and Amaral, David G.},
title = {Hippocampal-neocortical interaction: A hierarchy of associativity},
journal = {Hippocampus},
volume = {10},
number = {4},
year = {2000},
pages = {420-430},
abstract = {Abstract The structures forming the medial temporal lobe appear to be necessary for the establishment of long-term declarative memory. In particular, they may be involved in the “consolidation” of information in higher-order associational cortices, perhaps through feedback projections. This review highlights the fact that the medial temporal lobe is organized as a hierarchy of associational networks. Indeed, associational connections within the perirhinal, parahippocampal, and entorhinal cortices enables a significant amount of integration of unimodal and polymodal inputs, so that only highly integrated information reaches the remainder of the hippocampal formation. The feedback efferent projections from the perirhinal and parahippocampal cortices to the neocortex largely reciprocate the afferent projections from the neocortex to these areas. There are, however, noticeable differences in the degree of reciprocity of connections between the perirhinal and parahippocampal cortices and certain areas of the neocortex, in particular in the frontal and temporal lobes. These observations are particularly important for models of hippocampal-neocortical interaction and long-term storage of information in the neocortex. Furthermore, recent functional studies suggest that the perirhinal and parahippocampal cortices are more than interfaces for communication between the neocortex and the hippocampal formation. These structures participate actively in memory processes, but the precise role they play in the service of memory or other cognitive functions is currently unclear. Hippocampus 10:420–430, 2000 © 2000 Wiley-Liss, Inc.},
}


5 Blumenbach had one big idea that no one took seriously. He was convinced about a special feature of humans, about which he could not have been more explicit. "Man,” he wrote in 1795, "is far more domesticated and far more advanced from his first beginnings than any other animal." In 1806 he explained that our species' domestication was due to biology: "There is only one domestic animal — domestic in the true sense, if not in the ordinary acceptation of this word — that also surpasses all others in these respects, and that is man. The difference between him and other domestic animals is only this, that they are not so completely born to domestication as he is, having been created by nature immediately a domestic animal." There was one big obstacle to the adoption of his thesis, so great that for a century Blumenbach's big idea was not taken seriously. The question was: how could the domestication of humans have happened? In the case of farmyard animals, humans were obviously responsible for the domesticating. But if humans were domesticated, who was responsible? Who could have domesticated our ancestors?

In his 1871 book on human evolution, The Descent of Man, and Selection in Relation to Sex, Darwin contemplated Blumenbach's proposition. If humans really were domesticated, he wanted to know how and why, alas he too got caught up in an infinite regress. Subsequently, human self-domestication in some form or other has been explored by evolutionarily minded scholars from an astonishing range of perspectives, including archaeology, social anthropology, biological anthropology, paleoanthropology, philosophy, psychiatry, psychology, ethology, biology, history, and economics. Everywhere, the essential rationale is the same. Our docile behavior recalls that of a domesticated species, and since no other species can have domesticated us, we must have done it ourselves. We must be self-domesticated. But how could that have happened?" — If you want to learn the answer, read Richard Wrangham's excellent The Goodness Paradox: How Evolution Made Us Both More and Less Violent from which this excerpt was adapted [210].

6 Ethology is the scientific and objective study of animal behavior, usually with a focus on behavior under natural conditions, and viewing behavior as an evolutionarily adaptive trait. Behaviorism is a term that also describes the scientific and objective study of animal behavior, usually referring to measured responses to stimuli or trained behavioral responses in a laboratory context, without a particular emphasis on evolutionary adaptivity (SOURCE)

7 SRI International (SRI) is an American nonprofit scientific research institute and organization headquartered in Menlo Park, California. The trustees of Stanford University established SRI in 1946 as a center of innovation to support economic development in the region. SRI performs client-sponsored research and development for government agencies, commercial businesses, and private foundations. It also licenses its technologies, forms strategic partnerships and creates spin-off companies. (SOURCE)

8 SNARC (Stochastic Neural Analog Reinforcement Calculator) is a neural net machine designed by Marvin Lee Minsky. George Miller gathered the funding for the project from the Air Force Office of Scientific Research in the summer of 1951. This machine is considered one of the first pioneering attempts at the field of artificial intelligence. Minsky went on to be a founding member of MIT's Project MAC, which split to become the MIT Laboratory for Computer Science and the MIT Artificial Intelligence Lab, and is now the MIT Computer Science and Artificial Intelligence Laboratory. (SOURCE)

9 The Dartmouth Summer Research Project on Artificial Intelligence was the name of a 1956 summer workshop now considered by many (though not all[3]) to be the seminal event for artificial intelligence as a field. The project lasted approximately six to eight weeks, and was essentially an extended brainstorming session. Eleven mathematicians and scientists were originally planned to be attendees, and while not all attended, more than ten others came for short times. (SOURCE)

10 The prevailing connectionist approach today was originally known as parallel distributed processing (PDP). It was an artificial neural network approach that stressed the parallel nature of neural processing, and the distributed nature of neural representations. It provided a general mathematical framework for researchers to operate in. The framework involved eight major aspects (SOURCE):

1. A set of processing units, represented by a set of integers.

2. An activation for each unit, represented by a vector of time-dependent functions.

3. An output function for each unit, represented by a vector of functions on the activations.

4. A pattern of connectivity among units, represented by a matrix of real numbers indicating connection strength.

5. A propagation rule spreading the activations via the connections, represented by a function on the output of the units.

6. An activation rule for combining inputs to a unit to determine its new activation, represented by a function on the current activation and propagation.

7. A learning rule for modifying connections based on experience, represented by a change in the weights based on any number of variables.

8. An environment that provides the system with experience, represented by sets of activation vectors for some subset of the units.

11 I've purchased three copies each of [Chater, 2018] and [Dehaene, 2014] that you are welcome to borrow from me, and the engineering library has one physical copy of each of the first four titles listed below plus an electronic copy of fifth [Dere et al] held on reserve for students taking CS379C:

@book{Chater2018,
title = {The Mind is Flat: The Illusion of Mental Depth and The Improvised Mind},
author = {Chater, Nick},
year = {2018},
publisher = {Penguin Books Limited}
}
@book{Dehaene2014,
title = {Consciousness and the Brain: Deciphering How the Brain Codes Our Thoughts},
author = {Stanislas Dehaene},
publisher = {Viking Press},
year = 2014,
}
@book{Deacon2012incomplete,
title = {Incomplete Nature: How Mind Emerged from Matter},
author = {Deacon, Terrence W.},
year = {2012},
publisher = {W. W. Norton}
}
@book{Deacon1998symbolic,
title = {The Symbolic Species: The Co-evolution of Language and the Brain},
author = {Deacon, Terrence W.},
year = {1998},
publisher = {W. W. Norton}
}
@book{Dereetal2008,
title = {Handbook of Episodic Memory},
editor = {Ekrem Dere and Alexander Easton and Lynn Nadel and Joseph P. Huston},
series = {Handbook of Behavioral Neuroscience},
publisher = {Elsevier},
volume = {18},
year = {2008}
}


12 Due to bilateral symmetry the brain has a hippocampus in each cerebral hemisphere. If damage to the hippocampus occurs in only one hemisphere, leaving the structure intact in the other hemisphere, the brain can retain near-normal memory functioning.

13 Technically, an exponential curve will have similar properties along its entire length. There's no specific value at which we can define to be the point when things going "really" fast, and in fact, your perception of how fast the graph will grow is dependent upon the axes you define. The more you compress the y axis, the faster the graph will appear to grow. However, a cursory glance at some exponential functions shows that humans have a tendency to say the "knee" of the curve would probably be around the point at which a line which appears to have a slope of one grows slower than the exponential curve. Again, a slope that appears to have slope one will in reality have a slope dependent upon the scale you use for each of the axes. — Posted by Austin Garrett, Research Assistant at MIT Computer Science and Artificial Intelligence Laboratory (SOURCE)

14 Excerpts from [30]. We have 4 copies (3 hardcover and 1 Kindle) that can be checked out on loan:

• The brain is continually scrambling to link together scraps of sensory information (and has the ability, of course, to gather more information, with a remarkably quick flick of the eye). We 'create' our perception of an entire visual world from a succession of fragments, picked up one at a time (see Chapter 2). Yet our conscious experience is merely the output of this remarkable process; we have little or no insight into the relevant sensory inputs or how they are combined.

• As soon as we query some aspects of the visual scene (or, equally, of our memory), then the brain immediately locks onto relevant information and attempts to impose meaning upon it. The process of creating such meaning is so fluent that we imagine ourselves merely to be reading off pre-existing information, to which we already have access, just as, when scrolling down the contents of a word processor, or exploring a virtual reality game, we have the illusion that the entire document, or labyrinth, pre-exist in all their glorious pixel-by-pixel detail (somewhere 'off-screen'). But, of course, they are created for us by the computer software at the very moment they are needed (e.g. when we scroll down or 'run' headlong down a virtual passageway). This is the sleight of hand that underlies the grand illusion (see Chapter 3).

• In perception, we focus on fragments of sensory information and impose what might be quite abstract meaning: the identity, posture, facial expression, intentions of another person, for example. But we can just as well reverse the process. We can focus on an abstract meaning, and create a corresponding sensory image: this is the basis of mental imagery. So just as we can recognize a tiger from the slightest of glimpses, we can also imagine a tiger — although, as we saw in Chapter 4, the sensory image we reconstruct is remarkably sketchy.

• Feelings are just one more thing we can pay attention to. An emotion is, as we saw in Chapter 5, the interpretation of a bodily state. So experiencing an emotion requires attending to one's bodily state as well as relevant aspects of the outer world: the interpretation imposes a 'story' linking body and world together. Suppose, for example, that Inspector Lestrade feels the physiological traces of negativity (perhaps he draws back, hunches his shoulders, his mouth turns down, he looks at the floor) as Sherlock Holmes explains his latest triumph. The observant Watson attends, successively, to Lestrade's demeanour and Holmes's words, searching for the meaning of these snippets, perhaps concluding: 'Lestrade is jealous of Holmes's brilliance.' But Lestrade's reading of his own emotions works in just the same way: he too must attend to, and interpret his own physiological state and Holmes's words in order to conclude that he is jealous of Holmes's brilliance. Needless to say, Lestrade may be thinking nothing of the kind — he may be trying (with frustratingly little success) to find flaws in Holmes's explanation of the case. If so, while Watson may interpret Lestrade as being jealous, Lestrade is not experiencing jealousy (of Holmes's brilliance, or anything else) — because experiencing jealousy results from a process of interpretation, in which jealous thoughts are the 'meaning' generated, but Lestrade's mind is attending to other matters entirely, in particular, the details of the case.

• Finally, consider choices (see Chapter 6). Recall how the left hemisphere of a split-brain patient fluently, though often completely spuriously, 'explains' the mysterious activity of the left hand — even though that hand is actually governed by the brain's right hemisphere. This is the left, linguistic brain's attempt to impose meaning on the left-hand movements: to create such meaningful (though, in the case of the split-brain patient, entirely illusory) explanation requires locking onto the activity of the left hand in order to make sense of it. It does not, in particular, involve locking onto any hidden inner motives lurking within the right hemisphere (the real controller of the left hand) because the left and right hemispheres are, of course, completely disconnected. But notice that, even if the hemispheres were connected, the left hemisphere would not be able to attend to the right hemisphere's inner workings — because the brain can only attend to the meaning of perceptual input (including the perception of one's own bodily state), not to any aspect of its own inner workings.

We are, in short, relentless improvisers, powered by a mental engine which is perpetually creating meaning from sensory input, step by step. Yet we are only ever aware of the meaning created; the process by which it arises is hidden. Our step-by-step improvisation is so fluent that we have the illusion that the 'answers' to whatever 'questions' we ask ourselves were 'inside our minds all along'. But, in reality, when we decide what to say, what to choose, or how to act, we are, quite literally, making up our minds, one thought at a time.

15 This is a cleaned-up version of my rough transcription of Chater's comments from a talk he gave at Google posted on the Talks at Google YouTube channel on May 22, 2018. My rough transcription doesn't do justice to his prepared talk since it was drawn from my scattered notes and this cleaned-up version doesn't do justice to his writing. I encourage you to purchase a copy of The Mind is Flat or check out the book from your local public library. You can find the relevant passage in the section of his book entitled "Rethinking the Boundary of Consciousness". For those of you taking the class at Stanford, I've purchased three copies that I can lend to students and I've asked the engineering library to put one or more copies on reserve.

16 I couldn't find an unassailable source for either citation or attribution, but here are two pieces of evidence — one quote from Popper's writing and one attribution by a reputable historian — that will have to do:

• "Good tests kill flawed theories; we remain alive to guess again." As quoted in My Universe : A Transcendent Reality (2011) by Alex Vary, Part II [REFERENCE]

• "If we are uncritical, we shall always find what we want: we shall look for, and find, confirmations, and we shall look away from, and not see, whatever might be dangerous to our pet theories. In this way it is only too easy to obtain what appears to be overwhelming evidence in favor of a theory which, if approached critically, would have been refuted." The Poverty of Historicism (1957) Ch. 29 The Unity of Method [REFERENCE]

17 Conduction aphasia is a rare form of aphasia, often specifically related damage in the arcuate fasciculus. An acquired language disorder, it is characterized by intact auditory comprehension, fluent speech production, but poor speech repetition. Patients are fully capable of understanding what they are hearing, but fail to encode phonological information for production. In the case of conduction aphasia, patients are still able to comprehend speech because the lesion does not disrupt the ventral stream pathway:

Studies have suggested that conduction aphasia is a result of damage specifically to the left superior temporal gyrus and/or the left supra marginal gyrus. The classical explanation for conduction aphasia is that of a disconnection between the brain areas responsible for speech comprehension (Wernicke's area) and speech production (Broca's area), due specifically to damage to the arcuate fasciculus, a deep white matter tract. Patients are still able to comprehend speech because the lesion does not disrupt the ventral stream pathway. SOURCE

18 This footnote serves as a parking spot for my notes on Nick Chater's The Mind is Flat. As an introduction and test to see if you are interested in his theory concerning human cognition, I suggest you start with his Google Talks book tour presentation. If you find that interesting, but still are not convinced enough that you want to read the book [30]. You might get a better idea by watching Episode #111 of the The Dissenter podcast hosted by Ricardo Lopes. Here is an excerpt relating to Chater's main thesis that we are misled by introspection into believing that below the threshold of our conscious thoughts there is a great deal of supporting unconscious thinking going on — unconscious, but of the same general depth and complexity as our conscious thoughts:

The things the brain does are very complicated, but they are nothing like the things we are consciously aware of. So I think using our conscious mind as a window into the brain is a disastrous error. It's like we think the mind is the tip of an iceberg. We see the iceberg poking out of the sea, and the illusion is that we think "I got the tip of the iceberg which is my flow of conscious experience, but I bet that the whole iceberg is just the same stuff". [...] The machinery that is operating is this incredibly complicated memory retrieval and reuse system which is searching our past experience and using that past experience to understand the present. [...] Like scanning all the faces you've seen in order to recognize a new face. All of that is fast and parallel, but its nothing like the sequential nature of the mind. I think the basic operation performed in these different areas [of the brain] is pretty much the same from one area to the next. They all involve perception memory in one way or another. Abstract thought whether mathematics, physics or the law [...] I think of these as all pretty much similar in spirit to thinking about and recognizing objects. [...] They are just more abstract versions of that. [There are some who believe that there are a number specialized systems handling different types of problems] The Toby and Cosmides Swiss Army knife model [59] [But I don't agree.] So I want to push against this [strong] modularity assumption.

Ricardo provides the following convenient bookmarks that take you directly to the relevant location in the podcast:

00:01:06 The basic premise of "The Mind is Flat"

00:05:33 We are like fictional characters

00:09:59 The problem with stories and narratives

00:13:58 The illusions our minds create (about motives, desires, goals, etc.)

00:17:44 The distinction between the conscious mind and brain activity

00:22:34 Does dualism make sense?

00:27:11 Is modularity of mind a useful approach?

00:31:21 How our perceptual systems work

00:41:49 How we represent things in our minds

00:44:57 The Kuleshov effect, and the interpretation of emotions

00:55:42 Why do we need our mental illusions?

00:59:10 The importance of our imagination

01:01:31 Can AI systems produce the same illusions (emotions, consciousness)?

## Lament Over Sloshed Milk

Here are the last few lines of Chapter 2 entitled "Anatomy of a Hoax" in which Chater commiserates with himself and the reader over the fact — actually a presupposition — that scientists (might) have deluded themselves regarding some of the most basic facts about human cognition. I will certainly admit that he makes a good case for his view of how we experience and make sense of the world around us. His theory explains some of the predictions one could make concerning the models I've been working on and so I will have little reason to complain if he is proved right. But I will hold out for a while and watch for more experimental evidence before celebrating my modeling choices or adopting his theory wholesale.

From time to time, I have found myself wondering, somewhat despairingly, how much the last hundred and fifty years or so of psychology and neuroscience has really revealed about the secrets of human nature. How far have we progressed beyond what we can gather from philosophical reflection, the literary imagination, or from plain common sense? How much has the scientific study of our minds and brains revealed that really challenges our intuitive conception of ourselves?

The gradual uncovering of the grand illusion through careful experimentation is a wonderful example of how startlingly wrong our intuitive conception of ourselves can be. And once we know the trick, we can see that it underlies the apparent solidity of our verbal explanations too. Just as the eye can dash into action to answer whatever question about the visual world I happen to ask myself, so my inventive mind can conjure up a justification for my actions, beliefs and motives, just as soon as I wonder about them. We wonder why puddles form or how electric and immediately we find explanations springing into our consciousness. And if we query any element of our explanation, more explanations spring into existence, and so on. Our powers of invention are so fluent that we can imagine that these explanations were pre-formed within us in all their apparently endless complexity. But, of course, each answer was created in the moment.

So whether we are considering sensory experience or verbal explanations, the story is the same. We are, it turns out, utterly wrong about a subject on which we might think we should be the ultimate arbiter: the contents of our own minds. Could we perhaps be equally or even more deluded when we turn to consider the workings of our imagination?

## Collective Decision Making

Here is an extended thought experiment inspired by my reading of Chater's The Mind is Flat [30] that explores how Chater's theory of human cognition might play out in a collective endeavor:

When we engage in a group undertaking whether that be evaluating candidates for a job position or deciding upon a strategy for investing in new markets, we are collectively creating a shared illusion that serves as the basis of our own individual thinking as well as any possible consensus regarding, for example, specific actions being contemplated.

Think about what happens when one of us makes some contribution to the discussion whether it be a comment or criticism or an addition or modification to some possible outcome of our joint focus, say a job offer, contract or new species of investment. In voicing an opinion, we influence one another's mind state by how our contribution is individually and jointly perceived. Given what Nick Chater tells us about human behavior, it is highly likely that our contribution will be misunderstood and our resulting thoughts and those of others thinly apprehended but richly fantasized.

It makes sense to think of this shared space as a collection of thought clouds in the sense that Geoff Hinton uses the term. Each thought cloud is no more than a sparse representation of an individual’s state vector. It includes, among many other things, the activation of state variables that correspond to our internal representation of the mental states of those sitting around the table — a representation that is no doubt poorly informed and incredibly misleading.

These individual idiosyncratic representations of the evolving joint space are tied together very loosely by our notoriously error-prone efforts to read one another's thoughts, but, whether or not we are able to read "minds", there is still the possibility of interpreting what each contributor actually says or how they act. Alas, we are just as fallible in reading body language and interpreting intent from what is explicitly said or acted out in pose, gesture or facial tick.

As each participant voices their opinion, makes his or her case, expresses their support for or opposition to what was said earlier, all of the individual thought clouds are separately altered according to the inscrutable adjustments of diverse hormonal secretions and neuromodulatory chemical gradients. The individuals may believe — or possibly hope — that some consensus will eventually be reached; however, unless carefully led by a very skilled facilitator, the separate thought clouds will be cluttered, full of contradictions and misunderstandings and yet by some independent measure oddly aligned — which could be simply due to the length of time the meeting was scheduled for or the perceived duration necessary for this particular group to reach consensus.

There will very likely be a good deal of wishful thinking among those who either want the meeting to end quickly irregardless of the outcome, hope that a consensus can be amicably reached or have already reached their final opinion and will become increasingly more strident in support of their views as the meeting drags on. There will be those who will want — or pretend — to hear their colleagues voiced support for their ideas, but will interpret whatever they say to suit their own selfish interests and expectations.

In Chater’s theory, each one of us is a single thread of conscious thought informed by and constructed upon a history of memories dredged up in response to sensory input, in this case, resulting from what was seen and heard in the meeting. This means that, in particular, each one of us will have a different context depending on our own stored memories and the degree to which we have attended to the discussion in the meeting. This will result in a collection of state vectors that in the best of circumstances are only roughly aligned, and, in the more realistic case, significantly discordant.

It would be interesting to know what sort of dimensions are more likely to appear in some semblance of their current grounding in fact or that, while they may have a different valence, at least represent an emotional trajectory accounting for roughly the same physiological state across some fraction of the individuals present in the discussion. While I don't believe this sort of dimensional alignment is likely for dimensions of a more abstract sort, I wouldn't be surprised if one were able to do a factor analysis on all the different thought vectors represented in a given collective, that we might be able to identify factors representing some alignment that translates across individuals — one that plays an important role in our evolution as successful social organisms.

The picture I have in my head is of a collection of thought clouds with some dimensional alignment across individuals with respect to perceptually — and in particular emotionally — mediated factors but very little alignment across abstract dimensions that capture more of the concrete aspects of the collective-focus intended by those who organized the meeting in the first place. All of the usual cognitive biases are likely to be at play in the interactions going on during the meeting. Individual opinions will not be obviously manifest in behavior and will likely be repressed and prettified to make them more palatable to the group as a whole.

Moreover, many if not most of the individuals will likely misinterpret the opinions and other hidden state of their co-contributors, and also likely adjust the valence and magnitude of related dimensions to suit their own idiosyncratic beliefs and desires with respect to the outcome of the collective effort.

It would be instructive to imagine the sort of collective enterprise as playing out in a protracted meeting and how, for example, participants might align their viewpoints based upon a particularly articulate opinion rendered by someone thought highly — or perhaps fondly — of, versus some sort of discordant alignment resulting from an incoherent but forcefully rendered opinion by someone not well thought of. The exercise would be not be necessarily to figure out a strategy for effectively coming to a joint understanding so much as how cognition would play out given sparse serialized thought processes operating on internal representations that only thinly capture the collective experience and ground much of what is heard or seen in their own idiosyncratic, suspiciously self-promoting or self-effacing episodic memory.

As a logical next step along these lines, it would be interesting to ask how the outcome might be different in the case of a group of very smart, highly motivated, super engaged individuals with a history of working together and a facilitator of particularly sharp intellect, unusually well-calibrated emotional and social intelligence and highly motivated to do the right thing in charge of overseeing the meeting and fully invested in guiding the participants toward a consensus worth the effort.

In this case, the focus would be on how the participants might rise above their (instinctual) predilections by using the same cognitive substrate with the same energy and focus as they would devote to something they might find intellectually more satisfying such as writing code and solving interesting programming problems. Specifically, how can the basic cycle of apprehend (sense), attend (select), integrate personal experience (recall), and respond to an internal model of the present circumstances (act) be successfully applied to effectively make difficult decisions given what Chater refers to as the essentially "flat" character of our world view / internal model of present circumstances.

P.S. The original name of this file — Laughably_Sparse_Embarrassingly_Serial.txt — is a tongue-in-cheek reference to a model of parallel / distributed computation (https://en.wikipedia.org/wiki/Embarrassingly_parallel) that describes much of the parallelism available in modern industrial cloud services and corporate data and computing centers.

@article{ChristiansenandChaterBBS-08,
author = {Christiansen, M. H. and Chater, N.},
title = {Language as shaped by the brain},
journal = {Behavior Brain Science},
year = {2008},
volume = {31},
number = {5},
pages = {489-508},
abstract = {It is widely assumed that human learning and the structure of human languages are intimately related. This relationship is frequently suggested to derive from a language-specific biological endowment, which encodes universal, but communicatively arbitrary, principles of language structure (a Universal Grammar or UG). How might such a UG have evolved? We argue that UG could not have arisen either by biological adaptation or non-adaptationist genetic processes, resulting in a logical problem of language evolution. Specifically, as the processes of language change are much more rapid than processes of genetic change, language constitutes a "moving target" both over time and across different human populations, and, hence, cannot provide a stable environment to which language genes could have adapted. We conclude that a biologically determined UG is not evolutionarily viable. Instead, the original motivation for UG--the mesh between learners and languages--arises because language has been shaped to fit the human brain, rather than vice versa. Following Darwin, we view language itself as a complex and interdependent "organism," which evolves under selectional pressures from human learning and processing mechanisms. That is, languages themselves are shaped by severe selectional pressure from each generation of language users and learners. This suggests that apparently arbitrary aspects of linguistic structure may result from general learning and processing biases deriving from the structure of thought processes, perceptuo-motor factors, cognitive limitations, and pragmatics.}
}
@incollection{ChaterandChristiansenHLB-11,
title = {A solution to the logical problem of language evolution: language as an adaptation to the human brain},
author = {Nick Chater and Morten H. Christiansen},
booktitle = {The Oxford Handbook of Language Evolution},
editor = {Kathleen R. Gibson and Maggie Tallerman},
publisher = {Oxford University Press},
year = {2011},
abstract = {This article addresses the logical problem of language evolution that arises from a conventional universal grammar (UG) perspective and investigates the biological and cognitive constraints that are considered when explaining the cultural evolution of language. The UG prespective states that language acquisition should not be viewed as a process of learning at all but it should be viewed as a process of growth, analogous to the growth of the arm or the liver. UG is intended to characterize a set of universal grammatical principles that hold across all languages. Language has the same status as other cultural products, such as styles of dress, art, music, social structure, moral codes, or patterns of religious beliefs. Language may be particularly central to culture and act as the primary vehicle through which much other cultural information is transmitted. The biological and cognitive constraints helps to determine which types of linguistic structure tend to be learned, processed, and hence transmitted from person to person, and from generation to generation. The communicative function of language is likely to shape language structure in relation to the thoughts that are transmitted and regarding the processes of pragmatic interpretation that people use to understand each other's behavior. A source of constraints derives from the nature of cognitive architecture, including learning, processing, and memory. The language processing involves generating and decoding regularities from highly complex sequential input, indicating a connection between general-purpose cognitive mechanisms for learning and processing sequential material, and the structure of natural language.}
}
@article{ChateretalJML-16,
title = {Language as skill: Intertwining comprehension and production},
author = {Nick Chater and Stewart M. McCauley and Morten H. Christiansen},
journal = {Journal of Memory and Language},
volume = {89},
pages = {244-254},
year = {2016},
abstract = {Are comprehension and production a single, integrated skill, or are they separate processes drawing on a shared abstract knowledge of language? We argue that a fundamental constraint on memory, the Now-or-Never bottleneck, implies that language processing is incremental and that language learning occurs on-line. These properties are difficult to reconcile with the ‘abstract knowledge’ viewpoint, and crucially suggest that language comprehension and production are facets of a unitary skill. This viewpoint is exemplified in the Chunk-Based Learner, a computational acquisition model that processes incrementally and learns on-line. The model both parses and produces language; and implements the idea that language acquisition is nothing more than learning to process. We suggest that the Now-or-Never bottleneck also provides a strong motivation for unified perception-production models in other domains of communication and cognition.}
}


Title: Handbook of Episodic Memory
Publication date: 2008
Keywords: Memory, Episodic; Cognitive Neuroscience.
Series: Handbook of Behavioral Neuroscience, Volume 18.
Editor: Dere, Ekrem; Easton, Alexander; Huston, Joseph; Nadel, Lynn.
Bibliography: Includes bibliographical references and index.
Perspectives on episodic and semantic memory retrieval / Lee Ryan, Siobhan Hoscheidt and Lynn Nadel
Exploring episodic memory / Martin A. Conway
Episodic memory and mental time travel / Thomas Suddendorf and Michael C. Corballis
Episodic memory: Reconsolidation / Lynn Nadel ... [et al.]
Attributes of episodic memory processing / Michael R. Hunsaker and Raymond P. Kesner
Cognitive and neural bases of flashbulb memories / Patrick S.R. Davidson
From the past into the future: The development origins and trajectory of episodic future thinking / Cristina M. Atance
Emotion and episodic memory / Philip A. Allen, Kevin P. Kaut and Robert R. Lord
Current status of cognitive time travel research in animals / William A. Roberts
Animal episodic memory / Ekrem Dere ... [et al.]
New working definition of episodic memory: Replacing "when" and "which" / Alexander Easton and Madeline J. Eacott
Episodic-like memory in food-hoarding birds / Gesa Feenders and Tom V. Smulders
Representing past and future events / Thomas R. Zentall
Functional neuroanatomy of remote, episodic memory / Morris Moscovitch ... [et al.]
Medial temporal lobe: Visual perception and recognition memory / Yael Shrager and Larry R. Squire
Toward a neurobiology of episodic memory / Howard Eichenbaum ... [et al.]
Spatio-temporal context and object recognition memory in rodents / Mark Good
Role of the prefrontal cortex in episodic memory / Matthias Brand and Hans J. Markowitsch
Basal forebrain and episodic memory / Toshikatsu Fujii
Role of the precuneus in episodic memory / Michael R. Trimble and Andrea E. Cavanna
Multiple roles of dopaminergic neurotransmission in episodic memory / Björn H. Schott and Emrah Düzel
Neural coding of episodic memory / Joe Z. Tsien
Primate hippocampus and episodic memory / Edmund T. Rolls
Hippocampal neuronal activity and episodic memory / Emma R. Wood and Livia de Hoz
Hippocampus, context processing and episodic memory / David M. Smith
Memory and perceptual impairments in amnesia and dementia / Kim S. Graham, Andy C.H. Lee and Morgan D. Barense
Using hippocampal amnesia to understand the neural basis of diencephalic amnesia / John P. Aggleton, [et al.]
Structure-function correlates of episodic memory in aging / Jonas Persson and Lars Nyberg
Memory and cognitive performance in preclinical Alzheimer's and Vascular disease / Brent J. Small [et al.]
Transgenic mouse models of Alzheimer's disease and episodic-like memory / David R. Borchelt and Alena V. Savonenko
Episodic memory in the context of cognitive control dysfunction: Huntington's disease / Francois Richer [et al.]
Adrenal steroids and episodic memory: Relevance to mood disorders / Hamid A. Alhaj and R. Hamish McAllister-Williams

20 Note that the temporal lobe has a lot going on in it that relates to the meaning of what we see and hear:

The inferior temporal gyrus is placed below the middle temporal gyrus, and is connected behind with the inferior occipital gyrus; it also extends around the infero-lateral border on to the inferior surface of the temporal lobe, where it is limited by the inferior sulcus. This region is one of the higher levels of the ventral stream of visual processing, associated with the representation of complex object features, such as global shape. It may also be involved in face perception, and in the recognition of numbers.

The inferior temporal gyrus is the anterior region of the temporal lobe located underneath the central temporal sulcus. The primary function of the occipital temporal gyrus — otherwise referenced as IT cortex — is associated with visual stimuli processing, namely visual object recognition, and has been suggested by recent experimental results as the final location of the ventral cortical visual system. The IT cortex in humans is also known as the Inferior Temporal Gyrus since it has been located to a specific region of the human temporal lobe. The IT processes visual stimuli of objects in our field of vision, and is involved with memory and memory recall to identify that object; it is involved with the processing and perception created by visual stimuli amplified in the V1, V2, V3, and V4 regions of the occipital lobe. This region processes the color and form of the object in the visual field and is responsible for producing the "what" from this visual stimuli, or in other words identifying the object based on the color and form of the object and comparing that processed information to stored memories of objects to identify that object.

The IT cortex's neurological significance is not just its contribution to the processing of visual stimuli in object recognition but also has been found to be a vital area with regards to simple processing of the visual field, difficulties with perceptual tasks and spatial awareness, and the location of unique single cells that possibly explain the IT cortex's relation to memory. (https://en.wikipedia.org/wiki/Inferior_temporal_gyrusSOURCE)

21 Page 126, Chapter 9: Different Voices. In The Voices Within by Charles Fernyhough [60]:

Yes and no. The inner speech theory of voice hearing really became established in the 1990s with the work of Chris Frith and Richard Bentall, who, working independently, developed Feinberg's ideas in slightly different directions. In one research group, Frith and his colleagues at University College London were developing a theory that the symptoms of schizophrenia stemmed from a problem in monitoring one's own actions. In an early study from their group, patients with the diagnosis did not do so well as a control group in a task that involved correcting the errors they made when moving a joystick. The idea was that, if you had a problem in monitoring your own behavior, you might fail to recognize some of the actions that you yourself produced as being your own work. And that could include inner speech: the words you produced, for yourself, in your own head.

22 Here is a summary of Terrence Deacon's Incomplete Nature: How Mind Emerged from Matter borrowing liberally from the Wikipedia page and recent interviews:

Ententional Phenomena:

Broadly, the book seeks to naturalistically explain aboutness, that is, concepts like intentionality, meaning, normativity, purpose, and function; which Deacon groups together and labels as ententional phenomena.

Constraints:

A central thesis is that absence can still be efficacious. Deacon makes the claim that just as the concept of zero revolutionized mathematics, thinking about life, mind, and other ententional phenomena in terms of constraints (i.e., what is absent). A good example of this concept is the hole that defines the hub of a wagon wheel. The hole itself is not a physical thing, but rather a source of constraint that helps to restrict the conformational possibilities of the wheel's components, such that, on a global scale, the property of rolling emerges.

Constraints which produce emergent phenomena may not be a process that can be understood by looking at the make-up of the constituents of a pattern. Emergent phenomena are difficult to study because their complexity does not necessarily decompose into parts. When a pattern is broken down, the constraints are no longer at work; there is no hole, no absence to notice. Imagine a hub, a hole for an axle, produced only when the wheel is rolling, thus breaking the wheel may not show you how the hub emerges.

Homeodynamic Systems:

Homeodynamic systems are essentially equivalent to classical thermodynamic systems like a gas under pressure or solute in solution, but the term serves to emphasize that homeodynamics is an abstract process that can be realized in forms beyond the scope of classic thermodynamics. For example, the diffuse brain activity normally associated with emotional states can be considered to be a homeodynamic system because there is a general state of equilibrium which its components (neural activity) distribute towards. In general, a homeodynamic system is any collection of components that will spontaneously eliminate constraints by rearranging the parts until a maximum entropy state (disorderliness) is achieved.

Morphodynamic Systems:

A morphodynamic system consists of a coupling of two homeodynamic systems such that the constraint dissipation of each complements the other, producing macroscopic order out of microscopic interactions. Morphodynamic systems require constant perturbation to maintain their structure, so they are relatively rare in nature. Common examples are snowflake formation, whirlpools and the stimulated emission of laser light.

Teleodynamic Systems:

A teleodynamic system consists of coupling two morphodynamic systems such that the self-undermining quality of each is constrained by the other. Each system prevents the other from dissipating all of the energy available, and so long-term organizational stability is obtained. Deacon claims that we should pinpoint the moment when two morphodynamic systems reciprocally constrain each other as the point when ententional qualities like function, purpose and normativity emerge.

23 Terrence Deacon offers his view on Chomsky's universal grammar in this excerpt from The Symbolic Species [39]. If you have access to a copy (PDF) of Deacon's book, I recommend the clarity and conciseness of Figure 1.1 and its associated caption, showing four cartoon depictions of some of the major theoretical paradigms proposed to explain the basis of human language:

Interpreting the discontinuity between linguistic and nonlinguistic communication as an essential distinction between humans and nonhumans, however, has led to an equally exaggerated and untenable interpretation of language origins: the claim that language is the product of a unique one-of-a-kind piece of neural circuitry that provides all the essential features that make language unique (e.g., grammar). But this does not just assume that there is a unique neurological feature that correlates with this unique behavior, it also assumes an essential biological discontinuity. In other words, that language is somehow separate from the rest of our biology and neurology. It is as though we are apes plus language - as though one handed a language computer to a chimpanzee.

The single most influential "hopeful monster" theory of human language evolution was offered by the linguist Noam Chomsky, and has since been echoed by numerous linguists, philosophers, anthropologists, and psychologists. Chomsky argued that the ability of children to acquire the grammar of their first language, and the ability of adults effortlessly to use this grammar, can only be explained if we assume that all grammars are variations of a single generic "Universal Grammar," and that all human brains come with a built-in language organ that contains this language blueprint. This is offered as the only plausible answer to an apparently insurmountable learning problem.

Grammars appear to have an unparalleled complexity and systematic logical structure, the individual grammatical "rules" aren't explicitly evident in the information available to the child, and when they acquire their first language children are still poor at learning many other things. Despite these limitations children acquire language knowledge at a remarkable rate. This leads to the apparently inescapable conclusion that language information must already be "in the brain" before the process begins in order for it to be successful. Children must already "know" what constitutes an allowable grammar, in order to be able to ignore the innumerable false hypotheses about grammar that their limited expelience might otherwise suggest.

Bruno Dubuc summarizes the view of Philip Lieberman and Deacon on the plausibility of an innate language facility in the form of a universal agreement as follows:

Since Chomsky first advanced these theories, however, evolutionary biologists have undermined them with the proposition that it may be only the brain’s general abilities that are "pre-organized". These biologists believe that to try to understand language, we must approach it not from the standpoint of syntax, but rather from that of evolution and the biological structures that have resulted from it. According to Philip Lieberman [129130], for example, language is not an instinct encoded in the cortical networks of a "language organ", but rather a learned skill based on a "functional language system" distributed across numerous cortical and subcortical structures.

Though Lieberman does recognize that human language is by far the most sophisticated form of animal communication, he does not believe that it is a qualitatively different form, as Chomsky claims. Lieberman sees no need to posit a quantum leap in evolution or a specific area of the brain that would have been the seat of this innovation. On the contrary, he says that language can be described as a neurological system composed of several separate functional abilities.

For Lieberman and other authors, such as Terrence Deacon, it is the neural circuits of this system, and not some "language organ", that constitute a genetically predetermined set that limits the possible characteristics of a language. In other words, these authors believe that our ancestors invented modes of communication that were compatible with the brain’s natural abilities. And the constraints inherent in these natural abilities would then have manifested themselves in the universal structures of language.

24 It appears to have been Louis Pasteur who said — modulo the translator's choice of wording, "Where observation is concerned, chance favors only the prepared mind."

25 Here is a concise explanation of how fMRI works: The human body is mostly water. Water molecules (H2O) contain hydrogen nuclei (protons), that become aligned in a magnetic field. An MRI scanner applies a very strong magnetic field (about 0.2 to 3 teslas, or roughly a thousand times the strength of the small magnets used to post reminders and grocery lists on the door of a refrigerator), which aligns the proton "spins."

The scanner also produces a radio frequency current that creates a varying magnetic field. The protons absorb the energy from the magnetic field and flip their spins. When the field is turned off, the protons gradually return to their normal spin, a process called precession. The return process produces a radio signal that can be measured by receivers in the scanner and made into an image.

This professionally produced video from the National Institute of Biological Imaging and Bioengineering (NIBIB) provides much the same explanation along with a visual accompaniment. The Center for Functional MRI at UCSD has a more comprehensive introduction to fMRI including an explanation of the blood oxygenation level dependent (BOLD) effect and its application in brain imaging.

26 The bAbI tasks [207] are generated using simulator described by the authors as follows: "The idea is that generating text within this simulation allows us to ground the language used into a coherent and controlled (artificial) world. [...] To produce more natural looking text with lexical variety from statements and questions we employ a simple automated grammar. Each verb is assigned a set of synonyms, e.g., the simulation command get is replaced with either picked up, got, grabbed or took, and drop is replaced with either dropped, left, discarded or put down."

27 Dynamic Memory Networks (DMN) are a general framework for question answering over inputs. Conceptually, a difference is made between inputs and questions. The DMN takes a sequence of inputs and a question and then employs an iterative attention process to compute the answer. The sequence of inputs can be seen as the history, which complements the general world knowledge (see semantic memory module). The DNM framework consists of five components: (i) input module: processes raw input and maps it to a useful representation, (ii) semantic memory module: stores general knowledge about concepts and facts. It can be instantiated by word embeddings or knowledge bases, (iii) question module: maps a question into a representation, (iv) episodic memory module: an iterative component that in each iteration focuses on different parts of the input, updates its internal state and finally outputs an answer representation (vi) answer module: generates the answer to return. See [211] and [118] for DMN applications in computer vision and natural language processing.

28 See Gretto [82] for a brief introduction to reproducing kernel Hilbert spaces and related kernel functions.

29 In [100], Hofstadter asks two questions "Why do babies not remember events that happen to them?" and "Why does each new year seem to pass faster than the one before?". He then answers them as follows:

I wouldn't swear that I have the final answer to either one of these queries, but I do have a hunch, and I will here speculate on the basis of that hunch. And thus: the answer to both is basically the same, I would argue, and it has to do with the relentless, lifelong process of chunking — taking "small" concepts and putting them together into bigger and bigger ones, thus recursively building up a giant repertoire of concepts in the mind. How, then, might chunking provide the clue to these riddles? Well, babies' concepts are simply too small. They have no way of framing entire events whatsoever in terms of their novice concepts. It is as if babies were looking at life through a randomly drifting keyhole, and at each moment could make out only the most local aspects of scenes before them. It would be hopeless to try to figure out how a whole room is organized, for instance, given just a keyhole view, even a randomly drifting keyhole view.

Or, to trot out another analogy, life is like a chess game, and babies are like beginners looking at a complex scene on a board, not having the faintest idea how to organize it into higher-level structures. As has been well known for decades, experienced chess players chunk the setup of pieces on the board nearly instantaneously into small dynamic groupings defined by their strategic meanings, and thanks to this automatic, intuitive chunking, they can make good moves nearly instantaneously and also can remember complex chess situations for very long times. Much the same holds for bridge players, who effortlessly remember every bid and every play in a game, and months later can still recite entire games at the drop of a hat.

All of this is due to chunking, and I speculate that babies are to life as novice players are to the games they are learning — they simply lack the experience that allows understanding (or even perceiving) of large structures, and so nothing above a rather low level of abstraction gets perceived at all, let alone remembered in later years. As one grows older, however, one's chunks grow in size and in number, and consequently one automatically starts to perceive and to frame ever larger events and constellations of events; by the time one is nearing one's teen years, complex fragments from life's stream are routinely stored as high-level wholes — and chunks just keep on accreting and becoming more numerous as one lives. Events that a baby or young child could not have possibly perceived as such — events that stretch out over many minutes, hours, days, or even weeks — are effortlessly perceived and stored away as single structures with much internal detail (varying amounts of which can be pulled up and contemplated in retrospect, depending on context). Babies do not have large chunks and simply cannot put things together coherently. Claims by some people that they remember complex events from when they were but a few months old (some even claim to remember being born!) strike me as nothing more than highly deluded wishful thinking.

So much for question number one. As for number two, the answer, or so I would claim, is very similar. The more we live, the larger our repertoire of concepts becomes, which allows us to gobble up ever larger coherent stretches of life in single mental chunks. As we start seeing life's patterns on higher and higher levels, the lower levels nearly vanish from our perception. This effectively means that seconds, once so salient to our baby selves, nearly vanish from sight, and then minutes go the way of seconds, and soon so do hours, and then days, and then weeks [...] (SOURCE)

30 In his theory for how humans might transfer knowledge that could serve to reconfigure the learning space by altering the energy landscape, Bengio [18] notes that:

A richer form of communication, which would already be useful, would require simply naming objects in a scene. Humans have an innate understanding of the pointing gesture that can help identify which object in the scene is being named. In this way, the learner could develop a repertoire of object categories which could become handy (as intermediate concepts) to form theories about the world that would help the learner to survive better. Richer linguistic constructs involve the combination of concepts and allow the agents to describe relations between objects, actions and events, sequences of events (stories), causal links, etc., which are even more useful to help a learner form a powerful model of the environment.

31 Bengio points out that this hypothesis is related to much previous work in cognitive science, such as for example cognitive imitation [183], which has been observed in monkeys, and where the learner imitates not just a vocalization or a behavior but something more abstract that corresponds to a cognitive rule.

32 A meme is an idea, behavior, or style that spreads from person to person within a culture — often with the aim of conveying a particular phenomenon, theme, or meaning represented by the meme [37]. A meme acts as a unit for carrying cultural ideas, symbols, or practices, that can be transmitted from one mind to another through writing, speech, gestures, rituals, or other imitable phenomena with a mimicked theme. SOURCE

33 Here is a sample of references related Yoshua Bengio's work on using language to expedite learning high-level abstractions by regularizing the deep architectures required to represent such abstractions:

@article{BengioCoRR-12,
author = {Yoshua Bengio},
title = {Evolving Culture vs Local Minima},
journal = {CoRR},
volume = {arXiv:1203.2990},
year = {2012},
abstract = {We propose a theory that relates difficulty of learning in deep architectures to culture and language. It is articulated around the following hypotheses: (1) learning in an individual human brain is hampered by the presence of effective local minima; (2) this optimization difficulty is particularly important when it comes to learning higher-level abstractions, i.e., concepts that cover a vast and highly-nonlinear span of sensory configurations; (3) such high-level abstractions are best represented in brains by the composition of many levels of representation, i.e., by deep architectures; (4) a human brain can learn such high-level abstractions if guided by the signals produced by other humans, which act as hints or indirect supervision for these high-level abstractions; and (5), language and the recombination and optimization of mental concepts provide an efficient evolutionary recombination operator, and this gives rise to rapid search in the space of communicable ideas that help humans build up better high-level internal representations of their world. These hypotheses put together imply that human culture and the evolution of ideas have been crucial to counter an optimization difficulty: this optimization difficulty would otherwise make it very difficult for human brains to capture high-level knowledge of the world. The theory is grounded in experimental observations of the difficulties of training deep artificial neural networks. Plausible consequences of this theory for the efficiency of cultural evolutions are sketched.}
}
@article{BengioCoRR-17,
author = {Yoshua Bengio},
title = {The Consciousness Prior},
journal = {CoRR},
volume = {arXiv:1709.08568},
year = {2017},
abstract = {A new prior is proposed for representation learning, which can be combined with other priors in order to help disentangling abstract factors from each other. It is inspired by the phenomenon of consciousness seen as the formation of a low-dimensional combination of a few concepts constituting a conscious thought, i.e., consciousness as awareness at a particular time instant. This provides a powerful constraint on the representation in that such low-dimensional thought vectors can correspond to statements about reality which are either true, highly probable, or very useful for taking decisions. The fact that a few elements of the current state can be combined into such a predictive or useful statement is a strong constraint and deviates considerably from the maximum likelihood approaches to modeling data and how states unfold in the future based on an agent's actions. Instead of making predictions in the sensory (e.g. pixel) space, the consciousness prior allow the agent to make predictions in the abstract space, with only a few dimensions of that space being involved in each of these predictions. The consciousness prior also makes it natural to map conscious states to natural language utterances or to express classical AI knowledge in the form of facts and rules, although the conscious states may be richer than what can be expressed easily in the form of a sentence, a fact or a rule.}
}
@inproceedings{BengioCOGSCI-14,
author = {Bengio, Yoshua},
title = {Deep learning, Brains and the Evolution of Culture},
booktitle = {Proceedings of the 36th Annual Conference of the Cognitive Science Society Workshop on Deep Learning and the Brain},
publisher = {Cognitive Science Society},
location = {Quebec City, Quebec, Canada},
year = {2014},
}
@inproceedings{BengioCGEC-14,
author = {Bengio, Yoshua},
title = {Deep learning and cultural evolution},
booktitle = {Proceedings of the Companion Publication of the 2014 Annual Conference on Genetic and Evolutionary Computation},
publisher = {ACM},
location = {New York, NY, USA},
year = {2014},
abstract = {We propose a theory and its first experimental tests, relating difficulty of learning in deep architectures to culture and language. The theory is articulated around the following hypotheses: learning in an individual human brain is hampered by the presence of effective local minima, particularly when it comes to learning higher-level abstractions, which are represented by the composition of many levels of representation, i.e., by deep architectures; a human brain can learn such high-level abstractions if guided by the signals produced by other humans, which act as hints for intermediate and higher-level abstractions; language and the recombination and optimization of mental concepts provide an efficient evolutionary recombination operator for this purpose. The theory is grounded in experimental observations of the difficulties of training deep artificial neural networks and an empirical test of the hypothesis regarding the need for guidance of intermediate concepts is demonstrated. This is done through a learning task on which all the tested machine learning algorithms failed, unless provided with hints about intermediate-level abstractions.}
}
@article{PascanuetalCoRR-14,
author = {Razvan Pascanu and Yann N. Dauphin and Surya Ganguli and Yoshua Bengio},
title = {On the saddle point problem for non-convex optimization},
journal = {CoRR},
volume = {arXiv:1405.4604},
year = 2014,
abstract = {A central challenge to many fields of science and engineering involves minimizing non-convex error functions over continuous, high dimensional spaces. Gradient descent or quasi-Newton methods are almost ubiquitously used to perform such minimizations, and it is often thought that a main source of difficulty for the ability of these local methods to find the global minimum is the proliferation of local minima with much higher error than the global minimum. Here we argue, based on results from statistical physics, random matrix theory, and neural network theory, that a deeper and more profound difficulty originates from the proliferation of saddle points, not local minima, especially in high dimensional problems of practical interest. Such saddle points are surrounded by high error plateaus that can dramatically slow down learning, and give the illusory impression of the existence of a local minimum. Motivated by these arguments, we propose a new algorithm, the saddle-free Newton method, that can rapidly escape high dimensional saddle points, unlike gradient descent and quasi-Newton methods. We apply this algorithm to deep neural network training, and provide preliminary numerical evidence for its superior performance.}
}
@inproceedings{HamricketalCSS-18,
author = {Jessica B. Hamrick and Kelsey R. Allen and Victor Bapst and Tina Zhu and Kevin R. McKee and Joshua B. Tenenbaum and Peter W. Battaglia},
title = {Relational inductive bias for physical construction in humans and machines},
booktitle = {Proceedings of the 40th Annual Conference of the Cognitive Science Society},
year = {2018},
abstract = {While current deep learning systems excel at tasks such as object classification, language processing, and gameplay, few can construct or modify a complex system such as a tower of blocks. We hypothesize that what these systems lack is a "relational inductive bias": a capacity for reasoning about inter-object relations and making choices over a structured description of a scene. To test this hypothesis, we focus on a task that involves gluing pairs of blocks together to stabilize a tower, and quantify how well humans perform. We then introduce a deep reinforcement learning agent which uses object- and relation-centric scene and policy representations and apply it to the task. Our results show that these structured representations allow the agent to outperform both humans and more naive approaches, suggesting that relational inductive bias is an important component in solving structured reasoning problems and for building more intelligent, flexible machines.}
}
@inproceedings{BattagliaetalNIPS-16,
author = {Battaglia, Peter and Pascanu, Razvan and Lai, Matthew and Rezende, Danilo Jimenez and kavukcuoglu, Koray},
title = {Interaction Networks for Learning About Objects, Relations and Physics},
booktitle = {Proceedings of the 30th International Conference on Neural Information Processing Systems},
publisher = {Curran Associates Inc.},
location = {Barcelona, Spain},
year = {2016},
pages = {4509-4517},
abstract = {Reasoning about objects, relations, and physics is central to human intelligence, and a key goal of artificial intelligence. Here we introduce the interaction network, a model which can reason about how objects in complex systems interact, supporting dynamical predictions, as well as inferences about the abstract properties of the system. Our model takes graphs as input, performs object- and relation-centric reasoning in a way that is analogous to a simulation, and is implemented using deep neural networks. We evaluate its ability to reason about several challenging physical domains: n-body problems, rigid-body collision, and non-rigid dynamics. Our results show it can be trained to accurately simulate the physical trajectories of dozens of objects over thousands of time steps, estimate abstract quantities such as energy, and generalize automatically to systems with different numbers and configurations of objects and relations. Our interaction network implementation is the first general-purpose, learnable physics engine, and a powerful general framework for reasoning about object and relations in a wide variety of complex real-world domains.},
}


34 Here is a very brief summary of the different processes involved in human memory consolidation:

Memory consolidation is a category of processes that stabilize a memory trace after its initial acquisition.[1] Consolidation is distinguished into two specific processes, synaptic consolidation, which is synonymous with late-phase long-term potentiation and occurs within the first few hours after learning, and systems consolidation, where hippocampus-dependent memories become independent of the hippocampus over a period of weeks to years. Recently, a third process has become the focus of research, reconsolidation, in which previously-consolidated memories can be made labile again through reactivation of the memory trace. (SOURCE)

35 An inductive bias allows a learning algorithm to prioritize one solution (or interpretation) over another, independent of the observed data [139]. The inductive bias of a learning algorithm is the set of assumptions that the learner uses to predict outputs given inputs that it has not encountered. In a Bayesian model, inductive biases are typically expressed through the choice and parameterization of the prior distribution. In other contexts, an inductive bias might be a regularization term added to avoid overfitting, or it might be encoded in the architecture of the algorithm itself. SOURCE

36 In the mammalian brain, information pertaining to sensing and motor control is topographically mapped to reflect the intrinsic structure of that information required for interpretation. This was early recognized in the work of Hubel and Wiesel [107106] on the striate cortex of the cat and macaque monkey and in the work of Wilder Penfield [160] developing the idea of a cortical homunculus in the primary motor and somatosensory areas of the brain located between the parietal and frontal lobes of the primate cortex. Such maps have become associated with the theory of embodied cognition.

37 Papers on incorporating episodic memory in dialogue and recent techniques for augmenting dialogue data via active learning:

@inproceedings{SieberandKrennACL-10,
author = {Gregor Sieber and Brigitte Krenn},
title = {Episodic Memory for Companion Dialogue},
booktitle = {Proceedings of the 2010 Workshop on Companionable Dialogue Systems},
publisher = {Association for Computational Linguistics},
pages = {2010},
year = {1–6},
abstract = {We present an episodic memory component for enhancing the dialogue of artificial companions with the capability to refer to, take up and comment on past interactions with the user, and to take into account in the dialogue long-term user preferences and interests. The proposed episodic memory is based on RDF representations of the agent’s experiences and is linked to the agent’s semantic memory containing the agent’s knowledge base of ontological data and information about the user’s interests.}
}
@article{SuetalCoRR-16,
author = {Pei{-}Hao Su and Milica Gasic and Nikola Mrksic and Lina Maria Rojas{-}Barahona and Stefan Ultes and David Vandyke and Tsung{-}Hsien Wen and Steve J. Young},
title = {On-line Active Reward Learning for Policy Optimisation in Spoken Dialogue Systems},
journal = {CoRR},
volume = {arXiv:1605.07669},
year = {2016},
abstract = {The ability to compute an accurate reward function is essential for optimising a dialogue policy via reinforcement learning. In real-world applications, using explicit user feedback as the reward signal is often unreliable and costly to collect. This problem can be mitigated if the user's intent is known in advance or data is available to pre-train a task success predictor off-line. In practice neither of these apply for most real world applications. Here we propose an on-line learning framework whereby the dialogue policy is jointly trained alongside the reward model via active learning with a Gaussian process model. This Gaussian process operates on a continuous space dialogue representation generated in an unsupervised fashion using a recurrent neural network encoder-decoder. The experimental results demonstrate that the proposed framework is able to significantly reduce data annotation costs and mitigate noisy user feedback in dialogue policy learning.},
}
@article{YoungetalCoRR-18,
title = {Integrating Episodic Memory into a Reinforcement Learning Agent Using Reservoir Sampling},
author = {Kenny J. Young and Richard S. Sutton and Shuo Yang},
journal = {CoRR},
volume = {arXiv:1806.00540},
year = {2018},
abstract = {Episodic memory is a psychology term which refers to the ability to recall specific events from the past. We suggest one advantage of this particular type of memory is the ability to easily assign credit to a specific state when remembered information is found to be useful. Inspired by this idea, and the increasing popularity of external memory mechanisms to handle long-term dependencies in deep learning systems, we propose a novel algorithm which uses a reservoir sampling procedure to maintain an external memory consisting of a fixed number of past states. The algorithm allows a deep reinforcement learning agent to learn online to preferentially remember those states which are found to be useful to recall later on. Critically this method allows for efficient online computation of gradient estimates with respect to the write process of the external memory. Thus unlike most prior mechanisms for external memory it is feasible to use in an online reinforcement learning setting.}
}
@inproceedings{ZhangetalCASA-16,
author = {Zhang, Juzheng and Thalmann, Nadia Magnenat and Zheng, Jianmin},
title = {Combining Memory and Emotion With Dialog on Social Companion: A Review},
booktitle = {Proceedings of the 29th International Conference on Computer Animation and Social Agents},
publisher = {ACM},
location = {Geneva, Switzerland},
year = {2016},
pages = {1-9},
abstract = {In the coming era of social companions, many researchers have been pursuing natural dialog interactions and long-term relations between social companions and users. With respect to the quick decrease of user interest after the first few interactions, various emotion and memory models have been developed and integrated with social companions for better user engagement. This paper reviews related works in the effort of combining memory and emotion with natural language dialog on social companions. We separate these works into three categories: (1) Affective system with dialog, (2) Task-driven memory with dialog, (3) Chat-driven memory with dialog. In addition, we discussed limitations and challenging issues to be solved. Finally, we also introduced our framework of social companions.}
}


38 I've been reading Terrence Deacon's latest book "Incomplete Nature: How Mind Emerged from Matter". I had started a month ago thinking it was relevant to some ideas I had about the origins of language. I found it frustrating and several of the reviews agreed with my observation that the first chapter was almost incomprehensible. I returned to it a week ago when I was reading a biography of Ludwig Boltzmann and thinking about thermal equilibrium and entropy while simultaneously "watching" Boltzmann's tragic life unravel leading up to his final — and successful — suicide attempt. I still find Deacon's book difficult reading, but I think I understand his theory and how it relates to consciousness.

If you're interested check out my attempt reconstruct and summarize Deacon's theory and see if it makes sense to you: The second law of thermodynamics tells us that the entropy of an isolated system either remains constant or increases with time. We can use this to argue that a system consisting of a container of water at one temperature containing a second container of water at a higher temperature has lower entropy than an otherwise identical system in which the water in both containers is the same temperature. Barring external intervention, the two bodies of water are said to achieve thermal equilibrium.

Now imagine a system consisting of an organism and the environment in which it evolved. Considered as a closed system, the laws of thermodynamics tell us that the entropy of the system as a whole will increase over time. However, the organism has evolved so as to prevent, or at least slow down this from happening. Part of its strategy for achieving this is to produce a barrier enclosing the organism and separating it from its surrounding environment. The laws of thermodynamics tell us that it must do work to maintain the integrity of this barrier as well as the essential properties that define its function.

In single-celled organisms, the barrier is the cell membrane along with the molecules that allow it to absorb nutrients, excrete waste, exclude poisons and repel pathogens. Ask any molecular biologist and she'll tell you that this is an unrelenting, incredibly challenging burden for the organism, especially given that the environment of the cell is dynamic, quick to exacerbate any weakness the organism might exhibit and natural selection has had a long time to evolve sophisticated highly-adaptive pathogens. Multi-cellular life is faced with an even more daunting set of challenges in managing the selectively-permeable interface between self and not-self.

Our immune system, the complex neural circuits of our gastrointestinal tract and our gut micro biome are but a few of the many systems our bodies employ in maintaining our physical autonomy. Deacon includes the systems involved in consciousness as playing a key role in directing those activities designed to secure and maintain the autonomy of the self. These activities include monitoring and controlling our psychological and social state with much the same sort of challenges as those of single-celled animals, albeit with very different dynamics and modes of interaction. So consciousness is an evolved response to a threatened loss of autonomy.

However, Deacon goes much further in relating these phenomena to Shannon's information theory, Boltzmann entropy and thermodynamic work, emphasizing their emergent properties and arguing that the problem of consciousness and the problem of the origin of life are inexorably linked. The arguments based on thermodynamics concerning how systems consistently far from equilibrium can interact and combine to produce novel emergent properties is well done. I got lost in the discussion of homeodynamics, an abstract process that can be realized in forms beyond the scope of classic thermodynamics. It's an interesting read if you're willing to temporarily suspend judgment and entertain his mix of insight and complex flights of fancy.

If you've ever listened to one of his book talks, you know that he often leads with a discussion concerning the invention of zero. He posits that the study of constraints leads to a better understanding of nature and that, in particular, ententions — concepts like zero that refer to or are, in some other way, "about" something not present — are key in understanding the emergent properties of living things. It's an intriguing conceit and useful example of what Daniel Dennett calls an intuition pump. Despite my cynicism, I am somewhat positively disposed toward Deacon's ideas, but then I'm often drawn to complicated mechanisms.

Miscellaneous loose ends:

I had not previously encountered the notion of consciousness as (one way) to deal with entropy and it's a really intriguing idea. It almost seems like a hypothesis that one could test ... As a thought experiment, do you think that Deacon would say that an animal who is unconscious is doing a worse job at staving off entropy than when it is conscious?

I think he would agree with that, noting however that the organism is taking a calculated risk by assuming that the restorative benefits of taking your brain off line to consolidate what you've learned outweigh the potential negative consequences of being in an unprotected or less-readily defended state. I expect Sean Carroll would say consciousness is an emergent property of the laws of physics and the boundary conditions imposed by our planet.

It's interesting that Deacon also views consciousness as playing a key role in modulating the complex neural circuitry involved in things like the microbiome — most people would say that we have been largely unaware/unconscious of the important role that the microbiome plays in our lives, so how could consciousness be involved deeply in it?

I'm not exactly sure how Deacon would answer your question. I didn't read all the later chapters of the book with the same thoroughness as the earlier ones — it's one of those books in which almost every page requires the full concentration of the reader and hence it's more than a little exhausting to read for long periods of time. That said, I'll make an attempt to channel Deacon to answer your question. Beware of speech-recognition transcription errors in the following.

Complex life — even the simplest predecessors of the first fully-functional, self-reproducing, single-cell organisms — is fragile and it had to be an hit-or-miss struggle in bringing all the pieces together in a functioning organism. The development of a cell membrane enclosing and protecting the coalition of molecules that were to eventually evolve into a cell was clearly an important innovation and a remarkable tour de force by natural selection.

Very early on these earlier proto-cells had to develop some means of distinguishing parts of themselves from other complex molecules that surrounded them or, worse yet, found some means of entering the cell, a trick that viruses have refined to an high art. Distinguishing self from non-self was difficult even at this very primitive stage in the evolution of life, and it got even more difficult as there developed multitudes of different life forms all competing for the same scarce resources and naturally seizing upon opportunities to take what they needed from other life forms.

Initially the molecular tools for maintaining autonomy were likely in the form of separate uncoordinated bespoke molecules, but the threats to autonomy evolved quickly to foil the efforts of uncoordinated defenders and so over time it was necessary to deploy ever more complicated mechanisms to fend off attacks. In the arms race that ensued, these defensive mechanisms became too complicated to be implemented with ever larger molecules, however sophisticated their molecular structure. Multi-cellular life was able to differentiate cells to perform specific defensive strategies often involving the coordination of several different cell types.

It became necessary for cells to communicate with one another and so even more complicated protocols evolved for carrying out complex defensive strategies. Communication and collaboration of cell types within a single organism evolved alongside of communication and collaboration between multiple organisms and the mechanisms for conveying and interpreting information became more complicated in order to cope with the growing complexity of life. Communication via shared molecules was often superseded by other methods of signaling but some of the most powerful strategies for exploiting another organism involved different forms of deceit, including camouflage and mimicking.

Recall that semiotics — referential signaling as the foundation for symbolic communication and the evolution of language — was at the core of Deacon's earlier book, "The Symbolic Species". Given the central role of signaling and communication in "Incomplete Nature", it should come as no surprise that some of same themes resurfaced. My evidence for this supposition is somewhat circumstantial so don't quote me on this.

Eventually, the complexity reached some threshold such that it was worth the metabolic cost to build and maintain computational resources in an arms race that played out so quickly evolution could not keep pace to ensure the survival of the species. Throughout earlier wars involving the co-evolution of better boundaries and better tools for breaching them, a proto-sense of self emerged in the form of coalitions of cells whose purpose it was to distinguish self from non-self and rid the organism of the latter. The immune system in mammals is among one of the most sophisticated of such coalitions.

I expect that Deacon or Daniel Dennett would be comfortable with the idea that just as cybersecurity has become one of the most critical factors in modern warfare, so too natural selection has seen fit to develop offensive and defensive weapons to protect boundaries no longer defined purely in biological terms. If you believe as I do that consciousness is simply an attentional mechanism — no more complicated from a purely computational perspective than the attentional machinery that governs visual saccades, then it is not too far-fetched to say that consciousness is a consequence of the evolution of life on a planet sufficiently rich in the opportunities for biological communication and computation.

P.S. Lest I leave you thinking that Deacon has it altogether regarding the other aspects of his thesis, you might want to read Daniel Dennett's review of Deacon's book [50] (PDF). Chapters 2 (Homunculi), 3 (Golems) and 4 (Teleonomy) present what I believe to be a deep misunderstanding of the nature and limitations of computing both carbon- and silicon-based. Dennett takes Deacon to task, but in such a polite and scholarly manner that some readers will not even register his opposition.

In Chapter 5 (Emergence) (pp. 143-181) Deacon explicitly rejects claims that living or mental phenomena can be explained by dynamical systems approaches, arguing that life- or mind-like properties only emerge from a higher-order reciprocal relationship between self-organizing processes. Finally, in the first 12 minutes of this podcast, Buddhist teacher and writer, Joseph Goldstein, speaks about the relationship between consciousness and self, emphasizing the view that consciousness only makes one think there exists a separate self.

39 The theory articulated in the speaker notes of the slides I sent around can be summarized as follows: unimodal sensory information enters the posterior cortex, is combined in multi-modal association areas and abstracted in thought vectors. Attentional circuits in the frontal cortex repeatedly extract low-dimensional abstracts of these thought vectors and then select a subset to highlight and actively maintain for a period on the order of 100 milliseconds. Less than a dozen — 5 plus or minus 2 we have been led to believe — can be so highlighted and the same vector can be repeatedly highlighted to sustain its activation over longer periods.

In parallel with this process — managing the contents of the global workspace in the posterior cortex, another process involving the basal ganglia, hippocampus and portions of the prefrontal cortex extracts low-dimensional probes from the highlighted activations in the global workspace and the deploys them to select content from episodic memory that it then makes available to augment the activations highlighted in the global workspace, where "making the contents of episodic memory available" might correspond to composing thought vectors. In explaining this to a friend, I gave the following — collaborative and externalized — example:

You're sitting in the living room listening to Keith Jarrett on the stereo. You and your wife are reading. The passage in the book you're reading makes a reference to how we think about the passage of time. A few hundred milliseconds later as you are still attending to the passage you are reminded of having dinner with your mother in which she said something about how time seems to stretch and contract depending upon what she's doing and what sort of things she has on her mind. You ask your wife if she remembers this conversation. She replies that she doesn't remember this particular conversation but that she has a similar reaction when in the course of the day she reflects upon how time passes when she's preparing dinner in the evening or when she's checking her email or reading the news feeds in the morning while having her first cup of tea.

When I first came up with the above computational model incorporating (conscious) awareness and (unconscious) episodic recall it occurred to me — and I found this rather disturbing — how little control I had over my thoughts. This was disturbing because I didn't yet have a good model of how the programmer’s apprentice agent could intervene in this cycle of attending and remembering in order to direct its thoughts so as to solve programming problems or any other problems for that matter.

But then I thought more carefully about how in my meditation practice I have been able to recognize disturbing thoughts without having them adversely influence my emotional state and then dismiss them by simply letting them pass away. It struck me that as I have become more adept at controlling my attention during meditation I am able to recognize when I've allowed a thought to gain some purchase on my mind. Intuitively I call up what I will call my diligent avoidance subroutine and use my current experience to reinforce my ability to recognize such nuisance thoughts and let them pass without allowing them to gain some purchase — generally accompanied by some unpleasant or at least unwanted thoughts — in order to reveal their destructive character.

41 Following [4122], we employ hierarchical planning technology to implement several key components in the underlying bootstrapping and dialog management system. Each such component consists of a hierarchical task network (HTN) representing a collection of hierarchically organized plan schemas designed to run in a lightweight Python implementation of the HTN planner developed by Dana Nau et al [145]:

Hierarchical task network (HTN) planning is an approach to automated planning in which the dependency among actions can be given in the form of hierarchically structured networks. Planning problems are specified in the hierarchical task network approach by providing a set of tasks, which can be:
1. primitive tasks, that roughly correspond to the actions of STRIPS,

2. compound tasks, that can be seen as composed of a set of simpler tasks, and

3. objective tasks, that roughly correspond to the goals of STRIPS, but are more general.

A solution to an HTN problem is then an executable sequence of primitive tasks that can be obtained from the initial task network by decomposing compound tasks into their set of simpler tasks, and by inserting ordering constraints. SOURCE

40 Bootstrapping the programmer's apprentice: Basic cognitive bootstrapping and linguistic grounding

%%% Thu Sep  6 04:35:05 PDT 2018


The programmer's assistant agent is designed to distinguish between three voices: the voice of the programmer, the voice of the assistant's automated tutor and its own voice. We could have provided an audio track to distinguish these voices, but since there only these three and the overall system can determine when any one of them is speaking, the system simply adds a few bits to each utterance as a proxy for an audio signature allowing the assistant to make such distinctions for itself. When required, we use the same signature to indicate which of the three speakers is responsible for changes to the shared input and output associated with the fully instrumented IDE henceforth abbreviated as FIDE — pronounced "/fee/'-/day/", from the Latin meaning: (i) trust, (ii) credit, (iii) fidelity, (iv) honesty. It will also prove useful to further distinguish the voice of the assistant as being in one of two modes: private, engaging in so-called inner speech that is not voiced aloud, and public, meaning spoken out loud for the explicit purpose of communicating with the programmer. We borrow the basic framework for modeling other agents and simple theory-of-mind from Rabinowitz et al [164].

The bootstrap statistical language model consists of an n-gram embedding trained on large general-text language corpus augmented with programming and software-engineering related text drawn from online forums and transcripts of pair-programming dialog. For the time being, we will not pursue the option of trying to acquire a large enough dialog corpus to train an encoder-decoder LSTM/GRU dialog manager / conversational model [200]. In the initial prototype, natural language generation (NLG) output for the automated tutor and assistant will be handled using hierarchical planning technology leveraging ideas developed in the CMU RavenClaw dialogue management system [22]41, but we have plans to explore hybrid natural language generation by combining hard-coded Python dialog agents corresponding to hierarchical task networks and differentiable dialogic encoder-decoder thought-cloud generators using a variant of pointer-generator networks as described by See et al [175].

Both the tutor and assistant NLG subsystems will rely on a base-level collection of plans — hierarchical task network (HTN) — that we employ in several contexts plus a set of specialized plans — an HTN subnetwork — specific to each subsystem. At any given moment in time, a meta control system [90] in concert with a reinforcement-learning-trained policy determines the curricular goal constraining the tutor's choice of specific lesson is implemented using a variant of the scheduled auxiliary control paradigm described by Riedmiller et al [166]. Having selected a subset of lessons relevant to the current curricular goal, the meta-controller cedes control to the tutor which selects a specific lesson and a suitable plan to oversee interaction with the agent over the course of the lesson.

Most lessons will require a combination of spoken dialogue and interactive signaling that may include both the agent and the tutor pointing, highlighting, performing edits and controlling the FIDE by executing code and using developer tools like the debugger to change state, set break points and single step the interpreter, but we're getting ahead of ourselves. The curriculum for mastering the basic referential modes is divided into three levels of mastery in keeping with Terrence Deacon's description [39] and Charles Sanders Peirce's (semiotic) theory of signs. The tutor will start at the most basic level, continually evaluating performance to determine when it is time to graduate to the next level or when it is appropriate to revert to an earlier level to provide additional training in order to master the less demanding modes of reference.

Miscellaneous loose ends: Williams et al [209] introduce a related approach to bootstrapping called Hybrid Code Networks (HCNs) that targets dialog systems for applications such as automated technical support and restaurant reservations — see Figure 47 for an overview. Bordes et al [23] and Das et al [36] demonstrate end-to-end dialog systems based on Memory Networks [208] that exhibit promising performance and learn to perform non-trivial operations. See Figure 39 for a simple hierarchical dialog-management plan from [41].

44 Here are some of the key papers that O'Reilly mentioned in his 2018 presentation in class:

%%% (Huang et al, 2013) - O’Reilly
@article{HuangetalJCN-13,
author = {Huang, Tsung-Ren and Hazy, Thomas and A Herd, Seth and O’Reilly, Randall},
title = {Assembling Old Tricks for New Tasks: A Neural Model of Instructional Learning and Control},
booktitle = {Journal of Cognitive Neuroscience},
volume = {25},
issue = {6},
year = {2013},
pages = {843-841},
abstract = {We can learn from the wisdom of others to maximize success. However, it is unclear how humans take advice to flexibly adapt behavior. On the basis of data from neuroanatomy, neurophysiology, and neuroimaging, a biologically plausible model is developed to illustrate the neural mechanisms of learning from instructions. The model consists of two complementary learning pathways. The slow-learning parietal pathway carries out simple or habitual stimulus-response (S-R) mappings, whereas the fast-learning hippocampal pathway implements novel S-R rules. Specifically, the hippocampus can rapidly encode arbitrary S-R associations, and stimulus-cued responses are later recalled into the basal ganglia-gated pFC to bias response selection in the premotor and motor cortices. The interactions between the two model learning pathways explain how instructions can override habits and how automaticity can be achieved through motor consolidation.}
}
%%% (Taatgen and Frank, 2003) - Taatgen
@article{TaatgenandFrankHUMAN-FACTORS-03,
author = {Taatgen, Niels and Lee, Frank},
title = {Production Compilation: A Simple Mechanism to Model Complex Skill Acquisition},
booktitle = {Human Factors},
volume = {45},
year = {2003},
pages = {61-76},
abstract = {In psychology many theories of skill acquisition have had great success in addressing the find details of learning relatively simple tasks, but can they scale up to complex tasks that are more typical of human learning in the world? In this paper we describe production composition , a theory of skill acquisition that combines aspects of the theories forwarded by Anderson (1982) and Newell and Roosenbloom (1981), that we believe can model the fine details of learning in complex and dynamic tasks. We use production composition to model in detail learning in a simulated air-traffic controller task.}
}
%%% (Santoro et al, 2016) - Lillicrap
@inproceedings{SantoroetalICML-16,
author = {Santoro, Adam and Bartunov, Sergey and Botvinick, Matthew and Wierstra, Daan and Lillicrap, Timothy},
title = {Meta-learning with Memory-augmented Neural Networks},
booktitle = {Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48},
year = {2016},
pages = {1842-1850},
abstract = {Despite recent breakthroughs in the applications of deep neural networks, one setting that presents a persistent challenge is that of "one-shot learning." Traditional gradient-based networks require a lot of data to learn, often through extensive iterative training. When new data is encountered, the models must inefficiently relearn their parameters to adequately incorporate the new information without catastrophic interference. Architectures with augmented memory capacities, such as Neural Turing Machines (NTMs), offer the ability to quickly encode and retrieve new information, and hence can potentially obviate the downsides of conventional models. Here, we demonstrate the ability of a memory-augmented neural network to rapidly assimilate new data, and leverage this data to make accurate predictions after only a few samples. We also introduce a new method for accessing an external memory that focuses on memory content, unlike previous methods that additionally use memory location-based focusing mechanisms.}
}
%%% (Bengio, 2017) - Bengio
@article{BengioCoRR-17,
author = {Yoshua Bengio},
title = {The Consciousness Prior},
journal = {CoRR},
volume = {arXiv:1709.08568},
year = {2017},
abstract = {A new prior is proposed for representation learning, which can be combined with other priors in order to help disentangling abstract factors from each other. It is inspired by the phenomenon of consciousness seen as the formation of a low-dimensional combination of a few concepts constituting a conscious thought, i.e., consciousness as awareness at a particular time instant. This provides a powerful constraint on the representation in that such low-dimensional thought vectors can correspond to statements about reality which are either true, highly probable, or very useful for taking decisions. The fact that a few elements of the current state can be combined into such a predictive or useful statement is a strong constraint and deviates considerably from the maximum likelihood approaches to modeling data and how states unfold in the future based on an agent’s actions. Instead of making predictions in the sensory (e.g. pixel) space, the consciousness prior allow the agent to make predictions in the abstract space, with only a few dimensions of that space being involved in each of these predictions. The consciousness prior also makes it natural to map conscious states to natural language utterances or to express classical AI knowledge in the form of facts and rules, although the conscious states may be richer than what can be expressed easily in the form of a sentence, a fact or a rule.}
}
%%% (Lamme, 2006) - Lamme
@article{LammeTiCS-06,
author = {Lamme, V. A.},
title = {Towards a true neural stance on consciousness},
journal = {Trends in Cognitive Science},
year = {2006},
volume = {10},
number = {11},
pages = {494-501},
abstract = {Consciousness is traditionally defined in mental or psychological terms. In trying to find its neural basis, introspective or behavioral observations are considered the gold standard, to which neural measures should be fitted. I argue that this poses serious problems for understanding the mind-brain relationship. To solve these problems, neural and behavioral measures should be put on an equal footing. I illustrate this by an example from visual neuroscience, in which both neural and behavioral arguments converge towards a coherent scientific definition of visual consciousness. However, to accept this definition, we need to let go of our intuitive or psychological notions of conscious experience and let the neuroscience arguments have their way. Only by moving our notion of mind towards that of brain can progress be made.}
}


43 What if you wanted to recreate a pattern of activity using a sparse probe? Why might you want to do this aside from the incentive of compact storage? Randall O'Reilly — Slide 19 of his presentation in CS373C — answers this question by arguing that sparsity encourages pattern separation that enables rapid binding of arbitrary informational states:

In Slide 18 of the same presentation he identifies the parts of the hyppocampal system that enable extreme pattern separation (dendate gyrus), pattern completion (CA3) and stable sparse invertable representation (CA1) and their respective roles in memory encoding and retrieval in the entorhinal cortex by way of interaction with the parietal (dorsal stream) and inferotemporal (ventral stream) cortex.

O'Reilly then proceeds to explain what he likes to call the "three R's of serial processing": (i) reduce binding errors by serial processing, (ii) reuse same neural tissue across many different situations thus improving generalization, and (iii) recycle activity throughout network to coordinate all areas on one thing at a time as in conscious, attentional awareness44

42 Bootstrapping the programmer's apprentice: Simple interactive behavior for signaling and editing:

%%% Tue Sep 11 18:46:56 PDT 2018


In the first stage of bootstrapping, the assistant's automated tutor engages in an analog of the sort of simple signaling and reinforcement that a mother might engage in with her baby in order to encourage the infant to begin taking notice of its environment and participating in the simplest forms of communication. The basic exchange goes something like: the mother draws the baby's attention to something and the baby acknowledges by making some sound or movement. This early step requires that the baby can direct its gaze and attend to changes in its visual field.

In the case of the assistant, the relevant changes would correspond to changes in FIDE or the shared browser window, pointing would be accomplished by altering the contents of FIDE buffers or modifying HTML. Since the assistant has an innate capability to parse language into sequences of words, the tutor can preface each lesson with short verbal lesson summary, e.g., "the variable 'foo'", "the underlined variable", "the highlighted assignment statement", "the expression highlighted in blue". The implicit curriculum followed by the tutor would systematically graduate to more complicated language for specifying referents, e.g., "the body of the 'for' loop", "the 'else' clause in the 'conditional statement", "the scope of the variable 'counter'", "the expression on the right-hand side of the first assignment statement".

The goal of the bootstrap tutor is to eventually graduate to simple substitution and repair activities requiring a combination of basic attention, signaling, requesting feedback and simple edits, e.g., "highlight the scope of the variable shown in red", "change the name of the function to be "Increment_Counter", "insert a "for" loop with an iterator over the items in the "bucket" list", "delete the next two expressions", with the length and complexity of the specification gradually increasing until the apprentice is capable of handling code changes that involve multiple goals and dozens of intermediate steps, e.g., "delete the variable "Interrupt_Flag" from the parameter list of the function declaration and eliminate all of the expressions that refer to the variable within the scope of the function definition".

Note the importance of an attentional system that can notice changes in the integrated development environment and shared browser window, the ability to use recency to help resolve ambiguities, and emphasize basic signals that require noticing changes in the IDE and acknowledging that these changes were made as a means of signaling expectations relevant to the ongoing conversation between the programmer and the apprentice. These are certainly subtleties that will have to be introduced gradually into the curricular repertoire as the apprentice gains experience. We are depending on employing a variant of Riedmiller et al that will enable us to employ the FIDE to gamify the process by evaluating progress at different levels using a combination of general extrinsic reward and policy-specific intrinsic motivations to guide action selection [166].

Randall O'Reilly mentioned in his class presentation the idea that natural language might play an important role in human brains as an intra-cortical lingua franca. Given that one of the primary roles language serves is to serialize thought thereby facilitating serial computation with all of its advantages in terms of logical precision and combinatorial expression, projecting a distributed connectionist representation through some sort of auto encoder bottleneck might gain some advantage in combining aspects of symbolic and connectionist architectures. This also relates to O'Reilly’s discussion of the hippocampal system and in particular the processing performed by the dentate gyrus and hippocampal areas CA1 in CA2 in generating a sparse representation that enables rapid binding of arbitrary informational states and facilitates encoding and retrieving of episodic memory in the entorhinal cortex43 .

Miscellaneous loose ends: [...] thought cloud annotation and serialization — thought cloud fusion and sparse reconstruction using a combination of serialization and predictive coding — variational information bottleneck auto encoder [...] think about reinforcement learning in the case of multi-step plans designed to re-write a program fragment or fix a bug [...] the notion of setting goals and maintaining the motivation necessary sustain effort over the long term in the absence of reward in the short term [...] see Huang et al [105] Neural Model of Instructional Learning and Control and Taatgen and Frank [189] Model of Complex Skill Acquisition [...]

45 Bootstrapping the programmer's apprentice: Mixed dialogue interleaving instruction and mirroring:

%%% Wed Sep 12 05:34:16 PDT 2018


Every utterance, whether generated by the programmer or the apprentice's tutor or generated by the apprentice either intended for the programmer or sotto voce for its internal record, has potential future value and hence it makes sense to record that utterance along with any context that might help to realize that potential at a later point in time. Endel Tulving coined the phrase episodic memory to refer to this sort of memory. We'll forgo discussion of other types of memory for the time being and focus on what the apprentice will need to remember in order take advantage of its past experience.

Here is the simplest, stripped-to-its-most-basic-elements scenario outlined in the class notes: (a) the apprentice performs a sequence of steps that effect a repair on a code fragment, (b) this experience is recorded in a sequence of tuples of the form (st,at,rt,st+1) and consolidated in episodic memory, (c) at a subsequent time, days or weeks later, the apprentice recognizes a similar situation and realizes an opportunity to exercise what was learned in the earlier episode, and (d) a suitably adapted repair is applied in the present circumstances and incorporated into a more general policy so that it can be applied in wider range circumstances.

The succinct notation doesn't reveal any hint of the complexity and subtlety of the question. What were the (prior) circumstances — st? What was thought, said and done to plan, prepare and take action — at? What were the (posterior) consequences — rt and st+1? We can't simply record the entire neural state vector. We could, however, plausibly record the information temporarily stored in working memory since this is the only information that could have played any substantive role — for better or worse — in guiding executive function.

We can't store everything and then carefully pick through the pile looking for what might have made a difference, but we can do something almost as useful. We can propagate the reward gradient back through the value- / Q-function and then further back through the activated circuits in working memory that were used to select ai and adjust their weights accordingly. The objective in this case being to optimize the Q-function by predicting the state variables that it needs in order to make an accurate prediction of the value of applying action at in st as described in Wayne et al [204].

Often the problem can be described as a simple Markov process and the state represented as a vector comprising of a finite number of state variables, st = ⟨ α0,  α1,  α2,  α3,  α4,  α5,  α6,  α7 ⟩ , with the implicit assumption that the process is fully observable. More generally, the Markov property still holds, but the state is only partially observable resulting in a much harder class of decision problem known as a POMDP. In some cases, we can finesse the complexity if we can ensure that we can observe the relevant state variables in any given state, e.g., in one set of states it is enough to know one subset of the state variables, {⟨ α0,  α1,  α2,  α3,  α4,  α5,  α6,  α7 ⟩ }, while in another set of states a different subset of state variables suffices, {⟨ α0,  α1,  α2,  α3,  α4,  α5,  α6,  α7 ⟩ }. If you can learn which state variables are required and arrange to observe them, the problem reduces to the fully observed case.

There's a catch however. The state vector includes state variables that correspond to the observations of external processes that we have little or no direct control over as well as the apprehension of internal processes including the activation of subnetworks. We may need to plan for and carry out the requisite observations to acquire the external process state and perform the requisite computations to produce and then access the resulting internal state information. We also have the ability to perform two fundamentally different types of computation each of which has different strengths and weaknesses that conveniently complement the other.

The mammalian brain is optimized to efficiently perform many computations in parallel; however, for the most part it is not particularly effective dealing with the inconsistencies that arise among those largely independent computations. Rather than relying on estimating and conditioning action selection on internally maintained state variables, most animals rely on environmental cues — callsed affordances [74] — to restrict the space of possible options and simplify action selection. However, complex skills like programming require complex serial computations in order to reconcile and make sense of the contradictory suggestions originating from our mostly parallel computational substrate.

Conventional reinforcement learning may work for some types of routine programming like writing simple text-processing scripts, but it is not likely to suffice for programs that involve more complex logical, mathematical and algorithmic thinking. The programmer's apprentice project is intended as a playground in which to explore ideas derived from biological systems that might help us chip away at these more difficult problems. For example, the primate brain compensates for the limitations of its largely parallel processing approach to solving problems by using specialized networks in the frontal cortex, thalamus, striatum, and basal ganglia to serialize the computations necessary to perform complex thinking.

%%% Thu Sep 13 15:59:45 PDT 2018


At the very least, it seems reasonable to suggest that we need cognitive machinery that is at least as powerful as the programs we aspire the apprentice to generate [70]. We need the neural equivalent of the [CONTROL UNIT] responsible for maintaining a [PROGRAM COUNTER] and the analog of loading instructions and operands into REGISTERS in the [ARITHMETIC AND LOGIC UNIT] and subsequently writing the resulting computed products into other registers or RANDOM ACCESS MEMORY. These particular features of the von Neumann architecture are not essential — what is required is a lingistic foundation that supports a complete story of computation and that is grounded in the detailed — almost visceral — experience of carrying out computations.

A single Q (value) function encoding a single action-selection policy with fixed finite-discrete or continuous state and action spaces isn't likely to suffice. Supporting compiled subroutines doesn't significantly change the picture. The addition of a meta controller for orchestrating a finite collection of separate, special-purpose policies adds complexity without appreciably adding competence. And simply adding language for describing procedures, composing production rules, and compiling subroutines as a Sapir-Whorf-induced infusion of ontological enhancement is — by itself — only a distraction.

We need an approach that exploits a deeper understanding of the role of language in the modern age — a method of using a subset of natural language to describe programs in terms of narratives where executing such a program is tantamount to telling the story. Think about how human cognitive systems encode and serialize remembered stories [...] about programs as stories drawing on life experience by exploiting the serial nature of episodic memory [...] about thought clouds that represent a superposition of eigenstates such that collapsing the wave function yields coherent narrative that serves as a program trace.

Miscellaneous loose ends: [...] classifying intention: learning to categorize tasks and summarize intentions to act [...] confirming comprehension: conveying practical understanding of specific instructions [...] dealing with complex utterances that mix explanation, exhortation and simple instruction [...] parsing arbitrary natural language input into different modalities and routing the resulting annotations to appropriate networks [...] mention Wayne et al [204] and, in particular, review some of the details in Figure 45 regarding MERLIN [...] think about annotated episodic memory as the basis for generating novel proposals for evolving program development.

46 Bootstrapping the programmer's apprentice: Composite behaviors corresponding to simple repairs:

%%% Wed Sep 12 05:34:16 PDT 2018


A software design pattern "is a general, reusable solution to a commonly occurring problem within a given context in software design. It is not a finished design that can be transformed directly into source or machine code. It is a description or template for how to solve a problem that can be used in many different situations. Design patterns are formalized best practices that the programmer can use to solve common problems when designing an application or system" — SOURCE. They are typically characterized as belonging to one of three categories: creational, structural, or behavioral.

I would like to believe that such patterns provide clear prescriptions for how to tackle challenging programming problems, but I know better. Studying such patterns and analyzing examples of their application to practical problems is an excellent exercise for both computer science students learning to program, and practicing software engineers wanting to improve their skills. That said, these design patterns require considerable effort to master and are well beyond what one might hope to accomplish in bootstrapping basic linguistic and programming skills. Indeed, mastery depends on already knowing — at the very least — the rudiments of these skills.

%%% Sat Sep 15  3:46:51 PDT 2018


In response to a message I sent to Matt Botvinick regarding the idea that natural language is software that runs on human brains, Matt wrote:

I am very sympathetic to this perspective. Another relevant reference, which I read with very similar thoughts in mind, is this one: Gallistel, C. R., & King, A. P. (2011). Memory and the computational brain: Why cognitive science will transform neuroscience (Volume 6). John Wiley & Sons.

However, I'm not sure I share your view that mental software is always expressed in language (using language in the narrow sense: English, French, etc.). For example, can't one 'hold in mind' a plan of action that is structured like a program but not expressed in language?

I'm willing to concede that mental software is not always expressed in language. For the programmer's apprentice, I'm thinking of encoding what is essentially static and syntactic knowledge about programs and programming using four different representations, and what is essentially dynamic and semantic knowledge in a family of structured representations that encode program execution traces of one sort or another. The four static / syntactic representations are summarized as follows:

• (i) distributed (connectionist) representations of natural language as points in high-dimensional embedding spaces — thought clouds;

• (ii) natural language transcripts of dialogical utterances / interlocutionary acts encoded as lexical token streams — word sequences;

• (iii) programs in the target programming language represented as structured objects corresponding to augmented abstract syntax trees (ASTs)— the augmentations correspond to edges representing procedure calls, iteration and recursion resulting in directed acyclic graphs;

• (iv) hierarchical plans corresponding to subnetworks of hierarchical task networks (HTNs) or, if you like, the implied representation of hierarchical plans encoded in value iteration networks [190] and goal-based policies [83]. I'm also thinking about encoding HTNs as policies using a variation on the idea of options [187] as described in Riedmiller et al [166];

The first entry (i) is somewhat misleading in that any one of the remaining three (ii-iv) can be represented as a point / thought cloud using an appropriate embedding method. Thought clouds are the Swiss Army knife of distributed codes. They represent a (constrained) superposition of possibilities allowing us to convert large corpora of serialized structures into point clouds that enable massively parallel search, and subsequently allow us to collapse the wave function, as it were, to read off solutions by re-serializing the distributed encoding of constraints that result from conducting such parallel searches.

I propose to develop encoders and decoders to translate between (serial) representations (ii-iv) where only a subset of conversions are possible or desirable given the expressivity of the underlying representation language. I imagine autoencoders with an information bottleneck that take embeddings of natural language descriptions as input and produces an equivalent HTN representation, combining a mixture of (executable) interlocutory and code synthesis tasks. The interlocutory tasks generate explanations and produce comments and specifications. The code-synthesis tasks serve to generate, repair, debug and test code represented in the FIDE.

Separately encoded embeddings will tend to evolve independently, frustrating attempts to combine them into composite representations that allow powerful means of abstraction. The hope is that we can use natural language as a lingua franca — a "bridge" language — to coerce agreement among disparate representations by forcing them to cohere along shared, possibly refactored dimensions in much the same way that trade languages serve as an expeditious means of exchanging information between scientists and engineers working in different disciplines or scholars who do not share a native language or dialect.

%%% Sun Sep 16 10:20:49 PDT 2018
`

The idea of using natural language as the basis for an intracortical lingua franca must have been suggested before. I seem to recall Terrence Deacon mentioning a similar idea in The Symbolic Species [39], and, in his discussion in class, Randy O'Reilly mentioned the idea in passing. In any case, we now have the tools to test related hypotheses.

The general idea is to better exploit tradeoffs involving parallel processing relying on distributed connectionist models and serial processing relying on combinatorial symbolic models — a theme that Randy explores at some length in his 2018 class presentation. He focuses on prefrontal-hippocampal-basal-ganglia circuits, but we don't have to be a slave to biology in coming up with new approaches to solving practical problems involving automated programming and digital assistants.

Miscellaneous loose ends: Relating to Matt's suggestion, I found a PDF including the first nine chapters of Gallistel & King [70]. From what I can ascertain, the material in these chapters consists almost entirely of very basic computer science, formal language theory, basic automata and information theory, with a few excursions into biology and associative networks. From scanning the preface, it appears that the content most closely related to Matt's reference to the book is in Chapters 10-16. This article by Gallistel [69] PDF provides a contemporaneous account of his perspective on learning for those not willing to spring for the book.

Gallistel [68] writes in response to a review by John Donahoe [55]: "As Donahoe points out, Shannon’s theory of communication, from which the modern mathematical definition of information comes, is central to our argument. The one-sentence essence of our argument is that: 'The function of memory is to carry information forward in time in a computationally accessible form.'" The preface provides a four-page summary of the last seven chapters focusing on properties of the neural substrate and forwarding a thesis about the fundamental basis / properties of memory which, as the authors emphatically point out, is not synaptic plasticity. They conclude the preface with the statement:

We do not think we know what the mechanism of an addressable read / write memory is, and we have no faith in our ability to conjecture a correct answer. We do, however, raise a number of considerations that we believe should guide thinking about possible mechanisms. Almost all of these considerations lead us to think that the answer is most likely to be found deep within neurons, at the molecular or sub-molecular level of structure. It is easier and less demanding of physical resources to implement a read / write memory at the level of molecular or sub-molecular structure. Indeed, most of what is needed is already implemented at the sub-molecular level in the structure of DNA and RNA.