During the last few years, a new approach to language processing has started to emerge, which has become known under the name of “Data Oriented Parsing” or “DOP”. This approach embodies the assumption that human language comprehension and production works with representations of concrete past language experiences, rather than with abstract grammatical rules. The models that instantiate this approach therefore maintain corpora of linguistic representations of previously occurring utterances. New utterance-representations are constructed by freely combining partial structures from the corpus. A probability model is used to choose from the collection of different structures of different sizes those that make up the most appropriate representation of an utterance.

In this book, DOP models for several kinds of linguistic representations are developed, ranging from tree representations, compositional semantic representations, attribute-value representations, and dialogue representations. These models are studied from a formal, linguistic and computational perspective and are tested with available language corpora. The main outcome of these tests suggests that the productive units of natural language cannot be defined in terms of a minimal set of rules (or constraints or principles), as is usually attempted in linguistic theory, but need to be defined in terms of a large, redundant set of previously experienced structures with virtually no restriction on their size and complexity. I will argue that this outcome has important consequences for linguistic theory, leading to a new notion of language competence. In particular, it means that the knowledge of a speaker/hearer cannot be understood as a grammar, but as a statistical ensemble of language experiences that changes slightly every time a new utterance is processed.

Rens Bod is a researcher and lecturer in computational linguistics in the Institute of Logic, Language and Computation at the University of Amsterdam.

- Preference
- 1 Introduction: what are the productive units of natural language
- 1 A probabilistic approach to language
- 2 Stochastic grammars and the problem of productive unit size
- 3 The Data-Oriented Parsing framework: productivity from examples
- 4 Evaluation of DOP models
- 5 Evaluation of DOP models
- 6 Overview of this book

- 2 An experience-based model for phrase-structure representations
- 1 Representations
- 2 Fragments
- 3 Composition operations
- 4 Probability calculation

- 3 Formal Stochastic Language Theory
- 1 A formal language theory of stochastic grammars
- 2 DOPI as a Stochastic Tree-Substitution Grammar
- 3 A comparison between Stochastic Tree-Submission Grammar
- 4 Other stochastic grammars
- 4.1 Stochastic History-Based Grammar (SHBG)
- 4.2 Stochastic Lexicalized Tree-Adjoining Grammar (SLTAG)
- 4.3 Other stochastic lexicalized grammars

- 5 Open questions

- 4 Parsing and disambiguation
- 1 Parsing
- 2 Disambiguation
- 2.1 Viterbi optimization is not applicable to finding the most probable parse
- 2.2 Monte Carlo disambiguation: estimating the most probable parse by sampling random derivations
- 2.3 Cognitive aspects of Monte Carlo disambiguation

- 5 Testing the model: can we restrict the productive units?
- 1 The test environment
- 2 The base line
- 3 The impact of overlapping fragments
- 4 The impact of fragments size
- 5 The impact of fragments lexicalization
- 6 The impact of fragments frequency
- 7 The impact of non&345;head words
- 8 Overview of the derived properties and discussion

- 6 Learning new words
- 1 The model DOP2
- 2 Experiments with DOP2
- 3 Evaluation: what goes wrong?
- 4 The problem of unknown-category words

- 7 Learning new structures
- 1 The problem of unknown structures
- 2 Good-Turing: estimating the population frequencies of (un)seen types
- 3 Using Good-Turing to adjust the frequencies of subtrees
- 4 The model DOP3
- 5 Cognitive aspects of DOP3
- 6 Experiments with DOP3

- 8 An experienced-based model for compositional semantic representations
- 1 Incorporating semantic interpretation
- 1.1 Assuming surface compositionality
- 1.2 Not assuming surface compositionality: partial annotations
- 1.3 The probability model of semantic DOP

- 2 Extending DOP to discourse and recency

- 9 Speech understanding and dialogue processing
- 1 The OVIS corpus: trees enriched with compositional frame semantics
- 2 Using OVIS corpus for data-oriented semantic analysis
- 3 Extending DOP to dialogue context: context-dependent subcorpora
- 4 Interfacing DOP with speech
- 5 Experiments

- 8 Experience-based models for non-context-free representations
- 1 A DOP model for Lexical-Functional representations
- 1.1 Representations
- 1.2 Fragments
- 1.3 The composition operation
- 1.4 Probablity models

- 2 Illustration and properties of LFG-DOP

- Conclusion: linguistics revisited
- References
- Index

11/1/98