NSF Grant No. BCS-0642752 (Christopher Potts)

Year 2 activities and findings: Brief report

Research and education activities

The focus of year 2 was gathering, interpreting, and visualizing quantitative information relating to expressive content cross-linguistically. The project collected a total of eight corpora in four languages. Six of those corpora have a uniform format, which means that researchers can do comparative work with them easily, and also pool them into larger, richer collections.

The major corpus release was in January: The UMass Amherst Linguistics Sentiment Corpora are drawn from over 700,000 online product reviews in Chinese, English, German, and Japanese. The project researchers downloaded these from the Web in their raw HTML format, converted them to an XML format to make research quicker and more accurate, and then released them as tables of word and phrase counts. The texts themselves are mostly informal unpolished prose, so they are rich in the highly emotive language that drives this grant. Led by Potts, the project RAs were involved in all stages of this collection, formatting, and distribution. The resulting collection is the first very large corpus to be released on the Semantics Archive, and it is now in use by both theoretical and computational linguists.

There were two smaller, more focused corpus releases as well. The collection Wait a minute! What kind of discourse strategy is this? provides 439 annotated examples of speakers employing the discourse strategy of uttering "Wait a minute!" when they need to question or challenge an implicit assumption of something that someone in the discourse has just said. This discourse strategy is often linked with presuppositions, but these data suggest that it is more widely used as well, as a target for all kinds of subtle, implicit content.

Embedded appositives is an annotated collection of 278 sentences containing appositives embedded syntactically in the complement of propositional attitude predicates and verbs of saying, drawn from 177 million words of novels, newspaper articles, and TV transcripts. In late 2007, the journal Linguistics and Philosophy published a review article of Potts's 2005 book The Logic of Conventional Implicatures (Amaral, Patricia, Craige Roberts, and E. Allyn Smith 2007. Review of The Logic of Conventional Implicatures by Chris Potts. Linguistics and Philosophy 30(6): 707–749). The article calls into question one of the central claims of Potts's book, namely, that appositives content is always a speaker commitment. The purpose of this collection is to provide a quantitative perspective on these issues. Potts did the initial collection and annotation, and he has since been working with Jesse A. Harris to make this resource more reliable.

Both the Embedded appositives corpus and the Wait a minute corpus are available on the Net, and Potts designed flexible search tools for exploring them in a Web browser. Both corpus releases were advertised on the Semantics Archive as well.

As in year 1, the project researchers reported often to the research community on the results of the project.

Findings

This was a breakthrough year for the project in terms of research results and intellectual outreach. The primary breakthrough was in the area of data-gathering methods. The project participants developed and released a collection of multi-million word corpora in Chinese, English, German, and Japanese, and it also found new ways of taking advantage of existing corpus resources. This in turn led to new techniques for studying the distribution of expressives statistically, and for visualizing those distributions in a way that highlights their semantic denotations and their pragmatic effects in context.

Each of the approximately 700,000 reviews in the UMass Amherst Linguistics Sentiment Corpora has associated with it a star rating, from one to five stars. As one might expect, the language in the extreme rating categories — one star and five star — is significantly more emotional than the language in the middle-of-the-road reviews. The vast majority of expressives are at the extreme ends. If we array the rating categories along the x-axis of a graph and put frequencies on a log-scale along the y-axis, then the distribution of expressives is typically U-shaped: the frequency is highest at the extremes and drops off almost to nothing in the middle. This is a quantitative perspective on the heightened emotion that expressives convey.

Potts and Schwarz wrote a suite of tools in the statistical programming language R for systematically exploring the distribution of words and phrases across these rating categories. This permits fine-grained comparisons between the distribution of different items, and it provides a way of giving a fairly complete answer to the question of what other kinds of words and phrases have the same distributional properties as expressives. Where there is distributional similarity, there is likely to be pragmatic similarity as well, so this approach provides a powerful way of uncovering pragmatic regularities and correspondences.

An immediate result of this statistically-informed perspective was a new connection between expressives like 'damn' and exclamative constructions like 'What big eyes you have!'. This in turn led to connections between exclamatives and intensives like 'totally' as in 'She could totally get that job'. Potts and Schwarz wrote a paper reporting on these techniques and results, Exclamatives and heightened emotion: Extracting pragmatic generalizations from large corpora, which is available from the project website and which is under review at the open-source journal Semantics & Pragmatics.

With these new corpora, the project participants are also able to revisit and refine some of the cross-linguistic generalizations explored in year 1. Using abstract statistical profiles, one can automatically identify and extract new expressives from large data sets. The U-shaped distribution of exclamatives is not the only important one. A J-shaped distribution shows a bias for positivity, but with fairly extensive (often ironical) negative uses. A Reverse J is the hallmark of an expression with a negative bias and some positive uses. The Turned U is a sort of anti-expressive, primarily used in the middle categories (e.g., somewhat, but). And so forth. These shapes pick out important classes of expressions in all four of the languages studied in detail so far. Some initial results of this work are described in the paper The pragmatics of expressive content: Evidence from large corpora, co-authored by Davis, Constant, Potts, and Schwarz.

Potts has also studied expressives in a large collection of email messages (the Enron corpus). That work suggests that swearing is a kind of in-group linguistic activity. In the collection, swearing in messages is basically confined to people who have exchanged a lot of messages with each other. This makes intuitive sense: swearing is risky behavior, in the sense that some people react very badly to it. Thus, one really wants to know one's addressee before taking this risk. Potts also looked at the distribution of intensives in a very large collection of posts on political weblogs. He found that usage spikes are correlated with exciting or dramatic events in the eyes of the contributors.

The most likely focus of year 3 will be on finding ways to incorporate these results into theoretical work in lexical semantics and pragmatics. Many questions in this area remain open, but the basic path seems clear. If pragmatic inference is a product of the interaction between speaker and hearer expectations in context, then these statistical profiles tell us volumes about the nature and relative reliability of those expectations. The dramatic U-shape of an exclamative is an indication that a speaker who uses one is in a heightened emotional state --- loosely speaking, he loves or loathes what is under discussion at that moment. As hearers, we are attuned to this information. And speakers know that hearers are attuned to it. Thus, out of the frequency distributions arises a characterization of what this expression signals reliably.

At present, project participants are in the process of collecting and exploring additional data sets and working with independently created corpora as well. Because these applications of corpus methodologies are a recent development, one not anticipated at the time of the initial project proposal, there are limited resources available for obtaining new data sets (which are often very expensive due to the resources required to create them). This has slowed progress somewhat, but it has also encouraged the group to be creative in taking advantage of freely available Web resources and open-source corpus efforts.