An introduction to the Stuttgart Corpus Workbench, CQP, & XKWICBy Susanne Riehemann (ed. by Florian Jaeger): The Stutgart Corpus Workbench is a collection of powerful tools to do searches on prepared corpora, sort the results, creates frequency lists, etc. This page gives an introduction to the Stuttgart Corpus Workbench, summarizes what you have to do to use the Workbench, provides some sample queries, and points out some known problems.
Tip-1: See also the Stuttgart Corpus Workbench Web Page and be sure to check out Jeanette Pettibone's CQP Manual (one of the reasons why the current page is not really necessary ;-). Tip-2: Be sure to check out this nice introductory case study by Roger Levy (PDF file; ~115KB). On only 1.5 pages you can get a pretty good i idea of the potential of CQP and along the way you even get an introduction to some of the relevant syntax. Tip-3: You may also find the CQP Demos very useful. Tip-4: The following site offers a comprehensive comparison of XKwic/CQP vs. WordSmith. Note: This page assumes that you are logged in on turing since the Workbench does not yet work on the Leland system. Getting startedBefore you can use the Stuttgart Corpus Workbench tools with the North American News corpus for the first time, you'll have to add the following three lines in your .cshrc on turing (although it seems as if the IMS Workbench is also installed on AFS now):
setenv LLQUERY_LOCAL_CORP_DIR somedirectory setenv CQP_LOCAL_CORP_DIR somedirectory You can choose any somedirectoy as long as there is enough space, e.g. make a subdirectory in /tmp ("mkdir /tmp/myusername"). Then add the following path to your PATH variable ('echo $PATH' shows you what is included in your PATH):
You can do this by typing (which will append the above-mentioned path to your PATH variable):
[ Getting started | Two interfaces | Selecting a corpus | Queries | Subcorpora | Saving the results | Sorting | Macros | Problems | top ] Two interfaces to the Stuttgart Corpus WorkbenchThere are two ways of accessing the data:
All the commands you can type at the CQP command line should also work in the XKWIC "query input" box (you'll have to click on "start query" instead of hitting return), although for some things (like selecting a corpus) there is also a simpler way of doing it in XKWIC. There are some things that you can do only in XKWIC, like click on a line in the result and see more context in a separate box, or sort the result, or select a previous query from the query history. [ Getting started | Two interfaces | Selecting a corpus | Queries | Subcorpora | Saving the results | Sorting | Macros | Problems | top ] Selecting an available CorpusTo select a corpus you type the name of the corpus followed by a semicolon. Corpora have to be specially prepared and indexed in order to be used with the Stuttgart Corpus Workbench. The corpora available for use with the Corpus Workbench are (please be aware that this information may be outdated; if you have questions please ask the Corpus TA):
These correspond to the plain text files in
[ Getting started | Two interfaces | Selecting a corpus | Queries | Subcorpora | Saving the results | Sorting | Macros | Problems | top ] Formulating a queryBefore you start querying, you need to decide how much context you want to have available, e.g. if you want one sentence before and one sentence after the sentence that matches your query:
Some example queries to get familiar with the syntax are given below:
returns all occurrences of "pull" and "strings" in the same sentence [ Getting started | Two interfaces | Selecting a corpus | Queries | Subcorpora | Saving the results | Sorting | Macros | Problems | top ] Creating subcorpora to speed up searchesIf you want to look for a sequence of frequent words like "There|there" "is" "every" you can start by doing an efficient meet query (see "Formulating a query") on parts of it and then run the slower query just on the result:
isevery=Last; isevery; "There|there" "is" "every"; There are many other things you can do with "subcorpora" - see the documentation. [ Getting started | Two interfaces | Selecting a corpus | Queries | Subcorpora | Saving the results | Sorting | Macros | Problems | top ] Saving the ResultIn XKWIC there is a menu option for saving the results of your query. In CQP, you use:
If you want to save all the sentences matching your query into a file without any of the context around them, you need to set the context to one sentence and expand the match to one sentence:
MU (meet "pull" "strings") expand to s; cat Last > "filename"; This only seems to work in CQP and not in XKWIC (Obviously, when you "expand to s" you don't have a KWIC format any more). [ Getting started | Two interfaces | Selecting a corpus | Queries | Subcorpora | Saving the results | Sorting | Macros | Problems | top ] XKWIC: Sorting the resultIf you are using XKWIC you have the option of sorting your result in various ways, e.g. in this case by the first word following "there is every". Because the first word in the match ("there") is position 0, the position you're interested in is 3, so select first sort column=3, last sort column=4. To sort by the word before the match use first sort column=-1, last sort column=-2. For more details, see the documentation. [ Getting started | Two interfaces | Selecting a corpus | Queries | Subcorpora | Saving the results | Sorting | Macros | Problems | top ] CQP: Using MacrosIf you are not a graphical interface person or you need to run many similar, complex queries, you might like the ability to run the system from the command line with macro files. For example, you can put the following into a macro file called "give-up-macro":
set context 1 s; MU (meet (union (union (union (union "give" "gives") "gave") "given") "giving") "up" 1 2); and then use the command:
This will return all occurrences of the forms "give" followed by "up" with at most one other word intervening. For more examples, look at the documentation. If you are interested in the frequency of words with a particular characteristic such as ending in "ive", you can type something like:
See also the Stuttgart Corpus Workbench Web Page [ Getting started | Two interfaces | Selecting a corpus | Queries | Subcorpora | Saving the results | Sorting | Macros | Problems | top ] Known problemsThere are a lot of duplicate examples. In some cases this is because press releases are repeated in slightly varied form and/or in different papers, but I don't know whether that accounts for all of the duplicates. In any case, I don't know whether these duplicates can be deleted automatically with these tools - I've always used unix sort & uniq for this purpose. (There is a button for manually deleting all selected sentences in XKWIC, which may be an option for a small search result.) When I inserted the sentence boundary markers I missed some of the SGML tags, so you get junk appended to some of your sentences even with the context set to "1 s". Also, my sentence boundary detection from punctuation marks & a list of abbreviations isn't perfect, so there are some incomplete sentences and some missed sentence boundaries. Since it took me weeks to install the corpus (mostly because turing neither had enough memory nor a large enough partition) I'm not going to fix this. |