[Catalog][Search][Home][Tell Us][Help]

ATS Logo
Academic Text Service (ATS)

380 Meyer Library
email: ats@lists.stanford.edu
phone: 725-3163

[Searcher | OED | Services | Texts | Web Access | English Poetry | Other Sites | Papers | Staff]


Obtaining/Creating E-Texts

Obtaining E-Texts

There are a growing number of sources for electronic texts. While this is not the place to expand on all the collection development considerations for e-texts, their selection follows most of the same considerations for their paper-based equivalents: provenance of the text, reliability of the publisher, quality of the editing, fitness of the text/material for the local program all come into play. Format considerations are also notable -- in general the library and publishing community are coming to see Standard Generalized Markup Language (SGML) and the Text Encoding Initiative's (TEI) work in this area as standards -- but it is possible that for reasons of necessity you will need to support other formats as well, at least for the next few years.

Project Gutenberg texts are generally useful for getting 'digital dirt' under one's fingernails; they otherwise fail in most respects the tests of provenance, reliability, editing, and format mentioned above. The Oxford Text Archive's catholic approach to collecting all that it is given also results in anomolies, but there is much that passes the quality tests mentioned and an increasing number of OTA offerings come in TEI-compliant SGML formats. Intelex and Chadwyck-Healey are commercial vendors of note offering a wide range of material in SGML, with Intelex specializing in philosophy and Chadwyck-Healey in large, expensive sets of heavily encoded 'great works.'


Creating E-Texts

The most time-honored method of creating electronic texts is by direct keyboarding. For many forms of text, particulary those older than about 1800, it is the only current method of rendering the text electronically. Texts from the late-nineteenth and twentieth centuries can often be created by a series of steps involving scanning, passing the scanned image on to an optical character recognition program (OCR), which then 'reads' the text and renders it as an ASCII document. Scanning for the purpose of OCR is usually straightforward, and can be facilitated by disbinding the item (if possible) and using document feeders.


Caveats on E-Text Creation

The problems with OCR and its accuracy rate are legion (even a 99% accuracy rate guarentees that the average page will have 20 to 30 errors), and these increase as the quality of the copy decreases. For detailed information about OCR, see the University of Maryland's Document Image Understanding and Character Recognition Server.

Most OCR packages available in this country are not attuned to languages other than English, and this can also lead to a decrease in the accuracy rate. Using OCR packages in conjunction with word processing systems that support multi-language dictionaries is critical. Other errors can be caught by using concordancing or indexing programs and then paying careful attention to the 'singletons' -- these are likely to be errors. In the end, careful proofreading by native or near-native speakers of the text you are converting is essential.

A production environment for electronic texts is one that would support their creation through keyboarding (where necessary), scanning and OCR for basic rendering. Many aspects of textual mark-up can be left to experienced 'taggers' after the basic rendering, but some features (front and back material, for example) can be most easily captured during the physical process of scanning and proofreading.


Format Considerations

If you are creating and marking up texts, the emerging standard is the TEI flavor of SGML, and should be carefully considered. To be sure, other mark-up schemes have been used in the past (COCOA, for example), and there are clearly areas where the TEI guidelines may not meet every demand. But the TEI is the best realized example of a metasystem for encoding textual content we have today; it is standards-based and platform-independent; it is an open system capable of expansion and creates documents capable of interchange.

That said, it is also true that there is a great amount of legacy data that is not in SGML, much less in its TEI-compliant form, and you are likely to have to convert data to SGML for delivery and manage systems and software that may not speak SGML directly. Even if your system is fully TEI SGML-compliant, you may need to convert documents for delivery to other access systems that don't fully support SGML (the Web, for example).

At present, there are no good 'off the shelf' tools to help you manage the conversion, but there are a number of specialized tools that may be of use (nsgmls, perl, OCLC's Fred spring to mind in the shareware/freeware area, Avalanche's FastTag, and EBT's DynaTag in the commercial). This area is changing rapidly, however, and more useful general tools may soon be available. In any event, your center will need the expertise of such a toolsmith.

Previous | Next

Last Updated: July 5, 1995

[an error occurred while processing this directive]