Les Earnest

Visible Legacies for Y3K

Les Earnest (les at cs.stanford.edu)

Senior Research Scientist Emeritus, Stanford University

Abstract

Much of what we know about ancient civilizations is derived from the records and stories they kept. The cost of keeping records has been greatly reduced in the modern world by the development of digital technology but the resulting records are now disappearing about as fast as they are being created. This problem is illustrated by attempts by members of the Stanford Artificial Intelligence Laboratory, a research group, to preserve their digital records. All digital records from the 1960s have been lost due to technical difficulties and, while most records from the 1970s and ‘80s have been preserved and are now accessible on a web site, they will disappear within the next 30 years unless a more reliable way can be developed to preserve such records and keep them accessible. This is not just a local problem but is worldwide and needs a worldwide solution. It appears that a solution can be found by augmenting the services of an existing international organization so as to provide redundant storage, maintenance and translation services that will keep records accessible for thousands of years.

Problem solved

The problem of preserving records for long periods was solved several thousand years ago by civilizations that first developed written records, then carved some of them in rock as illustrated on the next page. However that scheme was rather expensive and therefore was accessible only to the rich and powerful. Thus much of what we know (or think we know) about these civilizations comes from inscriptions on temples, tombs and tablets that were created on behalf of egotistical monarchs. Many of their accounts were undoubtedly distorted or outright lies but, except when they attacked other literate societies or vice versa, there were generally no records of alternative views, so their stories are now accepted as history.

The biographies of modern monarchs and politicians are probably no more accurate than those of the ancient ones but, since the modern costs of documentation are much lower, there is now a greater likelihood that alternative views will be recorded. Unfortunately most modern records are much less durable than those carved in hard rock. Since computers today mostly use various kinds of magnetic recording, which tends to deteriorate in a decade or two, and the ever-changing technologies for creating and reading recordings will probably continue to complicate the archiving process indefinitely, technological advances will come to naught in the long run if these records are lost.

Note that even records carved in stone need some protection from the elements in order to survive. In some cases this happened though accidental burial, which made them inaccessible for a time. In all cases it is necessary to translate these records into modern languages in order to make them understandable to the general public.

Egyptian historical text

The value of archived records generally can be enhanced by the addition of interpretations that explain the significance of the record, including perhaps a description of events that preceded or followed and references to related records. For important records there are likely to be multiple interpretations, since academicians like to do that sort of thing.

Paper works

Some people believe that the best way to preserve information now is to print it on paper and store it in a safe place. Indeed, paper has a potential shelf life measured in centuries. Not surprisingly, many of the paper advocates are librarians.

Of course, paper records, like all others, are subject to loss from earthquakes, fire, and floods and, in addition, may get eaten by voracious insects. Thus, in order to assure survival, multiple copies should be stored in different locations, which is something that libraries do well. However there are two strong arguments against using paper records:

(1) They are much more expensive to create and store than digital records and this cost disadvantage is rising rapidly;

(2) Even in a library and more so in an archive, they are much less accessible than, say, a web site.

Digital dropouts

A major problem with current digital records is that most of them have been stored on magnetic media that have a shelf life of around ten years. An additional complication is that, because of evolving software, file formats also change at similar intervals. To make matters worse, evolving hardware technology results in the devices used to record and read digital data typically changing about every five years. As a result, if you record something and try to retrieve it twenty years later you are likely to be out of luck.

In order to get around these problems using current technology, you need to recopy archives frequently and either reformat the information to conform to newer file format standards or develop software translators that can interpret and display information from the old formats. Either approach requires ongoing effort and attention to details.

This problem is illustrated below by an account of our troubled efforts to preserve computer records from the 1966-90 period. Happily, these problems will not be repeated exactly but I expect that similar problems will continue to arise going forward. In order to preserve such records we must find ways to work around both the volatility of digital recordings and changing technologies. One big advantage will be the steep decline in storage costs that apparently will continue for the foreseeable future.

I believe we should find a way to make current data accessible to people living in the year 3000 and later and it appears that an institutional solution to this problem can be created, though not without some effort.

Many people are working on more advanced storage technologies. For example see the “Billion year Ultra-dense Memory Chip” [Begtrup09]. None of these long term storage devices has been proven yet but I expect that some will work eventually and will further greatly reduce the already low cost of storing digital data for long periods. However that will not in itself ensure the accessibility of old records for future generations – we will need multiple organizations and ongoing software development to accomplish that.

Why bother?

In summary, there is a lot of interesting history in the SAIL archives that is worthy of preservation, but there is currently no way to do that for a long period.

Personal archiving interests.

Like many people, I have a lot of personal digital materials that I would like to preserve and make accessible to my descendents and whoever else is interested, including:

(1) Family genealogies going back four centuries, with additional materials being gathered by cousins;

(2) Diaries of my mother, who lived a rich life to age 100; she became a popular stage actress as a teenager and was invited to Hollywood but instead went to college, then married and worked full time as a teacher while raising two children and earning a PhD in her “spare time,” then became a college professor and head of her department for many years;

(3) Various family photos, some going back 150 years;

(4) Family movies going back 75 years that have been digitized;

(5) Stories I have written that my children and grandchildren seem to enjoy.

My great grandchildren are beginning to read now and, with a bit of luck, I will begin seeing some great-great-grandchildren in another 15 to 20 years. I don’t know how many of them there will eventually be, but I would like to establish a web site that they can go to when interested and can use when they set up their own web sites, or whatever new communication medium is in vogue then. Unfortunately there currently is no way to do that reliably.

Archiving isn’t easy

Our attempts at preserving the digital records of SAIL have been only partially successful because of technical and operational problems. Happily our successors will be able to avoid some of our problems since technology has advanced a lot since we started and likely will continue to do so. However I expect that continuing technology shifts will continue to cause compatibility problems similar to those we experienced. Here is a summary of what happened.

In the 1950s and early ‘60s the principal secondary storage medium was punched paper – tape or cards. These media were capable of fairly long term storage, at least a couple of hundred years, provided that they were stored in a stable environment and not accessed repeatedly. However, given that reading them caused physical wear, they had limited durability under frequent use.

At MIT during 1959-63 I created the first spelling checker as part of the first cursive handwriting recognizer [Earnest62, Lindgren65]. I needed a list of acceptable words but there were no online dictionaries then, nor even online collections of documents, so I typed in a published list of the 10,000 most common English words which filled seven reels of paper tape. I brought those paper tapes with me when I came to SAIL in 1965 and, as mentioned above, later talked a couple of graduate students into making spelling checkers for text files.

On 6/6/66 SAIL took delivery of a Digital Equipment Corporation PDP-6 computer, which was the first computer designed from scratch for general purpose timesharing. The only secondary storage available initially was punched paper tape or DECtapes, which used small strips of magnetic tape to record programs or data. DECtapes were functionally similar to the later floppy discs and shared the same unreliability for long term storage, so paper tape was a more reliable medium for the long term. However DECtapes were more convenient for short term use and were used both while computing and for backups or archiving, all of which were the responsibility of individual users.

10,000 Most Common English Words on paper tape

DECtape Drives

In 1969 a magnetic disk system made by Librascope was installed at SAIL and ultimately turned out to be a disaster.

• Cabinet approximately 8 feet high, 12 feet long, 3 feet deep

• Contained 6 disks, each 5 feet in diameter, mounted on a horizontal shaft

• Capacity: 50 megabytes

• Cost: $330,000

Instead of having magnetic heads mounted on a moving arm, as modern disks do, it used fixed heads, one per track: hundreds of them. It was known to require mechanical stability, so instead of setting it on the false floor in the computer room we excavated and mounted it directly on the steel I-beams of the building. The first thing that happened was that a few magnetic heads crashed, causing damage to the disk surface. That was okay because the disk system had spare tracks that could be wired in. But then a technician left an aluminum tool inside when he restarted it, causing aluminum shreds to be scattered throughout the disk system and causing more head crashes.

We worked around that with more rewiring but then discovered that this system was extremely temperature sensitive. A change in ambient room temperature of one degree Fahrenheit was enough to make it begin “forgetting”, evidently because the coefficient of expansion of the head assembly was slightly different from that of the disk.

Since this system didn’t work, we then sued the manufacturer which by that time was the Singer sewing machine company since they had purchased Librascope and its liabilities. As usual, the only people who did well in this undertaking were the lawyers.

We planned to scrap this disk system but it then occurred to me that the disks would make nice coffee tables, so we auctioned them off and I bought one, which now functions in my living room as a coffee table and Lazy Susan. The evident discoloration is due to the fact that the nickel plating, which was the magnetic material, seems to be soluble in alcohol. Nevertheless we can still read some recorded bits by sprinkling iron filings on the table. It also functions as a seismometer, since even small earthquakes set it swinging.

Librascope disk/coffee table

In the early 1970s, IBM disks were known to be the most reliable ones available, but they were not compatible with the DEC-10 computer system that we had acquired by that time. A couple of graduate students solved that problem by designing and building a fake IBM channel that connected to the DEC-10 computer, so we could acquire IBM 2314 disks and later switch to more economical Ampex disks that were functionally equivalent to IBM 3330s.

After finally acquiring reliable disks we began doing tape backups using 7-track magnetic tapes that were IBM-compatible, but we made no serious attempt at archiving files. Finally, using a new tape backup/archiving program called DART, we began preserving nearly all files that lasted at least a day.

SAIL-DART Chronology

The DART archiving program was created by Ralph Gorin in 1972 and later modified by Martin Frost. Archival tapes covered 1972 Nov 5 to 1990 Aug 17 (2,984 tapes).
Conversion by Martin Frost from 7-track to 9-track during Feb. 1988 to Apr. 1990 destroyed a tape drive and produced 229 tapes.
Conversion by Bruce Baumgart, Martin Frost, John Nagle and me from 9-track tapes to SCSI disks during 1997-98 produced tar files totaling 20.76 gigabytes, which expanded into individual files totaling 53 gigabytes, with some redundancy. Copies of the archive were given to a number of people to ensure short term survivability and Bruce Baumgart made CD-ROM copies of personal files for a number of people.
In each tape conversion some files were lost due to read errors, generally caused by degradation over time of the magnetic recordings.
In 2006 Bruce Baumgart created an online archive at http://www.saildart.org containing both public and private files, the latter being accessible only by individual logins.
A dozen videos of SAIL research results in robotics, speech understanding, display-based timesharing and other topics also have been preserved and many are viewable online, with more to come. These were originally made as 16 mm films and have a total running time of about two hours.

How long will it last?

The SAILDART web site is being increasingly accessed both by those who participated in SAIL around forty years ago and by others who want to know more about what was done in that center of innovation. Furthermore an emulator is being developed that will allow the old programs to live again! However the continuation of this web site currently depends on the one person who created and maintains it and a small charitable nonprofit corporation that embraces it. A number of earlier participants in SAIL have passed away and SAILDART is unlikely to survive the deaths of the current participants.

The first data to go should be private files. Individuals will be allowed to delete what they wish and to declare some or all of the remaining files public. Eventually, perhaps by 2030, we expect to declare all remaining files public. Thus this archive will eventually become entirely public and we believe that it will be of interest to information archaeologists. However unless a way can be found to preserve it, the whole thing will vanish, just as is happening to other web sites of the deceased.

Carnegie Mellon University has sensibly decided to preserve tributes to distinguished professors such as Allen Newell at http://diva.library.cmu.edu/Newell/biography.html and Herbert Simon at http://diva.library.cmu.edu/Simon/biography.html. Stanford University and some others are thinking about this idea but haven’t yet done it. I believe they would also be well advised to preserve records of prominent alumni as a form of advertising.

Who can do this?

Candidate agencies for preserving digital records include:

Commercial corporations such as Google
Non-profit corporations such as Archive.org
Government agencies such as the Library of Congress
Universities

Google has an enormous capacity for document storage and already is offering a free service for storing important documents [Google09]. However, like any commercial organization, their long term focus must be on their bottom line and, if they eventually falter in that area, as most high tech organizations do sooner or later, they likely will have to curtail their archiving activities. If things got even worse, they might go out of business or be purchased by another commercial organization that might end up terminating archiving activities.

Archive.org has done an excellent job of recording and preserving web information but is very much dependent on the organizers who put it together and the volunteers and sponsors who have kept it going so far. It might still be operating a thousand years from now but I seriously doubt it and that goes for any single nonprofit organization that I can think of.

A government agency such as the Library of Congress could provide long term archiving services but would inevitably be subjected to congressional curtailment in hard times as well as politicians’ natural desire to control or censor information they don’t like. There might also be problems in dealing with agencies of other governments.

Universities, on the other hand, have a natural interest in collecting and preserving information and tend to be more stable than other entities, including governments. No single university can be trusted for the long haul, but a consortium of independent universities could do it, I believe. The basic idea would be to make a number of copies of each record and distribute them around the world, so that even the eventual loss of a few universities would not end it. Given that a substantial amount of ongoing software will be required to make the archiving work and that much of that could be handled as student projects, doing it with cooperating universities would be a good fit.

I believe that the following things are needed to make long term archiving work:

Low cost memory (already here and getting cheaper for the foreseeable future);
A worldwide network to provide access (Internet);
A method for distributing files worldwide for backup and getting them copied to new media as needed (LOCKSS.org);
A fee structure that fairly charges for services and distributes income equitably among the participants;
A scheme for ensuring the integrity of the records being preserved;
Methods for translating old data structures into new formats as needed;
Emulators for computers that are dying or extinct;
Methods for translating between source languages and others, including evolving languages.

Quoting from the web site at http://www.lockss.org:

“LOCKSS (Lots of Copies Keep Stuff Safe), based at Stanford University Libraries, is an international community initiative that provides libraries with digital preservation tools and support so that they can easily and inexpensively collect and preserve their own copies of authorized e-content. LOCKSS, in its eleventh year, provides libraries with the open-source software and support to preserve today’s web-published materials for tomorrow’s readers while building their own collections and acquiring a copy of the assets they pay for, instead of simply leasing them. LOCKSS provides 100% post cancellation access.”

LOCKSS currently includes hundreds of participating libraries around the world, which back each other up by typically making seven copies of each file and distributing them among sites. If one copy is lost somehow another copy is made promptly.

Given that LOCKSS already operates in a university environment, the archiving functions discussed below could be undertaken as an expansion of the scope of their operations or could organized separately and perhaps contract with LOCKSS for their backup service.

The fee structure and income distribution scheme will necessarily be a bit complicated but negotiable, I believe. Long term storage should be marketed in standard units of space-time, such as gigabytes in blocks of 50 or 100 years. There might even be a “Forever” fee, though that would be problematic since nothing lasts forever, including Planet Earth. University alumni associations would be natural marketing agencies for archiving services, I believe, but it could also be handled by commercial marketers on a commission basis.

In order to maintain the integrity of stored records, I believe that each should be bundled together with an accurate description of the computing environment in which it operates and have a digital signature attached that is checked from time to time. The fact that there will be multiple stored copies of everything at various sites makes possible another kind of integrity check, which should be done periodically.

One complication is that changes in software standards and media will necessitate accommodating data formats that differ from those in the original file. I believe that this should be handled not by fiddling with the stored records but by developing and maintaining format translators, which may even be cascaded over the long term. The cost of doing this must, of course, be figured into the fee structure.

Because stored records may include computer code of various kinds and recognizing that nearly all computer systems go extinct sooner or later, it will be necessary to maintain emulators for all machines represented in the archive and these will need to be modified as the computer systems used for presenting the archived information evolve.

Similarly, if readers or listeners speak a different language than the one used in a given record, translation will be needed. Computer software is getting better at doing both speech understanding and text translations between languages but this will be a weak point initially. However these fields are still advancing rapidly and we can expect that they will be doing an adequate job within a decade or so.

In the longer term, the system will have to deal with linguistic drift. As you may have noticed, the language of the founding fathers of the U.S. differs somewhat from our own and the story of Beowulf, written in Old English between the 8^th and 11^th Centuries, is rather hard for modern English readers to understand. Similarly our current language will look quite strange toY3K speakers of “New English”, if it exists. Again, no attempt should be made to fiddle with the source materials but automatic translators should be maintained.

In summary, it appears that preservation of digital representations of various media can be accomplished and kept accessible for thousands of years by a suitable institutional consortium, the principal elements of which already exist. This project will need a name, of course, and I suggest “LongAgo”. It could be created as an augmentation of LOCKSS.org or as a separate university-affiliated entity. I hope that it can be created before it is too late for me.

References

[ACM09] “A.M. Turing Awards,” at http://awards.acm.org/homepage.cfm?awd=140.

[Begtrup09] G. E. Begtrup, W. Gannett, T. D. Yuzvinsky, V. H. Crespi and A. Zett “Nanoscale Reversible Mass Transport for Archival Memory”, NANOletters, American Chemical Society, April 2009 – see http://newscenter.lbl.gov/feature-stories/2009/06/03/.

[Diffie76] W. Diffie & M. Hellman, “Multi-user Cryptographic Techniques,” AFIPS Conf. Proc., Vol. 45, June 1976.

[LES62] L. Earnest, “Machine recognition of cursive writing,” Proc. IFIP Congress 1962, North Holland, Amsterdam, 1963.

[LES72] L. Earnest, “Video switch,” SAILON-69, 15 May 1972 – see public file VDS.LES[S,DOC] in http://www.saildart.org.

[LES73] L. Earnest (ed.), J. McCarthy, E. Feigenbaum, J. Lederberg, “FINAL REPORT: The first ten years of artificial intelligence research at Stanford,” Stanford Artificial Intelligence Laboratory Memo AIM-228, July 1973 – see http://historical.ncstrl.org/litesite-data/stan/CS-TR-74-409.pdf

[Earnest74] L. Earnest, “Prancing Pony vending machine,” 1974 – see public file PONYSY.SAI[PNY,DOC] in http://www.saildart.org.

[Earnest75] L. Earnest, “FINGER,” Oct. 13, 1975 – see public file FINGER.LES[UP,DOC] in http://www.saildart.org.

[Feldman72] J. Feldman, J. Low, D.C. Swinehart, R.H. Taylor, Recent Developments in SAIL, an ALGOL based language for Artificial Intelligence, Proc. FJCC, 1972.

[Frost72] M. Frost, “Reading the Associated Press News,” Stanford Artificial Intelligence Lab., 22 Sep. 1972 – se APE.ME[UP,DOC] in http://www.saildart.org.

[Frost74] M. Frost, “Reading the wire service news,” SAILON-72.1, Stanford Artificial Intelligence Lab, 26 May 1974 – see NS.ME[UP,DOC] in http://www.saildart.org.

[Google09] “Online document storage process” at http://closereach.com/doconlinestorage.pdf.

[Gorin72] R. Gorin, “Spelling checker/corrector”, June 11, 1972, -- see public file SPELL.REG[UP,DOC] in http://www.saildart.org.

[Hanrahan09] P. Hanrahan, “A conversation with David E. Shaw,” CACM, Oct. 2009.

[Helliwell74] R. Helliwell, “Stanford University Design System” 29 April 1974 – see public file SUDS.RPH[UP,DOC] in http://www.saildart.org.

[Herrenstien77] K.Harrenstien, “Name/Finger,” RFC 742, Network Working Group,30 Dec.1977.

[Knuth82] D. Knuth, “TeX” – see http://en.wikipedia.org/wiki/TeX

[Lindgren65] N. Lindgren, “Machine Recognition of Human Language, Part III – Cursive Script Recognition, IEEE Spectrum, May 1965.

[McCullagh07] D. McCullagh, A Broache, “Blogs turn 10 – Who’s the father?”, CNET Asia, May 21, 2007 – see http://asia.cnet.com/reviews/pcperipherals/0,39051168,61998604,00.htm

[Quam71] L. Quam, "Computer Comparison of Pictures", SAIL AIM-144, Thesis: Ph.D. in Computer Science, May 1971.

[Quam73] L. Quam, W. Diffie, Stanford LISP 1.6 Manual, Stanford A. I. Lab.

Operating note SAILON-28.7, 1973.

[Reddy67] D. Raj Reddy, Computer Recognition of Connected Speech, J. Acoust. Soc. Amer., August, 1967.

[Tesler72] L. Tesler, PUB the document compiler, August 1972 – see http://www.nomodes.com/pub_manual.html .

[Wiki09] “Turing award” in Wikipedia at http://en.wikipedia.org/wiki/Turing_Award