Visible Legacies for Y3K
Les Earnest (les at cs.stanford.edu)
Senior Research
Scientist Emeritus,
Abstract
Much
of what we know about ancient civilizations is derived from the records and
stories they kept. The cost of keeping records has been greatly reduced in the
modern world by the development of digital technology but the resulting records
are now disappearing about as fast as they are being created. This problem is illustrated
by attempts by members of the Stanford Artificial Intelligence Laboratory, a
research group, to preserve their digital records. All digital records from the
1960s have been lost due to technical difficulties and, while most records from
the 1970s and ‘80s have been preserved and are now accessible on a web site, they
will disappear within the next 30 years unless a more reliable way can be
developed to preserve such records and keep them accessible. This is not just a
local problem but is worldwide and needs a worldwide solution. It appears that
a solution can be found by augmenting the services of an existing international
organization so as to provide redundant storage, maintenance and translation
services that will keep records accessible for thousands of years.
Problem solved
The
problem of preserving records for long periods was solved several thousand years
ago by civilizations that first developed written records, then carved some of
them in rock as illustrated on the next page. However that scheme was rather
expensive and therefore was accessible only to the rich and powerful. Thus much
of what we know (or think we know) about these civilizations comes from
inscriptions on temples, tombs and tablets that were created on behalf of
egotistical monarchs. Many of their accounts were undoubtedly distorted or outright
lies but, except when they attacked other literate societies or vice versa,
there were generally no records of alternative views, so their stories are now
accepted as history.
The
biographies of modern monarchs and politicians are probably no more accurate
than those of the ancient ones but, since the modern costs of documentation are
much lower, there is now a greater likelihood that alternative views will be
recorded. Unfortunately most modern records are much less durable than those
carved in hard rock. Since computers today mostly use various kinds of magnetic
recording, which tends to deteriorate in a decade or two, and the ever-changing
technologies for creating and reading recordings will probably continue to
complicate the archiving process indefinitely, technological advances will come
to naught in the long run if these records are lost.
Note
that even records carved in stone need some protection from the elements in
order to survive. In some cases this happened though accidental burial, which
made them inaccessible for a time. In all cases it is necessary to translate
these records into modern languages in order to make them understandable to the
general public.
Egyptian
historical text
The
value of archived records generally can be enhanced by the addition of
interpretations that explain the significance of the record, including perhaps
a description of events that preceded or followed and references to related
records. For important records there are likely to be multiple interpretations,
since academicians like to do that sort of thing.
Paper works
Some
people believe that the best way to preserve information now is to print it on
paper and store it in a safe place. Indeed, paper has a potential shelf life
measured in centuries. Not surprisingly, many of the paper advocates are
librarians.
Of
course, paper records, like all others, are subject to loss from earthquakes,
fire, and floods and, in addition, may get eaten by voracious insects. Thus, in
order to assure survival, multiple copies should be stored in different
locations, which is something that libraries do well. However there are two
strong arguments against using paper records:
(1) They are much more expensive
to create and store than digital records and this cost disadvantage is rising
rapidly;
(2) Even in a library and more
so in an archive, they are much less accessible than, say, a web site.
Digital dropouts
A major problem
with current digital records is that most of them have been stored on magnetic
media that have a shelf life of around ten years. An additional complication is
that, because of evolving software, file formats also change at similar intervals.
To make matters worse, evolving hardware technology results in the devices used
to record and read digital data typically changing about every five years. As a
result, if you record something and try to retrieve it twenty years later you
are likely to be out of luck.
In order to get around these
problems using current technology, you need to recopy archives frequently and
either reformat the information to conform to newer file format standards or
develop software translators that can interpret and display information from
the old formats. Either approach requires ongoing effort and attention to
details.
This
problem is illustrated below by an account of our troubled efforts to preserve
computer records from the 1966-90 period. Happily, these
problems will not be repeated exactly but I expect that similar problems will
continue to arise going forward. In order to preserve such records we must find
ways to work around both the volatility of digital recordings and changing
technologies. One big advantage will be the steep decline in storage costs that
apparently will continue for the foreseeable future.
I
believe we should find a way to make current data accessible to people living
in the year 3000 and later and it appears that an institutional solution to
this problem can be created, though not without some effort.
Many
people are working on more advanced storage technologies. For example see the
“Billion year Ultra-dense Memory Chip” [Begtrup09]. None of these long term
storage devices has been proven yet but I expect that some will work eventually
and will further greatly reduce the already low cost of storing digital data
for long periods. However that will not in itself ensure the accessibility of
old records for future generations – we will need multiple organizations and
ongoing software development to accomplish that.
Why bother?
In summary, there is a lot of
interesting history in the SAIL archives that is worthy of preservation, but
there is currently no way to do that for a long period.
Personal archiving interests.
Like
many people, I have a lot of
personal digital materials that I would like to preserve and make accessible to
my descendents and whoever else is interested, including:
(1) Family genealogies going
back four centuries, with additional materials being gathered by cousins;
(2) Diaries of my mother, who
lived a rich life to age 100; she became a popular stage actress as a teenager
and was invited to Hollywood but instead went to college, then married and
worked full time as a teacher while raising two children and earning a PhD in
her “spare time,” then became a college professor and head of her department for
many years;
(3) Various family photos, some
going back 150 years;
(4) Family movies going back 75
years that have been digitized;
(5) Stories I have written that
my children and grandchildren seem to enjoy.
My
great grandchildren are beginning to read now and, with a bit of luck, I will
begin seeing some great-great-grandchildren in another 15 to 20 years. I don’t
know how many of them there will eventually be, but I would like to establish a
web site that they can go to when interested and can use when they set up their
own web sites, or whatever new communication medium is in vogue then.
Unfortunately there currently is no way to do that reliably.
Archiving isn’t easy
Our
attempts at preserving the digital records of SAIL have been only partially
successful because of technical and operational problems. Happily our
successors will be able to avoid some of our problems since technology has
advanced a lot since we started and likely will continue to do so. However I
expect that continuing technology shifts will continue to cause compatibility
problems similar to those we experienced. Here is a summary of what happened.
In
the 1950s and early ‘60s the principal secondary storage medium was punched
paper – tape or cards. These media were capable of fairly long term storage, at
least a couple of hundred years, provided that they were stored in a stable
environment and not accessed repeatedly. However, given that reading them
caused physical wear, they had limited durability under frequent use.
At
MIT during 1959-63 I created the first spelling checker as part of the first
cursive handwriting recognizer [Earnest62, Lindgren65]. I needed a list of
acceptable words but there were no online dictionaries then, nor even online collections
of documents, so I typed in a published list of the 10,000 most common English
words which filled seven reels of paper tape. I brought those paper tapes with
me when I came to SAIL in 1965 and, as mentioned above, later talked a couple
of graduate students into making spelling checkers for text files.
On
10,000 Most
Common English Words on paper tape
DECtape Drives
In
1969 a magnetic disk system made by Librascope was
installed at SAIL and ultimately turned out to be a disaster.
•
Cabinet approximately 8 feet high, 12 feet long, 3 feet deep
•
Contained 6 disks, each 5 feet in diameter, mounted on a horizontal
shaft
•
Capacity: 50 megabytes
•
Cost: $330,000
Instead
of having magnetic heads mounted on a moving arm, as modern disks do, it used
fixed heads, one per track: hundreds of them. It was known to require mechanical
stability, so instead of setting it on the false floor in the computer room we
excavated and mounted it directly on the steel I-beams of the building. The
first thing that happened was that a few magnetic heads crashed, causing damage
to the disk surface. That was okay because the disk system had spare tracks
that could be wired in. But then a technician left an aluminum tool inside when
he restarted it, causing aluminum shreds to be scattered throughout the disk
system and causing more head crashes.
We
worked around that with more rewiring but then discovered that this system was
extremely temperature sensitive. A change in ambient room temperature of one
degree Fahrenheit was enough to make it begin “forgetting”, evidently because
the coefficient of expansion of the head assembly was slightly different from
that of the disk.
Since
this system didn’t work, we then sued the manufacturer which by that time was
the Singer sewing machine company since they had purchased Librascope
and its liabilities. As usual, the only people who did well in this undertaking
were the lawyers.
We
planned to scrap this disk system but it then occurred to me that the disks
would make nice coffee tables, so we auctioned them off and I bought one, which
now functions in my living room as a coffee table and Lazy Susan. The evident
discoloration is due to the fact that the nickel plating, which was the
magnetic material, seems to be soluble in alcohol. Nevertheless we can still
read some recorded bits by sprinkling iron filings on the table. It also
functions as a seismometer, since even small earthquakes set it swinging.
Librascope disk/coffee table
In
the early 1970s,
After
finally acquiring reliable disks we began doing tape backups using 7-track
magnetic tapes that were
SAIL-DART
Chronology
How long will it last?
The
SAILDART web site is being increasingly accessed both by those who participated
in SAIL around forty years ago and by others who want to know more about what
was done in that center of innovation. Furthermore an emulator is being
developed that will allow the old programs to live again! However the
continuation of this web site currently depends on the one person who created
and maintains it and a small charitable nonprofit corporation that embraces it.
A number of earlier participants in SAIL have passed away and SAILDART is
unlikely to survive the deaths of the current participants.
The
first data to go should be private files. Individuals will be allowed to delete
what they wish and to declare some or all of the remaining files public.
Eventually, perhaps by 2030, we expect to declare all remaining files public. Thus
this archive will eventually become entirely public and we believe that it will
be of interest to information archaeologists. However unless a way can be found
to preserve it, the whole thing will vanish, just as is happening to other web
sites of the deceased.
Who can do this?
Candidate
agencies for preserving digital records include:
Google
has an enormous capacity for document storage and already is offering a free
service for storing important documents [Google09]. However, like any
commercial organization, their long term focus must be on their bottom line
and, if they eventually falter in that area, as most high tech organizations do
sooner or later, they likely will have to curtail their archiving activities.
If things got even worse, they might go out of business or be purchased by
another commercial organization that might end up terminating archiving
activities.
Archive.org
has done an excellent job of recording and preserving web information but is
very much dependent on the organizers who put it together and the volunteers
and sponsors who have kept it going so far. It might still be operating a
thousand years from now but I seriously doubt it and that goes for any single
nonprofit organization that I can think of.
A government
agency such as the Library of Congress could provide long term archiving
services but would inevitably be subjected to congressional curtailment in hard
times as well as politicians’ natural desire to control or censor information
they don’t like. There might also be problems in dealing with agencies of other
governments.
Universities,
on the other hand, have a natural interest in collecting and preserving
information and tend to be more stable than other entities, including
governments. No single university can be trusted for the long haul, but a
consortium of independent universities could do it, I believe. The basic idea
would be to make a number of copies of each record and distribute them around
the world, so that even the eventual loss of a few universities would not end
it. Given that a substantial amount of ongoing software will be required to
make the archiving work and that much of that could be handled as student
projects, doing it with cooperating universities would be a good fit.
I
believe that the following things are needed to make long term archiving work:
Quoting
from the web site at http://www.lockss.org:
“LOCKSS
(Lots of Copies Keep Stuff Safe), based at Stanford University Libraries, is an
international community initiative that provides libraries with digital
preservation tools and support so that they can easily and inexpensively
collect and preserve their own copies of authorized e-content. LOCKSS, in its
eleventh year, provides libraries with the open-source software and support to
preserve today’s web-published materials for tomorrow’s readers while building
their own collections and acquiring a copy of the assets they pay for, instead
of simply leasing them. LOCKSS provides 100% post cancellation access.”
LOCKSS
currently includes hundreds of participating libraries around the world, which
back each other up by typically making seven copies of each file and
distributing them among sites. If one copy is lost somehow another copy is made
promptly.
Given
that LOCKSS already operates in a university environment, the archiving
functions discussed below could be undertaken as an expansion of the scope of
their operations or could organized separately and perhaps contract with LOCKSS
for their backup service.
The
fee structure and income distribution scheme will necessarily be a bit
complicated but negotiable, I believe. Long term storage should be marketed in
standard units of space-time, such as gigabytes in blocks of 50 or 100 years.
There might even be a “Forever” fee, though that would be problematic since
nothing lasts forever, including Planet Earth. University alumni associations
would be natural marketing agencies for archiving services, I believe, but it
could also be handled by commercial marketers on a commission basis.
In
order to maintain the integrity of stored records, I believe that each should
be bundled together with an accurate description of the computing environment
in which it operates and have a digital signature attached that is checked from
time to time. The fact that there will be multiple stored copies of everything
at various sites makes possible another kind of integrity check, which should
be done periodically.
One
complication is that changes in software standards and media will necessitate
accommodating data formats that differ from those in the original file. I
believe that this should be handled not by fiddling with the stored records but
by developing and maintaining format translators, which may even be cascaded
over the long term. The cost of doing this must, of course, be
figured into the fee structure.
Because
stored records may include computer code of various kinds and recognizing that
nearly all computer systems go extinct sooner or later, it will be necessary to
maintain emulators for all machines represented in the archive and these will
need to be modified as the computer systems used for presenting the archived
information evolve.
Similarly, if readers or listeners speak a different language than the
one used in a given record, translation will be needed. Computer software is
getting better at doing both speech understanding and text translations between
languages but this will be a weak point initially. However these fields are
still advancing rapidly and we can expect that they will be doing an adequate
job within a decade or so.
In
the longer term, the system will have to deal with linguistic drift. As you may
have noticed, the language of the founding fathers of the
In
summary, it appears that preservation of digital representations of various
media can be accomplished and kept accessible for thousands of years by a
suitable institutional consortium, the principal elements of which already
exist. This project will need a name, of course, and I suggest “LongAgo”. It could be created as an augmentation of
LOCKSS.org or as a separate university-affiliated entity. I hope that it can be
created before it is too late for me.
References
[ACM09] “A.M. Turing Awards,” at http://awards.acm.org/homepage.cfm?awd=140.
[Begtrup09] G. E. Begtrup,
W. Gannett, T. D. Yuzvinsky, V. H. Crespi and A. Zett “Nanoscale Reversible Mass Transport for Archival Memory”, NANOletters, American Chemical Society, April 2009 – see http://newscenter.lbl.gov/feature-stories/2009/06/03/.
[Diffie76] W. Diffie
& M. Hellman, “Multi-user Cryptographic
Techniques,” AFIPS Conf. Proc., Vol. 45, June 1976.
[LES62] L. Earnest, “Machine recognition of cursive writing,”
Proc. IFIP Congress 1962,
[LES72] L. Earnest, “Video switch,”
SAILON-69,
[LES73] L. Earnest (ed.), J.
McCarthy, E. Feigenbaum, J. Lederberg,
“FINAL REPORT: The first ten years of artificial intelligence research at
Stanford,” Stanford Artificial Intelligence Laboratory Memo
[Earnest74] L. Earnest, “Prancing Pony
vending machine,” 1974 – see public file PONYSY.
[Earnest75] L. Earnest, “FINGER,”
[Feldman72] J. Feldman, J. Low, D.C. Swinehart, R.H. Taylor, Recent Developments in SAIL, an
ALGOL based language for Artificial Intelligence, Proc.
FJCC, 1972.
[Frost72] M. Frost, “Reading the Associated Press News,”
Stanford Artificial Intelligence Lab.,
[Frost74] M. Frost, “Reading the wire service news,”
SAILON-72.1, Stanford Artificial Intelligence Lab,
[Google09] “Online document storage process” at http://closereach.com/doconlinestorage.pdf.
[Gorin72] R. Gorin,
“Spelling checker/corrector”,
[Hanrahan09] P. Hanrahan, “A conversation
with David E. Shaw,” CACM, Oct. 2009.
[Helliwell74] R. Helliwell,
“Stanford University Design System”
[Herrenstien77] K.Harrenstien, “Name/Finger,” RFC 742, Network Working Group,30
Dec.1977.
[Knuth82] D. Knuth, “
[Lindgren65] N. Lindgren, “Machine
Recognition of Human Language, Part
[McCullagh07] D. McCullagh,
A Broache, “Blogs turn 10 – Who’s the father?”,
[Quam71] L. Quam, "Computer Comparison of Pictures", SAIL
[Quam73] L. Quam,
W. Diffie, Stanford LISP 1.6
Manual, Stanford A.
Operating
note SAILON-28.7, 1973.
[Reddy67] D. Raj
Reddy, Computer Recognition of Connected Speech, J. Acoust.
Soc. Amer., August, 1967.
[Tesler72] L. Tesler,
[Wiki09]
“Turing award” in Wikipedia at http://en.wikipedia.org/wiki/Turing_Award