Time Warps, String Edits, and Macromolecules
The Theory and Practice of Sequence Comparison
David Sankoff and Joseph Kruskal, with introduction by John Nerbonne
Time Warps, String Edits and Macromolecules is a young classic in computational science, scientific analysis from a computational perspective. The computational perspective is that of sequence processing, in particular the problem of recognizing related sequences. The book is the first, and still best compilation of papers explaining how to measure distance between sequences, and how to compute that measure effectively. This is called string distance, Levenshtein distance, or edit distance. The book contains lucid explanations of the basic techniques; wellannotated examples of applications; mathematical analysis of its computational (algorithmic) complexity; and extensive discussion of the variants needed for weighted measures, timed sequences (songs), applications to continuous data, comparison of multiple sequences and extensions to treestructures. In molecular biology the sequences compared are the macromolecules DNA and RNA. Sequence distance allows the recognition of homologies (correspondences) between related molecules. One may interpret the distance between molecular sequences in terms of the mutations necessary for one molecule to evolve into another. A further application explores methods of predicting the secondary structure (chemical bonding) of RNA sequences. In speech recognition speech input must be compared to stored patterns to find the most likely interpretation (e.g., syllable). Because speech varies in tempo, part of the comparison allows for temporal variation, and is known as “timewarping”. In dialectology Levenshtein distance allows analysis of the learned variation in pronunciation, its cultural component. Levenshtein distance introduces a metric which allows more sophisticated analysis than traditional dialectology's focus on classes of alternative pronunciations. A similar application is the study of bird song, where degrees of distance in song are seen to correspond to the divergence of bird populations. A final application area is software, where Levenshtein distance is employed to located differing parts of different versions of computer files, and to perform error correction.
David Sankoff is a professor in the Department of Mathematics and Statistices at the University of Montreal. Joseph Kruskal works at the Mathematics and Statistics Research Center at Bell Laboratories
John Nerbonne is a professor at Alfainformatica, BCN, University of Groningen.

 1 An Overview of Sequence Comparison
Joseph B. Kruskal
 1 Introduction
 2 How sequences differ
 3 Analysis of differences: trace, alignment, and listing
 4 Levenshtein and other distances
 5 Calculating distance: the basic algorithm
 6 Computational complexity of sequence comparisons
 7 How sequence comparison is used
 8 Why should distances be used?
 9 Why should Levenshtein be used?
 References
 I Macromolecular Sequences
 1 Introduction
 2 Advantages of the Dynamic Programming Method
 3 Biological Background
 References
 2 Recognition Patterns in Genetic Sequences
Bruce W. Erikson and Peter H. Sellers
 1 Evolutionary distance and metric alignment
 2 Two patternrecognition theorems
 3 Recognition of pattern s in sequence 1
 4 Recognition of a pattern in two DNA sequences
 5 A metric set of genetic distances between amino acids
 6 Recognition of a cyclically permuted pattern in two protein sequences
 7 Recognition of a potential translocation signal in a protein sequence
 8 Acknowledgements
 References
 3 Fast Algorithms to Determine RNA Secondary Structure Containing Multiple Loops
David Sankoff, Joseph B. Kruskal, Sylvie Mainville, and Robert J. Cedergren
 1 Introduction
 2 Constraints on secondary structure
 3 Loops
 4 The thermodynamics of a structure
 5 Sequence comparison and selfcomparison
 6 A recursion
 7 The basic algorithm
 8 An n^{3} algorithm
 9 An n^{4} algorithm
 10 A heuristic extension algorithm for longer sequences
 References
 II TimeWrapping, Continuous Functions, and Speech Processing
 1 Introduction
 References
 4 The Symmetric TimeWrapping Problem: From Continuous to Discrete
Joseph B. Kruskal and Mark Liberman
 1 Introduction
 2 Timewrapping in the continuous case
 3 Timewrapping in the discrete case
 4 Distance and length in the continuous case
 5 Distance and length in the discrete case
 6 How compressionexpansion differs from deletioninsertion
 7 Combining deletioninsertion and compressionexpansion in a single method
 8 Interpolation between the sampling points
 9 Average of two trajectories
 10 Average of two sequences
 References
 5 Use of Dynamic Programming in a SyllableBased Continuous speech Recognition System
Melvyn J. Hunt, Matthew Lenning, and Paul Mermelstein
 1 Introduction
 2 Speech recognition as a stringmatching problem
 3 Approaches to the problem
 4 The syllablebased system
 5 The syllable comparator
 6 System performance
 7 Conclusions
 References
 6 Application of Sequence Comparison to the Study of Bird Songs
David W. Bradley and Richard A. Bradley
 1 Introduction
 2 Methods for coding the songs
 3 Determining internote distrances
 4 Algorithms for sequence comparison
 5 Comparative evaluationg of the methods
 6 Application of the method
 7 Discussion
 8 Conclusion
 9 Acknowledgements
 References
 Appendix 1 APL programs
 Appendix 2 Statistical Methods
 III Variations on a Theme: Algorithms for Related Problems
 1 Introduction
 2 Finding a pattern in a sequence
 3 Finding similar portions in two sequences
 4 Comparing several sequences
 5 Comparing one sequence with itself
 6 Comparing two trees
 7 Comparing two directed networks
 8 Comparing two continuous functions permitting compression and expansion
 9 Permitting transpositions
 10 Permitting generalized substituions
 11 Comparing under constraints
 12 Generalizing Levenshtein
 References
 7 On the Complexity of the Extended StringtoString Correction Problem
Robert A. Wagner
 1 Introduction
 2 The problem
 3 Cuts of traces
 4 Cellars and paths
 5 Subgraphs of G
 6 CELLAR algorithm
 7 Proof of the algorithm
 8 Special algorithms for special sets of w_{x} values
 9 An NPcomplete ESSCP
 10 A generalization
 References
 8 An analysis of the General TreeEditing Problem
Andrew S. Noetzel and Stanley M. Selkow
 1 Introduction
 2 Definitions
 3 Insertions and deletions restricted to leaves
 4 An algorithm for tree distance when insertions and deletions are restricted to leaves
 5 The case of general insertions and deletions
 6 An algorithm for treeedit distance in the case of general insertions and deletions
 7 Conclusions
 References
 9 Simultaneous Comparison of Three or More Sequences Related by a Tree
David Sankoff and Robert J. Cedergren
 1 Introduction
 2 Tree alignments
 3 An algorithm to calculate tree distances
 4 A dynamicprogramming algorithm for the inner minimization
 References
 10 An Anthology of Algorithms and Concepts for Sequence Comparison
Joseph B. Kruskal and David Sankoff
 1 Directed networks
 2 Levelbuilding in recognition of continuous speech
 3 Cutting corners to improve calculation time
 4 Deletion and insertion constraints
 5 A biological application of insertion and deletion constraints: 5S RNA from humans and E. coli
 6 Finding similar portions of two sequences
 7 Weights for insertion and deletion of consecutive strings
 8 Generalized substituions: Quartic, cubic, and quadratictime algorithms
 9 Consistency and metric properties: some conditions
 References
 11 Dissimilarity Measures for Clustering Strings
James M. Coggins
 1 Background
 2 String dissimilarity measures
 3 Improved string dissimilarity measures
 4 Examples
 5 Conclusion
 References
 IV Computational Complexity
 1 Introduction
 12 Recent Results on the Complexity of CommonSubsequence Problems
Daniel S. Hirschberg
 1 Introduction
 2 NPcompleteness results
 3 Polynomial algorithms
 4 Lower bounds
 5 Other algorithms for the 2LCS problem
 6 An open problem
 References
 13 FormalLanguage Error Correction
Robert A. Wagner
 1 Introduction
 2 Error processing
 3 Optimal error correction
 4 Some results
 5 The contextsensitive cas and the maximum independent set problem
 6 A graphtheoretical formulation
 References
 14 How to Compute StringEdit Distances Quickly
William J. Masek and Michael S. Paterson
 1 Introduction
 2 A faster algorithm
 3 Longest common subsequence
 4 Sparseness is necessary
 5 When to use the faster algorithm
 6 Conclusion
 Acknowledge
 References
 V Random Sequences
 1 Introduction
 References
 15 An UpperBound Technique for Length
Václav Chvatál and David Sankoff
 1 Introduction
 2 An unexpected invariant for lengthn sequences
 3 A bound on the occurrences of highly similar sequences
 4 Calculating expected values
 References
 16 Probabilistic Behavior of LongestCommonSubsequences Length
Joseph Deken
 1 Probabilistic models for random sequences
 2 Upper bounds
 3 Upper bounds for c'(k, p, n)
 4 More extensive precedence rules
 References
 17 Common Subsequences and Monotone Subsequences
David Sankoff and Sylvie Mainville
 1 Common subsequences
 2 Monotone subsequences
 References
 Author Index
 Subject Index
October 2001
ISBN (Paperback): 1575862174 (9781575862170)
ISBN (electronic): 1575867117 (9781575867113)
Subject: Computer Science; Sequences; Pattern Perception

Distributed by the University of Chicago Press
