Evrim Acar Department of Computer Science Rensselaer Polytechnic Institute COLLECTIVE SAMPLING AND ANALYSIS OF HIGH ORDER TENSORS FOR CHATROOM COMMUNICATIONS This work investigates the accuracy and efficiency tradeoffs between centralized and collective algorithms for (i) sampling and (ii) n-way data analysis techniques in multidimensional stream data, such as Internet chatroom communications. Its contributions are threefold. First, we use the Kolmogorov-Smirnov goodness-of-fit test to demonstrate that statistical differences between real data obtained by collective sampling in time dimension from multiple servers and that of obtained from a single server are insignificant. Second, we show using the real data that collective data analysis of 3-way data arrays (users x keywords x time) is more efficient than centralized algorithms with respect to both space and computational cost. Third, we examine the sensitivity of collective constructions and analysis of high order data arrays to the choice of server selection and sampling window size. We construct 4-way datasets (users x keywords x time x servers) and analyze them to show the impact of server and window size selections on the results. Ioannis Antonellis Computer Engineering and Informatics Department University of Patras TENSOR BASED TEXT REPRESENTATION: A NEW DIMENSION IN IR We investigate the basics of and experiment with a tensor based document representation in Information Retrieval. Most documents have an inherent hierarchical structure that renders desirable the use of flexible multidimensional representations such as those offered by tensor objects. We focus on the performance of special instances of a Tensor Model, in which documents are represented using second-order (matrices) and third-order tensors. We exploit the local-structure encapsulated by the proposed representation to approximate these tensors using high order singular value and nonnegative tensor decompositions and assemble the results to form the final term-document matrix. We analyze the spectral properties of this matrix and observe that topic identification is enhanced by deploying k-plane clustering. Our results provide evidence that tensor based models can be particularly effective for IR, offering an excellent alternative to traditional VSM and LSI especially for text collections of multi-topic documents. Joint work with Efstratios Gallopoulos. Onureena Banerjee Department of Electrical Engineering and Computer Sciences University of California at Berkeley CONVEX OPTIMIZATION TECHNIQUES FOR LARGE-SCALE COVARIANCE SELECTION We consider the problem of fitting a large-scale covariance matrix to multivariate Gaussian data in such a way that the inverse is sparse, thus providing model selection. Beginning with a dense empirical covariance matrix, we solve a maximum likelihood problem with an $l_1$-norm penalty term added to encourage sparsity in the inverse. For models with tens of nodes, the resulting problem can be solved using standard interior-point algorithms for convex optimization, but these methods scale poorly with problem size. We present two new algorithms aimed at solving problems with a thousand nodes. The first, based on Nesterov's first-order algorithm, yields a rigorous complexity estimate for the problem, with a much better dependence on problem size than interior-point methods. Our second algorithm uses block coordinate descent, updating row/columns of the covariance matrix sequentially. Experiments with genomic data show that our method is able to uncover biologically interpretable connections among genes. Christos Boutsidis Computer Engineering and Informatics Department University of Patras, Greece PALSIR: A NEW APPROACH TO NONNEGATIVE TENSOR FACTORIZATION PALSIR (Projected Alternating Least Squares with Initialization and Regularization) is a new approach to Nonnegative Tensor Factorization (NTF). PALSIR is designed to decompose a nonnegative ( D1 x D2 x D3) tensor T into a sum of 'k' nonnegative (D1 x D2 x D3) rank-1 tensors T_i each of which can be written as the outer product of three nonnegative vectors: x_i, y_i and z_i of dimensions (D1 x 1), (D2 x 1) and (D3 x 1) respectively, i=1:k. PALSIR consists of the following phases: i) initialization of 2 out of the 3 vector groups x_i's, y_i's, z_i's; ii) an iterative tri-alternating procedure where, at each stage, two of the three groups of vectors remain fixed and a nonnegative solution is computed with respect to the third group. Each of these stages requires solving {D1,D2 or D3} constrained least squares (LS) problems. These are solved by first solving a linear ill-posed inverse problem in the least squares sense, using Tikhonov regularization and then projecting the solutions back to the feasible solution - instead of nonnegative LS. We applied PALSIR on a 3d cube of images of the space shuttle Columbia taken by an Air Force telescope system (in its last orbit before disintegration upon re-entry in February 2003) in order to identify a parts-based representation of the image collection. PALSIR appears to be competitive compared to other available NTF algorithms both in terms of computational cost and approximation accuracy. Joint work with E. Gallopoulos, P. Zhang and R.J. Plemmons. Dongwei Cao Department of Computer Science University of Minnesota at Twin Cities SUPPORT VECTOR MACHINE TRAINING WITH A FEW REPRESENTATIVES We study algorithms that speed up the training process of support vector machines by using only a relatively small number of representatives. We show how kernel K-means usually can be expected to yield a good set of representatives. The effectiveness is demonstrated with experiments on some real datasets and a theoretical PAC-style generalization bound. Maarten De Vos Department of Electrical Engineering Katholieke Universiteit Leuven IMPOSING INDEPENDENCE CONSTRAINTS TO THE CP-MODEL Data-driven decomposition techniques (like Independent Component Analysis (ICA), Canonical Correlation Analysis (CCA), Canonical Decomposition/PARAFAC (CP)) received in the last decades an increasing amount of attention as an exploratory tool in biomedical signal processing as opposed to model-driven techniques. Recently, "tensor-ICA", a combination of ICA and the CP model was introduced as a new concept for the decomposition of functional Magnetic Resonance data (fMRI) (Beckmann et al, 2005). The trilinear structure was in that study imposed after the computation of the independent components. We propose another algorithm to compute a trilinear decomposition with supposed independence in one mode. In this algorithm, independent component and trilinear structure constraints are imposed at the same time. We also show that this new algorithm outperforms the previously proposed tensor-ICA. Joint research with Lieven De Lathauwer and Sabine Van Huffel. Amit Deshpande Department of Mathematics Massachusetts Institute of Technology FAST RELATIVE LOW-RANK MATRIX APPROXIMATION Low-rank approximation using Singular Value Decomposition (SVD) is computationally expensive for certain applications involving large matrices. Frieze-Kannan-Vempala (FKV) showed that from a small sample of rows of the given matrix we can compute a low-rank approximation, which is (in expectation) only an additive error worse than the "best" low-rank approximation. This can be converted into a randomized algorithm to compute this additive low-rank approximation in "linear" (in the number of non-zero entries) time. But in general, their additive error can be unbounded compared to the error of the "best" low-rank approximation. Using some generalizations of the FKV sampling scheme, we strenghthen their results for low-rank approximation within multiplicative error. Based on this we get a randomized algorithm to find such an approximation in "linear" time. Joint work with Luis Rademacher, Santosh Vempala and Grant Wang. Mariya Ishteva Department of Electrical Engineering Katholieke Universiteit Leuven RANK-(R1,R2,R3) REDUCTION OF TENSORS BASED ON THE RIEMANNIAN TRUST-REGION SCHEME We consider unstructured third-order tensors and look for the best (R1,R2,R3) low-rank approximation. In the matrix case, low-rank approximation can be obtained from the truncated Singular value decomposition (SVD). However, in the tensor case, the truncated Higher-order SVD (HOSVD) gives a suboptimal low-rank approximation of a tensor, which can only be used as a starting value for iterative algorithms. The algorithm we present is based on the Riemannian trust-region method. We express the tensor approximation problem as minimizing a cost function on a proper manifold. Making use of second order information about the cost function, superlinear convergence is achieved. Joint work with Lieven De Lathauwer, Pierre-Antoine Absil, Rodolphe Sepulchre, Sabine Van Huffel SungEun Jo Department of Computer Science Stanford University SOLVING SECULAR EQUATIONS FOR LARGE TOTAL LEAST SQUARES/DATA LEAST SQUARES PROBLEMS BY MEANS OF GAUSS QUADRATURE RULES We approximate secular equations for total least squares (TLS) and data least squares(DLS) problems by means of Lanczos tri-diagonalization processes. Based on Gaussian Quadrature (GQ) rules, the best number of Lanczos steps is determined with a given tolerance by investigating the bounds of the secular equations. The numerical example shows the efficacy of the GQ approach to solving the large TLS/DLS problems. We also discuss some implementation issues such as bisection, stabilization, Q-less QR preprocessing, and indefinite systems for TLS/DLS problems. Joint work with Gene Golub and Zheng Su. Andrew Knyazev Department of Mathematical Sciences University of Colorado at Denver MULTISCALE SPECTRAL GRAPH PARTITIONING AND IMAGE SEGMENTATION Spectral methods for graph partitioning, based on numerical solution of eigenvalue problems with the graph Laplacian, are well known to produce high quality partitioning, but are also considered to be expensive. We discuss modern preconditioned eigensolvers for computing the Fiedler vector of large scale eigenvalue problems. The ultimate goal is to find a method with a linear complexity, i.e. a method with computational costs that scale linearly with the problem size. We advocate the locally optimal block preconditioned conjugate gradient method (LOBPCG), suggested by the presenter, as a promising candidate, if matched with a high quality preconditioner. We provide preliminary numerical results, e.g., we show that a Fiedler vector for a 24 megapixel image can be computed in seconds on IBM's BlueGene/L using BLOPEX in our BLOPEX software with Hypre algebraic multigrid preconditioning. Rajesh Kumar and Swaroop Jagadish Department of Computer Science University of California, Santa Barbara RETRIEVAL OF BIOLOGICAL IMAGES BASED ON REGION SIMILARITY The sub-regions of an image may be more 'interesting' than an entire image. For e.g., an image of a retina has only a few regions that depict detachment of proteins in a certain fashion. To detect the presence of such protein detachments that are similar to a given image pattern, from a large database of retinal images, is a non-trivial task. The optimal algorithms for such pattern-matching are practically infeasible and unscalable. The goal of this project is to develop efficient heuristics for real time detection of such matching regions. Pei Ling Lai Department of Electronics Engineering Southern Taiwan University of Technology STOCHASTIC PROCESS METHODS FOR INFORMATION EXTRACTION We consider several stochastic process methods for performing canonical correlation analysis (CCA). The first uses a Gaussian Process formulation of regression in which we use the current projection of one data set as the target for the other and then repeat in the opposite direction. The second uses a method which relies on probabilistically sphering the data, concatenating the two streams and then performing a probabilistic PCA. The third gets the canonical correlation projections directly without having to calculate the filters first. We also investigate nonlinearity and sparsification of these methods. Finally, we use a Dirichlet process of Gaussian models in which the Gaussian models are determined by Probabilistic CCA in order to model nonlinear relationships with a mixture of linear correlations where the number of mixtures is not pre-determined. Wanjun Mi Institute for Computational and Mathematical Engineering Stanford University SHIFT SENSITIVITY OF EIGENFACE, EIGENPHASE, AND EIGENMAGNITUDE Eigenface method applies Principal Component Analysis on a set of learning images from which eigenface are extracted. It's widely used and studied in statistical image recognition. We demonstrate that this method is sensitive to shift in the learning images. The correlation of images with the eigenvector, also known as eigenfeatures, are shifting as the learning images shift. We also compare it with eigen-phase and eigen-magnitude methods. Niloy J. Mitra Department of Computer Science Stanford University PROBABILISTIC FINGERPRINTS FOR SHAPES We propose a new probabilistic framework for the efficient estimation of similarity between 3D shapes. Our framework is based on local shape signatures and is designed to allow for quick pruning of dissimilar shapes, while guaranteeing not to miss any shape with significant similarities to the query model in shape database retrieval applications. Since directly evaluating 3D similarity for massive collections of signatures on shapes is expensive and impractical, we propose a suitable but compact approximation based on probabilistic fingerprints which are computed from the shape signatures using Rabin?s hashing scheme and a small set of random permutations. We provide a probabilistic analysis that shows that while the preprocessing time depends on the complexity of the model, the fingerprint size and hence the query time depends only on the desired confidence in our estimated similarity. Our method is robust to noise, invariant to rigid transforms, handles articulated deformations, and effectively detects partial matches. In addition, it provides important hints about correspondences across shapes which can then significantly benefit other algorithms that explicitly align the models. We demonstrate extension of our algorithm to streaming data application. We demonstrate the utility of our method on a wide variety of geometry processing applications. A preliminary version of the work has been submitted to Symposium of Geometry Processing (2006) Kourosh Modarresi Institute for Computational and Mathematical Engineering Stanford University OUTLINES OF NON-CLASSICAL TIKHONOV METHOD Today we know that ill-posed problems, for which the solution does not continuously depend on the variations in the data, arise naturally from real physical problems and are not, in most cases, as a result of incorrect modeling, but due to the inherent characteristics of the original physical problems. In solving the corresponding Linear Least squares Problems, resulting from discretization of the original models, the usual solution methods such as QR/SVD or Normal equation Algorithm, would result in solutions with very little relevance to the exact solutions. The remedy, for the solution of these ill-conditioned least squares problem, is application of some regularization methods. The most used regularization methods are TSVD and Tikhonov regularization methods. In using Tikhonov regularization method, the most important steps are the choice of priori and the regularization parameter, which controls the level of regularization for the problem. Classical Tikhonov method is based on global priori and regularization parameter. This approach underestimates the local features of the solution and may result to oversmoothing of the original solution.To address this difficulty, a more global approach is needed. In this work, we consider this approach in the solution of the Tikhonov regularization method. Joint work with Gene H. Golub. Morten Mørup Department of Signal Processing Technical University of Denmark EXTENSIONS OF NON-NEGATIVE MATRIX FACTORIZATION (NMF) TO HIGHER ORDER DATA Higher order matrix (tensor) decompositions are mainly used in psychometrics, chemometrics, image analysis, graph analysis and signal processing. For higher order data the two most commonly used decompositions are the PARAFAC and the TUCKER model. If the data analyzed is non-negative it may be relevant to consider additive non-negative components. We here extend non-negative matrix factorization (NMF) to form algorithms for non-negative TUCKER and PARAFAC decompositions. Furthermore, we extend the PARAFAC model to account for shift and echo effects in the data. To improve uniqueness of the decompositions we use updates that can impose sparseness in any combination of modalities. The algorithms developed are demonstrated on a range of datasets spanning from electroencephalography to sound and chemometry signals. Chaitanya Muralidhara Department of Cellular and Molecular Biology University of Texas at Austin EXPLORING PHYLOGENETIC RELATIONSHIPS IN SEQUENCE ALIGNMENTS THROUGH SINGULAR VALUE DECOMPOSITION The 16S ribosomal RNA is a highly conserved molecule used to derive phylogenetic relationships among organisms, by traditional methods for phylogeny reconstruction. We describe the SVD analysis of an alignment of 16S rRNA sequences from organisms belonging to different phylogenetic domains. The dataset is transformed from a matrix of positions $\times$ organisms to a tensor of positions $\times$ code $\times$ organisms through a binary encoding that takes into account the nucleotide at each position. The tensor is flattened into a matrix of (positions $\times$ code) $\times$ organisms and singular value decomposition is applied, to obtain a representation in the ``eigenpositions'' $\times$ ``eigenorganisms'' space. These eigenpositions and eigenorganisms are unique orthonormal superpositions of the positions and organisms respectively. We show that the significant eigenpositions correlate with the underlying phylogenetic relationships among the organisms examined. The specific positions that contribute to each of these relationships, identified from the eigenorganisms, correlate with known sequence and structure motifs in the data, which are associated with functions like RNA-- or protein--binding. Among others, we identify unpaired adenosines as significant contributors to phylogenetic distinctions. These adenosine nucleotides, unpaired in the secondary (2D) structure, have been shown to be involved in a variety of tertiary (3D) structural motifs, some of which are believed to play a role in RNA folding [Gutell et al., \textit{RNA} 2000]. Joint work with Robin R. Gutell, Gene H. Golub, Orly Alter. Larsson Omberg Department of Physics University of Texas at Austin A TENSOR HIGHER-ORDER SINGULAR VALUE DECOMPOSITION FOR INTEGRATIVE ANALYSIS OF DNA MICROARRAY DATA The structure of DNA microarray data is often of an order higher than that of a matrix, especially when integrating data from different studies. Flattened into a matrix format, much of the information in the data is lost. We describe the use of a higher-order singular value decomposition (HOSVD) in transforming a tensor of genes $\times$ arrays $\times$ studies, which tabulates a series of DNA microarray datasets from different studies, to a ``core tensor'' of ``eigengenes'' $\times$ ``eigenarrays'' $\times$ ``eigenstudies,'' where the eigengenes, eigenarrays and eigenstudies are unique orthonormal superpositions of the genes, arrays and studies, respectively. This HOSVD, also known as N-mode SVD, formulates the tensor as a linear superposition of all possible outer products of an eigengene with an eigenarray with an eigenstudy, i.e., rank-1 ``subtensors,'' the superposition coefficients of which are tabulated in the core tensor. Each coefficient indicates the significance of the corresponding subtensor in terms of the overall information that this subtensor captures in the data. We show that significant rank-1 subtensors can be associated with independent biological processes, which are manifested in the data tensor. Filtering out the insignificant subtensors off the data tensor simulates experimental observation of only those processes associated with the significant subtensors. Sorting the data according to the eigengenes, eigenarrays and eigenstudies appears to classify the genes, arrays and studies, respectively, into groups of similar underlying biology. We illustrate this HOSVD with an integration of genome-scale mRNA expression data from yeast cell cycle time courses, each of which is under a different oxidative stress condition. Novel correlation between the DNA-binding of a transcription factor and the difference in the effects of these oxidative stresses on the progress of the cell cycle is predicted. Joint work with Gene H. Golub, Orly Alter. Sri Priya Ponnapalli Department of Electrical and Computer Engineering University of Texas at Austin A NOVEL HIGHER-ORDER GENERALIZED SINGULAR VALUE DECOMPOSITION FOR COMPARATIVE ANALYSIS OF MULTIPLE GENOME-SCALE DATASETS We define a higher-order generalized singular value decomposition (GSVD) of two or more matrices $D_{i}$ of the same number of columns and, in general, different numbers of rows. Each matrix is factored into a product $U_i \Sigma_i X^{-1}$ of a matrix $U_i$ composed of the normalized column basis vectors, a diagonal matrix $\Sigma_i$, and a nonsingular matrix $X^{-1}$ composed of the normalized row basis vectors. The matrix $X^{-1}$ is identical in all the matrix factorizations. The row basis vectors are the eigenvectors of $C$, the arithmetic mean of all quotients of the correlation matrices $D_{i}^{T}D_{i}$. The $n$th diagonal element of $\Sigma_{i}$, $\Sigma_{i,n}$, indicates the significance of the $n$th row basis vector in the $i$th matrix in terms of the overall information that the $n$th row basis vector captures in the $i$th matrix. The ratio $\Sigma_{i,n} / \Sigma_{j,n}$ indicates the relative significance of the $n$th row basis vector in the $i$th matrix relative to the $j$th matrix. We show that the eigenvalues of $C$ that correspond to row basis vectors of equal significance in all matrices $D_{i}$, such that $\Sigma_{i,n} / \Sigma_{j,n}=1$, are equal to 1; the eigenvalues that correspond to row basis vectors which are approximately insignificant in one or more matrices $D_{i}$ relative to all the other matrices $D_{j}$, such that $\Sigma_{i,n} / \Sigma_{j,n} \approx 0$, are $\gg 1$. We show that the column basis vector $U_{i,n}$ is orthogonal to all other column basis vectors if the corresponding $n$th row basis vector is of equal significance in all matrices, such that the correspodning eigenvalue of $C$ is 1. These properties of the GSVD of two matrices [Golub \& Van Loan, Johns Hopkins Univ.~Press 1996] are preserved in this higher-order GSVD of two or more matrices. Recently we showed that the mathematical row basis vectors uncovered in the GSVD of two genome-scale mRNA expression datasets from two different organisms, human and the yeast {\it Saccharomyces cerevisiae}, during their cell cycle, correspond to the similar and dissimilar among the biological programs that compose each of the two datasets [Alter, Brown \& Botstein, {\it PNAS} 2003]. We now show that the mathematical row basis vectors uncovered in this higher-order GSVD of five genome-scale mRNA expression datasets from five different organisms, human, the yeast {\it Saccharomyces cerevisiae}, the yeast {\it Schizosacchomyces pombe}, bacteria and plant during their cell cycle, correspond to the similar and dissimilar among the biological programs that compose each of the five datasets. Row basis vectors of equal significance in all datasets correspond to the cell cycle program which is common to all organisms; row basis vectors which are approximately insignificant in one or more of the datasets correspond to biological programs, such as synchronization responses, that are exclusively manifested in all the other datasets and might be exclusive to the corresponding organisms. Such comparative analysis of genome-scale mRNA data among two or more model organisms, that is not limited to orthologous or homologous genes across the different organisms, promises to enhance fundamental understanding of the universality as well as the specialization of molecular biological mechanisms. Joint work with Gene H. Golub, Orly Alter. Luis Rademacher Department of Mathematics Massachusetts Institute of Technology MATRIX APPROXIMATION AND PROJECTIVE CLUSTERING VIA ADAPTIVE SAMPLING Frieze, Kannan and Vempala proved that a small sample of rows of a given matrix contains a low rank approximation that minimizes the distance in terms of the Frobenius norm to within a small additive error, and the sampling can be done efficiently using just two passes over the matrix. We generalize this work by showing that the additive error drops exponentially by iterating the sampling in an adaptive manner. This result is one of the ingredients of the linear time algorithm for multiplicative low-rank approximation by Deshpande and Vempala. The existence of a small certificate for multiplicative low-rank approximation leads to a PTAS for the following projective clustering problem: Given a set of points in Euclidean space and integers k and j, find j subspaces of dimension k that minimize the sum over the points of squared distances of each point to the nearest subspace. Joint work with Amit Deshpande, Santosh Vempala and Grant Wang. Inam Ur Rahman Institute for Computational and Mathematical Engineering Stanford University MANIFOLD-VALUED DATA MINING New types of sensors and devices are being built everyday. Not only we are measuing huge amount of data but also new types of data, highly geometric in nature and inherently different than traditional Eucledian-valued data. Typical examples include human motion data in animation and diffusion tensor data in medical imaging. These type of data takes values on special Riemannain manifolds, called Symmetric Spaces. We call it 'Manifold-Valued' data. As the amount M-valued data touches terabyte mark, tools are needed that can efficiently mine this type of data. For example, radiologist in medical imaging might be interested in searching many terabytes of diffusion tensor data to find those matching, in a certain sense, given DT image . In animation one might be interested in extracting those motion clips, from huge motion capture database, that matches given query clip or description given by the animator. Hence all the traditional supervised and unsupervised learning issues, like clustering, classification, indexing, retrieval, searching etc, become important for M-valued data. Learning algorithm for Eucledian-valued data may not be appropriate and directly applicable because of highly geometric and non-linear nature of M-valued data. In this work, we discuss new wavelet like transform, that we have developed for M-valued data and its applicability in facilitating above cited learning and data mining task. David Skillicorn School of Computing Queen's University MATRIX DECOMPOSITIONS AND SECURITY PROBLEMS Problems in counterterrorism, fraud, law enforcement, and organizational attempts to watch for corporate malfeasance all require looking for traces in large datasets. Sophisticated `bad guys' face two countervailing pressures: the needs of the task or mission, and the need to remain concealed. Because they are sophisticated, they do not show up as outliers; because they are unusual, they don't show up as mainstream either. Instead, they are likely to appear at the `edges' of structures in the data. Several properties of matrix decompositions make them superb tools to look in the `edges' or `corners' of datasets. For example, SVD transforms data into a space in which distance from the origin corresponds, in a useful sense, to interestingness; data from innocent people provides a picture of normal correlation, against which unusual correlation stands out; and the symmetry between records and attributes makes it possible to investigate how `edge' records differ from normal ones. The machinery of spectral graph partitioning can also be used to look for unusual records, or values using link prediction. Unresolved issues of normalization and removing stationarity are critical to making these approaches work. Other matrix decompositions also have a role to play. ICA, for example, is powerfully able to discover small tightly-knit subgroups within a dataset. It was able to discover cells within a dataset of al Qaeda links. The importance of textual data suggests a role (as yet unfulfilled) for NNMF. Jimeng Sun Department of Computer Science Carnegie Mellon University BEYOND STREAMS AND GRAPHS: DYNAMIC TENSOR ANALYSIS Time-evolving data models have been widely studied in data mining field such as time-series, data streams and graphs over time. We argue that all these canonical examples can be covered and enriched using a flexible model {\em tensor stream}, that is a sequence of tensors growing over time. Under this model, we propose two streaming algorithms for tensor PCA, a generalization of PCA for a sequence of tensors instead of vectors. We applied them in two real settings, namely, anomaly detection and multi-way latent semantic indexing. We used two real, large datasets, one on network flow data (100GB over 1 month) and one from DBLP (200MB over 25 years). Our experiments show that our methods are fast, accurate and that they find interesting patterns and outliers on the real datasets. Grant Wang Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology EIGENCLUSTER: FINDING INNATE CLUSTERINGS ON THE FLY We present a spectral algorithm for clustering massive data sets based on pairwise similarities. The algorithm has guarantees on the quality of the clustering found. The algorithm is especially well-suited for the common case where data objects are encoded as sparse feature vectors and the pairwise similarity between objects is the inner product between their feature vectors; here, the algorithm runs in space linear in the number of nonzeros in the object-feature matrix. The spectral algorithm outputs a hierarchical clustering tree. We show how to use dynamic programming to find the optimal tree-respecting clustering for many natural clustering objective functions, such as k-means, k-median, min-diameter, and correlation clustering. We evaluate the algorithm on a handful of real-world datasets; the results show our method compares favorably with known results. We also give an implementation of a meta-search engine that clusters results from web searches. This is joint work with David Cheng, Ravi Kannan, and Santosh Vempala. Joab Winkler Department of Computer Science University of Sheffield STRUCTURED MATRIX METHODS FOR THE COMPUTATION OF A RANK REDUCED SYLVESTER MATRIX The Sylvester resultant matrix S(p,q) is a structured matrix that can be used to determine if two polynomials p=p(y) and q=q(y) are coprime, and if they are not coprime, it allows their greatest common divisor (GCD) to be computed. In particular, the rank loss of S(p,q) is equal to the degree of the GCD of p(y) and q(y), and the GCD can be obtained by reducing S(p,q) to row echelon form. The computation of the GCD of two polynomials arises in many applications, including computer graphics, control theory and geometric modelling. Experimental errors imply that the data consists of noisy realisations of the exact polynomials p(y) and q(y), and thus even if p(y) and q(y) have a non-constant GCD, their noisy realisations, f(y) and g(y) respectively, are coprime. It is therefore only possible to compute an approximate GCD, that is, a GCD of the polynomials f*(y) and g*(y) that are obtained by small perturbations of f(y) and g(y). Different perturbations of f(y) and g(y) yield different approximate GCDs, all of which are legitimate if the magnitude of these perturbations is smaller than the noise in the coefficients. It follows that f*(y) and g*(y) have a non-constant GCD, and thus the Sylvester resultant matrix S(f*, g*) is a low rank approximation of the Sylvester matrix S(f,g). The method of structured total least norm (STLN) is used to compute the rank reduced Sylvester resultant matrix S(f*, g*), given inexact polynomials f(y) and g(y). Although this problem has been considered previously, there exist several issues that have not been addressed, and that these issues have a considerable effect on the computed approximate GCD. The GCD of f(y) and g(y) is equal (up to a scalar multiplier) to the GCD of f(y) and ag(y), where a is an arbitrary constant, and it is shown that a has a significant effect on the computed results. In particular, although the GCD of f(y) and ag(y) is independent (up to an arbitrary constant) of a, an incorrect value of a leads to unsatisfactory numerical answers. This dependence on the value of a has not been considered previously, and methods for the determination of its optimal value are considered. It is shown that a termination criterion of the optimisation algorithm that is based on a small normalised residual may lead to incorrect results, and that it is also necessary to monitor the singular values of S(f*,g*) in order to achieve good results. Several non-trivial examples are used to illustrate the importance of a, and the effectiveness of a termination criterion that is based on the normalised residual and the singular values of S(f*,g*). The dependence of the computed solution on the value of a has implications for the method that is used for the solution of the least squares equality (LSE) problem that arises from the method of STLN. In particular, this problem is usually solved by the penalty method (method of weights), which requires that the value of the weight be set, but its value is defined heuristically, that is, it is independent of the data (the coefficients of the polynomials). As noted above, the value of the parameter a is crucial to the success or failure of the computed solution, and thus the presence of a parameter that is defined heuristically is not satisfactory. The QR decomposition, which does not suffer from this disadvantage, is therefore used to solve the LSE problem. Joint work with John D. Allan.