9:00 - 9:45 BREAKFAST/REGISTRATION 9:45 - 10:00 Opening: Organizers 10:00 - 11:00 Tutorial: Christos Faloutsos Carnegie Mellon University GRAPH MINING: LAWS, GENERATORS AND TOOLS How do graphs look like? How do they evolve over time? How can we generate realistic-looking graphs? We review some static and temporal 'laws', and we describe the ``Kronecker'' graph generator, which naturally matches all of the known properties of real graphs. Moreover, we present tools for discovering anomalies and patterns in two types of graphs, static and time-evolving. For the former, we present the 'CenterPiece' subgraphs (CePS), which expects $q$ query nodes (eg., suspicious people) and finds the node that is best connected to all $q$ of them (eg., the master mind of a criminal group). We also show how to compute CenterPiece subgraphs efficiently. For the time evolving graphs, we present tensor-based methods, and apply them on real data, like the DBLP author-paper dataset, where they are able to find natural research communities, and track their evolution. Finally, we also briefly mention some results on influence and virus propagation on real graphs, as well as on the emerging map/reduce approach and its impact on large graph mining. 11:00 - 11:30 Deepak Agarwal Yahoo! Research, Silicon Valley PREDICTIVE DISCRETE LATENT MODELS FOR LARGE INCOMPLETE DYADIC DATA We propose a novel statistical model to predict large scale dyadic response variable in the presence of covariate information. Our approach simultaneously incorporates the effect of covariates and estimates local structure that is induced by interactions among the dyads through a discrete latent factor model. The discovered latent factors provide a predictive model that is both accurate and interpretable. To further combat sparsity that is often present in large scale dyadic data, we enhance the discrete latent factors by coupling it with a factorization model on the cluster effects. In fact, by regularizing the factors through $L_{2}$ norm constraints, we are able to allow larger number of factors that capture dyadic interactions with greater precision compared to a pure factorization model, especially for sparse data. The methods are illustrated in a generalized linear model framework and includes linear regression, logistic regression and Poisson regression as special cases. We also provide scalable generalized EM-based algorithms for model fitting using both "hard" and "soft" cluster assignments. We demonstrate the generality and efficacy of our approach through large scale simulation studies and analysis of datasets obtained from certain real-world movie recommendation and internet advertising applications. 11:30 - 12:00 Chandrika Kamath Lawrence Livermore National Laboratory SCIENTIFIC DATA MINING: WHY IS IT DIFFICULT? Scientific data sets, whether from computer simulations, observations, or experiments, are not only massive, but also very complex. This makes it challenging to find useful information in the data, a fact scientists find disconcerting as they wonder about the science still undiscovered in the data. In this talk, I will describe my experiences in analyzing data in a variety of scientific domains. Using example problems from astronomy, fluid mixing, remote sensing, and experimental physics, I will describe our solution approaches and discuss some of the challenges we have encountered in mining these datasets. 12:00 - 2:00 LUNCH (on your own) 2:00 - 3:00 Tutorial Edward Chang Google Research, Mountain View CHALLENGES IN MINING LARGE-SCALE SOCIAL NETWORKS Social networking sites such as Orkut, MySpace, Hi5, and Facebook attract billions of visits a day, surpassing the page views of Web Search. These social networking sites provide applications for individuals to establish communities, to upload and share documents/photos/videos, and to interact with other users. Take Orkut as an example. Orkut hosts millions of communities, with hundreds of communities created and tens of thousands of blogs/photos uploaded each hour. To assist users to find relevant information, it is essential to provide effective collaborative filtering tools to perform recommendations such as friend, community, and ads matching. In this talk, I will first describe both computational and storage challenges to traditional collaborative filtering algorithms brought by aforementioned information explosion. To deal with huge social graphs that expand continuously, an effective algorithm should be designed to 1) run on thousands of parallel machines for sharing storage and speeding up computation, 2) perform incremental retraining and updates for attaining online performance, and 3) fuse information of multiple sources for alleviating information sparseness. In the second part of the talk, I will present algorithms we recently developed including parallel PF-Growth [1], parallel combinational collaborative filtering [2], parallel LDA, parallel spectral clustering, and parallel Support Vector Machines [3]. [1,2,3] Papers and some codes are available at http://infolab.stanford.edu/~echang/ 3:00 - 3:30 Sharad Goel Yahoo! Research, New York PREDICTIVE INDEXING FOR FAST SEARCH We tackle the computational problem of query-conditioned search. Given a machine-learned scoring rule and a query distribution, we build a predictive index by pre-computing lists of potential results sorted based on an expected score of the result over future queries. The predictive index datastructure supports an anytime algorithm for approximate retrieval of the top elements. The general approach is applicable to webpage ranking, internet advertisement, and approximate nearest neighbor search. It is particularly effective in settings where standard techniques (e.g., inverted indices) are intractable. We experimentally find substantial improvement over existing methods for internet advertisement and approximate nearest neighbors. This work is joint with John Langford and Alex Strehl. 3:30 - 4:00 James Demmel University of California, Berkeley AVOIDING COMMUNICATION IN LINEAR ALGEBRA ALGORITHMS We survey recent results on designing numerical algorithms to minimize the largest cost component: communication. This could be bandwidth and latency costs between processors over a network, or between levels of a memory hierarchy; both costs are increasing exponentially compared to floating point. We describe novel algorithms in sparse and dense linear algebra, for both direct methods (like QR and LU) and iterative methods that can minimize communication. 4:00 - 4:30 COFFEE BREAK 4:30 - 5:00 Jun Liu Harvard University BAYESIAN INFERENCE OF INTERACTIONS AND ASSOCIATIONS In a regression or classification problem, one often has many potential predictors (independent variables), and these predictors may interact with each other to exert non-additive effects. I will present a Bayesian approach to search for these interactions. We were motivated by the epistasis detection problem in population-based genetic association studies, i.e., to detect interactions among multiple genetic defects (mutations) that may be causal to a specific complex disease. Aided with MCMC sampling techniques, our Bayesian method can efficiently detect interactions among many thousands of genetic markers. A related problem is to find patterns or associations in text documents, such as the "market basket" problem, in which one attempts to mine association rules among the items in a supermarket based on customers' transaction records. For this problem, we formulate our goal as to find subsets of words that tend to co-occur in a sentence. We call the set of co-occurring words (not necessarily orderly) a "theme" or a "module". I will illustrate a simple generative model and the EM algorithm for inferring the themes. I will demonstrate its applications in a few examples including an analysis of chinese medicine prescriptions and an analysis of a chinese novel. 5:00 - 5:30 Fan Chung University of California, San Diego FOUR GRAPH PARTITIONING ALGORITHMS We will discuss four partitioning algorithms using eigenvectors, random walks, PageRank and their variations. In particular, we will examine local partitioning algorithms, which find a cut near a specified starting vertex, with a running time that depends on the size of the small side of the cut, rather than on the size of the input graph (which can be prohibitively large). Three of the four partitioning algorithms are local algorithms and are particularly appropriate for applications for modern massive data sets. 5:30 - 6:00 Ronald Coifman Yale University DIFFUSION GEOMETRIES AND HARMONIC ANALYSIS ON DATA SETS We discuss the emergence of self organization of data, either through eigenvectors of affinity operators, or equivalently through a multiscale ontology. We illustrate these ideas on images and audio files as well as molecular dynamics. 6:00 - 8:00 WELCOME RECEPTION AT NEW GUINEA GARDEN

9:00 - 10:00 Tutorial: Milena Mihail Georgia Institute of Technology MODELS AND ALGORITHMS FOR COMPLEX NETWORKS, WITH NETWORK ELEMENTS MAINTAINING CHARACTERISTIC PROFILES We will discuss new trends in the study of models and algorithms for complex networks. The common theme is to enhance network elements, say nodes, with a small list of attributes describing the profile of the node relative to other nodes of the network. Indeed, this is common practice in real complex networks. Standard models for complex networks have captured sparse graphs with heavy tailed statistics and/or the small world phenomenon. We will discuss random dot product graphs, which is a model generating networks with a much wider range of properties. These properties include varying degree distributions, varying graph densities, and varying spectral decompositions. The flexibility of this model follows by representing nodes as d-dim vectors (one dimension for each relevant attribute) sampled from natural general distributions, and links added with probabilities proportional to a similarity function over vectors in high dimensions. Concerning algorithms for complex networks, we argue that endowing local network elements with (a computationally reasonable amount of) global information can facilitate fundamental algorithmic primitives. For example, we will discuss how spectral representations (in particular, a network node being aware of its value according to a principal eigenvector) can facilitate searching, information propagation and information retrieval. However, computing such eiganevector components locally in a distributed, asynchronous network is a challenging open problem. 10:00 - 10:30 Reid Andersen Microsoft Research, Redmond AN ALGORITHM FOR IMPROVING GRAPH PARTITIONS The minimum quotient cut problem is a fundamental and well-studied problem in graph partitioning. We present an algorithm called Improve that improves a proposed partition of a graph, taking as input a subset of vertices and returning a new subset of vertices with a smaller quotient cut score. The most powerful previously known method for improving quotient cuts, which is based on parametric flow, returns a partition whose quotient cut score is at least as small as any set contained within the proposed set. For our algorithm, we can prove a stronger guarantee: the quotient score of the set returned is nearly as small as any set in the graph with which the proposed set has a larger-than-expected intersection. The algorithm finds such a set by solving a sequence of polynomially many S-T minimum cut problems, a sequence that cannot be cast as a single parametric flow problem. We demonstrate empirically that applying Improve to the output of various graph partitioning algorithms greatly improves the quality of cuts produced without significantly impacting the running time. Joint work with Kevin Lang. 10:30 - 11:00 COFFEE BREAK 11:00 - 11:30 Michael W. Mahoney Yahoo! Research, Silicon Valley COMMUNITY STRUCTURE IN LARGE SOCIAL AND INFORMATION NETWORKS The concept of a community is central to social network analysis, and thus a large body of work has been devoted to identifying community structure. For example, a community may be thought of as a set of web pages on related topics, a set of people who share common interests, or more generally as a set of nodes in a network more similar amongst themselves than with the remainder of the network. Motivated by difficulties we experienced at actually finding meaningful communities in large real-world networks, we have performed a large scale analysis of a wide range of social and information networks. Our main methodology uses local spectral methods and involves computing isoperimetric properties of the networks at various size scales -- a novel application of ideas from scientific computation to internet data analysis. Our empirical results suggest a significantly more refined picture of community structure than has been appreciated previously. Our most striking finding is that in nearly every network dataset we examined, we observe tight but almost trivial communities at very small size scales, and at larger size scales, the best possible communities gradually ``blend in'' with the rest of the network and thus become less ``community-like.'' This behavior is not explained, even at a qualitative level, by any of the commonly-used network generation models. Moreover, this behavior is exactly the opposite of what one would expect based on experience with and intuition from expander graphs, from graphs that are well-embeddable in a low-dimensional structure, and from small social networks that have served as testbeds of community detection algorithms. Possible mechanisms for reproducing our empirical observations will be discussed, as will implications of these findings for clustering, classification, and more general data analysis in modern large social and information networks. 11:30 - 12:00 Nikhil Srivastava Yale University GRAPH SPARSIFICATION BY EFFECTIVE RESISTANCES A sparsifier of a graph G is a sparse subgraph H that approximates it in some natural or useful way. Benczur and Karger gave the first sparsifiers in 1996, in which the weight of every cut in H was within a factor of (1+/-\epsilon) of its weight in G. In this work, we consider a stronger notion of approximation (introduced by Spielman and Teng in 2004) that requires the Laplacian quadratic forms of H and G to be close -- specifically, that x^TL'x = (1+/-\epsilon) x^TLx for all vectors x in R^n, where L and L' are the Laplacians of G and H respectively. This subsumes the Benczur-Karger notion, which corresponds to the special case of x in {0,1}^n. It also implies that G and H are similar spectrally, and that L' is a good preconditioner for L. We show that every graph contains a sparsifier with O(n log n) edges, and that one can be found in nearly-linear time by randomly sampling each edge of the graph with probability proportional to its effective resistance. A key ingredient in our algorithm is a subroutine of independent interest: a nearly-linear time algorithm that builds a data structure from which we can query the approximate effective resistance between any two vertices in a graph in O(log n) time. We conjecture the existence of sparsifiers with O(n) edges, noting that these would generalize the notion of expander graphs, which are constant-degree sparsifiers for the complete graph. Joint work with Dan Spielman. 12:00 - 12:30 Amin Saberi Stanford University SEQUENTIAL ALGORITHMS FOR GENERATING RANDOM GRAPHS Random graph generation has been studied extensively as an interesting theoretical problem. It has also become an important tool in a variety of real world applications including detecting motifs in biological networks and simulating networking protocols on the Internet topology. One of the central problems in this context is to generate a uniform sample from the set of simple graphs with a given degree sequence. The classic method for approaching this problem is the Markov chain Monte Carlo (MCMC) method. However, current MCMC-based algorithms have large running times, which make them unusable for real-world networks that have tens of thousands of nodes. This has constrained practitioners to use simple heuristics that are non-rigorous and have often led to wrong conclusions. I will present a technique that leads to almost linear time algorithms for generating random graphs for a range of degree sequences. I will also show how this method can be extended for generating random graphs that exclude certain subgraphs e.g. short cycles. Our approach is based on sequential importance sampling (SIS) technique that has been recently successful for counting and sampling graphs in practice. 12:30 - 2:30 LUNCH (on your own) 2:30 - 3:00 Pankaj K. Agarwal Duke University MODELING AND ANALYZING MASSIVE TERRAIN DATA SETS With recent advances in terrain-mapping technologies such as Laser altimetry (LIDAR) and ground based laser scanning, millions of georeferenced points can be acquired within short periods of time. However, while acquiring and georeferencing the data has become extremely efficient, transforming the resulting massive amounts of heterogeneous data to useful information for different types of users and applications is lagging behind, in large part because of the scarcity of robust, efficient algorithms for terrain modeling and analysis that can handle massive data sets acquired by different technologies and that can rapidly detect and predict changes in the model as the new data is acquired. This talk will review our on-going work on developing efficient algorithms for terrain modeling and analysis that work with massive data sets. It will focus on an I/O-efficient algorithm for computing contour maps of a terrain. A few open questions will also be discussed. 3:00 - 3:30 Leonidas Guibas Stanford University DETECTION OF SYMMETRIES AND REPEATED PATTERNS IN 3D POINT CLOUD DATA Digital models of physical shapes are becoming ubiquitous in our economy and life. Such models are sometimes designed ab initio using CAD tools, but more and more often they are based on existing real objects whose shape is acquired using various 3D scanning technologies. In most instances, the original scanner data is just a set, but a very large set, of points sampled from the surface of the object. We are interested in tools for understanding the local and global structure of such large-scale scanned geometry for a variety of tasks, including model completion, reverse engineering, shape comparison and retrieval, shape editing, inclusion in virtual worlds and simulations, etc. This talk will present a number of point-based techniques for discovering global structure in 3D data sets, including partial and approximate symmetries, shared parts, repeated patterns, etc. It is also of interest to perform such structure discovery across multiple data sets distributed in a network, without actually ever bring them all to the same host. 3:30 - 4:00 Yuan Yao Stanford University TOPOLOGICAL METHODS FOR EXPLORING PATHWAY ANALYSIS IN COMPLEX BIOMOLECULAR FOLDING We develop a computational approach to explore the relatively low populated transition or intermediate states in biomolecular folding pathways, based on a topological data analysis tool, Mapper, with simulation data from large-scale distributed computing. Characterization of these transient intermediate states is crucial for the description of biomolecular folding pathways, which is however difficult in both experiments and computer simulations. We are motivated by recent debates over the existence of multiple intermediates even in simple systems like RNA hairpins. With a proper design of conditional density filter functions and clustering on their level sets, we are able to provide structural evidence on the multiple intermediate states. The method is effective in dealing with high degree of heterogeneity in distribution, being less sensitive to the distance metric than geometric embedding methods, and capturing structural features in multiple pathways. It is a reminiscence of the classical Morse theory from mathematics, efficiently capturing topological features of high dimensional data sets. 4:00 - 4:30 COFFEE BREAK 4:30 - 5:00 Piotr Indyk Massachusetts Institute of Technology SPARSE RECOVERY USING SPARSE RANDOM MATRICES Over the recent years, a new *linear* method for compressing high-dimensional data (e.g., images) has been discovered. For any high-dimensional vector x, its *sketch* is equal to Ax, where A is an m x n matrix (possibly chosen at random). Although typically the sketch length m is much smaller than the number of dimensions n, the sketch contains enough information to recover an *approximation* to x. At the same time, the linearity of the sketching method is very convenient for many applications, such as data stream computing and compressed sensing. The major sketching approaches can be classified as either combinatorial (using sparse sketching matrices) or geometric (using dense sketching matrices). They achieve different trade-offs, notably between the compression rate and the running time. Thus, it is desirable to understand the connections between them, with the ultimate goal of obtaining the "best of both worlds" solution. In this talk we show that, in a sense, the combinatorial and geometric approaches are based on different manifestations of the same phenomenon. This connection will enable us to obtain several novel algorithms and constructions, which inherit advantages of sparse matrices, such as lower sketching and recovery times. Joint work with: Radu Berinde, Anna Gilbert, Howard Karloff, Milan Ruzic and Martin Strauss. 5:00 - 5:30 Ping Li Cornell University COMPRESSED COUNTING AND STABLE RANDOM PROJECTIONS The method of stable random projections has become a popular tool for dimension reduction, in particular, for efficiently computing pairwise distances in massive high-dimensional data (including dynamic streaming data) matrices, with many applications in data mining and machine learning such as clustering, nearest neighbors, kernel methods etc.. Closely related to stable random projections, Compressed Counting (CC) is recently developed to efficiently compute Lp frequency moments of a single dynamic data stream. CC exhibits a dramatic improvement over stable random projections when p is about 1. Applications of CC include estimating entropy moments of data streams and statistical parameter estimations in dynamic data using low memory. 5:30 - 6:00 Joel Tropp California Institute of Technology EFFICIENT ALGORITHMS FOR MATRIX COLUMN SELECTION A deep result of Bourgain and Tzafriri states that every matrix with unit-norm columns contains a large collection of columns that is exceptionally well conditioned. This talk describes a randomized, polynomial-time algorithm for producing the desired submatrix.

9:00 - 10:00 Tutorial Jerome H. Friedman Stanford University FAST SPARSE REGRESSION AND CLASSIFICATION Regularized regression and classification methods fit a linear model to data, based on some loss criterion, subject to a constraint on the coefficient values. As special cases, ridge-regression, the lasso, and subset selection all use squared-error loss with different particular constraint choices. For large problems the general choice of loss/constraint combinations is usually limited by the computation required to obtain the corresponding solution estimates, especially when non convex constraints are used to induce very sparse solutions. A fast algorithm is presented that produces solutions that closely approximate those for any convex loss and a wide variety of convex and non convex constraints, permitting application to very large problems. The benefits of this generality are illustrated by examples. 10:00 - 10:30 Tong Zhang Rutgers University AN ADAPTIVE FORWARD/BACKWARD GREEDY ALGORITHM FOR LEARNING SPARSE REPRESENTATIONS Consider linear least squares regression where the target function is a sparse linear combination of a set of basis functions. We are interested in the problem of identifying those basis functions with non-zero coefficients and reconstructing the target function from noisy observations. This problem is NP-hard. Two heuristics that are widely used in practice are forward and backward greedy algorithms. First, we show that neither idea is adequate. Second, we propose a novel combination that is based on the forward greedy algorithm but takes backward steps adaptively whenever necessary. We prove strong theoretical results showing that this procedure is effective in learning sparse representations. Experimental results support our theory. 10:30 - 11:00 COFFEE BREAK 11:00 - 11:30 Jitendra Malik University of California, Berkeley CLASSIFICATION USING INTERSECTION KERNEL SVMs IS EFFICIENT Straightforward classification using (non-linear) kernelized SVMs requires evaluating the kernel for a test vector and each of the support vectors. This can be prohibitively expensive, particularly in image recognition applications, where one might be trying to search for an object at any of a number of possible scales and locations. We show that for a large class of kernels, namely those based on histogram intersection and its variants, one can build classifiers with run time complexity logarithmic in the number of support vectors (as opposed to linear in the standard case). By precomputing auxiliary tables, we can construct approximations with constant runtime and space requirements , independent of the number of SV's. These improvements lead to up to 3 orders of magnitude speedup in classification time and up to 2 orders of magnitude memory savings on various classification tasks. We also obtain the state of the art results on pedestrian classification datasets. This is joint work with Subhransu Maji and Alexander Berg. A paper as well as source code is available at http://www.cs.berkeley.edu/~smaji/projects/fiksvm/ 11:30 - 12:00 Elad Hazan IBM Almaden Research Center EFFICIENT ONLINE ROUTING WITH LIMITED FEEDBACK AND OPTIMIZATION IN THE DARK Topological methods have in recent years shown themselves to be capable of identifying a number of qualitative geometric properties of high dimensional data sets. In this talk, we will describe philosophical reasons why topological methods should play a role in the study of data, as well as give several examples of how the ideas play out in practice. 12:00 - 12:30 T.S. Jayram IBM Almaden Research Center CASCADED AGGREGATES ON DATA STREAMS Let A be a matrix whose rows are given by A_1, A_2, ..., A_m. For two aggregate operators P and Q, the cascaded aggregate (P \circ Q)(A) is defined to be P(Q(A_1), Q(A_2), ..., Q(A_m)). The problem of computing such aggregates over data streams has received considerable interest recently. In this talk, I will present some recent results on computing cascaded frequency moments and norms over data streams. Joint work with David Woodruff (IBM Almaden). 12:30 - 2:30 LUNCH (on your own) 2:30 - 3:30 Tutorial Gunnar Carlsson Stanford University TOPOLOGY AND DATA Topological methods have in recent years shown themselves to be capable of identifying a number of qualitative geometric properties of high dimensional data sets. In this talk, we will describe philosophical reasons why topological methods should play a role in the study of data, as well as give several examples of how the ideas play out in practice. 3:30 - 4:00 Partha Niyogi University of Chicago MANIFOLD REGULARIZATION AND SEMI-SUPERVISED LEARNING Increasingly, we face machine learning problems in very high dimensional spaces. We proceed with the intuition that although natural data lives in very high dimensions, they have relatively few degrees of freedom. One way to formalize this intuition is to model the data as lying on or near a low dimensional manifold embedded in the high dimensional space. This point of view leads to a new class of algorithms that are "manifold motivated" and a new set of theoretical questions that surround their analysis. A central construction in these algorithms is a graph or simplicial complex that is data-derived and we will relate the geometry of these to the geometry of the underlying manifold. We will develop the idea of manifold regularization, its applications to semi-supervised learning, and the theoretical guarantees that may be provided in some settings. 4:00 - 4:30 COFFEE BREAK 4:30 - 5:00 Sanjoy Dasgupta University of California, San Diego RANDOM PROJECTION TREES AND LOW DIMENSIONAL MANIFOLDS The curse of dimensionality has traditionally been the bane of nonparametric statistics (histograms, kernel density estimation, nearest neighbor search, and so on), as reflected in running times and convergence rates that are exponentially bad in the dimension. This problem is all the more pressing as data sets get increasingly high dimensional. Recently the field has been rejuvenated substantially, in part by the realization that a lot of real-world data which appears high-dimensional in fact has low "intrinsic" dimension in the sense of lying close to a low-dimensional manifold. In the past few years, there has been a huge interest in learning such manifolds from data, and then using the learned structure to transform the data into a lower-dimensional space where standard statistical methods generically work better. I'll exhibit a way to benefit from intrinsic low dimensionality without having to go to the trouble of explicitly learning its fine structure. Specifically, I'll present a simple variant of the ubiquitous k-d tree (a spatial data structure widely used in machine learning and statistics) that is provably adaptive to low dimensional structure. We call this a "random projection tree" (RP tree). Along the way, I'll discuss different notions of intrinsic dimension -- motivated by manifolds, by local statistics, and by analysis on metric spaces -- and relate them. I'll then prove that RP trees require resources that depend only on these dimensions rather than the dimension of the space in which the data happens to be situated. This is work with Yoav Freund (UC San Diego). 5:00 - 5:30 Kenneth Clarkson IBM Almaden Research Center TIGHTER BOUNDS FOR RANDOM PROJECTIONS OF MANIFOLDS The Johnson-Lindenstrauss random projection lemma gives a simple way to reduce the dimensionality of a set of points while approximately preserving their pairwise Euclidean distances. The most direct application of the lemma applies to a finite set of points, but recent work has extended the technique to affine subspaces, smooth manifolds, and sets of bounded doubling dimension; in each case, a projection to a sufficiently large dimension k implies that all pairwise distances are approximately preserved with high probability. Here the case of random projection of a smooth manifold (submanifold of R^m) is considered, and a previous analysis is sharpened, giving an upper bound for k that depends on the surface area, total absolute curvature, and a few other quantities associated with the manifold, and in particular does not depend on the ambient dimension m or the reach of the manifold. 5:30 - 6:00 Yoram Singer Google Research, Mountain View EFFICIENT PROJECTION ALGORITHMS FOR LEARNING SPARSE REPRESENTATIONS FROM HIGH DIMENSIONAL DATA Many machine learning tasks can be cast as constrained optimization problems. The talk focuses on efficient algorithms for learning tasks which are cast as optimization problems subject to L1 and hyper-box constraints. The end result are typically sparse and accurate models. We start with an overview of existing projection algorithms onto the simplex. We then describe a linear time projection for dense input spaces. Last, we describe a new efficient projection algorithm for very high dimensional spaces. We demonstrate the merits of the algorithm in experiments with large scale image and text classification. 6:00 - 6:30 Arindam Banerjee University of Minnesota, Twin Cities BAYESIAN CO-CLUSTERING FOR DYADIC DATA ANALYSIS In recent years, co-clustering has emerged as a powerful data mining tool that can analyze dyadic data connecting two entities. However, almost all existing co-clustering techniques are partitional, and allow individual rows and columns of a data matrix to belong to only one cluster. Several current applications, such as recommendation systems, market basket analysis, and gene expression analysis, can substantially benefit from a mixed cluster assignment of rows and columns. In this talk, we present Bayesian co-clustering (BCC) models, that allow mixed probabilistic assignments of rows and columns to all clusters. BCC maintains separate Dirichlet priors over the mixed assignments and assumes each observation to be generated by an exponential family distribution corresponding to its row and column clusters. The model is designed to naturally handle sparse matrices as only the non-zero/non-missing entries are assumed to be generated by the model. We propose a fast variational algorithm for inference and parameter estimation in the model. In addition to finding co-cluster structure in observations, the model outputs a low dimensional co-embedding of rows and columns, and predicts missing values in the original matrix. We demonstrate the efficacy of the model through experiments on both simulated and real data. 6:30 - 8:00 RECEPTION AND POSTER SESSION (OLD UNION CLUB HOUSE)

9:00 - 10:00 Tutorial Michael I. Jordan University of California, Berkeley SUFFICIENT DIMENSION REDUCTION Consider a regression or classification problem in which the data consist of pairs (X,Y), where X is a high-dimensional vector. If we wish to find a low-dimensional subspace to represent X, one option is to ignore Y and avail ourselves of unsupervised methods such PCA and factor analysis. In this talk, I will discuss a supervised alternative which aims to take Y into account in finding a low-dimensional representation for X, while avoiding making strong assumptions about the relationship of X to Y. Specifically, the problem of "sufficient dimension reduction" (SDR) is that of finding a subspace S such that the projection of the covariate vector X onto S captures the statistical dependency of the response Y on X (Li, 1991; Cook, 1998). I will present a general overview of the SDR problem, focusing on the formulation of SDR in terms of conditional independence. I will also discuss some of the popular algorithmic approaches to SDR, particularly those based on inverse regression. Finally, I will describe a new methodology for SDR which is based on the characterization of conditional independence in terms of conditional covariance operators on reproducing kernel Hilbert spaces (a characterization of conditional independence that is of independent interest). I show how this characterization leads to an M-estimator for S and I show that the estimator is consistent under weak conditions; in particular, we do not have to impose linearity or ellipticity conditions of the kinds that are generally invoked for SDR methods based on inverse regression. Joint work with Kenji Fukumizu and Francis Bach. 10:00 - 10:30 Nathan Srebro University of Chicago MORE DATA LESS WORK: SVM TRAINING IN TIME DECREASING WITH LARGER DATA SETS Traditional runtime analysis of training Support Vector Machines, and indeed most learning methods, shows how the training runtime increases as more training examples are available. Considering the true objective of training, which is to obtain a good predictor, I will argue that training time should be studied as a decreasing, or at least non-increasing, function of training set size. I will then present both theoretical and empirical results demonstrating how a simple stochastic subgradient descent approach for training SVMs indeed displays such monotonic decreasing behavior. 10:30 - 11:00 COFFEE BREAK 11:00 - 11:30 Inderjit S. Dhillon University of Texas, Austin RANK MINIMIZATION VIA ONLINE LEARNING Minimum rank problems arise frequently in machine learning applications and are notoriously difficult to solve due to the non-convex nature of the rank objective. In this talk, we will present a novel online learning approach for the problem of rank minimization of matrices over polyhedral sets. In particular, we present two online learning algorithms for rank minimization - our first algorithm is a multiplicative update method based on a generalized experts framework, while our second algorithm is a novel application of the online convex programming framework (Zinkevich, 2003). A salient feature of our online learning approach is that it allows us to give provable approximation guarantees for the rank minimization problem over polyhedral sets. We demonstrate that our methods are effective in practice through experiments on synthetic examples, and on the real-life application of low-rank kernel learning. This is joint work with Constantine Caramanis, Prateek Jain and Raghu Meka. 11:30 - 12:00 Nir Ailon Google Research, New York EFFICIENT DIMENSION REDUCTION The Johnson-Lindenstrauss dimension reduction idea using random projections was discovered in the early 80's. Since then many "computer science friendly" versions were published, offering constant factor but no big-Oh improvements in the runtime. Two years ago Ailon and Chazelle showed a nontrivial algorithm with the first asymptotic improvement, and suggested the question: What is the exact complexity of J-L computation from d dimensions to k dimensions? An O(d log d) upper bound is implied by A-C for k up to d^{1/3} (in the L2->L2) case. In this talk I will show how to achieve this bound for k up to d^{1/2} combining techniques from the theories of error correcting codes and probability in Banach spaces. This is based on joint work with Edo Liberty. 12:00 - 12:30 Satyen Kale Microsoft Research, Redmond A COMBINATORIAL, PRIMAL-DUAL APPROACH TO SEMIDEFINITE PROGRAMS Algorithms based on convex optimization, especially linear and semidefinite programming, are ubiquitous in Computer Science. While there are polynomial time algorithms known to solve such problems, quite often the running time of these algorithms is very high. Designing simpler and more efficient algorithms is important for practical impact. In my talk, I will describe applications of a Lagrangian relaxation technique, the Multiplicative Weights Update method in the design of efficient algorithms for various optimization problems. We generalize the method to the setting of symmetric matrices rather than real numbers. The new algorithm yields the first truly general, combinatorial, primal-dual method for designing efficient algorithms for semidefinite programming. Using these techniques, we obtain significantly faster algorithms for approximating the Sparsest Cut and Balanced Separator in both directed and undirected weighted graphs to factors of O(log n) and O(sqrt{log n}), and also for the Min UnCut and Min 2CNF Deletion problems. The algorithm also has applications in quantum computing and derandomization. This is joint work with Sanjeev Arora. 12:30 - 2:30 LUNCH (BOX LUNCH PROVIDED) 2:30 - 3:00 Ravi Kannan Microsoft Research, India SPECTRAL ALGORITHMS While spectral algorithms enjoy quite some success in practice, there are not many provable results about them. In general, results are only known for the ``average case'' assuming some probabilitic model of the data. We discuss some models and results. Two sets of models have been analyzed. Gaussian mixture models yield provable results on spectral algorithms. The geometry of Gaussians and ``isopermetry'' play a crucial role here. A second set of models inspired by Random Graphs and Random Matrix theory assume total mutual independence of all entries of the input matrix. While this allows the use of the classical bounds on eigenvalues of random matrices, it is too restrictive. A Partial Independence model where the rows are assumed to be independent vector-valued random variables is more realistic and recently results akin to totally independent matrices have been proved for these. 3:00 - 3:30 Chris Wiggins Columbia University INFERRING AND ENCODING GRAPH PARTITIONS Connections among disparate approaches to graph partitioning may be made by reinterpreting the problem as a special case of one of either of two more general and well-studied problems in machine learning: inferring latent variables in a generative model or estimating an (information-theoretic) optimal encoding in rate distortion theory. In either approach, setting in a more general context shows how to unite and generalize a number of approaches. As an encoding problem, we see how heuristic measures such as the normalized cut may be derived from information theoretic quantities. As an inference problem, we see how variational Bayesian methods lead naturally to an efficient and interpretable approach to identifying ``communities" in graphs as well as revealing the natural scale (or number of communities) via Bayesian approaches to complexity control. 3:30 - 4:00 Anna Gilbert University of Michigan, Ann Arbor COMBINATORIAL GROUP TESTING IN SIGNAL RECOVERY Traditionally, group testing is a design problem. The goal is to construct an optimally efficient set of tests of items such that the test results contains enough information to determine a small subset of items of interest. It has its roots in the statistics community and was originally designed for the Selective Service to find and to remove men with syphilis from the draft. It appears in many forms, including coin-weighing problems, experimental designs, and public health. We are interested in both the design of tests and the design of an efficient algorithm that works with the tests to determine the group of interest. Many of the same techniques that are useful for designing tests are also used to solve algorithmic problems in analyzing and in recovering statistical quantities from streaming data. I will discuss some of these methods and briefly discuss several recent applications in high throughput drug screening. 4:00 - 4:30 COFFEE BREAK 4:30 - 5:00 Lars Kai Hansen Technical University of Denmark GENERALIZATION IN HIGH-DIMENSIONAL MATRIX FACTORIZATION While the generalization performance of high-dimensional principal component analysis is quite well understood, matrix factorizations like independent component analysis, non-negative matrix factorization, and clustering are less investigated for generalizability. I will review theoretical results for PCA and heuristics used to improve PCA test performance, and discuss extensions to high-dimensional ICA, NMF, and clustering. 5:00 - 5:30 Elizabeth Purdom University of California, Berkeley DATA ANALYSIS WITH GRAPHS Graphs or networks are common ways of depicting information. In biology, for example, many different biological processes are represented by graphs, such as signalling networks or metabolic pathways. This kind of a priori information is a useful supplement to the standard numerical data coming from an experiment. Incorporating the information from these graphs into an analysis of the numerical data is a non-trivial task that is generating increasing interest. There are many results from graph theory regarding the properties of an adjacency matrix and other closely related matrices. These results suggest jointly analyzing numerical data and a graph by using the spectral decomposition of the adjacency matrix (or certain transformations of it) to find interesting features of the numerical data that also reflect the structure of the graph. The adjacency matrix representation of graphs also underlies similar kernel methods that have been used to jointly analyze graphs and data. We will briefly discuss these problems and how these ideas can be used in for prediction of novel graph structure as well as graph-based classification. 5:30 - 6:00 Holly Jin LinkedIn EXPLORING SPARSE NONNEGATIVE MATRIX FACTORIZATION We explore the use of basis pursuit denoising (BPDN) for sparse nonnegative matrix factorization (sparse NMF). A matrix A is approximated by low-rank factors UDV', where U and V are sparse with unit-norm columns, and D is diagonal. We use an active-set BPDN solver with warm starts to compute the rows of U and V in turn. (Friedlander and Hatz have developed a similar formulation for both matrices and tensors.) We present computational results and discuss the benefits of sparse NMF for some real matrix applications. This is joint work with Michael Saunders. 6:00 - 6:30 Lek-Heng Lim University of California, Berkeley RANKING VIA HODGE DECOMPOSITIONS OF GRAPHS AND SKEW-SYMMETRIC MATRICES Modern ranking data is often incomplete, unbalanced, and arises from a complex network. We will propose a method to analyze such data using combinatorial Hodge theory, which may be regarded either as an additive decomposition of a skew-symmetric matrix into three matrices with special properties or a decomposition of a weighted digraph into three orthogonal components. In this framework, ranking data is represented as a skew-symmetric matrix and Hodge-decomposed into three mutually orthogonal components corresponding to globally consistent, locally consistent, and locally inconsistent parts of the ranking data. Rank aggregation then naturally emerges as projections onto a suitable subspace and an inconsistency measure of the ranking data arises as the triangular trace distribution. This is joint work with Yuan Yao of Stanford University. 6:30 - 8:00 CLOSING RECEPTION