# Talks Abstracts. Workshop on Algorithms for Modern Massive Data Sets

## Wednesday, June 25, 2008. Theme: Data analysis and data applications

9:00 - 9:45 BREAKFAST/REGISTRATION

9:45 - 10:00 Opening: Organizers

10:00 - 11:00 Tutorial:
Christos Faloutsos
Carnegie Mellon University

GRAPH MINING: LAWS, GENERATORS AND TOOLS

How do graphs look like? How do they evolve over time? How can we generate
realistic-looking graphs? We review some static and temporal 'laws', and
we describe the Kronecker'' graph generator, which naturally matches all
of the known properties of real graphs. Moreover, we present tools for
discovering anomalies and patterns in two types of graphs, static and
time-evolving. For the former, we present the 'CenterPiece' subgraphs
(CePS), which expects $q$ query nodes (eg., suspicious people) and finds
the node that is best connected to all $q$ of them (eg., the master mind
of a criminal group). We also show how to compute CenterPiece subgraphs
efficiently. For the time evolving graphs, we present tensor-based
methods, and apply them on real data, like the DBLP author-paper dataset,
where they are able to find natural research communities, and track their
evolution.

Finally, we also briefly mention some results on influence and virus
propagation on real graphs, as well as on the emerging map/reduce approach
and its impact on large graph mining.

11:00 - 11:30
Deepak Agarwal
Yahoo! Research, Silicon Valley

PREDICTIVE DISCRETE LATENT MODELS FOR LARGE INCOMPLETE DYADIC DATA

We propose a novel statistical model to predict large scale dyadic
response variable in the presence of covariate information. Our approach
simultaneously incorporates the effect of covariates and estimates local
structure that is induced by interactions among the dyads through a
discrete latent factor model.  The discovered latent factors provide a
predictive model that is both accurate and interpretable. To further
combat sparsity that is often present in large scale dyadic data, we
enhance the discrete latent factors by coupling it with a factorization
model on the cluster effects. In fact, by regularizing the factors through
$L_{2}$ norm constraints, we are able to allow larger number of factors
that capture dyadic interactions with greater precision compared to a pure
factorization model, especially for sparse data.  The methods are
illustrated in a generalized linear model framework and includes linear
regression, logistic regression and Poisson regression as special cases.
We also provide scalable generalized EM-based algorithms for model fitting
using both "hard" and "soft" cluster assignments. We demonstrate the
generality and efficacy of our approach through large scale simulation
studies and analysis of datasets obtained from certain real-world movie

11:30 - 12:00
Chandrika Kamath
Lawrence Livermore National Laboratory

SCIENTIFIC DATA MINING: WHY IS IT DIFFICULT?

Scientific data sets, whether from computer simulations, observations, or
experiments, are not only massive, but also very complex. This makes it
challenging to find useful information in the data, a fact scientists find
disconcerting as they wonder about the science still undiscovered in the
data. In this talk, I will describe my experiences in analyzing data in a
variety of scientific domains.  Using example problems from astronomy,
fluid mixing, remote sensing, and experimental physics, I will describe
our solution approaches and discuss some of the challenges we have
encountered in mining these datasets.

12:00 - 2:00 LUNCH (on your own)

2:00 - 3:00 Tutorial
Edward Chang

CHALLENGES IN MINING LARGE-SCALE SOCIAL NETWORKS

Social networking sites such as Orkut, MySpace, Hi5, and Facebook attract
billions of visits a day, surpassing the page views of Web Search.  These
social networking sites provide applications for individuals to establish
communities, to upload and share documents/photos/videos, and to interact
with other users.  Take Orkut as an example. Orkut hosts millions of
communities, with hundreds of communities created and tens of thousands of
blogs/photos uploaded each hour.  To assist users to find relevant
information, it is essential to provide effective collaborative filtering
tools to perform recommendations such as friend, community, and ads
matching.

In this talk, I will first describe both computational and storage
challenges to traditional collaborative filtering algorithms brought by
aforementioned information explosion.  To deal with huge social graphs
that expand continuously, an effective algorithm should be designed to 1)
run on thousands of parallel machines for sharing storage and speeding up
computation, 2) perform incremental retraining and updates for attaining
online performance, and 3) fuse information of multiple sources for
alleviating information sparseness.  In the second part of the talk, I
will present algorithms we recently developed including parallel PF-Growth
[1], parallel combinational collaborative filtering [2], parallel LDA,
parallel spectral clustering, and parallel Support Vector Machines [3].

[1,2,3] Papers and some codes are available at
http://infolab.stanford.edu/~echang/

3:00 - 3:30
Yahoo! Research, New York

PREDICTIVE INDEXING FOR FAST SEARCH

We tackle the computational problem of query-conditioned search. Given a
machine-learned scoring rule and a query distribution, we build a
predictive index by pre-computing lists of potential results sorted based
on an expected score of the result over future queries. The predictive
index datastructure supports an anytime algorithm for approximate
retrieval of the top elements. The general approach is applicable to
search. It is particularly effective in settings where standard techniques
(e.g., inverted indices) are intractable. We experimentally find
and approximate nearest neighbors. This work is joint with John Langford
and Alex Strehl.

3:30 - 4:00
James Demmel
University of California, Berkeley

AVOIDING COMMUNICATION IN LINEAR ALGEBRA ALGORITHMS

We survey recent results on designing numerical algorithms to minimize the
largest cost component: communication.  This could be bandwidth and
latency costs between processors over a network, or between levels of a
memory hierarchy; both costs are increasing exponentially compared to
floating point. We describe novel algorithms in sparse and dense linear
algebra, for both direct methods (like QR and LU) and iterative methods
that can minimize communication.

4:00 - 4:30 COFFEE BREAK

4:30 - 5:00
Jun Liu
Harvard University

BAYESIAN INFERENCE OF INTERACTIONS AND ASSOCIATIONS

In a regression or classification problem, one often has many potential
predictors (independent variables), and these predictors may interact with
each other to exert non-additive effects. I will present a Bayesian
approach to search for these interactions. We were motivated by the
epistasis detection problem in population-based genetic association
studies, i.e., to detect interactions among multiple genetic defects
(mutations) that may be causal to a specific complex disease. Aided with
MCMC sampling techniques, our Bayesian method can efficiently detect
interactions among many thousands of genetic markers.

A related problem is to find patterns or associations in text documents,
such as the "market basket" problem, in which one attempts to mine
association rules among the items in a supermarket based on customers'
transaction records.  For this problem, we formulate our goal as to find
subsets of words that tend to co-occur in a sentence.  We call the set of
co-occurring words (not necessarily orderly) a "theme" or a "module". I
will illustrate a simple generative model and the EM algorithm for
inferring the themes. I will demonstrate its applications in a few
examples including an analysis of chinese medicine prescriptions and an
analysis of a chinese novel.

5:00 - 5:30
Fan Chung
University of California, San Diego

FOUR GRAPH PARTITIONING ALGORITHMS

We will discuss four partitioning algorithms using eigenvectors, random
walks, PageRank and their variations. In particular, we will examine local
partitioning algorithms, which find a cut near a specified starting
vertex, with a running time that depends on the size of the small side of
the cut, rather than on the size of the input graph (which can be
prohibitively large). Three of the four partitioning algorithms are local
algorithms and are particularly appropriate for applications for modern
massive data sets.

5:30 - 6:00
Ronald Coifman
Yale University

DIFFUSION GEOMETRIES AND HARMONIC ANALYSIS ON DATA SETS

We discuss the emergence of self organization of data, either through
eigenvectors of affinity operators, or equivalently through a multiscale
ontology. We illustrate these ideas on images and audio files as well as
molecular dynamics.

6:00 - 8:00 WELCOME RECEPTION AT NEW GUINEA GARDEN


## Thursday, June 26, 2006. Theme: Networked data and algorithmic tools

9:00 - 10:00 Tutorial:
Milena Mihail
Georgia Institute of Technology

MODELS AND ALGORITHMS FOR COMPLEX NETWORKS,
WITH NETWORK ELEMENTS MAINTAINING CHARACTERISTIC PROFILES

We will discuss new trends in the study of models and algorithms for
complex networks. The common theme is to enhance network elements, say
nodes, with a small list of attributes describing the profile of the node
relative to other nodes of the network. Indeed, this is common practice in
real complex networks.

Standard models for complex networks have captured sparse graphs with
heavy tailed statistics and/or the small world phenomenon. We will discuss
random dot product graphs, which is a model generating networks with a
much wider range of properties. These properties include varying degree
distributions, varying graph densities, and varying spectral
decompositions. The flexibility of this model follows by representing
nodes as d-dim vectors (one dimension for each relevant attribute) sampled
proportional to a similarity function over vectors in high dimensions.

Concerning algorithms for complex networks, we argue that endowing local
network elements with (a computationally reasonable amount of) global
information can facilitate fundamental algorithmic primitives. For
example, we will discuss how spectral representations (in particular, a
network node being aware of its value according to a principal
eigenvector) can facilitate searching, information propagation and
information retrieval. However, computing such eiganevector components
locally in a distributed, asynchronous network is a challenging open
problem.

10:00 - 10:30
Reid Andersen
Microsoft Research, Redmond

AN ALGORITHM FOR IMPROVING GRAPH PARTITIONS

The minimum quotient cut problem is a fundamental and well-studied problem
in graph partitioning.  We present an algorithm called Improve that
improves a proposed partition of a graph, taking as input a subset of
vertices and returning a new subset of vertices with a smaller quotient
cut score.  The most powerful previously known method for improving
quotient cuts, which is based on parametric flow, returns a partition
whose quotient cut score is at least as small as any set contained within
the proposed set.  For our algorithm, we can prove a stronger guarantee:
the quotient score of the set returned is nearly as small as any set in
the graph with which the proposed set has a larger-than-expected
intersection.  The algorithm finds such a set by solving a sequence of
polynomially many S-T minimum cut problems, a sequence that cannot be cast
as a single parametric flow problem.  We demonstrate empirically that
applying Improve to the output of various graph partitioning algorithms
greatly improves the quality of cuts produced without significantly
impacting the running time.

Joint work with Kevin Lang.

10:30 - 11:00 COFFEE BREAK

11:00 - 11:30
Michael W. Mahoney
Yahoo! Research, Silicon Valley

COMMUNITY STRUCTURE IN LARGE SOCIAL AND INFORMATION NETWORKS

The concept of a community is central to social network analysis, and thus
a large body of work has been devoted to identifying community structure.
For example, a community may be thought of as a set of web pages on
related topics, a set of people who share common interests, or more
generally as a set of nodes in a network more similar amongst themselves
than with the remainder of the network.  Motivated by difficulties we
experienced at actually finding meaningful communities in large real-world
networks, we have performed a large scale analysis of a wide range of
social and information networks.  Our main methodology uses local spectral
methods and involves computing isoperimetric properties of the networks at
various size scales -- a novel application of ideas from scientific
computation to internet data analysis. Our empirical results suggest a
significantly more refined picture of community structure than has been
appreciated previously. Our most striking finding is that in nearly every
network dataset we examined, we observe tight but almost trivial
communities at very small size scales, and at larger size scales, the best
possible communities gradually blend in'' with the rest of the network
and thus become less community-like.'' This behavior is not explained,
even at a qualitative level, by any of the commonly-used network
generation models. Moreover, this behavior is exactly the opposite of what
one would expect based on experience with and intuition from expander
graphs, from graphs that are well-embeddable in a low-dimensional
structure, and from small social networks that have served as testbeds of
community detection algorithms. Possible mechanisms for reproducing our
empirical observations will be discussed, as will implications of these
findings for clustering, classification, and more general data analysis in
modern large social and information networks.

11:30 - 12:00
Nikhil Srivastava
Yale University

GRAPH SPARSIFICATION BY EFFECTIVE RESISTANCES

A sparsifier of a graph G is a sparse subgraph H that approximates it in
some natural or useful way. Benczur and Karger gave the first sparsifiers
in 1996, in which the weight of every cut in H was within a factor of
(1+/-\epsilon) of its weight in G. In this work, we consider a stronger
notion of approximation (introduced by Spielman and Teng in 2004) that
requires the Laplacian quadratic forms of H and G to be close --
specifically, that x^TL'x = (1+/-\epsilon) x^TLx for all vectors x in R^n,
where L and L' are the Laplacians of G and H respectively. This subsumes
the Benczur-Karger notion, which corresponds to the special case of x in
{0,1}^n. It also implies that G and H are similar spectrally, and that L'
is a good preconditioner for L.

We show that every graph contains a sparsifier with O(n log n) edges, and
that one can be found in nearly-linear time by randomly sampling each edge
of the graph with probability proportional to its effective resistance. A
key ingredient in our algorithm is a subroutine of independent interest: a
nearly-linear time algorithm that builds a data structure from which we
can query the approximate effective resistance between any two vertices in
a graph in O(log n) time.

We conjecture the existence of sparsifiers with O(n) edges, noting that
these would generalize the notion of expander graphs, which are
constant-degree sparsifiers for the complete graph.

Joint work with Dan Spielman.

12:00 - 12:30
Amin Saberi
Stanford University

SEQUENTIAL ALGORITHMS FOR GENERATING RANDOM GRAPHS

Random graph generation has been studied extensively as an interesting
theoretical problem. It has also become an important tool in a variety of
real world applications including detecting motifs in biological networks
and simulating networking protocols on the Internet topology. One of the
central problems in this context is to generate a uniform sample from the
set of simple graphs with a given degree sequence.

The classic method for approaching this problem is the Markov chain Monte
Carlo (MCMC) method. However, current MCMC-based algorithms have large
running times, which make them unusable for real-world networks that have
tens of thousands of nodes. This has constrained practitioners to use
simple heuristics that are non-rigorous and have often led to wrong
conclusions.

I will present a technique that leads to almost linear time algorithms for
generating random graphs for a range of degree sequences. I will also show
how this method can be extended for generating random graphs that exclude
certain subgraphs e.g. short cycles.

Our approach is based on sequential importance sampling (SIS) technique
that has been recently successful for counting and sampling graphs in
practice.

12:30 - 2:30 LUNCH (on your own)

2:30 - 3:00
Pankaj K. Agarwal
Duke University

MODELING AND ANALYZING MASSIVE TERRAIN DATA SETS

With recent advances in terrain-mapping technologies such as Laser
altimetry (LIDAR) and ground based laser scanning, millions of
georeferenced points can be acquired within short periods of time.
However, while acquiring and georeferencing the data has become extremely
efficient, transforming the resulting massive amounts of heterogeneous
data to useful information for different types of users and applications
is lagging behind, in large part because of the scarcity of robust,
efficient algorithms for terrain modeling and analysis that can handle
massive data sets acquired by different technologies and that can rapidly
detect and predict changes in the model as the new data is acquired.

This talk will review our on-going work on developing efficient algorithms
for terrain modeling and analysis that work with massive data sets. It
will focus on an I/O-efficient algorithm for computing contour maps of a
terrain. A few open questions will also be discussed.

3:00 - 3:30
Leonidas Guibas
Stanford University

DETECTION OF SYMMETRIES AND REPEATED PATTERNS IN 3D POINT CLOUD DATA

Digital models of physical shapes are becoming ubiquitous in our economy
and life. Such models are sometimes designed ab initio using CAD tools,
but more and more often they are based on existing real objects whose
shape is acquired using various 3D scanning technologies. In most
instances, the original scanner data is just a set, but a very large set,
of points sampled from the surface of the object. We are interested in
tools for understanding the local and global structure of such large-scale
scanned geometry for a variety of tasks, including model completion,
reverse engineering, shape comparison and retrieval, shape editing,
inclusion in virtual worlds and simulations, etc. This talk will present a
number of point-based techniques for discovering global structure in 3D
data sets, including partial and approximate symmetries, shared parts,
repeated patterns, etc. It is also of interest to perform such structure
discovery across multiple data sets distributed in a network, without
actually ever bring them all to the same host.

3:30 - 4:00
Yuan Yao
Stanford University

TOPOLOGICAL METHODS FOR EXPLORING PATHWAY ANALYSIS IN COMPLEX BIOMOLECULAR FOLDING

We develop a computational approach to explore the relatively low
populated transition or intermediate states in biomolecular folding
pathways, based on a topological data analysis tool, Mapper, with
simulation data from large-scale distributed computing. Characterization
of these transient intermediate states is crucial for the description of
biomolecular folding pathways, which is however difficult in both
experiments and computer simulations. We are motivated by recent debates
over the existence of multiple intermediates even in simple systems like
RNA hairpins. With a proper design of conditional density filter functions
and clustering on their level sets, we are able to provide structural
evidence on the multiple intermediate states.  The method is effective in
dealing with high degree of heterogeneity in distribution, being less
sensitive to the distance metric than geometric embedding methods, and
capturing structural features in multiple pathways. It is a reminiscence
of the classical Morse theory from mathematics, efficiently capturing
topological features of high dimensional data sets.

4:00 - 4:30 COFFEE BREAK

4:30 - 5:00
Piotr Indyk
Massachusetts Institute of Technology

SPARSE RECOVERY USING SPARSE RANDOM MATRICES

Over the recent years, a new *linear* method for compressing
high-dimensional data (e.g., images) has been discovered. For any
high-dimensional vector x, its *sketch* is equal to Ax, where A is an m x
n matrix (possibly chosen at random). Although typically the sketch length
m is much smaller than the number of dimensions n, the sketch contains
enough information to recover an *approximation* to x. At the same time,
the linearity of the sketching method is very convenient for many
applications, such as data stream computing and compressed sensing.

The major sketching approaches can be classified as either combinatorial
(using sparse sketching matrices) or geometric (using dense sketching
matrices). They achieve different trade-offs, notably between the
compression rate and the running time. Thus, it is desirable to understand
the connections between them, with the ultimate goal of obtaining the
"best of both worlds" solution.

In this talk we show that, in a sense, the combinatorial and geometric
approaches are based on different manifestations of the same phenomenon.
This connection will enable us to obtain several novel algorithms and
constructions, which inherit advantages of sparse matrices, such as lower
sketching and recovery times.

Joint work with: Radu Berinde, Anna Gilbert, Howard Karloff, Milan Ruzic
and Martin Strauss.

5:00 - 5:30
Ping Li
Cornell University

COMPRESSED COUNTING AND STABLE RANDOM PROJECTIONS

The method of stable random projections has become a popular tool for
dimension reduction, in particular, for efficiently computing pairwise
distances in massive high-dimensional data (including dynamic streaming
data) matrices, with many applications in data mining and machine learning
such as clustering, nearest neighbors, kernel methods etc.. Closely
related to stable random projections, Compressed Counting (CC) is recently
developed to efficiently compute Lp frequency moments of a single dynamic
data stream. CC exhibits a dramatic improvement over stable random
projections when p is about 1. Applications of CC include estimating
entropy moments of data streams and statistical parameter estimations in
dynamic data using low memory.

5:30 - 6:00
Joel Tropp
California Institute of Technology

EFFICIENT ALGORITHMS FOR MATRIX COLUMN SELECTION

A deep result of Bourgain and Tzafriri states that every matrix with
unit-norm columns contains a large collection of columns that is
exceptionally well conditioned. This talk describes a randomized,
polynomial-time algorithm for producing the desired submatrix.


## Friday, June 27, 2008. Theme: Statistical, geometric, and topological methods

9:00 - 10:00 Tutorial
Jerome H. Friedman
Stanford University

FAST SPARSE REGRESSION AND CLASSIFICATION

Regularized regression and classification methods fit a linear model to
data, based on some loss criterion, subject to a constraint on the
coefficient values. As special cases, ridge-regression, the lasso, and
subset selection all use squared-error loss with different particular
constraint choices. For large problems the general choice of
loss/constraint combinations is usually limited by the computation
required to obtain the corresponding solution estimates, especially when
non convex constraints are used to induce very sparse solutions. A fast
algorithm is presented that produces solutions that closely approximate
those for any convex loss and a wide variety of convex and non convex
constraints, permitting application to very large problems. The benefits
of this generality are illustrated by examples.

10:00 - 10:30
Tong Zhang
Rutgers University

AN ADAPTIVE FORWARD/BACKWARD GREEDY ALGORITHM FOR LEARNING SPARSE REPRESENTATIONS

Consider linear least squares regression where the target function is a
sparse linear combination of a set of basis functions. We are interested
in the problem of identifying those basis functions with non-zero
coefficients and reconstructing the target function from noisy
observations. This problem is NP-hard. Two heuristics that are widely used
in practice are forward and backward greedy algorithms. First, we show
that neither idea is adequate. Second, we propose a novel combination that
is based on the forward greedy algorithm but takes backward steps
adaptively whenever necessary. We prove strong theoretical results showing
that this procedure is effective in learning sparse representations.
Experimental results support our theory.

10:30 - 11:00 COFFEE BREAK

11:00 - 11:30
Jitendra Malik
University of California, Berkeley

CLASSIFICATION USING INTERSECTION KERNEL SVMs IS EFFICIENT

Straightforward classification using (non-linear) kernelized SVMs requires
evaluating the kernel for a test vector and each of the support vectors.
This can be prohibitively expensive, particularly in image recognition
applications, where one might be trying to search for an object at any of
a number of possible scales and locations. We show that for a large class
of kernels, namely those based on histogram intersection and its variants,
one can build classifiers with run time complexity logarithmic in the
number of support vectors (as opposed to linear in the standard case). By
precomputing auxiliary tables, we can construct approximations with
constant runtime and space requirements , independent of the number of
SV's.  These improvements lead to up to 3 orders of magnitude speedup in
classification time and up to 2 orders of magnitude memory savings on
various classification tasks. We also obtain the state of the art results
on pedestrian classification datasets.

This is joint work with Subhransu Maji and Alexander Berg. A paper as well
as source code is available at
http://www.cs.berkeley.edu/~smaji/projects/fiksvm/

11:30 - 12:00

EFFICIENT ONLINE ROUTING WITH LIMITED FEEDBACK AND OPTIMIZATION IN THE DARK

Topological methods have in recent years shown themselves to be capable of
identifying a number of qualitative geometric properties of high
dimensional data sets.  In this talk, we will describe philosophical
reasons why topological methods should play a role in the study of data,
as well as give several examples of how the ideas play out in practice.

12:00 - 12:30
T.S. Jayram

Let A be a matrix whose rows are given by A_1, A_2, ..., A_m. For two
aggregate operators P and Q, the cascaded aggregate (P \circ Q)(A) is
defined to be P(Q(A_1), Q(A_2), ..., Q(A_m)). The problem of computing
such aggregates over data streams has received considerable interest
recently. In this talk, I will present some recent results on computing
cascaded frequency moments and norms over data streams.

Joint work with David Woodruff (IBM Almaden).

12:30 - 2:30 LUNCH (on your own)

2:30 - 3:30 Tutorial
Gunnar Carlsson
Stanford University

TOPOLOGY AND DATA

Topological methods have in recent years shown themselves to be capable of
identifying a number of qualitative geometric properties of high
dimensional data sets.  In this talk, we will describe philosophical
reasons why topological methods should play a role in the study of data,
as well as give several examples of how the ideas play out in practice.

3:30 - 4:00
Partha Niyogi
University of Chicago

MANIFOLD REGULARIZATION AND SEMI-SUPERVISED LEARNING

Increasingly, we face machine learning problems in very high dimensional
spaces. We proceed with the intuition that although natural data lives in
very high dimensions, they have relatively few degrees of freedom. One way
to formalize this intuition is to model the data as lying on or near a low
dimensional manifold embedded in the high dimensional space. This point of
view leads to a new class of algorithms that are "manifold motivated" and
a new set of theoretical questions that surround their analysis. A central
construction in these algorithms is a graph or simplicial complex that is
data-derived and we will relate the geometry of these to the geometry of
the underlying manifold. We will develop the idea of manifold
regularization, its applications to semi-supervised learning, and the
theoretical guarantees that may be provided in some settings.

4:00 - 4:30 COFFEE BREAK

4:30 - 5:00
Sanjoy Dasgupta
University of California, San Diego

RANDOM PROJECTION TREES AND LOW DIMENSIONAL MANIFOLDS

The curse of dimensionality has traditionally been the bane of
nonparametric statistics (histograms, kernel density estimation, nearest
neighbor search, and so on), as reflected in running times and convergence
rates that are exponentially bad in the dimension. This problem is all the
more pressing as data sets get increasingly high dimensional. Recently the
field has been rejuvenated substantially, in part by the realization that
a lot of real-world data which appears high-dimensional in fact has low
"intrinsic" dimension in the sense of lying close to a low-dimensional
manifold. In the past few years, there has been a huge interest in
learning such manifolds from data, and then using the learned structure to
transform the data into a lower-dimensional space where standard
statistical methods generically work better.

I'll exhibit a way to benefit from intrinsic low dimensionality without
having to go to the trouble of explicitly learning its fine structure.
Specifically, I'll present a simple variant of the ubiquitous k-d tree (a
spatial data structure widely used in machine learning and statistics)
that is provably adaptive to low dimensional structure. We call this a
"random projection tree" (RP tree).

Along the way, I'll discuss different notions of intrinsic dimension --
motivated by manifolds, by local statistics, and by analysis on metric
spaces -- and relate them. I'll then prove that RP trees require resources
that depend only on these dimensions rather than the dimension of the
space in which the data happens to be situated.

This is work with Yoav Freund (UC San Diego).

5:00 - 5:30
Kenneth Clarkson

TIGHTER BOUNDS FOR RANDOM PROJECTIONS OF MANIFOLDS

The Johnson-Lindenstrauss random projection lemma gives a simple way to
reduce the dimensionality of a set of points while approximately
preserving their pairwise Euclidean distances.  The most direct
application of the lemma applies to a finite set of points, but recent
work has extended the technique to affine subspaces, smooth manifolds, and
sets of bounded doubling dimension; in each case, a projection to a
sufficiently large dimension k implies that all pairwise distances are
approximately preserved with high probability. Here the case of random
projection of a smooth manifold (submanifold of R^m) is considered, and a
previous analysis is sharpened, giving an upper bound for k that depends
on the surface area, total absolute curvature, and a few other quantities
associated with the manifold, and in particular does not depend on the
ambient dimension m or the reach of the manifold.

5:30 - 6:00
Yoram Singer

EFFICIENT PROJECTION ALGORITHMS FOR LEARNING SPARSE REPRESENTATIONS
FROM HIGH DIMENSIONAL DATA

Many machine learning tasks can be cast as constrained optimization
problems. The talk focuses on efficient algorithms for learning tasks
which are cast as optimization problems subject to L1 and hyper-box
constraints. The end result are typically sparse and accurate models. We
We then describe a linear time projection for dense input spaces. Last, we
describe a new efficient projection algorithm for very high dimensional
spaces. We demonstrate the merits of the algorithm in experiments with
large scale image and text classification.

6:00 - 6:30
Arindam Banerjee
University of Minnesota, Twin Cities

BAYESIAN CO-CLUSTERING FOR DYADIC DATA ANALYSIS

In recent years, co-clustering has emerged as a powerful data mining tool
that can analyze dyadic data connecting two entities. However, almost all
existing co-clustering techniques are partitional, and allow individual
rows and columns of a data matrix to belong to only one cluster. Several
current applications, such as recommendation systems, market basket
analysis, and gene expression analysis, can substantially benefit from a
mixed cluster assignment of rows and columns. In this talk, we present
Bayesian co-clustering (BCC) models, that allow mixed probabilistic
assignments of rows and columns to all clusters. BCC maintains separate
Dirichlet priors over the mixed assignments and assumes each observation
to be generated by an exponential family distribution corresponding to its
row and column clusters. The model is designed to naturally handle sparse
matrices as only the non-zero/non-missing entries are assumed to be
generated by the model. We propose a fast variational algorithm for
inference and parameter estimation in the model. In addition to finding
co-cluster structure in observations, the model outputs a low dimensional
co-embedding of rows and columns, and predicts missing values in the
original matrix. We demonstrate the efficacy of the model through
experiments on both simulated and real data.

6:30 - 8:00 RECEPTION AND POSTER SESSION (OLD UNION CLUB HOUSE)


## Saturday, June 28, 2008. Theme: Machine learning and dimensionality reduction

9:00 - 10:00 Tutorial
Michael I. Jordan
University of California, Berkeley

SUFFICIENT DIMENSION REDUCTION

Consider a regression or classification problem in which the data consist
of pairs (X,Y), where X is a high-dimensional vector. If we wish to find a
low-dimensional subspace to represent X, one option is to ignore Y and
avail ourselves of unsupervised methods such PCA and factor analysis.  In
this talk, I will discuss a supervised alternative which aims to take Y
into account in finding a low-dimensional representation for X, while
avoiding making strong assumptions about the relationship of X to Y.
Specifically, the problem of "sufficient dimension reduction" (SDR) is
that of finding a subspace S such that the projection of the covariate
vector X onto S captures the statistical dependency of the response Y on X
(Li, 1991; Cook, 1998).  I will present a general overview of the SDR
problem, focusing on the formulation of SDR in terms of conditional
independence.  I will also discuss some of the popular algorithmic
approaches to SDR, particularly those based on inverse regression.
Finally, I will describe a new methodology for SDR which is based on the
characterization of conditional independence in terms of conditional
covariance operators on reproducing kernel Hilbert spaces (a
characterization of conditional independence that is of independent
interest). I show how this characterization leads to an M-estimator for S
and I show that the estimator is consistent under weak conditions; in
particular, we do not have to impose linearity or ellipticity conditions
of the kinds that are generally invoked for SDR methods based on inverse
regression.

Joint work with Kenji Fukumizu and Francis Bach.

10:00 - 10:30
Nathan Srebro
University of Chicago

MORE DATA LESS WORK: SVM TRAINING IN TIME DECREASING WITH LARGER DATA SETS

Traditional runtime analysis of training Support Vector Machines, and
indeed most learning methods, shows how the training runtime increases as
more training examples are available.  Considering the true objective of
training, which is to obtain a good predictor, I will argue that training
time should be studied as a decreasing, or at least non-increasing,
function of training set size.  I will then present both theoretical and
empirical results demonstrating how a simple stochastic subgradient
descent approach for training SVMs indeed displays such monotonic
decreasing behavior.

10:30 - 11:00 COFFEE BREAK

11:00 - 11:30
Inderjit S. Dhillon
University of Texas, Austin

RANK MINIMIZATION VIA ONLINE LEARNING

Minimum rank problems arise frequently in machine learning applications
and are notoriously difficult to solve due to the non-convex nature of the
rank objective. In this talk, we will present a novel online learning
approach for the problem of rank minimization of matrices over polyhedral
sets. In particular, we present two online learning algorithms for rank
minimization - our first algorithm is a multiplicative update method based
on a generalized experts framework, while our second algorithm is a novel
application of the online convex programming framework (Zinkevich, 2003).
A salient feature of our online learning approach is that it allows us to
give provable approximation guarantees for the rank minimization problem
over polyhedral sets. We demonstrate that our methods are effective in
practice through experiments on synthetic examples, and on the real-life
application of low-rank kernel learning.

This is joint work with Constantine Caramanis, Prateek Jain and Raghu
Meka.

11:30 - 12:00
Nir Ailon

EFFICIENT DIMENSION REDUCTION

The Johnson-Lindenstrauss dimension reduction idea using random
projections was discovered in the early 80's.  Since then many "computer
science friendly" versions were published, offering constant factor but no
big-Oh improvements in the runtime.  Two years ago Ailon and Chazelle
showed a nontrivial algorithm with the first asymptotic improvement, and
suggested the question:  What is the exact complexity of J-L computation
from d dimensions to k dimensions?  An O(d log d) upper bound is implied
by A-C for k up to d^{1/3} (in the L2->L2) case.  In this talk I will show
how to achieve this bound for k up to d^{1/2} combining techniques from
the theories of error correcting codes and probability in Banach spaces.
This is based on joint work with Edo Liberty.

12:00 - 12:30
Satyen Kale
Microsoft Research, Redmond

A COMBINATORIAL, PRIMAL-DUAL APPROACH TO SEMIDEFINITE PROGRAMS

Algorithms based on convex optimization, especially linear and
semidefinite programming, are ubiquitous in Computer Science. While there
are polynomial time algorithms known to solve such problems, quite often
the running time of these algorithms is very high. Designing simpler and
more efficient algorithms is important for practical impact.

In my talk, I will describe applications of a Lagrangian relaxation
technique, the Multiplicative Weights Update method in the design of
efficient algorithms for various optimization problems. We generalize the
method to the setting of symmetric matrices rather than real numbers. The
new algorithm yields the first truly general, combinatorial, primal-dual
method for designing efficient algorithms for semidefinite programming.
Using these techniques, we obtain significantly faster algorithms for
approximating the Sparsest Cut and Balanced Separator in both directed and
undirected weighted graphs to factors of O(log n) and O(sqrt{log n}), and
also for the Min UnCut and Min 2CNF Deletion problems. The algorithm also
has applications in quantum computing and derandomization.

This is joint work with Sanjeev Arora.

12:30 - 2:30 LUNCH (BOX LUNCH PROVIDED)

2:30 - 3:00
Ravi Kannan
Microsoft Research, India

SPECTRAL ALGORITHMS

While spectral algorithms enjoy quite some success in practice, there are
not many provable results about them. In general, results are only known
for the average case'' assuming some probabilitic model of the data. We
discuss some models and results. Two sets of models have been analyzed.
Gaussian mixture models yield provable results on spectral algorithms. The
geometry of Gaussians and isopermetry'' play a crucial role here. A
second set of models inspired by Random Graphs and Random Matrix theory
assume total mutual independence of all entries of the input matrix. While
this allows the use of the classical bounds on eigenvalues of random
matrices, it is too restrictive. A Partial Independence model where the
rows are assumed to be independent vector-valued random variables is more
realistic and recently results akin to totally independent matrices have
been proved for these.

3:00 - 3:30
Chris Wiggins
Columbia University

INFERRING AND ENCODING GRAPH PARTITIONS

Connections among disparate approaches to graph partitioning may be made
by reinterpreting the problem as a special case of one of either of two
more general and well-studied problems in machine learning: inferring
latent variables in a generative model or estimating an
(information-theoretic) optimal encoding in rate distortion theory. In
either approach, setting in a more general context shows how to unite and
generalize a number of approaches. As an encoding problem, we see how
heuristic measures such as the normalized cut may be derived from
information theoretic quantities. As an inference problem, we see how
variational Bayesian methods lead naturally to an efficient and
interpretable approach to identifying communities" in graphs as well as
revealing the natural scale (or number of communities) via Bayesian
approaches to complexity control.

3:30 - 4:00
Anna Gilbert
University of Michigan, Ann Arbor

COMBINATORIAL GROUP TESTING IN SIGNAL RECOVERY

Traditionally, group testing is a design problem. The goal is to construct
an optimally efficient set of tests of items such that the test results
contains enough information to determine a small subset of items of
interest. It has its roots in the statistics community and was originally
designed for the Selective Service to find and to remove men with syphilis
from the draft. It appears in many forms, including coin-weighing
problems, experimental designs, and public health. We are interested in
both the design of tests and the design of an efficient algorithm that
works with the tests to determine the group of interest. Many of the same
techniques that are useful for designing tests are also used to solve
algorithmic problems in analyzing and in recovering statistical quantities
from streaming data. I will discuss some of these methods and briefly
discuss several recent applications in high throughput drug screening.

4:00 - 4:30 COFFEE BREAK

4:30 - 5:00
Lars Kai Hansen
Technical University of Denmark

GENERALIZATION IN HIGH-DIMENSIONAL MATRIX FACTORIZATION

While the generalization performance of high-dimensional principal
component analysis is quite well understood, matrix factorizations like
independent component analysis, non-negative matrix factorization, and
clustering are less investigated for generalizability. I will review
theoretical results for PCA and heuristics used to improve PCA test
performance, and discuss extensions to high-dimensional ICA, NMF, and
clustering.

5:00 - 5:30
Elizabeth Purdom
University of California, Berkeley

DATA ANALYSIS WITH GRAPHS

Graphs or networks are common ways of depicting information. In biology,
for example, many different biological processes are represented by
graphs, such as signalling networks or metabolic pathways. This kind of a
priori information is a useful supplement to the standard numerical data
coming from an experiment. Incorporating the information from these graphs
into an analysis of the numerical data is a non-trivial task that is
generating increasing interest.

There are many results from graph theory regarding the properties of an
adjacency matrix and other closely related matrices. These results suggest
jointly analyzing numerical data and a graph by using the spectral
decomposition of the adjacency matrix (or certain transformations of it)
to find interesting features of the numerical data that also reflect the
structure of the graph. The adjacency matrix representation of graphs also
underlies similar kernel methods that have been used to jointly analyze
graphs and data.

We will briefly discuss these problems and how these ideas can be used in
for prediction of novel graph structure as well as graph-based
classification.

5:30 - 6:00
Holly Jin

EXPLORING SPARSE NONNEGATIVE MATRIX FACTORIZATION

We explore the use of basis pursuit denoising (BPDN) for sparse
nonnegative matrix factorization (sparse NMF).  A matrix A is approximated
by low-rank factors UDV', where U and V are sparse with unit-norm columns,
and D is diagonal.  We use an active-set BPDN solver with warm starts to
compute the rows of U and V in turn. (Friedlander and Hatz have developed
a similar formulation for both matrices and tensors.)  We present
computational results and discuss the benefits of sparse NMF for some real
matrix applications.

This is joint work with Michael Saunders.

6:00 - 6:30
Lek-Heng Lim
University of California, Berkeley

RANKING VIA HODGE DECOMPOSITIONS OF GRAPHS AND SKEW-SYMMETRIC MATRICES

Modern ranking data is often incomplete, unbalanced, and arises from a
complex network. We will propose a method to analyze such data using
combinatorial Hodge theory, which may be regarded either as an additive
decomposition of a skew-symmetric matrix into three matrices with special
properties or a decomposition of a weighted digraph into three orthogonal
components. In this framework, ranking data is represented as a
skew-symmetric matrix and Hodge-decomposed into three mutually orthogonal
components corresponding to globally consistent, locally consistent, and
locally inconsistent parts of the ranking data. Rank aggregation then
naturally emerges as projections onto a suitable subspace and an
inconsistency measure of the ranking data arises as the triangular trace
distribution.

This is joint work with Yuan Yao of Stanford University.

6:30 - 8:00 CLOSING RECEPTION