## Algorithms, Incentives, and Optimization Techniques for Social Data

### Overview

In recent years, individuals have become mass producers of content, generating articles, blogs, opinions, reviews, and recommendations, in a decentralized manner. This content is then discovered and consumed by other users, and any centralized review is rendered infeasible by the sheer magnitude of the available content. Consequently, there is a need to utilize user feedback, both explicit and implicit, in order to provide optimum rankings and recommendations to users on the Internet. The same broad problem occurs in determining appropriate advertisements for a given search keyword, automatic moderation of discussion boards, and automated deductions of user preference on social networks.

User activity data on the Internet tends to be very large, with hundreds of millions of users performing billions of actions every day. At the same time, this data is also very sparse, since each user only performs a very small share of possible actions on the Internet (eg. searches for a small fraction of possible keywords, reviews or purchases a small fraction of all products, etc.). Our goal is to design algorithms and optimization techniques to optimally utilize such data. We treat the sparse data as a prior belief'' on user preferences. We are also designing economic incentives to obtain useful and rich data, robust to manipulation. Further, we are interested in the field of reputation and recommendation systems, broadly defined.

This project is funded primarily by collaborative NSF grants 0904314 and 0904325. Research related to this project is often discussed at the RAIN seminar (Research on Algorithms for the INternet).

### Project Participants

PIs: Ashish Goel (Stanford) and Sanjeev Khanna (Penn)

Senior Personnel: N. Shivakumar (Stanford)

Graduate Students: THEORY: Anand Bhalgat, Tanmoy Chakraborty, Pranav Dandekar, Zhiyi Huang, Michael Kapralov, Aleksandra Korolova, David Lee, Ian Post, Sudeepa Roy. TOOLS: Noah Burbank, Eli Marschner, Eric Mibuari.

### Research Activities

#### Credit Networks

Credit networks are a mechanism for conducting trusted transactions in a decentralized network which is robust to adversarial infiltration. We studied whether this robustness can be achieved without any loss in network capacity. In particular, we have developed a simulator and a theoretical framework to analyze the long term success rate of transactions in a credit network. Ongoing work includes issues of rationality in credit networks, and applying credit networks to the problem of proxy circumvention.

Liquidity in Credit Networks: A Little Trust Goes a Long Way.
P. Dandekar, A. Goel, R. Govindan, and I. Post.
Preliminary version presented at Proc. Workshop on the Economics of Networks, Systems, and Computation (NetEcon) (NetEcon), 2010. Full version appeared in ACM Conference on Electronic Commerse (EC), 2011.

Strategic Formation of Credit Networks: Preliminary Report.
Pranav Dandekar, Bryce Wiedenbeck, Ashish Goel and Michael Wellman.
Preliminary report appeared in TADA 2011. Send email to Ashish Goel if you want to receive a copy.

#### Fast Algorithms for Birkhoff von-Neumann Decomposition

The Birkhoff von-Neumann (BVN) decomposition theorem says that any doubly stochastic matrix can be expressed as a convex combination of permutation matrices (perfect matchings). A key subroutine in many applications is to efficiently recover one or more perfect matchings in the support of the BVN decomposition. We have designed a randomized algorithm that finds a perfect matchings in the decomposition in $O(n \log^2 n)$ time using $O(m)$ pre-processing time; here $n$ denotes the number of rows/columns in the matrix and $m$ denotes the number of non-zero entries. More generally, our algorithm can find $k$ perfect matchings in the decomposition for any integer $k$ in $O(k*(n \log^2 n))$ time.

Our approach also yields an algorithm for finding a perfect matching in a regular bipartite graph that runs in $O(n \log n)$ time, a well-studied special case of the BVN decomposition. This improves upon the previous best running time of $O(n^{1.5})$.

Perfect matchings in O(n log n) time in regular bipartite graphs.
A. Goel, M. Kapralov, and S. Khanna.
Proc. 42nd ACM Symposium on Theory of Computing (STOC), 2010.

#### Processing Large Scale Social Network Data

Social networks such as Facebook and Twitter process large amounts of data; however this data also has substantial structure. For instance, many aspects of the evolution of these networks can be accurately described using generative models such as preferential attachment. We have shown how this structure can be used to compute PageRank in evolving social networks much more efficiently than believed before, using simple sampling techniques. Ongoing work includes applying similar techniques to the problem of mining term co-occurrences in a data stream, and develop algorithms for efficient personalized search.

Title: Fast Incremental and Personalized PageRank
B. Bahmani, A. Chowdhury, and A. Goel.
Appeared in VLDB 2011.

#### Privacy Issues in Microtargeted Ads

As we use increasingly sophisticated tools (including the ones being developed in this project) to target ads to users, it is also important to keep privacy considerations in mind. One of our graduate students is exploring this connection; her work has generated significant press coverage including this New York Times article and has also received the 2011 Award for Outstanding Research in Privacy Enhancing Technologies.

Privacy Violations Using Microtargeted Ads: A Case Study,
A. Korolova.

Personalized Social Recommendations - Accurate or Private?
A. Machanavajjhala, A. Korolova, and A. D. Sarma
Appeared in VLDB, 2011.

#### Optimization and Querying with Uncertain Data

In the stochastic knapsack problem, we are given a set of items each associated with a probability distribution on sizes and a profit, and a knapsack of unit capacity. The size of an item is revealed as soon as it is inserted into the knapsack, and the goal is to design a policy that maximizes the expected profit of items that are successfully inserted into the knapsack. We have designed a relaxed PTAS (Polynomial Time Approximation Scheme) that computes for any t > 0, a (1+t)-approximate adaptive policy provided the knapsack capacity is also slightly relaxed to 1+t. We have also designed a (8/3 + t)-approximate adaptive policy for any t > 0 without relaxing the knapsack capacity; this improves upon an earlier (3+t)-approximation.

Improved Approximation Results for Stochastic Knapsack Problems
A. Bhalgat, A. Goel, and S. Khanna.
Proc. 22nd Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2011.

In more recent work, we have further improved the (8/3 + t)-approximation to 2+t.

A $(2+\epsilon)$-Approximation Algorithm for the Stochastic Knapsack Problem

Anand Bhalgat
Under submission

Many applications use probabilistic databases as a way to model uncertain and imprecise data. For instance, a probabilistic relational database associates a probability value with each tuple, indicating the likelihood of its presence. We initiated a study of computational tractability of answering queries on probabilistic databases that use the difference operation.

Queries with Difference on Probabilistic Databases
S. Khanna, S. Roy, and V. Tannen.
37th International Conference on Very Large Data Bases (VLDB), 2011.

#### Widescope

Under development; please see the beta version.
Telescopes and microscopes help you zoom in and focus on a specific spot. The goal of this project is to help users and societies see the bigger picture: focus not just on individual variables and constraints but also on the overall objective function and feasibility. We provide users with data to help apply the widescope philosophy to the problem of reducing federal and state budget deficits, and we are developing social networking primitives and systems to facilitate users to tackle these problems in a collaborative fashion. We are also developing general purpose theoretical constructs and systems to allow an application of the widescope philosophy to other important problems.

A Wall Street Journal Report about out work.

#### Connections with Information Theory

Compression is a fundamental goal of both human language and digital communication, yet natural language is very different from compression schemes employed by modern computers. We partly explain this difference using the fact that information theory generally assumes a common prior probability distribution shared by the encoder and decoder, whereas human communication has to be robust to the fact that a speaker and listener may have different prior beliefs about what a speaker may say. We model this information-theoretically using the following question: what type of compression scheme would be effective when the encoder and decoder have (boundedly) different prior probability distributions. The resulting compression scheme resembles natural language to a far greater extent than existing digital communication protocols.

Compression Without a Common Prior: An Information-theoretic Justification for Ambiguity in Language
B. Juba, A. Kalai, S. Khanna, and M. Sudan
2nd Symposium on Innovations in Computer Science (ICS), 2011.

Any physical channel of communication offers two potential reasons why its capacity (the number of bits it can transmit in a unit of time) might be unbounded: (1) (Uncountably) infinitely many choices of signal strength at any given instant of time, and (2) (Uncountably) infinitely many instances of time at which signals may be sent. However channel noise cancels out the potential unboundedness of the first aspect, leaving typical channels with only a finite capacity per instant of time. We show the surprising result that under a natural model of noise and delay errors, this latter source of infinity is indeed realizable, and the capacity of the associated physical channel is infinite.

Delays and the Capacity of Continuous-time Channels
S. Khanna and M. Sudan.
To appear in 52nd IEEE Symposium on Foundations of Computer (FOCS), 2011.

#### Attention and Matching Markets

We study social welfare in one-sided matching markets where the goal is to efficiently allocate items to agents that each have a complete, private preference list and a unit demand over the items. Our focus is on allocation mechanisms that do not involve any monetary payments. We analyze two natural matching mechanisms, namely, the random serial dictatorship (RSD) mechanism and the the probabilistic serial (PS) mechanism, and give first non-trivial performance guarantees on social welfare achieved by these mechanisms.

Social Welfare in One-Sided Matching Markets without Money
A. Bhalgat, D. Chakrabarty, and S. Khanna.
14th International Workshop on Approximation Algorithms for Combinatorial Optimization Problems (APPROX) , 2011.

We model the economics of producing content in online social networks such as Facebook and Twitter using an attention market . We treat attention as a mechanism for sharing the profit from consuming information and introduce two main utility models: incremental utility and Shapley utility. We prove a bi-criteria bound on the price of anarchy; in particular we show that the social welfare from central control over level of contribution by users is no larger than the social welfare from selfish agents with twice as large consumption utilities.

Under submission

#### Sorting with Restricted Comparisons

We study the generalized sorting problem where we are given a set of n elements to be sorted but only a subset of all possible pairwise element comparisons is allowed. For instance, a user may only be able to compare content that the user directly knows. The goal is to determine the sorted order using the smallest possible number of allowed comparisons. To our knowledge, no bound better than the trivial O(n^2) bound was known thus far. We design an algorithm that solves the generalized sorting problem using roughly O(n^{1.5}) comparisons.

Algorithms for the Generalized Sorting Problem
Z. Huang, S. Kannan, and S. Khanna.
To appear in 52nd IEEE Symposium on Foundations of Computer (FOCS), 2011.