Privacy against inference attacks: From theory to practice
From theory... We propose a general statistical inference framework to capture the privacy threat under inference attacks, i.e. the threat incurred by a user who wishes to release data that is correlated with his private data to a passive but curious adversary, given utility constraints. We show that applying this general framework to the setting where the adversary uses the self-information cost function naturally leads to a non-asymptotic information-theoretic approach for characterizing the privacy-utility trade-off. We introduce two privacy metrics, namely average and maximum information leakage, and prove that under both metrics the resulting design problem of finding the optimal mapping from the user's data to a privacy-preserving output can be formulated as a convex program. We compare our framework with differential privacy.
To practice... We focus on some challenges encountered by this framework when applied to real world data. On one hand, the design of optimal privacy-preserving mechanisms requires knowledge of the prior distribution linking private data and data to be released, which is often unavailable in practice. On the other hand, the optimization may become intractable and face scalability issues when data assumes values in large size alphabets, or is high dimensional. Our work makes three major contributions. First, we provide bounds on the impact on the privacy-utility tradeoff of a mismatched prior. Second, we show how to reduce the optimization size by introducing a quantization step, and how to generate privacy mappings under quantization. Third, we evaluate our methods on several datasets, and demonstrate that good privacy properties can be achieved with limited distortion so as not to undermine the original purpose of the publicly released data, e.g. recommendations.