Privacy-Preserving Data Mining Rebecca Wright Computer Science Department Stevens Institute of Technology www.cs.stevens.edu/~rwright PORTIA Site Visit 12 May, 2005

The Data Revolution • The current data revolution is fueled by the perceived, actual, and potential usefulness of the data. • Most electronic and physical activities leave some kind of data trail. These trails can provide useful information to various parties. • However, there are also concerns about appropriate handling and use of sensitive information. • Privacy-preserving methods of data handling seek to provide sufficient privacy as well as sufficient utility.

Advantages of Privacy Protection • protection of personal information • protection of proprietary or sensitive information • enables collaboration between different data owners (since they may be more willing or able to collaborate if they need not reveal their information) • compliance with legislative policies

Overview • Introduction • Primitives • Higher-level protocols – Distributed data mining – Publishable data – Coping with massiveness – Beyond privacy-preserving data mining • Implementation and experimentation • Lessons learned, conclusions

Models for Distributed Data Mining, I • Horizontally Partitioned P1

… … …

P2

… …

P3

… …

• Vertically Partitioned P1

… … … … … … … …

P2

… … … … … … … …

Models for Distributed Data Mining, II • Fully Distributed

• Client/Server(s)



SERVER(S)

P2



Each holds database

P3



Pn-1



Pn





P1

CLIENT Wishes to compute on servers’ data

Cryptography vs. Randomization cryptographic approach inefficiency privacy loss randomization approach inaccuracy

Cryptography vs. Randomization

inefficiency

cryptographic approach privacy loss randomization approach

inaccuracy

Secure Multiparty Computation • Allows n players to privately compute a function f of their inputs. P 1

Pn

P2

• Overhead is polynomial in size of inputs and complexity of f [Yao86, GMW87, BGW88, CCD88, ...] • In theory, can solve any private distributed data mining problem. In practice, not efficient for large data.

Primitives for PPDM • Common tools include secret sharing, homomorphic encryption, secure scalar product, secure set intersection, secure sums, and other statistics. • PORTIA work: – [BGN05]: homomorphic encryption of 2-DNF formulas (arbitrary additions, one multiplication), based on bilinear maps. (P) – [AMP04]: Medians, kth ranked element. (P) – [FNP04]: set intersection and cardinality.

Higher-Level Protocols • [LP00]: private protocol for lnx and xlnx • Various protocols to search remote, encrypted, or access controlled data (e.g. for keywords, items in common): [BBA03(P), Goh03, FNP04, BCOP04, ABG+05(P), EFH#] • [YZW05]: frequency mining protocol. (P)

Data Mining Models • [WY04,YW05]: privacy-preserving construction of Bayesian networks from vertically partitioned data. • [YZW05]: classification from frequency mining in fully distributed model (naïve Bayes classification, decision trees, and association rule mining). (P) • [JW#]: privacy-preserving k-means clustering for arbitrarily partitioned data. (In vertically partitioned case, similar to two-party [VC03].) • [AST05]: privacy-preserving computation of multidimensional aggregates on vertically or horizontally partitioned data using randomization.

Privacy-Preserving Bayes Networks [WY04,YW05] Goal: Cooperatively learn Bayesian network structure on the combination of DBA and DBB , ideally without either party learning anything except the Bayesian network itself. Alice Bob DBA

DBB

K2 Algorithm for BN Learning • Determining the best BN structure for a given data set is NP-hard, so heuristics are used in practice. • The K2 algorithm [CH92] is a widely used BN structure-learning algorithm, which we use as the starting point for our solution. • Considers nodes in sequence. Adds new parent that most increases a score function f, up to a maximum number of parents per node.

α 0!α 1! f (i, π (i)) = ∏ (α 0 + α 1 + 1)!

Our Solution: Approximate Score Modified score function: approximates the same relative ordering, and lends itself well to private computation. • Apply natural log to f and use Stirling’s approximation • Drop constant factor and bounded term. Result is:

g(i, π (i)) = ∑ ( 12 (ln α 0 + ln α1 − ln t) +

(α 0 ln α 0 + α1 ln α1 − t ln t )) where t = α0 + α1 + 1

Our Solution: Components Sub-protocols used: • Privacy-preserving scalar product protocol: based on homomorphic encryption • Privacy-preserving computation of α-parameters: uses scalar product • Privacy-preserving score computation: uses α-parameters, [LP00] protocols for lnx and xlnx • Privacy-preserving score comparison: uses [Yao86]

All intermediate values (scores and parameters) are protected using secret sharing. [YW05] improves on [MSK04] for parameter computation.

Overview • Introduction • Primitives • Higher-level protocols – Distributed data mining – Publishable data – Coping with massiveness – Beyond privacy-preserving data mining • Implementation and experimentation • Lessons learned, conclusions

Publishable Data • Goal: Modify data before publishing so that results have good privacy and good utility. – Some situations favor one more than the other. – May prevent some things from being learned at all.

• [DN04]: Extends privacy definitions of [EGS03,DN03] relating a priori and a posteriori knowledge, and provides solutions in a moderated publishing model. • [CDMSW04]: provide quantifiable definitions of privacy and utility. One’s privacy is guaranteed to the extent that one does not stand out from others.

Publishable Data: k-Anonymity • Modify database before publishing so (quasi-identifier of) every record in the database is identical to at least k – 1 other records [Swe02, MW04].

• [AFK+05]: optimal k-anonymization is NP-hard even if the data values are ternary. Presents efficient approximation algorithms for k-anonymization. (P) • [ZYW05]: in two formulations, present solutions for a data publisher to learn a k-anonymized version of a fully distributed database without learning anything else. (P)

Coping with Massiveness • Data mining on massive data sets in an important field in its own right. • It is also privacy-relevant, because: – Massive data sets are likely to be distributed and multiply owned. – Efficiency improvements are needed in order to have any hope of adding overhead for privacy. • [FKMSZ05]: Stream algorithms for massive graphs (P) • [DKM04]: Approximate massive-matrix computations

(P)

Beyond Privacy-Preserving Data Mining Enforce policies about what kind of queries or computations on data are allowed. • [JW#]: Extends private inference control of [WS04] to work with more complex query functions. Client learns query result if and only if inference rule is met (and learns nothing else). • [KMN05]: Simulatable auditing to ensure that query denials do not leak information. (P) • [ABG+04]: P4P: Paranoid Platform for Privacy Preferences. Mechanism for ensuring released data is usable only for allowed tasks. (P)

Overview • Introduction • Primitives • Higher-level protocols – Distributed data mining – Publishable data – Coping with massiveness – Beyond privacy-preserving data mining • Implementation and experimentation • Lessons learned, conclusions

Implementation and Experimentation • secure scalar product protocol [SWY04] • MySQL private information retrieval (PIR) [BBFS#] • Fairplay: a system implementing Yao’s two party secure function evaluation [MNPS04] • Bayesian network implementation [KRFW#] (D) • secure computation of surveys using Fairplay and use for Taulbee survey [FPRS04] (P,D)

Survey Software [FPRS04] (P,D) • User-friendly, open-source, free implementation using Fairplay [MNPS04], suitable for use with CRA’s Taulbee salary survey. Not adopted. • CRA’s reasons: – Need for data cleaning, multiyear comparisons, unanticipated use – “Perhaps most member departments will trust us.”

• Provost Offices’ reasons: – No legal basis for using this privacy-preserving protocol on data that we otherwise don’t disclose – Correctness and security claims are hard and expensive to assess, despite open-source implementation. – All-or-none adoption by Ivy+ peer group. Can’t make decision unilaterally.

Future Directions in Experimentation • Combine these and others into a general-purpose privacypreserving data mining experimental platform. Useful for: – fast prototyping of new protocols – efficiency, accuracy comparisons of different approaches

• Experiment with real data and real uses. – need to find a user community that has explicitly expressed interest, and that could potentially accomplish something via PPDM that it currently cannot accomplish. – [Scha04]: genetics researchers may form such a community

Other Future Directions • Preprocessing of data for PPDM. • Privacy-preserving data solutions that use both randomization and cryptography in order to gain some of the advantages of both. • Policies for privacy-preserving data mining: languages, reconciliation, and enforcement. • Incentive-compatible privacy-preserving data mining.

Conclusions • Increasing use of computers and networks has led to a proliferation of sensitive data. • Without proper precautions, this data could be misused. • Many technologies exist for supporting proper data handling, but much work remains, and some barriers must be overcome in order for them to be deployed. • Cryptography is a useful component, but not the whole solution. • Technology, policy, and education must work together.