You are what you say: Privacy risks of public mentions
-
Upload
teranika-fullerton -
Category
Documents
-
view
47 -
download
6
description
Transcript of You are what you say: Privacy risks of public mentions
![Page 1: You are what you say: Privacy risks of public mentions](https://reader038.fdocuments.net/reader038/viewer/2022103006/568137e3550346895d9f900a/html5/thumbnails/1.jpg)
Natural Language Processing LabNational Taiwan University
You are what you say: Privacy risks of public mentions
Dan Frankowski et al.University of Minnesota
SIGIR 2006Presentor: Chun-Yuan Teng
![Page 2: You are what you say: Privacy risks of public mentions](https://reader038.fdocuments.net/reader038/viewer/2022103006/568137e3550346895d9f900a/html5/thumbnails/2.jpg)
Natural Language Processing LabNational Taiwan University
Motivation
• “Public data” + “Private data” + “IR Algorithm” = Privacy risk
![Page 3: You are what you say: Privacy risks of public mentions](https://reader038.fdocuments.net/reader038/viewer/2022103006/568137e3550346895d9f900a/html5/thumbnails/3.jpg)
Natural Language Processing LabNational Taiwan University
Example of privacy risk
• Privacy risk: Link datasets with overlapping users
• “blog” + “purchase history” = “someone”
• Ex: 吳若權 or 紫微斗數
![Page 4: You are what you say: Privacy risks of public mentions](https://reader038.fdocuments.net/reader038/viewer/2022103006/568137e3550346895d9f900a/html5/thumbnails/4.jpg)
Natural Language Processing LabNational Taiwan University
Examples of privacy encroachment
• People are judged by their preference
• Rating + Mention in porn in forum?
![Page 5: You are what you say: Privacy risks of public mentions](https://reader038.fdocuments.net/reader038/viewer/2022103006/568137e3550346895d9f900a/html5/thumbnails/5.jpg)
Natural Language Processing LabNational Taiwan University
Research questions
Risks of dataset release What are the risks to user privacy when
releasing a dataset? Altering the dataset
How can dataset owners alter the dataset they release to preserve user privacy?
Self defense How can users protect their own privacy?
![Page 6: You are what you say: Privacy risks of public mentions](https://reader038.fdocuments.net/reader038/viewer/2022103006/568137e3550346895d9f900a/html5/thumbnails/6.jpg)
Natural Language Processing LabNational Taiwan University
Experimental setup
• Ratings– Large– 140K users. max 6K rats, average 90, median 33.– 9K movies. max 49K rats, average 1,403, median 207– 12.6M ratings
• Forum mentions– Small– 133 forum posters– 1,685 different movies– 3,828 movie mentions
![Page 7: You are what you say: Privacy risks of public mentions](https://reader038.fdocuments.net/reader038/viewer/2022103006/568137e3550346895d9f900a/html5/thumbnails/7.jpg)
Natural Language Processing LabNational Taiwan University
RQ1: Risks of dataset release
• How to evaluate the risks?• What’s the risky algorithms?
![Page 8: You are what you say: Privacy risks of public mentions](https://reader038.fdocuments.net/reader038/viewer/2022103006/568137e3550346895d9f900a/html5/thumbnails/8.jpg)
Natural Language Processing LabNational Taiwan University
K-anonymity & K-identification
• K-anonymity (In Cryptography)– Sweeney: “A dataset release provides k-
anonymity protection if the information for each person contained in data cannot be distinguished from k-1 individuals in the data”
• K-identification– K-identification is a measure of how well an
algorithm can narrow each user in a dataset to one of k users in another dataset
![Page 9: You are what you say: Privacy risks of public mentions](https://reader038.fdocuments.net/reader038/viewer/2022103006/568137e3550346895d9f900a/html5/thumbnails/9.jpg)
Natural Language Processing LabNational Taiwan University
K-identification (cont.)
• We know target user t in ratings data, too
• t is k-identified if at position k or higher on the likely list.
• In paper, k=1,5,10,100. We’ll talk about 1-identification, because it’s the scariest.
• Likely list– u1, s1– u2, s2– u3, s3 (t)
– u4, s4– …
• Above, t is 3-identified, also 4-identified, 5-identified, etc., but NOT 2-identified
![Page 10: You are what you say: Privacy risks of public mentions](https://reader038.fdocuments.net/reader038/viewer/2022103006/568137e3550346895d9f900a/html5/thumbnails/10.jpg)
Natural Language Processing LabNational Taiwan University
An observation of data
Number of ratings of an item by percentile
0
10000
20000
30000
40000
50000
60000
0% 20% 40% 60% 80% 100%Item percentile
Nu
mb
er
of ra
tin
gs
• Low Rated item may be a good indicator
![Page 11: You are what you say: Privacy risks of public mentions](https://reader038.fdocuments.net/reader038/viewer/2022103006/568137e3550346895d9f900a/html5/thumbnails/11.jpg)
Natural Language Processing LabNational Taiwan University
Algorithms to identify users
• Set Intersection algorithm• TF-IDF algorithm• Scoring algorithm• Scoring algorithm with ratings
![Page 12: You are what you say: Privacy risks of public mentions](https://reader038.fdocuments.net/reader038/viewer/2022103006/568137e3550346895d9f900a/html5/thumbnails/12.jpg)
Natural Language Processing LabNational Taiwan University
Set Intersection algorithm
![Page 13: You are what you say: Privacy risks of public mentions](https://reader038.fdocuments.net/reader038/viewer/2022103006/568137e3550346895d9f900a/html5/thumbnails/13.jpg)
Natural Language Processing LabNational Taiwan University
Set Intersection algorithm• Find users who rate EVERY movie the target user ment
ioned– They all have same likeliness score
• Ignore rating value entirely
• RESULT: 1-identification rate: 7%
• MEANING: 7% of the time there was one user at the top of the likely list, and it was the target user
• Room for improvement– For target user with many mentions, no one possible
![Page 14: You are what you say: Privacy risks of public mentions](https://reader038.fdocuments.net/reader038/viewer/2022103006/568137e3550346895d9f900a/html5/thumbnails/14.jpg)
Natural Language Processing LabNational Taiwan University
TF-IDF algorithm
• Score each user by similarity to the target user. Score more highly if– User has rated more mentions of target– User has rated mentions of rarely rated movies
• For us: “word” is a movie, “document” (bag of words) is a user
• Score is cosine similarity to the target user• RESULTS: 1-ident rate of 20% (compared to 7% from S
et Int.)• Room for improvement
– over-weights any mention for ratings user who rated few movies– high-scoring users have 4 ratings and 1 mention
![Page 15: You are what you say: Privacy risks of public mentions](https://reader038.fdocuments.net/reader038/viewer/2022103006/568137e3550346895d9f900a/html5/thumbnails/15.jpg)
Natural Language Processing LabNational Taiwan University
Scoring algorithm
• Emphasizes mentions of rarely-rated movies, de-emphasizes number of ratings a user has
• A user who has rated a mention is 10-20 times more likely to be the target user than one who has not
![Page 16: You are what you say: Privacy risks of public mentions](https://reader038.fdocuments.net/reader038/viewer/2022103006/568137e3550346895d9f900a/html5/thumbnails/16.jpg)
Natural Language Processing LabNational Taiwan University
Examples
• Example– Target user t mentioned A, B, C, rated 20, 50, 1000 tim
es (from 10,000 users)– User u1 rated A, user u2 rated B, C
• u1 score: 0.9981 * 0.05 * 0.05 = 0.0025• u2 score: 0.05 * 0.9501 * 0.9001= 0.043• u2 more likely to be target t
• Rating a mention is good, rare even better
![Page 17: You are what you say: Privacy risks of public mentions](https://reader038.fdocuments.net/reader038/viewer/2022103006/568137e3550346895d9f900a/html5/thumbnails/17.jpg)
Natural Language Processing LabNational Taiwan University
Scoring algorithm with rating
• The same as above algorithm• Add threshold to add the rating
feature
![Page 18: You are what you say: Privacy risks of public mentions](https://reader038.fdocuments.net/reader038/viewer/2022103006/568137e3550346895d9f900a/html5/thumbnails/18.jpg)
Natural Language Processing LabNational Taiwan University
Percent of k-identified
![Page 19: You are what you say: Privacy risks of public mentions](https://reader038.fdocuments.net/reader038/viewer/2022103006/568137e3550346895d9f900a/html5/thumbnails/19.jpg)
Natural Language Processing LabNational Taiwan University
RQ2: altering the dataset
• Perturbation: Change rating value– Rating is not needed
• Generalization: group items– Dataset becomes less useful
• Suppression: hide data– Using in following experiments
![Page 20: You are what you say: Privacy risks of public mentions](https://reader038.fdocuments.net/reader038/viewer/2022103006/568137e3550346895d9f900a/html5/thumbnails/20.jpg)
Natural Language Processing LabNational Taiwan University
RQ2: Altering the dataset
• We won’t modify forum data– users wouldn’t like it. Focus on ratings data
• Rarely-rated items are identifyingIDEA: Release a ratings dataset suppressing
all “rarely-rated” items
![Page 21: You are what you say: Privacy risks of public mentions](https://reader038.fdocuments.net/reader038/viewer/2022103006/568137e3550346895d9f900a/html5/thumbnails/21.jpg)
Natural Language Processing LabNational Taiwan University
RQ2: Altering the dataset
![Page 22: You are what you say: Privacy risks of public mentions](https://reader038.fdocuments.net/reader038/viewer/2022103006/568137e3550346895d9f900a/html5/thumbnails/22.jpg)
Natural Language Processing LabNational Taiwan University
RQ3: Self Defense
• The question is how user protect their own privacy
• Suppression: suppress rare-rated movie– May not be accepted by user
• Misdirection
![Page 23: You are what you say: Privacy risks of public mentions](https://reader038.fdocuments.net/reader038/viewer/2022103006/568137e3550346895d9f900a/html5/thumbnails/23.jpg)
Natural Language Processing LabNational Taiwan University
Suppression
• Not significant if more than 20%
![Page 24: You are what you say: Privacy risks of public mentions](https://reader038.fdocuments.net/reader038/viewer/2022103006/568137e3550346895d9f900a/html5/thumbnails/24.jpg)
Natural Language Processing LabNational Taiwan University
Misdirection
• Mention popular items is more effective
• Mention a popular item, more users increase their score
![Page 25: You are what you say: Privacy risks of public mentions](https://reader038.fdocuments.net/reader038/viewer/2022103006/568137e3550346895d9f900a/html5/thumbnails/25.jpg)
Natural Language Processing LabNational Taiwan University
Conclusion
• A new problem in IR– Interesting and hard
• Hard to preserve privacy– You need to suppress large data