05/04/07 Using Active Learning to Label Large Email Corpora Ted Markowitz Pace University CSIS DPS &...
-
Upload
melvyn-hensley -
Category
Documents
-
view
216 -
download
0
description
Transcript of 05/04/07 Using Active Learning to Label Large Email Corpora Ted Markowitz Pace University CSIS DPS &...
![Page 1: 05/04/07 Using Active Learning to Label Large Email Corpora Ted Markowitz Pace University CSIS DPS & IBM T. J. Watson Research Ctr.](https://reader035.fdocuments.net/reader035/viewer/2022062503/5a4d1acc7f8b9ab05996fe7d/html5/thumbnails/1.jpg)
05/04/07
Using Active Learning to Label Large Email Corpora
Ted MarkowitzPace University CSIS DPS &
IBM T. J. Watson Research Ctr.
![Page 2: 05/04/07 Using Active Learning to Label Large Email Corpora Ted Markowitz Pace University CSIS DPS & IBM T. J. Watson Research Ctr.](https://reader035.fdocuments.net/reader035/viewer/2022062503/5a4d1acc7f8b9ab05996fe7d/html5/thumbnails/2.jpg)
2
Quick History
• Email-related research suggested by Dr. Chuck Tappert’s work with MS student, Ian Stuart
• Decided to approach IBM Research’s SpamGuru anti-spam group for joint research
• Started P/T onsite at IBM in 11/05• Dr. Richard Segal of IBM Research
generously agreed to act as adjunct advisor
![Page 3: 05/04/07 Using Active Learning to Label Large Email Corpora Ted Markowitz Pace University CSIS DPS & IBM T. J. Watson Research Ctr.](https://reader035.fdocuments.net/reader035/viewer/2022062503/5a4d1acc7f8b9ab05996fe7d/html5/thumbnails/3.jpg)
3
Research Motivation
• Assumption: Ongoing training and testing of anti-spam tools require large, fresh databases–corpora–of labeled (spam vs. good) messages
• Problem: How do we accurately label large numbers of examples―potentially millions― without manually examining every one?
![Page 4: 05/04/07 Using Active Learning to Label Large Email Corpora Ted Markowitz Pace University CSIS DPS & IBM T. J. Watson Research Ctr.](https://reader035.fdocuments.net/reader035/viewer/2022062503/5a4d1acc7f8b9ab05996fe7d/html5/thumbnails/4.jpg)
4
Building Email Corpora
• Accurate training & testing of anti-spam tools require:– truly random, i.e., unbiased, samples– sufficient # of examples to measure low (< 0.1%) error rates– reasonable distributions of spam vs. good mail– examples which represent the target operating environment
• However, most existing email testing corpora are:– Rather small (just a few thousand messages)– Very narrowly focused in type and content– Aging rapidly and growing more and more stale over time
![Page 5: 05/04/07 Using Active Learning to Label Large Email Corpora Ted Markowitz Pace University CSIS DPS & IBM T. J. Watson Research Ctr.](https://reader035.fdocuments.net/reader035/viewer/2022062503/5a4d1acc7f8b9ab05996fe7d/html5/thumbnails/5.jpg)
5
Building Email Corpora (cont.)
• Email and spam are constantly evolving• Building large, current and diverse bodies of
examples is time-consuming and expensive• Result: Just a few–relatively small and aging–
email corpora are used over and over again
![Page 6: 05/04/07 Using Active Learning to Label Large Email Corpora Ted Markowitz Pace University CSIS DPS & IBM T. J. Watson Research Ctr.](https://reader035.fdocuments.net/reader035/viewer/2022062503/5a4d1acc7f8b9ab05996fe7d/html5/thumbnails/6.jpg)
6
One Potential Approach
• Machine Learning (ML) methods can help to build corpus labelers which learn how to label
• Research in semi-supervised learning (SSL) has shown it’s possible to accurately learn by bootstrapping, i.e., using relatively few labeled examples and lots of unlabeled examples
![Page 7: 05/04/07 Using Active Learning to Label Large Email Corpora Ted Markowitz Pace University CSIS DPS & IBM T. J. Watson Research Ctr.](https://reader035.fdocuments.net/reader035/viewer/2022062503/5a4d1acc7f8b9ab05996fe7d/html5/thumbnails/7.jpg)
7
Active Learning
• Active Learning (AL) is one form of SSL• While some ML is passive (e.g., learner is only
given labeled examples), AL is proactive• Active Learner component directs attention to
particular areas it wants information about from a teacher who knows all the labels
![Page 8: 05/04/07 Using Active Learning to Label Large Email Corpora Ted Markowitz Pace University CSIS DPS & IBM T. J. Watson Research Ctr.](https://reader035.fdocuments.net/reader035/viewer/2022062503/5a4d1acc7f8b9ab05996fe7d/html5/thumbnails/8.jpg)
Active Learning & Email Corpora
Select M “best” messages to label
Ask human to label selected messages
Update model based on returned labels
Label messages using Spam Classifier Model
Unlabeled Messages
Done?No
Yes
SpamClassifier
Model
![Page 9: 05/04/07 Using Active Learning to Label Large Email Corpora Ted Markowitz Pace University CSIS DPS & IBM T. J. Watson Research Ctr.](https://reader035.fdocuments.net/reader035/viewer/2022062503/5a4d1acc7f8b9ab05996fe7d/html5/thumbnails/9.jpg)
9
Active Learning (cont.)
• Basic Challenge: Minimize the total cost of teacher queries required to achieve a target error rate, often simply the fewest queries
• Research Question: How does one selectively choose an optimal set of queries for the teacher during each update cycle?
![Page 10: 05/04/07 Using Active Learning to Label Large Email Corpora Ted Markowitz Pace University CSIS DPS & IBM T. J. Watson Research Ctr.](https://reader035.fdocuments.net/reader035/viewer/2022062503/5a4d1acc7f8b9ab05996fe7d/html5/thumbnails/10.jpg)
10
Selective Sampling
• Uncertainty Sampling† (US) is one selective sampling technique for choosing the most informative examples
• US is based on the premise that the learner learns fastest by asking first about those examples it, itself, is most uncertain about
† “A Sequential Algorithm for Training Text Classifiers”, D. D. Lewis & W. A. Gale, ACM SIGIR ‘94
![Page 11: 05/04/07 Using Active Learning to Label Large Email Corpora Ted Markowitz Pace University CSIS DPS & IBM T. J. Watson Research Ctr.](https://reader035.fdocuments.net/reader035/viewer/2022062503/5a4d1acc7f8b9ab05996fe7d/html5/thumbnails/11.jpg)
11
Uncertainty Sampling (cont.)
• Minimizing total uncertainty over all examples is computationally expensive: O(n)
• Can you reduce the # of questions asked in each cycle and still learn accurately?
• Is picking just the most uncertain examples always the best learning strategy?
• Can other knowledge be brought to bear in selecting the best questions?
![Page 12: 05/04/07 Using Active Learning to Label Large Email Corpora Ted Markowitz Pace University CSIS DPS & IBM T. J. Watson Research Ctr.](https://reader035.fdocuments.net/reader035/viewer/2022062503/5a4d1acc7f8b9ab05996fe7d/html5/thumbnails/12.jpg)
12
Research Hypothesis
• Hypothesis: It should be possible to achieve close to full US accuracy while asking fewer, better questions
• Focused on development of Approximate Uncertainty Sampling (AUS) labelers– Compromise between speed of learning, # of
questions asked & computational resources– Computational complexity is O(m log(n)) vs. O(n)
for original Uncertainty Sampling algorithm
![Page 13: 05/04/07 Using Active Learning to Label Large Email Corpora Ted Markowitz Pace University CSIS DPS & IBM T. J. Watson Research Ctr.](https://reader035.fdocuments.net/reader035/viewer/2022062503/5a4d1acc7f8b9ab05996fe7d/html5/thumbnails/13.jpg)
13
Research Approach
1. Construct competing AL/US-based labelers2. Compare them by…
– Accuracy (% correct, FP’s & FN’s) – # of teacher queries required to hit error rates– Relative sample sizes– Overall performance & resource usage
3. Select best labelers and refine them
![Page 14: 05/04/07 Using Active Learning to Label Large Email Corpora Ted Markowitz Pace University CSIS DPS & IBM T. J. Watson Research Ctr.](https://reader035.fdocuments.net/reader035/viewer/2022062503/5a4d1acc7f8b9ab05996fe7d/html5/thumbnails/14.jpg)
14
Research Infrastructure
• Built a Java labeler testbench for comparing labeler variations on IBM SpamGuru codebase
• Developed and tested several Uncertainty Sampling-based labelers
• Used gold-standard, labeled 92K msg TREC 2005 Enron mail corpus to simulate the teacher
• Built a GUI front-end (CSI) to support human teacher interaction with labelers
![Page 15: 05/04/07 Using Active Learning to Label Large Email Corpora Ted Markowitz Pace University CSIS DPS & IBM T. J. Watson Research Ctr.](https://reader035.fdocuments.net/reader035/viewer/2022062503/5a4d1acc7f8b9ab05996fe7d/html5/thumbnails/15.jpg)
![Page 16: 05/04/07 Using Active Learning to Label Large Email Corpora Ted Markowitz Pace University CSIS DPS & IBM T. J. Watson Research Ctr.](https://reader035.fdocuments.net/reader035/viewer/2022062503/5a4d1acc7f8b9ab05996fe7d/html5/thumbnails/16.jpg)
16
Benefits of AUS
• Nearly as effective as vanilla US, but with lower computational complexity: O(m log(n))
• Reduced computational cost allows AUS to be applied to labeling larger datasets
• AUS makes it possible to update the learned model more frequently
• AUS is applicable to any AL/US-based solution, not just email corpus labeling
![Page 17: 05/04/07 Using Active Learning to Label Large Email Corpora Ted Markowitz Pace University CSIS DPS & IBM T. J. Watson Research Ctr.](https://reader035.fdocuments.net/reader035/viewer/2022062503/5a4d1acc7f8b9ab05996fe7d/html5/thumbnails/17.jpg)
17
Ongoing Work
• Determine why selective sampling of queries using simple unsupervised clustering (AUS3 & AUS4) didn’t produce better results
• Develop enhanced clustering versions to attempt to improve AUS performance
![Page 18: 05/04/07 Using Active Learning to Label Large Email Corpora Ted Markowitz Pace University CSIS DPS & IBM T. J. Watson Research Ctr.](https://reader035.fdocuments.net/reader035/viewer/2022062503/5a4d1acc7f8b9ab05996fe7d/html5/thumbnails/18.jpg)