A Robust Framework for Classifying Evolving Document Streams in an Expert-Machine-Crowd Setting
-
Upload
muhammad-imran -
Category
Science
-
view
100 -
download
1
Transcript of A Robust Framework for Classifying Evolving Document Streams in an Expert-Machine-Crowd Setting
![Page 1: A Robust Framework for Classifying Evolving Document Streams in an Expert-Machine-Crowd Setting](https://reader036.fdocuments.net/reader036/viewer/2022081605/5874c01e1a28ab8f508b5121/html5/thumbnails/1.jpg)
A Robust Framework for Classifying Evolving Document Streams in an Expert-Machine-Crowd Setting
Muhammad Imran*, Sanjay Chawla*, Carlos Castillo**
*Qatar Computing Research Institute, Doha, Qatar**Eurecat, Barcelona, Spain
![Page 2: A Robust Framework for Classifying Evolving Document Streams in an Expert-Machine-Crowd Setting](https://reader036.fdocuments.net/reader036/viewer/2022081605/5874c01e1a28ab8f508b5121/html5/thumbnails/2.jpg)
Data Stream Processing
Challenges1. Infinite length2. Concept-drift (change in data distributions)3. Concept-evolution (new categories emerge)4. Limited labeled data
Credit Card fraud detection Sensor data classification Social media stream mining
Data stream
![Page 3: A Robust Framework for Classifying Evolving Document Streams in an Expert-Machine-Crowd Setting](https://reader036.fdocuments.net/reader036/viewer/2022081605/5874c01e1a28ab8f508b5121/html5/thumbnails/3.jpg)
Social Media Stream Processing in Time-Critical Situations
2013 Pakistan EarthquakeSeptember 28 at 07:34 UTC
2010 Haiti EarthquakeJanuary 12 at 21:53 UTC
Social MediaPlatforms
Availability of Immense Data:
Around 16 thousands tweetsper minute were posted duringthe hurricane Sandy in the US.
Opportunities:- Early warning and event detection
- Situational awareness
- Actionable information extraction
- Rapid crisis response
- Post-disaster analysis
Disease outbreaks
![Page 4: A Robust Framework for Classifying Evolving Document Streams in an Expert-Machine-Crowd Setting](https://reader036.fdocuments.net/reader036/viewer/2022081605/5874c01e1a28ab8f508b5121/html5/thumbnails/4.jpg)
Social Media Data Streams Classification
We address two issues in the classification (supervised) of social media streams:
1. How to keep the categories used for classification up-to-date?
2. While adding new categories, how to maintain high classification accuracy?
![Page 5: A Robust Framework for Classifying Evolving Document Streams in an Expert-Machine-Crowd Setting](https://reader036.fdocuments.net/reader036/viewer/2022081605/5874c01e1a28ab8f508b5121/html5/thumbnails/5.jpg)
Input and OutputCategory A Category B Category C Miscellaneous Z
Category A’ Category B’ Category C’
Z1 Z2
Z’
INPU
TO
UTP
UT
![Page 6: A Robust Framework for Classifying Evolving Document Streams in an Expert-Machine-Crowd Setting](https://reader036.fdocuments.net/reader036/viewer/2022081605/5874c01e1a28ab8f508b5121/html5/thumbnails/6.jpg)
Problem DefinitionGiven as input a data set of documents:
Categorized into a taxonomy: containing
Partitioning of documents into taxonomy:
Our task is to produce a new taxonomy:
With the following characteristics:• There are N new categories: • Pre-existing categories are slightly modified:• New categories are different than the old:
• The size of the miscellaneous category is reduced:
![Page 7: A Robust Framework for Classifying Evolving Document Streams in an Expert-Machine-Crowd Setting](https://reader036.fdocuments.net/reader036/viewer/2022081605/5874c01e1a28ab8f508b5121/html5/thumbnails/7.jpg)
Expert-Machine-Crowd Setting
Constraints Outlier Detection (COD-Means):
1. Constraints formation using classified items2. Clustering using COD-Means3. Labeling errors identification (using outlier detection)
![Page 8: A Robust Framework for Classifying Evolving Document Streams in an Expert-Machine-Crowd Setting](https://reader036.fdocuments.net/reader036/viewer/2022081605/5874c01e1a28ab8f508b5121/html5/thumbnails/8.jpg)
Expert-Machine-Crowd Setting
Constraints Outlier Detection (COD-Means):
1. Constraints formation using classified items2. Clustering using COD-Means3. Labeling errors identification (using outlier detection)
12
3
4
![Page 9: A Robust Framework for Classifying Evolving Document Streams in an Expert-Machine-Crowd Setting](https://reader036.fdocuments.net/reader036/viewer/2022081605/5874c01e1a28ab8f508b5121/html5/thumbnails/9.jpg)
Constraints Formation1. Items in same category have Must-link constraints2. Items belonging to different categories have Cannot-link constraints
Category A Category B Category C Category Z
Must-link
Cannot-linkNote: Items in Z do not have any constraints
![Page 10: A Robust Framework for Classifying Evolving Document Streams in an Expert-Machine-Crowd Setting](https://reader036.fdocuments.net/reader036/viewer/2022081605/5874c01e1a28ab8f508b5121/html5/thumbnails/10.jpg)
Objective Function
Standard distortion error
If an ML constraint if violated then the cost of the violation is equal to the distance between the two centroids that contain the instances.
If a CL constraint is violated then the error cost is the distance between the centroid C assigned to the pair and its nearest centroid h(c).
![Page 11: A Robust Framework for Classifying Evolving Document Streams in an Expert-Machine-Crowd Setting](https://reader036.fdocuments.net/reader036/viewer/2022081605/5874c01e1a28ab8f508b5121/html5/thumbnails/11.jpg)
Assignment and Update RulesRule 1: For items without any constraints (standard distortion error)
Rule 2: For items with Must-link constraints; cost of violation is distance b/w their centroids
Rule 3: For items with Cannot-link constraints; cost is the distance b/w centroid c and Its nearest centroid
is the Kronecker delta function i.e. it is 1 if x=y and 0 if x != y
Update rule: The update rule computes a modified average of all points that belong to a cluster.
![Page 12: A Robust Framework for Classifying Evolving Document Streams in an Expert-Machine-Crowd Setting](https://reader036.fdocuments.net/reader036/viewer/2022081605/5874c01e1a28ab8f508b5121/html5/thumbnails/12.jpg)
COD-Means AlgorithmAlgorithm
1
2
3
Initialization (e.g. random pick of k centroids)
Assignment of items based on 3 assignment rules considering ML and CL constraints
Points in each cluster are sorted based on their distance to the centroid and top l are removed and inserted into L
![Page 13: A Robust Framework for Classifying Evolving Document Streams in an Expert-Machine-Crowd Setting](https://reader036.fdocuments.net/reader036/viewer/2022081605/5874c01e1a28ab8f508b5121/html5/thumbnails/13.jpg)
Dataset and Experiments
1. Are the new clusters identified by the COD-Means algorithm genuinely different and novel?
2. What is the nature of outliers (labeling errors) discovered by the COD-Means algorithm? Are they genuine outliers?
3. What is the impact of outlier on the quality of clusters generated by COD-Means?4. Once refined clusters (without labeling errors) used in the training process, does the
overall accuracy improves?
8 disaster-related datasets were used from Twitter
![Page 14: A Robust Framework for Classifying Evolving Document Streams in an Expert-Machine-Crowd Setting](https://reader036.fdocuments.net/reader036/viewer/2022081605/5874c01e1a28ab8f508b5121/html5/thumbnails/14.jpg)
Clusters Novelty and CoherenceK-Means vs. COD-Means
• The proposed approach generates more cohesive and novel clusters by removing outliers. • As the value of L increases, more tight and coherent clusters are observed.
![Page 15: A Robust Framework for Classifying Evolving Document Streams in an Expert-Machine-Crowd Setting](https://reader036.fdocuments.net/reader036/viewer/2022081605/5874c01e1a28ab8f508b5121/html5/thumbnails/15.jpg)
Data Improvements Evaluation1. Labeling errors in non-miscellaneous categories2. Items incorrectly labeled as miscellaneous
![Page 16: A Robust Framework for Classifying Evolving Document Streams in an Expert-Machine-Crowd Setting](https://reader036.fdocuments.net/reader036/viewer/2022081605/5874c01e1a28ab8f508b5121/html5/thumbnails/16.jpg)
Impact on Classification Performance
![Page 17: A Robust Framework for Classifying Evolving Document Streams in an Expert-Machine-Crowd Setting](https://reader036.fdocuments.net/reader036/viewer/2022081605/5874c01e1a28ab8f508b5121/html5/thumbnails/17.jpg)
Conclusion
• Our setting: supervised stream classification• We presented COD-Means to learn novel
categories and labeling errors from live streams• We used real-word Twitter datasets and
performed extensive experimentation• We showed that COD-Means is able to identify
new categories and labeling errors efficiently
![Page 18: A Robust Framework for Classifying Evolving Document Streams in an Expert-Machine-Crowd Setting](https://reader036.fdocuments.net/reader036/viewer/2022081605/5874c01e1a28ab8f508b5121/html5/thumbnails/18.jpg)
Thank you for your attention!