RM World 2014: Data mining with background knowledge from the web
-
Upload
rapidminer -
Category
Documents
-
view
126 -
download
0
description
Transcript of RM World 2014: Data mining with background knowledge from the web
![Page 1: RM World 2014: Data mining with background knowledge from the web](https://reader034.fdocuments.net/reader034/viewer/2022051323/548119c4b379593a2b8b5bb6/html5/thumbnails/1.jpg)
08/22/14 Paulheim, Ristoski, Mitichkin, Bizer 1
Data Mining with Background Knowledgefrom the Web
Introducing the RapidMinerLinked Open Data Extension
Heiko Paulheim, Petar Ristoski, Evgeny Mitichkin, Christian Bizer
![Page 2: RM World 2014: Data mining with background knowledge from the web](https://reader034.fdocuments.net/reader034/viewer/2022051323/548119c4b379593a2b8b5bb6/html5/thumbnails/2.jpg)
08/22/14 Paulheim, Ristoski, Mitichkin, Bizer 2
Motivation: An Example Data Mining Task
• Analyzing book sales
ISBN City Sold
3-2347-3427-1 Darmstadt 124
3-43784-324-2 Mannheim 493
3-145-34587-0 Roßdorf 14
...
ISBN City Population ... Genre Publisher ... Sold
3-2347-3427-1 Darm-stadt
144402 ... Crime Bloody Books
... 124
3-43784-324-2 Mann-heim
291458 … Crime Guns Ltd. … 493
3-145-34587-0 Roß-dorf
12019 ... Travel Up&Away ... 14
...
→ Crime novels sell better in larger cities
![Page 3: RM World 2014: Data mining with background knowledge from the web](https://reader034.fdocuments.net/reader034/viewer/2022051323/548119c4b379593a2b8b5bb6/html5/thumbnails/3.jpg)
08/22/14 Paulheim, Ristoski, Mitichkin, Bizer 3
Motivation
• Many data mining problems are solved better
– when you have more background knowledge
(leaving scalability aside)
• Problems:
– Tedious work
– Selection bias: what to include?
![Page 4: RM World 2014: Data mining with background knowledge from the web](https://reader034.fdocuments.net/reader034/viewer/2022051323/548119c4b379593a2b8b5bb6/html5/thumbnails/4.jpg)
08/22/14 Paulheim, Ristoski, Mitichkin, Bizer 4
Linked Open Data in a Nutshell
• Started in 2007
• A collection of ~1,000 open datasets
– from various domains, e.g., general knowledge, government data, …
– using semantic web standards (HTTP, RDF, SPARQL,…)
• Machine processable
• Free of charge
• Sophisticated tool stacks
![Page 5: RM World 2014: Data mining with background knowledge from the web](https://reader034.fdocuments.net/reader034/viewer/2022051323/548119c4b379593a2b8b5bb6/html5/thumbnails/5.jpg)
08/22/14 Paulheim, Ristoski, Mitichkin, Bizer 5
Linked Open Data in a Nutshell
http://lod-cloud.net/
![Page 6: RM World 2014: Data mining with background knowledge from the web](https://reader034.fdocuments.net/reader034/viewer/2022051323/548119c4b379593a2b8b5bb6/html5/thumbnails/6.jpg)
08/22/14 Paulheim, Ristoski, Mitichkin, Bizer 6
Example: DBpedia
![Page 7: RM World 2014: Data mining with background knowledge from the web](https://reader034.fdocuments.net/reader034/viewer/2022051323/548119c4b379593a2b8b5bb6/html5/thumbnails/7.jpg)
![Page 8: RM World 2014: Data mining with background knowledge from the web](https://reader034.fdocuments.net/reader034/viewer/2022051323/548119c4b379593a2b8b5bb6/html5/thumbnails/8.jpg)
08/22/14 Paulheim, Ristoski, Mitichkin, Bizer 8
The RapidMiner LOD Extension
• Automatic discovery of links to Linked Open Data
– for local data objects
– e.g., the database entry Boston is linked to http://dbpedia.org/resource/Boston
• Automatic generation of attributes
– e.g., add all numeric values found for Boston (and other cities)
• Plus
– Feature selection algorithms optimized for LOD
– Automatic following of links to other datasets
– Schema matching (coming soon)
• No need to know Semantic Web technologies!
![Page 9: RM World 2014: Data mining with background knowledge from the web](https://reader034.fdocuments.net/reader034/viewer/2022051323/548119c4b379593a2b8b5bb6/html5/thumbnails/9.jpg)
08/22/14 Paulheim, Ristoski, Mitichkin, Bizer 9
Example: the Auto MPG Dataset
• A well-known UCI dataset
– Goal: predict fuel consumption of cars
• Hypothesis: background knowledge → more accurate predictions
• Used background knowledge:
– Entity types and categories from DBpedia (=Wikipedia)
![Page 10: RM World 2014: Data mining with background knowledge from the web](https://reader034.fdocuments.net/reader034/viewer/2022051323/548119c4b379593a2b8b5bb6/html5/thumbnails/10.jpg)
08/22/14 Paulheim, Ristoski, Mitichkin, Bizer 10
Example: the Auto MPG Dataset
• A well-known UCI dataset
– Goal: predict fuel consumption of cars
• Hypothesis: background knowledge → more accurate predictions
• Used background knowledge:
– Entity types and categories from DBpedia (=Wikipedia)
• Result: M5Rules down to almost half the prediction error
– i.e., on average, we are wrong by 1.6 instead of 2.9 MPG
![Page 11: RM World 2014: Data mining with background knowledge from the web](https://reader034.fdocuments.net/reader034/viewer/2022051323/548119c4b379593a2b8b5bb6/html5/thumbnails/11.jpg)
08/22/14 Paulheim, Ristoski, Mitichkin, Bizer 11
Example: the Auto MPG Dataset
• The original attributes are
– cylinders, displacement, horsepower, weight, acceleration, model, origin
– plus name (unique string) and mpg (target)
• Models built are, e.g.,
– high horsepower/weight → high consumption
• Additional attributes lead to further insights, e.g.
– front-wheel drives have a lower consumption than rear-wheel drives
– hatchbacks have a lower consumption than station wagons
– rally cars generally have a low consumption
![Page 12: RM World 2014: Data mining with background knowledge from the web](https://reader034.fdocuments.net/reader034/viewer/2022051323/548119c4b379593a2b8b5bb6/html5/thumbnails/12.jpg)
08/22/14 Paulheim, Ristoski, Mitichkin, Bizer 12
Example: Analyzing Statistics
• As shown, e.g., at ESWC 2012, SemStats 2013
• Statistics found on the web often contain only few attributes
– extreme case: only entity + target
• Examples:
– Quality of living in cities (right)
– Corruption by country
– Fertility rate by country
– Suicide rate by country
– Box office revenue of films
– ...
![Page 13: RM World 2014: Data mining with background knowledge from the web](https://reader034.fdocuments.net/reader034/viewer/2022051323/548119c4b379593a2b8b5bb6/html5/thumbnails/13.jpg)
08/22/14 Paulheim, Ristoski, Mitichkin, Bizer 13
Example: Analyzing Statistics
• Process in RapidMiner:
– load statistic
– link entities (cities, countries, etc.) to LOD cloud
– collect additional attributes
– analyze for correlations with target attribute of statistic
![Page 14: RM World 2014: Data mining with background knowledge from the web](https://reader034.fdocuments.net/reader034/viewer/2022051323/548119c4b379593a2b8b5bb6/html5/thumbnails/14.jpg)
![Page 15: RM World 2014: Data mining with background knowledge from the web](https://reader034.fdocuments.net/reader034/viewer/2022051323/548119c4b379593a2b8b5bb6/html5/thumbnails/15.jpg)
08/22/14 Paulheim, Ristoski, Mitichkin, Bizer 15
Example: Analyzing Statistics
• Corruption Perception Index (CPI) by Transparency International
• Indicators for low corruption:
– high HDI (human development index)
– large number of companies
– large number of NGOs
– small number of cargo airlines?!
• Burnout rates in German DAX companies
– Positive correlation between turnover and burnout rates
– Car manufacturers are less prone to burnout
– Local companies are less prone to burnout than international ones
• Exception: Frankfurt
![Page 16: RM World 2014: Data mining with background knowledge from the web](https://reader034.fdocuments.net/reader034/viewer/2022051323/548119c4b379593a2b8b5bb6/html5/thumbnails/16.jpg)
08/22/14 Paulheim, Ristoski, Mitichkin, Bizer 16
Example: Analyzing Statistics
• Sexual activity (based on Durex survey 2005-2009)
– Higher in French speaking than in English speaking countries
– High GDP per capita → low activity
– High unemployment rate → high activity
– High number of ISPs → low activity
http://xkcd.com/552/
![Page 17: RM World 2014: Data mining with background knowledge from the web](https://reader034.fdocuments.net/reader034/viewer/2022051323/548119c4b379593a2b8b5bb6/html5/thumbnails/17.jpg)
08/22/14 Paulheim, Ristoski, Mitichkin, Bizer 17
Further Usage Examples
• Classification of Twitter messages (SMILE, 2013)– given a target, e.g., messages related to car traffic
– annotate message, extract abstract features for concepts
– e.g. “I-90” → highway
• Prediction of user location for Twitter (ICWSM, 2013)– useful, e.g., for market research
– combination with sentiment analysis: public opinion maps
• Identifying disputed topics in the news (LD4KD, 2014)– on a corpus of different online newspapers
– identified, e.g., concurrent opinions on drug legislation and gay marriage
• Debugging Linked Open Data as such– e.g., identifying wrong links and axioms
– combination with outlier detection
![Page 18: RM World 2014: Data mining with background knowledge from the web](https://reader034.fdocuments.net/reader034/viewer/2022051323/548119c4b379593a2b8b5bb6/html5/thumbnails/18.jpg)
08/22/14 Paulheim, Ristoski, Mitichkin, Bizer 18
Conclusions
• Many data mining tasks are better solved with more background knowledge
– better predictive models
– more insights from additional attributes
• A lot of such knowledge exists as Linked Open Data
• The Linked Open Data extension grants easy access to that data
– from within RapidMiner
– without the need to know anything about RDF, SPARQL, etc.
• Try it out!
– find “Linked Open Data” on the marketplace
– Google Group: https://groups.google.com/forum/#!forum/rmlod
![Page 19: RM World 2014: Data mining with background knowledge from the web](https://reader034.fdocuments.net/reader034/viewer/2022051323/548119c4b379593a2b8b5bb6/html5/thumbnails/19.jpg)
08/22/14 Paulheim, Ristoski, Mitichkin, Bizer 19
Data Mining with Background Knowledgefrom the Web
Introducing the RapidMinerLinked Open Data Extension
Heiko Paulheim, Petar Ristoski, Evgeny Mitichkin, Christian Bizer