Geographic Data Mining - iba.muni.cz · WORM Data Data Selection Data pre-processing Data mining...
Transcript of Geographic Data Mining - iba.muni.cz · WORM Data Data Selection Data pre-processing Data mining...
Geographic Data Mining
Luboš PopelínskýKnowledge Discovery Laboratory Faculty of Informatics, MU Brno
[email protected]://www.fi.muni.cz/kd
Contents
• Knowledge discovery and data mining (DM)• Knowledge discovery in geographic data• Relational DM• Crisis management and DM
Knowledge Discovery: Steps
WORM DataData SelectionData pre-processing
Data miningEvaluation/Exploitation
Data mining: two approaches
“mechanized process of identifying or discovering useful structure in data” (Fayyad 2002)
• Computationalto gain additional knowledge about data which cannot be seendirectly
• Graphicalvisualize first to have better insight into the data
Data mining: tasksClass identification
groups database objects into similarity subclasses.unsupervised learning technique, e.g.cluster analysis
Classificationdescription of data in more compact way, e.g. by
finding typical individuals or by describing the data bymeans of propositional logic (attribute-value)Dependency analysis
prediction of values of some attributes if knowing the values of others
Deviation detectiondiscovers deviations from the expectations
Data mining: methodsUnsupervised learning - clustering similar data into subgroups
AutoClass, EM, cluster analysis
Descriptive methods – frequent patterns, association rules“A and B and C often holds together“often = in more than k “baskets”A and B → C (support, confidence)Apriori, MagnumOpus
Supervised learning - example classified into classesdecision tree learnersNaïve BayesSupport Vector Machines, neural nets
Data mining: toolsWEKA http://www.cs.waikato.ac.nz/ml/weka/
R http://www.R-project.org
Orange http://magix.fri.uni-lj.si/orange/based on C++ components that are accessed either directlythrough Python scripts (easier and better), or through GUIvisual programming
commercial: SPSS Clementine, Statistica
Example 1: Remote sensing
LANDSAT TM pixel = 30 x 30 meters6 bands, T1-T6 - visible, IR
C4.5
Learning 5 treesClassification
majority voting – 3:X2:2 mixel
(e.g. “water OR conifers”)
Example 1: Remote sensing II
TM3 <= 35 :| TM4 > 99 : vegetation (12.0)| TM4 <= 99 :| | TM5 > 58 : leafy-wood (30.0/1.0)| | TM5 <= 58 :| | | TM6 <= 12 : leafy-wood (2.0)| | | TM6 > 12 : conifers (8.0)TM3 > 35 :| TM6 <= 23 : water (17.0)| TM6 > 23 : no-vegetation (26.0)
Example 1: Remote sensing III
Example 1: Remote sensing IV
Results obtained with C4.5
accuracy: comparable the methods commonly used
fails in same situations – built-up area, shallow water
readable -> easy to analyze
much faster
Mining in geographic data
Spatial data have to be managed with means that respect (and exploit) their structural nature together with non-spatial data
understanding data discovering hidden relationship in data (e.g. classication, clustering and subgroup discovery etc.
GeoMiner (Han, 1998), FuSOQL (Bigolin, 1998), GWiM(Popelinsky, 1998), INGENS (Malerba, 2002), SPADA (Malerba, 2001, 2004), CommonGIS (Andrienko, 2000)
GRR, a tool for mining in GIS
GRASS v.5.0.0
implemented in Perl, Tk, R
PMML visualizer by Dietrich Wettschereck
Linux RedHat
GRR: tools
pre-processing - sampling, feature selection, feature construction
analytic tools – R and WEKA, e.g. decision tree and rules, regression tree, cluster analysis
multi-relational learning (inductive logic programming)classication – Aleph, INDEEDfirst-order association rules RAP
Example 2: Can we recognize friendly robots after short experience?
friendly unfriendly
Example 2: Robots and an atribute-value description
head smile neck body In hand friendly
Circle No Tie Rectangle Sword noRectangle
YesBow Rectangle Nothing yes
CircleNo
Bow Circle Sword yes
TriangleNo
Tie Rectangle Ball no
CircleYes
Nothing Triangle Flower no
TriangleNo
Nothing Triangle Ball yes
TriangleYes
Tie Circle Nothing no
Circle Yes Tie Circle Nothing yes
Example 2: hypothesis and testing
In the form of a decision tree
if neck = bow then yes = nothing then if head = triangle then yes else no = tie then if body = rectangle then no else if head = circle then yes else no
head smile neck body in hand friendly
circle no tie circle sword yes
triangle yes nothing rectangle nothing yes
Attribute-value language with equality
H2: if head ( r) = body ( r) then „friendly“ else „unfriendly“
head smile neck body in hand friendly
circle no tie circle sword yes
triangle yes nothing rectangle nothing no
Both H1 and H2 classify in the same way the learning data but they differ on the test set.
Ockham razor
William of Ockham„Entia non sunt multiplicanda praeter necessitatem“
The simpler hypothesis is, the better
Example 3: Need for domain knowledge
139 319 854 468 349 561 756 789 987 256 189 354
+ - - + + - - + - + + -
What language to use? E.g. attributs c1, c2 and c3 for the 1st, 2nd and 3rd digit in a triple?
Example 3: Need for domain knowledge
139 319 854 468 349 561 756 789 987 256 189 354
+ - - + + - - + - + + -
What language to use? E.g. the atributs c1, c2 and c3 for the 1st, 2nd and 3rd digit in a triple? … ordering on the digits in a triple
Example 3: Need for domain knowledge
139 319 854 468 349 561 756 789 987 256 189 354
+ - - + + - - + - + + -
What language to use? E.g. the atributs c1, c2 and c3 for the 1st, 2nd and 3rd digit in a triple? … ordering on the digits in a triple H3: if c1 < c2 & c2 < c3 then ‘+’.
Need for richer language
Attribute-value language is to poor• if examples are of different length• if structure (e.g. relations between attributes) principal• if domain knowledge cannot be expressed by examples
Relational data mining (inductive logic programming)• A subset of first-order predicate language• Domain knowdge = set of functions• can be easily incorporated
Example 4: SAR
• Structure Activity Relationships (SAR): known– Chemical structure – Empirical value on toxicity/ mutagenicity/medical effect
• What is the reason for the behaviour?
Result: structural indicator
positive negative
SAR: First-order representation
HCOOH
atom(MoleculeID, AtomID, Parameters…)
sbond(MoleculeID, Atom1, Atom2, Bond)
SAR: first-order frequent pattern
benzene(X,Y),atom(X,Z,U,40,V), atom(X,W,X1,38,Y1), sbond(X,Z,W,2),atom(X,Z1,U1,29,V1), atom(X,W1,X2,22,Y2), sbond(X,Z1,W1,7),atom(X,Z2,U2,3,V2), sbond(X,W1,Z2,1),atom(X,W2,o,40,X3), sbond(X,W,W2,2),atom(X,Y3,n,38,Z3), sbond(X,W2,Y3,2), sbond(X,Z,Y3,2),atom(X,U3,c,29,V3), sbond(X,W1,U3,7)
Carcinogenicity230 aromatic and heteroaromatic nitric compounds
188 regression-friendly, easy to classify in attribute-valuerepresentation
42 regression-unfriendly
regression-unfriendly compounds
relational data mining (Aleph) accuracy 88%clasical (attribute-value) methods about 20 % worse
Example 5: First-order rules for remote sensing
GRR
raster data, north-west part of Leicester shire, England
used layers image, popln, rail, roads, topo, urban and water
positive examples: bush, i.e. landcov = 7, negative: therest
domain knowledge : basic arithmetic relations like ≤ or >
learned 26 rules (accuracy 80.2 %). After removing the rules with accuracy lower than 70% on the learning set, the accuracy increased to 92 %.
Mining in geographic data
How to find the right hypothesis
Depends on • language for representing examples• hypothesis language• (need for) domain knowledge
Attribute-value language is to poor• if examples are of different length• if structure (e.g. relations between attributes) principal
• if domain knowledge cannot be expressed by examples
Example 6: Mining in vector data
Example 6: Mining in vector data
Visualization
graphical exploration of data
another method used for finding hidden knoweldge in data
ideation – advanced visualization“formation of ideas or concepts”
the goal common with computational data mining
CommonGIS
. CommonGIS as decision support system in tourism area: Territory of canton Walis of Switzerland
CommonGIS
a tool for presenting spatio-temporal data in the form of thematic maps and for their analysing
works with two kind of data - attribute tables that contain non-spatial information and layers with geographicaly referenced data.
Each attribute table is connected with a layer.
ESPRIT project 28983 - Common Access to Geographicaly Referenced Data, 2000 – 2002
CommonGIS
Ideal point - ranking all options in comparison to not existent but ideally satisfying the selected set of criteria.
Data mining for crisis management
distributed data mining
preserving privacy data when mining from private data sources
new data, both in terms of structure
stream data, text
Stream data mining
impossible or useless to remember all the data
mobile communication, email messages
spam detection
machine learning succesfull
Naïve Bayes
#false positive (a ham recognized as a spam) -> 0
New kind of data
need for new methods
mostly with a temporal and/or spatial coordinate
mobile communication networks
traffic monitoring systems
GPS
and text documents
Flood waters in the Czech capital, Prague, are surging towards new peak levels, with fears that river defences will be overwhelmed and the old city completely inundated. Parts of thecity’s historic Mala Strana district are already submerged and many people fear the worst is still to come as water levels fromthe Vltava River rise further. Thousands of people have already left the capital – Prime Minister Vladimir Spidla had authorised rescue teams to use reasonable force to remove those refusing toleave their homes. Water has engulfed Prague’s Kampa island close to the Old Town, flooding historic palaces and villas. The city is now bracing itself for a second flood tide, threatening to overwhelm the historic Charles Bridge, one of this country’s biggest tourist symbols. Amid concerns that many people could bestranded without water and electricity, Prague’s Mayor Igor Nemec ordered up to 40,000 inhabitants to leave their homes.
Text
natural language learning
text categorization
information extraction
useful for filtering and understanding messages exchanged during a crisis situation
Thanks for your attention
Parity
C1 C2 C3 C4 C5 C6 C7 C8 1 0 1 1 0 0 0 0 - 1 0 1 1 0 0 0 1 + 1 1 1 1 0 0 0 1 - 0 1 1 1 0 0 0 1 +
No attribute can be removed Decision tree will be very complex Recursion is needed
Yes No
Input = 1
Input = 0