Geographic Data Mining - iba.muni.cz · WORM Data Data Selection Data pre-processing Data mining...

44
Geographic Data Mining Luboš Popelínský Knowledge Discovery Laboratory Faculty of Informatics, MU Brno [email protected] http://www.fi.muni.cz/kd

Transcript of Geographic Data Mining - iba.muni.cz · WORM Data Data Selection Data pre-processing Data mining...

Page 1: Geographic Data Mining - iba.muni.cz · WORM Data Data Selection Data pre-processing Data mining Evaluation/Exploitation. Data mining: two approaches “mechanized process of identifying

Geographic Data Mining

Luboš PopelínskýKnowledge Discovery Laboratory Faculty of Informatics, MU Brno

[email protected]://www.fi.muni.cz/kd

Page 2: Geographic Data Mining - iba.muni.cz · WORM Data Data Selection Data pre-processing Data mining Evaluation/Exploitation. Data mining: two approaches “mechanized process of identifying

Contents

• Knowledge discovery and data mining (DM)• Knowledge discovery in geographic data• Relational DM• Crisis management and DM

Page 3: Geographic Data Mining - iba.muni.cz · WORM Data Data Selection Data pre-processing Data mining Evaluation/Exploitation. Data mining: two approaches “mechanized process of identifying

Knowledge Discovery: Steps

WORM DataData SelectionData pre-processing

Data miningEvaluation/Exploitation

Page 4: Geographic Data Mining - iba.muni.cz · WORM Data Data Selection Data pre-processing Data mining Evaluation/Exploitation. Data mining: two approaches “mechanized process of identifying

Data mining: two approaches

“mechanized process of identifying or discovering useful structure in data” (Fayyad 2002)

• Computationalto gain additional knowledge about data which cannot be seendirectly

• Graphicalvisualize first to have better insight into the data

Page 5: Geographic Data Mining - iba.muni.cz · WORM Data Data Selection Data pre-processing Data mining Evaluation/Exploitation. Data mining: two approaches “mechanized process of identifying

Data mining: tasksClass identification

groups database objects into similarity subclasses.unsupervised learning technique, e.g.cluster analysis

Classificationdescription of data in more compact way, e.g. by

finding typical individuals or by describing the data bymeans of propositional logic (attribute-value)Dependency analysis

prediction of values of some attributes if knowing the values of others

Deviation detectiondiscovers deviations from the expectations

Page 6: Geographic Data Mining - iba.muni.cz · WORM Data Data Selection Data pre-processing Data mining Evaluation/Exploitation. Data mining: two approaches “mechanized process of identifying

Data mining: methodsUnsupervised learning - clustering similar data into subgroups

AutoClass, EM, cluster analysis

Descriptive methods – frequent patterns, association rules“A and B and C often holds together“often = in more than k “baskets”A and B → C (support, confidence)Apriori, MagnumOpus

Supervised learning - example classified into classesdecision tree learnersNaïve BayesSupport Vector Machines, neural nets

Page 7: Geographic Data Mining - iba.muni.cz · WORM Data Data Selection Data pre-processing Data mining Evaluation/Exploitation. Data mining: two approaches “mechanized process of identifying

Data mining: toolsWEKA http://www.cs.waikato.ac.nz/ml/weka/

R http://www.R-project.org

Orange http://magix.fri.uni-lj.si/orange/based on C++ components that are accessed either directlythrough Python scripts (easier and better), or through GUIvisual programming

commercial: SPSS Clementine, Statistica

Page 8: Geographic Data Mining - iba.muni.cz · WORM Data Data Selection Data pre-processing Data mining Evaluation/Exploitation. Data mining: two approaches “mechanized process of identifying

Example 1: Remote sensing

LANDSAT TM pixel = 30 x 30 meters6 bands, T1-T6 - visible, IR

C4.5

Learning 5 treesClassification

majority voting – 3:X2:2 mixel

(e.g. “water OR conifers”)

Page 9: Geographic Data Mining - iba.muni.cz · WORM Data Data Selection Data pre-processing Data mining Evaluation/Exploitation. Data mining: two approaches “mechanized process of identifying

Example 1: Remote sensing II

TM3 <= 35 :| TM4 > 99 : vegetation (12.0)| TM4 <= 99 :| | TM5 > 58 : leafy-wood (30.0/1.0)| | TM5 <= 58 :| | | TM6 <= 12 : leafy-wood (2.0)| | | TM6 > 12 : conifers (8.0)TM3 > 35 :| TM6 <= 23 : water (17.0)| TM6 > 23 : no-vegetation (26.0)

Page 10: Geographic Data Mining - iba.muni.cz · WORM Data Data Selection Data pre-processing Data mining Evaluation/Exploitation. Data mining: two approaches “mechanized process of identifying

Example 1: Remote sensing III

Page 11: Geographic Data Mining - iba.muni.cz · WORM Data Data Selection Data pre-processing Data mining Evaluation/Exploitation. Data mining: two approaches “mechanized process of identifying

Example 1: Remote sensing IV

Results obtained with C4.5

accuracy: comparable the methods commonly used

fails in same situations – built-up area, shallow water

readable -> easy to analyze

much faster

Page 12: Geographic Data Mining - iba.muni.cz · WORM Data Data Selection Data pre-processing Data mining Evaluation/Exploitation. Data mining: two approaches “mechanized process of identifying

Mining in geographic data

Spatial data have to be managed with means that respect (and exploit) their structural nature together with non-spatial data

understanding data discovering hidden relationship in data (e.g. classication, clustering and subgroup discovery etc.

GeoMiner (Han, 1998), FuSOQL (Bigolin, 1998), GWiM(Popelinsky, 1998), INGENS (Malerba, 2002), SPADA (Malerba, 2001, 2004), CommonGIS (Andrienko, 2000)

Page 13: Geographic Data Mining - iba.muni.cz · WORM Data Data Selection Data pre-processing Data mining Evaluation/Exploitation. Data mining: two approaches “mechanized process of identifying

GRR, a tool for mining in GIS

GRASS v.5.0.0

implemented in Perl, Tk, R

PMML visualizer by Dietrich Wettschereck

Linux RedHat

Page 14: Geographic Data Mining - iba.muni.cz · WORM Data Data Selection Data pre-processing Data mining Evaluation/Exploitation. Data mining: two approaches “mechanized process of identifying

GRR: tools

pre-processing - sampling, feature selection, feature construction

analytic tools – R and WEKA, e.g. decision tree and rules, regression tree, cluster analysis

multi-relational learning (inductive logic programming)classication – Aleph, INDEEDfirst-order association rules RAP

Page 15: Geographic Data Mining - iba.muni.cz · WORM Data Data Selection Data pre-processing Data mining Evaluation/Exploitation. Data mining: two approaches “mechanized process of identifying
Page 16: Geographic Data Mining - iba.muni.cz · WORM Data Data Selection Data pre-processing Data mining Evaluation/Exploitation. Data mining: two approaches “mechanized process of identifying

Example 2: Can we recognize friendly robots after short experience?

friendly unfriendly

Page 17: Geographic Data Mining - iba.muni.cz · WORM Data Data Selection Data pre-processing Data mining Evaluation/Exploitation. Data mining: two approaches “mechanized process of identifying

Example 2: Robots and an atribute-value description

head smile neck body In hand friendly

Circle No Tie Rectangle Sword noRectangle

YesBow Rectangle Nothing yes

CircleNo

Bow Circle Sword yes

TriangleNo

Tie Rectangle Ball no

CircleYes

Nothing Triangle Flower no

TriangleNo

Nothing Triangle Ball yes

TriangleYes

Tie Circle Nothing no

Circle Yes Tie Circle Nothing yes

Page 18: Geographic Data Mining - iba.muni.cz · WORM Data Data Selection Data pre-processing Data mining Evaluation/Exploitation. Data mining: two approaches “mechanized process of identifying

Example 2: hypothesis and testing

In the form of a decision tree

if neck = bow then yes = nothing then if head = triangle then yes else no = tie then if body = rectangle then no else if head = circle then yes else no

head smile neck body in hand friendly

circle no tie circle sword yes

triangle yes nothing rectangle nothing yes

Page 19: Geographic Data Mining - iba.muni.cz · WORM Data Data Selection Data pre-processing Data mining Evaluation/Exploitation. Data mining: two approaches “mechanized process of identifying

Attribute-value language with equality

H2: if head ( r) = body ( r) then „friendly“ else „unfriendly“

head smile neck body in hand friendly

circle no tie circle sword yes

triangle yes nothing rectangle nothing no

Both H1 and H2 classify in the same way the learning data but they differ on the test set.

Page 20: Geographic Data Mining - iba.muni.cz · WORM Data Data Selection Data pre-processing Data mining Evaluation/Exploitation. Data mining: two approaches “mechanized process of identifying

Ockham razor

William of Ockham„Entia non sunt multiplicanda praeter necessitatem“

The simpler hypothesis is, the better

Page 21: Geographic Data Mining - iba.muni.cz · WORM Data Data Selection Data pre-processing Data mining Evaluation/Exploitation. Data mining: two approaches “mechanized process of identifying

Example 3: Need for domain knowledge

139 319 854 468 349 561 756 789 987 256 189 354

+ - - + + - - + - + + -

What language to use? E.g. attributs c1, c2 and c3 for the 1st, 2nd and 3rd digit in a triple?

Page 22: Geographic Data Mining - iba.muni.cz · WORM Data Data Selection Data pre-processing Data mining Evaluation/Exploitation. Data mining: two approaches “mechanized process of identifying

Example 3: Need for domain knowledge

139 319 854 468 349 561 756 789 987 256 189 354

+ - - + + - - + - + + -

What language to use? E.g. the atributs c1, c2 and c3 for the 1st, 2nd and 3rd digit in a triple? … ordering on the digits in a triple

Page 23: Geographic Data Mining - iba.muni.cz · WORM Data Data Selection Data pre-processing Data mining Evaluation/Exploitation. Data mining: two approaches “mechanized process of identifying

Example 3: Need for domain knowledge

139 319 854 468 349 561 756 789 987 256 189 354

+ - - + + - - + - + + -

What language to use? E.g. the atributs c1, c2 and c3 for the 1st, 2nd and 3rd digit in a triple? … ordering on the digits in a triple H3: if c1 < c2 & c2 < c3 then ‘+’.

Page 24: Geographic Data Mining - iba.muni.cz · WORM Data Data Selection Data pre-processing Data mining Evaluation/Exploitation. Data mining: two approaches “mechanized process of identifying

Need for richer language

Attribute-value language is to poor• if examples are of different length• if structure (e.g. relations between attributes) principal• if domain knowledge cannot be expressed by examples

Relational data mining (inductive logic programming)• A subset of first-order predicate language• Domain knowdge = set of functions• can be easily incorporated

Page 25: Geographic Data Mining - iba.muni.cz · WORM Data Data Selection Data pre-processing Data mining Evaluation/Exploitation. Data mining: two approaches “mechanized process of identifying

Example 4: SAR

• Structure Activity Relationships (SAR): known– Chemical structure – Empirical value on toxicity/ mutagenicity/medical effect

• What is the reason for the behaviour?

Result: structural indicator

positive negative

Page 26: Geographic Data Mining - iba.muni.cz · WORM Data Data Selection Data pre-processing Data mining Evaluation/Exploitation. Data mining: two approaches “mechanized process of identifying

SAR: First-order representation

HCOOH

atom(MoleculeID, AtomID, Parameters…)

sbond(MoleculeID, Atom1, Atom2, Bond)

Page 27: Geographic Data Mining - iba.muni.cz · WORM Data Data Selection Data pre-processing Data mining Evaluation/Exploitation. Data mining: two approaches “mechanized process of identifying

SAR: first-order frequent pattern

benzene(X,Y),atom(X,Z,U,40,V), atom(X,W,X1,38,Y1), sbond(X,Z,W,2),atom(X,Z1,U1,29,V1), atom(X,W1,X2,22,Y2), sbond(X,Z1,W1,7),atom(X,Z2,U2,3,V2), sbond(X,W1,Z2,1),atom(X,W2,o,40,X3), sbond(X,W,W2,2),atom(X,Y3,n,38,Z3), sbond(X,W2,Y3,2), sbond(X,Z,Y3,2),atom(X,U3,c,29,V3), sbond(X,W1,U3,7)

Page 28: Geographic Data Mining - iba.muni.cz · WORM Data Data Selection Data pre-processing Data mining Evaluation/Exploitation. Data mining: two approaches “mechanized process of identifying

Carcinogenicity230 aromatic and heteroaromatic nitric compounds

188 regression-friendly, easy to classify in attribute-valuerepresentation

42 regression-unfriendly

regression-unfriendly compounds

relational data mining (Aleph) accuracy 88%clasical (attribute-value) methods about 20 % worse

Page 29: Geographic Data Mining - iba.muni.cz · WORM Data Data Selection Data pre-processing Data mining Evaluation/Exploitation. Data mining: two approaches “mechanized process of identifying

Example 5: First-order rules for remote sensing

GRR

raster data, north-west part of Leicester shire, England

used layers image, popln, rail, roads, topo, urban and water

positive examples: bush, i.e. landcov = 7, negative: therest

domain knowledge : basic arithmetic relations like ≤ or >

learned 26 rules (accuracy 80.2 %). After removing the rules with accuracy lower than 70% on the learning set, the accuracy increased to 92 %.

Page 30: Geographic Data Mining - iba.muni.cz · WORM Data Data Selection Data pre-processing Data mining Evaluation/Exploitation. Data mining: two approaches “mechanized process of identifying

Mining in geographic data

Page 31: Geographic Data Mining - iba.muni.cz · WORM Data Data Selection Data pre-processing Data mining Evaluation/Exploitation. Data mining: two approaches “mechanized process of identifying

How to find the right hypothesis

Depends on • language for representing examples• hypothesis language• (need for) domain knowledge

Attribute-value language is to poor• if examples are of different length• if structure (e.g. relations between attributes) principal

• if domain knowledge cannot be expressed by examples

Example 6: Mining in vector data

Page 32: Geographic Data Mining - iba.muni.cz · WORM Data Data Selection Data pre-processing Data mining Evaluation/Exploitation. Data mining: two approaches “mechanized process of identifying

Example 6: Mining in vector data

Page 33: Geographic Data Mining - iba.muni.cz · WORM Data Data Selection Data pre-processing Data mining Evaluation/Exploitation. Data mining: two approaches “mechanized process of identifying

Visualization

graphical exploration of data

another method used for finding hidden knoweldge in data

ideation – advanced visualization“formation of ideas or concepts”

the goal common with computational data mining

Page 34: Geographic Data Mining - iba.muni.cz · WORM Data Data Selection Data pre-processing Data mining Evaluation/Exploitation. Data mining: two approaches “mechanized process of identifying

CommonGIS

. CommonGIS as decision support system in tourism area: Territory of canton Walis of Switzerland

Page 35: Geographic Data Mining - iba.muni.cz · WORM Data Data Selection Data pre-processing Data mining Evaluation/Exploitation. Data mining: two approaches “mechanized process of identifying

CommonGIS

a tool for presenting spatio-temporal data in the form of thematic maps and for their analysing

works with two kind of data - attribute tables that contain non-spatial information and layers with geographicaly referenced data.

Each attribute table is connected with a layer.

ESPRIT project 28983 - Common Access to Geographicaly Referenced Data, 2000 – 2002

Page 36: Geographic Data Mining - iba.muni.cz · WORM Data Data Selection Data pre-processing Data mining Evaluation/Exploitation. Data mining: two approaches “mechanized process of identifying

CommonGIS

Ideal point - ranking all options in comparison to not existent but ideally satisfying the selected set of criteria.

Page 37: Geographic Data Mining - iba.muni.cz · WORM Data Data Selection Data pre-processing Data mining Evaluation/Exploitation. Data mining: two approaches “mechanized process of identifying

Data mining for crisis management

distributed data mining

preserving privacy data when mining from private data sources

new data, both in terms of structure

stream data, text

Page 38: Geographic Data Mining - iba.muni.cz · WORM Data Data Selection Data pre-processing Data mining Evaluation/Exploitation. Data mining: two approaches “mechanized process of identifying

Stream data mining

impossible or useless to remember all the data

mobile communication, email messages

spam detection

machine learning succesfull

Naïve Bayes

#false positive (a ham recognized as a spam) -> 0

Page 39: Geographic Data Mining - iba.muni.cz · WORM Data Data Selection Data pre-processing Data mining Evaluation/Exploitation. Data mining: two approaches “mechanized process of identifying

New kind of data

need for new methods

mostly with a temporal and/or spatial coordinate

mobile communication networks

traffic monitoring systems

GPS

and text documents

Page 40: Geographic Data Mining - iba.muni.cz · WORM Data Data Selection Data pre-processing Data mining Evaluation/Exploitation. Data mining: two approaches “mechanized process of identifying

Flood waters in the Czech capital, Prague, are surging towards new peak levels, with fears that river defences will be overwhelmed and the old city completely inundated. Parts of thecity’s historic Mala Strana district are already submerged and many people fear the worst is still to come as water levels fromthe Vltava River rise further. Thousands of people have already left the capital – Prime Minister Vladimir Spidla had authorised rescue teams to use reasonable force to remove those refusing toleave their homes. Water has engulfed Prague’s Kampa island close to the Old Town, flooding historic palaces and villas. The city is now bracing itself for a second flood tide, threatening to overwhelm the historic Charles Bridge, one of this country’s biggest tourist symbols. Amid concerns that many people could bestranded without water and electricity, Prague’s Mayor Igor Nemec ordered up to 40,000 inhabitants to leave their homes.

Page 41: Geographic Data Mining - iba.muni.cz · WORM Data Data Selection Data pre-processing Data mining Evaluation/Exploitation. Data mining: two approaches “mechanized process of identifying

Text

natural language learning

text categorization

information extraction

useful for filtering and understanding messages exchanged during a crisis situation

Page 42: Geographic Data Mining - iba.muni.cz · WORM Data Data Selection Data pre-processing Data mining Evaluation/Exploitation. Data mining: two approaches “mechanized process of identifying

Thanks for your attention

Page 43: Geographic Data Mining - iba.muni.cz · WORM Data Data Selection Data pre-processing Data mining Evaluation/Exploitation. Data mining: two approaches “mechanized process of identifying

Parity

C1 C2 C3 C4 C5 C6 C7 C8 1 0 1 1 0 0 0 0 - 1 0 1 1 0 0 0 1 + 1 1 1 1 0 0 0 1 - 0 1 1 1 0 0 0 1 +

No attribute can be removed Decision tree will be very complex Recursion is needed

Page 44: Geographic Data Mining - iba.muni.cz · WORM Data Data Selection Data pre-processing Data mining Evaluation/Exploitation. Data mining: two approaches “mechanized process of identifying

Yes No

Input = 1

Input = 0