brescia DM 1 escience+DWHousing -...

Data Mining

Informazioni docente – Data Mining

M. Brescia - Data Mining - lezione 1 2

Chi è il vostro docente?

Laurea in Informatica presso Università di Salerno (1994)

Contratti di consulenza con OAC - Osservatorio Astronomico di Capodimonte (1995 – 1999)

Astronomo ricercatore presso INAF – OAC ( 1999 – pensione (forse??) )

Docente a contratto di architettura degli elaboratori presso il dip. Informatica dell’Università

Federico II di Napoli (2002 – 2007)

Docente associato di Tecnologie Astronomiche presso il dip. fisica della Federico II (dal 2008)

Progettazione/realizzazione grandi telescopi e strumenti (ottica, elettronica, software

engineering, quality control), project management, data mining e machine learning per

grandi archivi di dati astrofisici

Ufficio OAC tel. 081.5575553 – cell. 338.5354945 - e-mail: [email protected]

http://www.na.astro.it/~brescia.html – http://dame.dsf.unina.it

Presentazione corso – Data Mining


Il corso sarà articolato nei seguenti argomenti:

totale 12 lezioni da 4 ore cad. = 50 ore (ultime 2 lezioni da 5 ore)

a) fondamenti di e-science e data warehousing;

b) fondamenti di data mining e Intelligenza Artificiale;

c) fondamenti di machine learning supervisionato;

d) fondamenti di machine learning non supervisionato;

e) fondamenti di ICT per il data mining;

f) esempi pratici di data mining e casi d'uso;



Il recente riconoscimento a livello globale del concetto di Scienza data-centrica, ha indotto

una rapida diffusione e proliferazione di nuove metodologie di data mining. Il concetto

chiave consegue dal quarto paradigma della Scienza moderna, ossia del "Knowledge

Discovery in Databases" o KDD, dopo teoria, sperimentazione e simulazioni. Una delle cause

principali è stata l'evoluzione della tecnologia e di tutte le scienze di base ed applicate, che

fanno dell'esplorazione efficiente dei dati il principale mezzo per nuove scoperte.

Il data mining dunque si prefigge di gestire ed analizzare enormi quantità di dati eterogenei,

avvalendosi di tecniche ed algoritmi auto-adattivi, afferenti al paradigma del Machine

Learning.

Il presente corso intende quindi fornire i concetti fondamentali alla base della teoria del data

mining, data warehousing e Machine Learning (reti neurali, logica Fuzzy, algoritmi genetici,

Soft Computing), con tecniche pratiche derivanti dallo stato dell'arte dell'Information &

Communication Technology (tecnologie web 2.0, calcolo distribuito e cenni alla

programmazione su architetture parallele).

Il corso conterrà esempi di sviluppo di modelli di data mining, facendo uso di linguaggi di

programmazione (C, C++, Java, CUDA C);



Modalità di svolgimento del corso:

1) Lezioni frontali (slides);

2) Discussione collegiale;

3) (Esercitazioni) ed esempi pratici;

Le lezioni frontali saranno basate su slides quasi esclusivamente in lingua inglese (la

letteratura è infatti prevalentemente in inglese e conviene abituarci a consultare testi in

linguaggio diverso dall’italiano!)

Al termine di ogni lezione, dedicheremo una parte alla discussione aperta

Tutto il materiale del corso, incluse slide, bibliografia e links web utili, è a disposizione

attraverso una pagina web, gestita dal sottoscritto.

http://dame.dsf.unina.it/master.html

Data Mining


Data mining—core of knowledge discovery

process

Data Cleaning

Data Integration

Databases

Data Warehouse

Task-relevant Data

Selection

Data Mining

Pattern Evaluation

7

Data, Data everywhere, yet ...

� I can’t get the data I need

need an expert to get the data

� I can’t understand the data I found

available data poorly documented

� I can’t use the data I found

results are unexpected

data needs to be transformed from one form to other

� I can’t find the data I need

Data is scattered over the network

many versions and formats

Most data will never be seen by humans…

Cascade of data

1 ZB or 1.000.000.000.000 GB = 109 TeraByte

Tsunami of data

Small, big, in a network, isolated … modern devices produce large

amounts of data and EACH DATA which is produced needs to be

reduced, analyzed, interpreted…

Increase in number and size of

devices or in efficiency or in

number of bands … all cause an

increase in pixels (worldwide)

Computing time and costs do not

scale linearly with number of

pixels

Moore law’s does not apply

anymore. Slopes are changed.

International Technology Roadmap for Semiconductors

Tsunami of data

For over two decades, before the advent of multi-core architectures,

the general purpose CPUs have been characterized, at each

generation, by an almost linear increasing of performances together

with a decreasing of costs, also known as Moore’s Law (Moore 1965)

Increase in number and size of

devices or in efficiency or in

number of bands … all cause an

increase in pixels (worldwide)

Computing time and costs do not

scale linearly with number of

pixels

So far, in order to maintain the

cyclic hardware/software trend,

the software applications had to

change their perspective, moving

towards parallel computing

Tsunami of data

The forerunner:

LHC

Data Stream: 330 TB/week

ATLAS detector event

Computationally demanding but still

a relatively simple (embarassingly

parallel) KDD task

each CPU gets one event at a time

and needs to perform simple tasks

Tsunami of data

• Huge data sets ( ca. Pbyte)

In astronomy as in many other sciences

• Thousands of different problems

• Many, many thousands of users

i.e. LHC is a “piece of cake”

(simple computational model)

DATA INTENSIVE SCIENCE HAS BECOME A REALITY IN

ALMOST ALL FIELDS and poses worse problems

Tsunami of data

“One of the greatest challenges for

21st-century science is how we

respond to this new era of data

intensive science …

… This is recognized as a new

paradigm beyond experimental and

theoretical research and computer

simulations of natural phenomena -

one that requires new tools,

techniques, and ways of working.”

Jim Gray

Tsunami of data

Real world physics is too

complex. Validation of models

requires accurate simulations,

tools to compare simulations

and data, and better ways to

deal with complex & massive

data sets

Need to increase

computational and

algorithmic capabilities

beyond current and

expected technological

trends

Cosmological simulation.

The total number of

particles is 2,097,152

A new Science concept

Virtualization of Science and

Scholarship

Summary

Overture

• The world transformed

• Climbing the S-Curve

� Science in the exponential world

� Virtual Observatory: a case study

• The modern scientific process

� eScience and the new paradigms

� The evolution of computing

• Scientific communication and collaboration

� The rise of immersive virtual environments: Web 3.0?

• The growing synergies

� Exploring and building in cyberspace

Definitions

Definition: By Virtualization, I mean a migration of the scholarly work, data,

tools, methods, etc., to cyber-environments, today effectively the Web

This process is of course not limited to science and scholarship;

essentially all aspects of the modern society are undergoing the same

transformation

Cyberspace (today the Web, with all information and tools it connects) is

increasingly becoming the principal arena where humans interact with each

other, with the world of information, where they work, learn, and play

ITC Revolution

Information & Communication Technology

revolution is historically unprecedented -

in its impact it is like the industrial

revolution and the invention of printing

combined

Yet, most fields of science and scholarship have not

yet fully adopted the new ways of doing things, and

in most cases do not understand them well…

It is a matter of developing a new methodology of

science and scholarship for the 21st century

eScience

What Is This Beast Called e-Science?

It depends on whom you ask, but some

general properties include:

• Computationally enabled

• Data-intensive

• Geographically distributed resources (i.e., Web-based)

However:

• All science in the 21st century is becoming cyber-science (aka e-Science) –

so this is just a transitional phase

• There is a great emerging synergy of the computationally enabled science,

and the science-driven IT

Facing the Data Tsunami

Astronomy, all sciences, and every other modern field

of human endeavor (commerce, security, etc.) are

facing a dramatic increase in the volume and

complexity of data

• We are entering the second phase of the IT revolution: the rise of the

information/data driven computing

The challenges are universal, and growing:

– Management of large, complex,

distributed data sets

– Effective exploration of such data new

knowledge

Data complexity and volume

Multi-data fusion leads to a more complete, less

biased picture (also: multi-scale, multi-epoch, …)

Numerical simulations are also producing

many TB’s of very complex “data”

Data + Theory = Understanding

Understanding of complex phenomena requires complex data!

Exponential Growth in Data Volumes and

Complexity

An example: Astronomy

Astronomy Has Become Very Data-Rich

• Typical digital sky survey now generates ~ 10 - 100 TB, plus a

comparable amount of derived data products

– PB-scale data sets are on the horizon

• Astronomy today has ~ 1 - 2 PB of archived data, and generates a few

TB/day

– Both data volumes and data rates grow exponentially, with a

doubling time ~ 1.5 years

– Even more important is the growth of data complexity

• For comparison:

Human memory ~ a few hundred MB

Human Genome < 1 GB

1 TB ~ 2 million books

Library of Congress (print only) ~ 30 TB

The reaction

The Response of the Scientific Community to the IT Revolution

• The rise of Virtual Scientific Organizations:

– Discipline-based, not institution based

– Inherently distributed, and web-centric

– Always based on deep collaborations between domain scientists and applied

CS/IT scientists and professionals

– Based on an exponentially growing technology and thus rapidly evolving

themselves

– Do not fit into the traditional organizational structures

– Great educational and public outreach potential

• However: Little or no coordination and interchange between different scientific

disciplines

• Sometimes, entire new fields are created, e.g., bioinformatics, computational biology

The Virtual ObservatoryThe Virtual Observatory Concept

• A complete, dynamical, distributed, open research environment for the new

astronomy with massive and complex data sets

– Provide and federate content (data, metadata)

services, standards, and analysis/compute

services

– Develop and provide data exploration and

discovery tools

– Harness the IT revolution in the service of

astronomy

– A part of the broader e-Science /Cyber-

Infrastructure

http:// ivoa.net

http://us-vo.org

http://www.euro-vo.org

The world is flat

Professional Empowerment: Scientists and students anywhere with an internet

connection should be able to do a first-rate science (access to data and tools)

– A broadening of the talent pool in astronomy, leading to a substantial

democratization of the field

• They can also be substantial contributors, not only consumers

– Riding the exponential growth of the IT is far more cost effective than building

expensive hardware facilities, e.g., big telescopes, large accelerators, etc…

– Especially useful for countries without major research facilities

Probably the most important aspect of the IT

revolution in science

VO Education and Public Outreach

• Unprecedented opportunities in

terms of the content, broad

geographical and societal range, at

all levels

• Astronomy as a gateway to learning

about physical science in general,

as well as applied CS and IT

The Web has a truly

transformative potential

for education at all levels

VO (also as Virtual Organization) Functionality Today

What we did so far:

• Lots of progress on interoperability, standards, etc.

• An incipient data grid of astronomy

• Some useful web services

• Community training, EPO

What we did not do (yet):

• Significant data exploration and mining tools. That is where the science will

come from!

Thus, little VO-enabled science so far and a slow community buy-in

Development of powerful knowledge discovery tools should be a key priority

Donald Rumsfeld’s Epistemology

Or, in other words (Data Mining):

1. Optimized detection algorithms

2. Supervised clustering

3. Unsupervised clustering

There are known knowns,

There are known unknowns, and

There are unknown unknowns

The Mixed Blessings of Data Richness

Modern digital sky surveys typically contain ~ 10 – 100 TB, detect Nobj ~ 108 - 109

sources, with D ~ 102 – 103 parameters measured for each one -- and multi-PB data

sets are on the horizon

Nobj or data volume � Big surveys

Nsurveys2 (connections) � Data federation

Great! However … DM algorithms scale very badly:

– Clustering ~ N log N � N2, ~ D2

– Correlations ~ N log N � N2, ~ Dk (k 1)

– Likelihood, Bayesian ~ Nm (m ≥ 3), ~ Dk (k ≥ 1)

Potential for discovery

Scalability and dimensionality reduction (without a significant loss of information) are

critical needs!

The Curse of Hyperdimensionality

Not a matter of hardware

or software, but new ideas

� Visualization!

� A fundamental limitation of the human perception: DMAX = 3? 5? 10?

(We can understand mathematically much higher dimensionalities, but

cannot really visualize them; our own Neural Nets are powerful

pattern recognition tools)

� Interactive visualization must be a key part of the data mining process

� Dimensionality reduction via machine discovery of

patterns/substructures and correlations in the data?

DM Toolkit

VisualizationUser

Visualization

Effective visualization is the bridge between quantitative

information, and human intuition

L’uomo non è in grado di comprendere senza

immagini; L’immagine è una similitudine di

una cosa corporea, ma la comprensione è

dell’universale astratto dai particolari

Aristotele, De Memoria et Reminiscentia

Data analysis

The key role of data analysis is to replace the raw complexity

seen in the data with a reduced set of patterns, regularities,

and correlations, leading to their theoretical understanding

However, the complexity (e.g., dimensionality) of data sets and

interesting, meaningful constructs in them is starting to exceed

the cognitive capacity of the human brain

Data understanding

This is a Very Serious Problem!

Hyperdimensional structures (clusters, correlations, etc.) are likely present in

many complex data sets, whose dimensionality is commonly in the range of

D ~ 102 – 104, and will surely grow

It is not only the matter of data understanding, but also of choosing the

appropriate data mining algorithms, and interpreting their results

• Things are rarely Gaussian in reality

• The clustering topology can be complex

What good are the data if we cannot effectively extract knowledge from

them?“A man has got to know his limitations”

Dirty Harry, an American philosopher

Knowledge Discovery in Databases

The new Science

Information Technology New Science

• The information volume grows exponentially

Most data will never be seen by humans!

The need for data storage, network, database-related technologies, standards,

etc.

• Information complexity is also increasing greatly

Most data (and data constructs) cannot be comprehended by humans

directly!

The need for data mining, KDD, data understanding technologies,

hyperdimensional visualization, AI/Machine-assisted discovery …

• We need to create a new scientific methodology on the basis of applied CS

and IT

• Important for practical applications beyond science

Evolution of knowledge

The Evolving Paths to Knowledge

• The First Paradigm: Experiment/Measurement

• The Second Paradigm: Analytical Theory

• The Third Paradigm: Numerical Simulations

• The Fourth Paradigm: Data-Driven Science?

From numerical simulations…

Numerical Simulations:

A qualitatively new (and necessary) way

of doing theory, beyond analytical

approach

Formation

of a cluster of

galaxies

Turbulence in the Sun

Simulation output: a data set, the theoretical

statement, not an equation

…to the fourth paradigm

Is this really something qualitatively new, rather than the same old data

analysis, but with more data?

The information content of modern data sets is so high as to

enable discoveries which were not envisioned by the data

originators (data mining)

Data fusion reveals new knowledge which was implicitly

present, but not recognizable in the individual data sets

Complexity threshold for a human comprehension of complex

data constructs? Need new methods to make the data

understanding possible (machine learning)

Data Fusion + Data Mining + Machine Learning = The Fourth

Paradigm

The fourth paradigm

1. Experiment ( ca. 3000 years)

2. Theory (few hundreds years)

mathematical description, theoretical

models, analytical laws (e.g. Newton,

Maxwell, etc.)

3. Simulations (few tens of years)

Complex phenomena

4. Data-Intensive science

(and it is happening now!!)

http://research.microsoft.com/fourthparadigm/

Machine Learning

The Roles for Machine Learning and Machine Intelligence in

CyberScience:

Data processing:

Object / event / pattern classification

Automated data quality control (fault detection and repair)

Data mining, analysis, and understanding:

Clustering, classification, outlier / anomaly detection

Pattern recognition, hidden correlation search

Assisted dimensionality reduction for hyperdimensional visualisation

orkflow control in Grid-based apps

Data farming and data discovery: semantic web, and beyond

Code design and implementation: from art to science?

+

The way to produce new science

The old and the new

The Book and

the Cathedral …

… and

the Web, and

the Computer

Technologies for information

storage and access are

evolving, and so does scholarly

publishing

Worlds of knowledge

K. Popper, Objective Knowledge:

An Evolutionary Approach, 1972

Cyberspace is now

effectively World 3,

plus the ways of

interacting with it

Science Commons, or Discovery Space

Simulations

& Theory

Communication

& Collaboration

Published

Literature

Data

Archives

Origins of discovery

A Lot of Science Originates in

Discussions and Constructive

Interactions

This creative process can be

enabled and enhanced using

virtual interactive spaces,

including the Web2.0 tools

Computing as a Communication Tool

With the advent of the Web, most of the computing usage is not

in a number crunching, but in a search, manipulation, and display

of data and information, and increasingly also for human

interactions (e.g., much of Web 2.0)

Information as communication

Information Technology as a Communication Medium:

Social Networking and Beyond

• Science originates on the interface between human minds, and the human minds and

data (measurements, structured information, output of simulations)

• Thus, any technology which facilitates these interactions is an enabling technology for

science, scholarship, and intellectual progress more generally

• Virtual Worlds (or immersive VR) are one such technology, and will likely revolutionize

the ways in which we interact with each other, and with the world of information we

create

• Thus, we started the Meta-Institute for Computational Astrophysics (MICA), the first

professional scientific organization based entirely in VWs (Second life)

Subjective experience quality much higher

than traditional videoconferencing (and it

can only get better as VR improves)

Effective worldwide telecommuting, at ~

zero cost

Professional conferences easily organized,

at ~ zero cost

http://slurl.com/secondlife/StellaNova

Immersive data visualization

Encode up to a dozen dimensions for a

parameter space representation

Interactive data exploration in a pseudo-

3D environment

Multicolor SDSS data set on

stars, galaxies and quasars

Immersive mathematical visualization

Pseudo-3D representation of highly-dimensional mathematical

objects

Potential research and educational uses: geometry, topology, etc.

A pseudo-3D projection

of a 248-dimensional

mathematical object

Personalization of Cyberspace

We inhabit the Cyberspace as

individuals

– and not just for work, but in very

personal ways, to express ourselves,

and to connect with others (“As we

may feel”?)

“We must all hang together, or assuredly we will all

hang separately”

Ben Franklin

e-Science is unified by a common

methodology and tools

The Truth About Social Networking

social networking as the intersection of narcissism, ADHD (Attention Deficit

Hyperactivity Disorder), and good old fashioned stalking

The Core business of Academia

� To discover, preserve, and disseminate knowledge

� To serve as a source of scientific and technological innovation

� To educate the new generations, in terms of the knowledge, skills, and tools

But when it comes to the adoption of computational tools and methods, innovation,

and teaching them to our students, we are doing very poorly – and yet, the science and

the economy of the 21st century depend critically on these issues

Is the discrepancy of time scales

to blame for this slow uptake?

� IT ~ 2 years

� Education ~ 20 years

� Career ~ 50 years

� Universities ~ 200 years

(Are universities obsolete?)

Some Thoughts about e-Science

Computational science ≠ Computer science

Numerical modeling

Data-driven science

Computational science

• Data-driven science is not about data, it is about knowledge

extraction (the data are incidental to our real mission)

• Information and data are (relatively) cheap, but the expertise is

expensive

o Just like the hardware/software situation

• Computer science as the “new mathematics”

o It plays the role in relation to other sciences which

mathematics did in ~ 17th - 20th century

o Computation as a glue/lubricant of interdisciplinarity

Some Transformative Technologies To Watch

� Cloud (mobile, ubiquitous) computing

• Distributed data and services

• Also mobile / ubiquitous computing

� Semantic Web

• Knowledge encoding and discovery infrastructure

for the next generation Web

� Immersive & Augmentative Virtual Reality

• The human interface for the next generation

Web, beyond the Web 2.0 social networking

� Machine Intelligence redux

• Intelligent agents as your assistants / proxies

• Human-machine intelligence interaction

A new set of disciplines: X-Informatics

Databases

Data structures

Computational

infrastructures

Computer networks

Numerical analysisData mining

Machine learning

Advanced programming

languages

Semantics

visualization

Formation of a new

generation of

scientists

ETC.

Within any X-informatics discipline, information granules are unique to that discipline, e.g.,

gene sequences in bio, the sky object in astro, and the spatial object in geo (such as points

and polygons in the vector model, and pixels in the raster model). Nevertheless the goals are

similar: transparent data re-use across sub-disciplines and within education settings,

information and data integration and fusion, personalization of user interactions with data

collection, semantic search and retrieval, and knowledge discovery. The implementation of

an X-informatics framework enables these semantic e-science research goals

Some Speculations

We create technology, and it changes us, starting with the grasping of sticks

and rocks as primitive tools, and continuing ever since

When the technology touches our minds, that process can have profound

evolutionary impact in the long term; VWs are one such technology

Development of AI seems inevitable, and its uses in assisting us with the

information management and knowledge discovery are already starting

In the long run, immersive VR may facilitate the co-evolution of human and

machine intelligence

Scientific and Technological Progress

Mining of Warehouse Data

Data Mining + Data Warehouse = Mining of Warehouse Data

• For organizational learning to take place, data from must be gathered together

and organized in a consistent and useful way – hence, Data Warehousing (DW);

• DW allows an organization to remember what it has noticed about its data;

• Data Mining techniques should be interoperable with data organized in a DW.

VO

registries

Etc…

Observations Etc…

Simulations

Data

Warehouse

Enterprise “Database”

Transactions

Copied,

organized

summarized

Data Mining

Data Miners:

• “Farmers” – they know

• “Explorers” - unpredictable

DM 4-rule virtuous cycle

• Finding patterns is not enough

• Science business must:

– Respond to patterns by taking action

– Turning:

• Data into Information

• Information into Action

• Action into Value

• Hence, the Virtuous Cycle of DM:

1. Identify the problem

2. Mining data to transform it into actionable information

3. Acting on the information

4. Measuring the results

• Virtuous cycle implementation steps:

– Transforming data into informationvia:

• Hypothesis testing

• Profiling

• Predictive modeling

– Taking action

• Model deployment

• Scoring

– Measurement

• Assessing a model’s stability &effectiveness before it is used

DM: 11-step Methodology

1. Translate any opportunity (science case) into DM opportunity (problem)

2. Select appropriate data

3. Get to know the data

4. Create a model set

5. Fix problems with the data

6. Transform data to bring information

7. Build models

8. Assess models

9. Deploy models

10. Assess results

11. Begin again (GOTO 1)

The four rules reflect into an 11-step exploded strategy, at the base of DAME (Data Analysis,

Mining and Exploration) concept

Why Mine Data? Commercial Viewpoint

• Lots of data is being collected and warehoused

– Web data, e-commerce

– purchases at department/stores

– Bank/Credit Card transactions

• Computers have become cheaper and more powerful

• Competitive Pressure is Strong

– Provide better, customized services for an edge (e.g. in Customer Relationship Management)

Why Mine Data? Scientific Viewpoint

• Data collected and stored at enormous speeds (GB/hour)

– remote sensors on a satellite

– telescopes scanning the skies

– microarrays generating gene expression data

– scientific simulations generating terabytes of data

• Traditional techniques infeasible for raw data

• Data mining may help scientists – in classifying and segmenting data

– in Hypothesis Formation

Terminology

• Components of the input:

– Concepts: kinds of things that can be learned

• Aim: intelligible and operational concept description

– Instances: the individual, independent examples of a concept

• Note: more complicated forms of input are possible

– Features/Attributes: measuring aspects of an instance

• We will focus on nominal and numeric ones

– Patterns: ensemble (group/list) of features

• In a same dataset, a group of patterns are usually in a homogeneous format

(same number, meaning and type of features)

What’s a DM concept?

• Data Mining Tasks (Styles of learning):

Classification learning:

predicting a discrete class

Association learning:

detecting associations between features

Clustering:

grouping similar instances into clusters

Sequencing what events are likely to lead to later events

Forecasting what may happen in the future

Numeric prediction (Regression):

predicting a numeric quantity

• Concept: thing to be learned

• Concept description: output

of learning scheme

Effective DM process break-down

Market Analysis and Management

• Where does the data come from?—Credit card transactions, loyalty cards, discount

coupons, customer complaint calls, plus (public) lifestyle studies

• Target marketing

– Find clusters of “model” customers who share the same characteristics: interest, income

level, spending habits, etc.,

– Determine customer purchasing patterns over time

• Cross-market analysis—Find associations/co-relations between product sales, &

predict based on such association

• Customer profiling—What types of customers buy what products (clustering or

classification)

• Customer requirement analysis

– Identify the best products for different customers

– Predict what factors will attract new customers

• Provision of summary information

– Multidimensional summary reports

– Statistical summary information (data central tendency and variation)

Data quality and integrity problems

• Legacy systems no longer documented

• Outside sources with questionable quality procedures

• Production systems with no built in integrity checks and no integration

– Operational systems are usually designed to solve a specific business problem

and are rarely developed to a a corporate plan

• “And get it done quickly, we do not have time to worry about corporate

standards...”

• Same person, different spellings

– Agarwal, Agrawal, Aggarwal etc...

• Multiple ways to denote company name

– Persistent Systems, PSPL, Persistent Pvt. LTD.

• Use of different names

– mumbai, bombay

• Different account numbers generated by different applications for the same customer

• Required fields left blank

• Invalid product codes collected at point of sale

– manual entry leads to mistakes

– “in case of a problem use 9999999”

What is a Data Warehouse?

A single, complete and consistent store of data obtained from a variety of

different sources made available to end users in a what they can

understand and use in a business/research context.

• Data should be integrated across the

enterprise

• Summary data has a real value to the

organization

• Historical data holds the key to understanding

data over time

• What-if capabilities are required

DW is a process of transforming data into information and

making it available to users in a timely enough manner to

make a difference

Technique for assembling and managing data from various

sources for the purpose of answering business questions.

Thus making decisions that were not previous possible

The evolution of data analysis

Evolutionary Step

Business Question

Enabling Technologies

Product Providers

Characteristics

Data Collection (1960s)

"What was my total revenue in the last five years?"

Computers, tapes, disks

IBM, CDC

Retrospective, static data delivery

Data Access (1980s)

"What were unit sales in New England last March?"

Relational databases (RDBMS), Structured Query Language (SQL), ODBC

Oracle, Sybase, Informix, IBM, Microsoft

Retrospective, dynamic data delivery at record level

Data Warehousing & Decision Support (1990s)

"What were unit sales in New England last March? Drill down to Boston."

On-line analytic processing (OLAP), multidimensional databases, data warehouses

SPSS, Comshare, Arbor, Cognos, Microstrategy,NCR

Retrospective, dynamic data delivery at multiple levels

Data Mining (Emerging Today)

"What’s likely to happen to Boston unit sales next month? Why?"

Advanced algorithms, multiprocessor computers, massive databases

SPSS/Clementine, Lockheed, IBM, SGI, SAS, NCR, Oracle, numerous startups

Prospective, proactive information delivery

Definition of a Massive Data Set

• TeraBytes -- 1012 bytes:

• PetaBytes -- 1015 bytes:

• ExaBytes -- 1018 bytes:

• ZettaBytes -- 1021 bytes:

• ZottaBytes -- 1024 bytes:

Astrophysical observation (per night)

Geographic Information Systems or

Astrophysical Survey Archive

National Medical Records

Weather images

Intelligence Agency Videos

DM, Operational systems and DW

What makes data mining possible?

• Advances in the following areas are making data mining deployable:

– data warehousing

– Operational systems

– the emergence of easily deployed data mining tools and

– the advent of new data mining techniques (Machine Learning)

OLTP vs OLAP

OLTP OLAP

Application Operational: ERP,

CRM, legacy apps, ...

Management Information System,

Decision Support System

Typical users Staff Managers, Executives

Horizon Weeks, Months Years

Refresh Immediate Periodic

Data model Entity-relationship Multi-dimensional

Schema Normalized Star

Emphasis Update Retrieval

OLPT and OLAP are complementing technologies. You can't live without OLTP: it runs your

business day by day. So, using getting strategic information from OLTP is usually first “quick

and dirty” approach, but can become limiting later.

OLTP (On Line Transaction Processing) is a data modeling approach typically used to facilitate

and manage usual business applications. Most of applications you see and use are OLTP

based.

OLAP (On Line Analytic Processing) and is an approach to answer multi-dimensional queries.

OLAP was conceived for Management Information Systems and Decision Support Systems

but is still widely underused: every day I see too much people making out business

intelligence from OLTP data!

With the constant growth

of data analysis and

intelligence applications,

understanding the OLAP

benefits is a must if you

want to provide valid and

useful analytics to the

management.

Examples of OLTP data systems

Data Industry Usage Technology Volumes

CustomerFile

All TrackCustomerDetails

Legacy application, flatfiles, main frames

Small-medium

AccountBalance

Finance Controlaccountactivities

Legacy applications,hierarchical databases,mainframe

Large

Point-of-Sale data

Retail Generatebills, managestock

ERP, Client/Server,relational databases

Very Large

CallRecord

Telecomm-unications

Billing Legacy application,hierarchical database,mainframe

Very Large

ProductionRecord

Manufact-uring

ControlProduction

(ERP) Enterprise Resource Planning,

relational databases

Medium

Why Separate Data Warehouse?

� Function of DW for DM (outside data mining)

Missing data: Decision support requires historical data, which op dbs do not

typically maintain.

Data consolidation: Decision support requires consolidation (aggregation,

summarization) of data from many heterogeneous sources: op dbs, external

sources.

Data quality: Different sources typically use inconsistent data representations,

codes, and formats which have to be reconciled.

• Operational Systems are OLTP systems (DW is OLAP)

– Run mission critical applications

– Need to work with stringent performance requirements for routine

tasks

– Used to run a business!

– Optimized to handle large numbers of simple read/write transactions

– RDBMS have been used for OLTP systems

So, what’s different?

Application-Orientation vs. Subject-Orientation

Application-Orientation

Operational

Database

LoansCredit Card

Trust

Savings

Subject-Orientation

Data

Warehouse

Customer

Vendor

Product

Activity

OLTP vs Data Warehouse

• OLTP (run a business)

– Application Oriented

– Used to run business

– Detailed data

– Current up to date

– Isolated Data

– Repetitive access

– Office worker User

• Warehouse (optimize a business)

– Subject Oriented

– Used to analyze business

– Summarized and refined

– Snapshot data

– Integrated Data

– Ad-hoc access

– Knowledge User (Manager)

– Performance Sensitive

– Few Records accessed at a time

(tens)

– Read/Update Access

– No data redundancy

– Database Size 100MB -100GB

– Performance relaxed

– Large volumes accessed at a time

(millions)

– Mostly Read (Batch Update)

– Redundancy present

– Database Size 100 GB – few TB

OLAP and Data Marts

A data mart is the access layer of the data warehouse environment that is used to get

data out to the users. The data mart is a subset of the data warehouse that is usually

oriented to a specific business line or team. In some deployments, each department or

business unit is considered the owner of its data mart including all

the hardware, software and data

• Data marts and OLAP servers

are departmental solutions

supporting a handful of users

• Million dollar massively parallel

hardware is needed to deliver

fast time for complex queries

• OLAP servers require massive

indices

• Data warehouses must be at

least 100 GB to be effective

Components of the Warehouse

• Data Extraction and Loading

• The Warehouse

• Analyze and Query -- OLAP Tools

• Metadata

Data Warehouse Engine

Optimized Loader

ExtractionCleansing

AnalyzeQuery

Metadata Repository

RelationalDatabases

LegacyData

Purchased Data

EnterpriseResourcePlanningSystems

• Data Mart

• Data Mining

True data warehouses

Data Marts

Data Sources

Data Warehouse

With data mart centric DWs, if you end up creating multiple warehouses, integrating them is a problem

DW Query Processing - Indexing

Exploiting indexes to reduce scanning of data is of crucial importance

Bitmap Indexes

Join Indexes

Other Issues

Text indexing

Parallelizing and sequencing of index builds and incremental updates

• Bitmap indexing:

– A collection of bitmaps -- one for each distinct value of the column

– Each bitmap has N bits where N is the number of rows in the table

– A bit corresponding to a value v for a row r is set if and only if r has the value

for the indexed attribute

Cust Region RatingC1 N HC2 S MC3 W LC4 W HC5 S LC6 W LC7 N H

Base TableRow ID N S E W

1 1 0 0 02 0 1 0 03 0 0 0 14 0 0 0 15 0 1 0 06 0 0 0 17 1 0 0 0

Row ID H M L1 1 0 02 0 1 03 0 0 14 1 0 05 0 0 16 0 0 17 1 0 0

Rating IndexRegion Index

Customers where Region = W Rating = MAnd


• Join indexing

– Pre-computed joins

– A join index between a fact table and a dimension table correlates a

dimension tuple with the fact tuples that have the same value on the

common dimensional attribute

• e.g., a join index on city dimension of calls fact table

• correlates for each city the calls (in the calls table) from that city

Calls

C+T

C+T+L

C+T+L+P

Time

Loca-tion

Plan


• Parallel query processing:

– Three forms of parallelism

• Independent

• Pipelined

• Partitioned and “partition and replicate”

– Deterrents to parallelism

• startup

• Communication

– Partitioned Data

• Parallel scans

• Yields I/O parallelism

– Parallel algorithms for relational operators

• Joins, Aggregates, Sort

– Parallel Utilities

• Load, Archive, Update, Parse, Checkpoint, Recovery

– Parallel Query Optimization

OLAP Representation

OLAP Is FASMI • Fast

• Analysis

• Shared

• Multidimensional

• Information

Month1 2 3 4 76 5

Pro

du

ctToothpaste

JuiceColaMilk

Cream

Soap

WS

N

• Online Analytical Processing - coined by

EF Codd in 1994 paper contracted by

Arbor Software*

• Generally synonymous with earlier terms

such as Decisions Support, Business

Intelligence, Executive Information System

• OLAP = Multidimensional Database

• MOLAP: Multidimensional OLAP (Arbor

Essbase, Oracle Express)

• ROLAP: Relational OLAP (Informix

MetaCube, Microstrategy DSS Agent)

* Reference: http://www.arborsoft.com/essbase/wht_ppr/coddTOC.html

OLAP vs SQL

Limitation of SQL:

“A Freshman in Business needs a Ph.D. in SQL”

Ralph Kimball

• OLAP:

– powerful visualization paradigm

– fast, interactive response times

– good for analyzing time series

– It finds some clusters and outliers

– Many vendors offer OLAP tools

– Embedded SQL Extensions

• Nature of OLAP Analysis:

– Aggregation - (total sales, percent-

to-total)

– Comparison -- Budget vs. Expenses

– Ranking -- Top 10, quartile analysis

– detailed and aggregate data

– Complex criteria specification

– Visualization

Relational OLAP

Data Warehouse Engine Decision Support Client

Database Layer Application Logic Layer Presentation Layer

Store atomic data in industry standard RDBMS.

Generate SQL execution plans in the engine to obtain OLAP functionality.

Obtain multi-dimensional reports from the DS Client.

Multi-Dimensional OLAP

MDDB Engine MDDB Engine Decision Support Client

Database Layer Application Logic Layer Presentation Layer

Store atomic data in a proprietary MD data structure (MDDB), pre-calculate as many outcomes as possible, obtain OLAP functionality via proprietary algorithms running against this data.

Obtain multi-dimensional reports from the DS Client.

OLAP Problem: too many data!

16 81 256 10244096

16384

65536

0

10000

20000

30000

40000

50000

60000

70000

2 3 4 5 6 7 8

Data Explosion Syndrome

Number of Dimensions

Nu

mb

er o

f Ag

gre

gat

ion

s

(4 levels in each dimension)

OLAP Solution: Metadata

With a unified meta-data source and

definition, the business is embarking further

on the analysis journey. OLAP reporting is

moving across stream with greater access to all

employees. Data mining models are now more

accurate as the model sets can be scored and

trained on larger data sets

The primary rational for data warehousing is

to provide businesses with analytics results

from data mining, OLAP and reporting. The

ability of obtaining front-end analytics is

lowered if there is an expensive data quality

all along the pipeline from data source to

analytical reporting.

Data Flow after Company-wide

Metadata Implementation

Data Warehouse pitfalls• You are going to spend much time extracting, cleaning, and loading data

• Despite best efforts at project management, data warehousing project scope will increase

• You are going to find problems with systems feeding the data warehouse

• You will find the need to store data not being captured by any existing system

• You will need to validate data not being validated by transaction processing systems

• For interoperability among worldwide data centers, you need to move massive data sets on the network:

DISASTER!

Data �� Applications ?

Moving programs not data:

the true bottle neckData Mining + Data Warehouse =

Mining of Warehouse Data

• For organizational learning to take place, data from must be gathered together and

organized in a consistent and useful way – hence, Data Warehousing (DW);

• DW allows an organization to remember what it has noticed about its data;

• Data Mining apps should be interoperable with data organized and shared between DW.

Interoperability scenariosDA1

DA2

Data+apps

Exchange

DA

WA

Data+apps

Exchange

WA

WA

Data+apps

Exchange

Full interoperability between DA (Desktop Applications)

Local user desktop fully involved (requires computing power)

Full WA � DA interoperabilityPartial DA � WA interoperability (such as remote file storing)

MDS must be moved between local and remote apps

user desktop partially involved (requires minor computing and storage power)

Except from URI exchange, no interoperability and different accounting policy

MDS must be moved between remote apps (but larger bandwidth)

No local computing power required

Improving Aspects

WA1

WA2

plugins

DAs has to become WAs

Unique accounting policy (google/Microsoft like)

To overcome MDS flow apps must be plug&play (e.g. any WAx

feature should be pluggable in WAy on demand)

No local computing power required. Also smartphones can

run VO apps

Requirements

• Standard accounting system;

• No more MDS moving on the web, but just moving Apps, structured as plugin repositories and

execution environments;

• standard modeling of WA and components to obtain the maximum level of granularity;

• Evolution of SAMP architecture to extend web interoperability (in particular for the migration

of the plugins);

Plugin granularity flow

WAx

Px-2

Px-1

Px-3

Px-…

Px-n

WAy

Py-1

Py-2

Py-…

Py-n

Px-33. Way execute Px-3

This scheme could be iterated and extended between more standardized web apps

The Lernaean Hydra

WAx

Px-2

Px-1

Px-3

Px-…

Px-n

WAy

Py-1

Py-2

Py-…

Py-n

After a certain number of such iterations…

Px-2

Px-1

Px-3

Px-…

Px-n

Py-1

Py-2

Py-…

Py-n

The synchronization of plugin

releases between WSs is

performed at request time

The scenario will

become:

No different WSs, but simply one

WS with several sites (eventually

with different GUIs and

computing environments)

All WS sites can become a mirror

site of all the others

Minimization of data exchange

flow (just few plugins in case of

synchronization between mirrors)

Web 2.0

Web 2.0? It is a system that breaks with the old model of centralized Web sites and moves the

power of the Web/Internet to the desktop. [J. Robb]

the Web becomes a universal, standards-based integration platform. [S. Dietzen]

Conclusions

e-Science is a transitional phenomenon, and will become an overall research

environment of the data-rich, computationally enabled science of the 21st

century

Essentially all of the humanity’s activities are being virtualized in some way,

science and scholarship included

We see growing synergies and co-evolution between science, technology,

society, and individuals, with an increasing fusion of the real and the virtual

Cyberspace, now embodied though the Web and its participants,

is the arena in which these processes unfold

VR technologies may revolutionize the ways in which humans interact with

each other, and with the world of information

A synthesis of the semantic Web, immersive and augmentative VW, and

machine intelligence may shape our world profoundly

REFERENCES

Borne, K. D., 2009. X-Informatics: Practical Semantic Science. American Geophysical

Union, Fall Meeting 2009, abstract #IN43E-01

(http://adsabs.harvard.edu/abs/2009AGUFMIN43E..01B)

The Fourth Paradigm, Microsoft Research,

http://research.microsoft.com/fourthparadigm/

Thomsen E., 1997. OLAP Solutions, John Wiley and Sons

Inmon W.H. , Zachman John A., Geiger Jonathan G. , 1997. Data Stores Data

Warehousing and the Zachman Framework, McGraw Hill Series on Data Warehousing

and Data Management

Inmon W.H., 1996. Building the Data Warehouse, Second Edition, John Wiley and Sons

Inmon W.H. , Welch J. D. , Glassey Katherine L., 1997. Managing the Data Warehouse,

John Wiley and Sons

Devlin B., 1997. Data Warehouse from Architecture to Implementation, Addison

Wesley Longman, Inc.

Lin S.C., Yen E., 2011. Data Driven e-Science; Use Cases and Successful Applications of

Distributed Computing Infrastructures (ISGC 2010), Springer

brescia DM 1 escience+DWHousing -...

Documents

Transcript of brescia DM 1 escience+DWHousing -...