Data warehousing
-
Upload
mandar-kulkarni -
Category
Technology
-
view
552 -
download
0
Transcript of Data warehousing
Data Warehousing&
Data Mining
By Mandar KulkarniPRN 10030141129
MBA-ITSICSR
Contents
• Data warehousing• Understanding data warehousing• Data warehouse architecture• Data Mining• Data mining techniques
Warehouse?
Real time example?
Data Warehousing
Samsung
Mumbai
Delhi
Chennai
Banglore
SalesManager
Sales per item type per branchfor first quarter.
• Now, the sales manager wants to know the sales of first quarter.?
• Solution– Extract information from each database store it at
a single place, and process using operational systems.!
Mumbai
Delhi
Chennai
Banglore
DataWarehouse
SalesManager
Query &Analysis tools
Report
Solution
Operational Systems
• Running the business real time• Routine tasks• Decision Support Systems(DSS)– Help in taking actions!
• Used by people who deal with customers, products
• They are increasingly used by customers
Data Warehouse
• A single, complete and consistent store of data obtained from a variety of different sources made available to end users in a what they can understand and use in a business context.
• A process of transforming data into information and making it available to users in a timely enough manner to make a difference
Definition
• Integrated, Subject-Oriented, Time-Variant, Nonvolatile database that provides support for
decision making
Data warehouse architecture
External
Production
Internal
Source Data
Archived Data MartsData Staging
Metadata
Data Warehouse DBMS
MDDB
Information DeliveryManagement & Control
OLAP
Report /Query
Data Mining
Components
• Source Data • Data Staging (Data Extraction, cleaning And Loading )– Talend is the first open source ETL tool
• Data Storage • Information Delivery (EIS)• Management and control
OLAP
• Online Analytical Processing Tools• DSS tools that use multidimensional data
analysis techniques– Support for a DSS data store– Data extraction and integration filter– Specialized presentation interface
• Oracle OLAP 11G
Multidimensional analysis
OLAP architecture
12 Rules of Data Warehouse
1. Data Warehouse and Operational Environments are Separated
2. Data is integrated3. Contains historical data over a long period of
time4. Data is a snapshot data captured at a given
point in time5. Data is subject-oriented
6.Mainly read-only with periodic batch updates
7.Development Life Cycle has a data driven approach versus the traditional process-driven approach
8.Data contains several levels of detail-Current, Old, Lightly Summarized, Highly Summarized
9.Environment is characterized by Read-only transactions to very large data sets
10.System that traces data sources, transformations, and storage
11.Metadata is a critical component– Source, transformation, integration, storage, relationships,
history, etc
12.Contains a chargeback mechanism for resource usage that enforces optimal use of data by end users
OLTP v/s Data warehousing
OLTP• Application Oriented • Used to Run Business• Detailed data • Current up-to date • Isolated data• Repetitive Access• Performance Sensitive• Few records accessed• Read/Update Access
Data Warehousing • Subject Oriented• Used to analyze business• Summarized and refined• Snapshot Data • Integrated Data• Ad-Hoc Access• Performance relaxed• Large volume accessed at a
time• Mostly Read
Data Warehouse summary
• Integrated platform for OLAP and DSS
• Helps optimize business operations
• Easy access to multidimensional data
Data Mining
Why Data Mining?
Strategic decision making
Wealth generation
Analyzing trends
Security
Data Mining
• Look for hidden patterns and trends in data that is not immediately apparent from summarizing the data
• No Query…
• …But an “Interestingness criteria”
Data Mining
+ =Data
Interestingnesscriteria
Hiddenpatterns
Data Mining
+ =Data
Interestingnesscriteria
Hiddenpatterns
Type of Patterns
Data Mining
+ =Data
Interestingnesscriteria
Hiddenpatterns
Type of data Type of Interestingness criteria
Type of Data• Tabular (Ex: Transaction data)
– Relational– Multi-dimensional
• Tree (Ex: XML data)
• Graphs
• Sequence (Ex: DNA, activity logs)
• Text, Multimedia …
Type of Interestingness
• Frequency• Rarity• Correlation • Length of occurrence (for sequence and temporal data)
• Consistency • Repeating / periodicity • “Abnormal” behavior • Other patterns of interestingness…
Data Mining vs Statistical Inference
Statistics:
ConceptualModel
(Hypothesis)
StatisticalReasoning
“Proof”(Validation of Hypothesis)
Data Mining vs Statistical Inference
Data mining:
MiningAlgorithmBased on InterestingnessData
Pattern (model, rule, hypothesis)discovery
Used for..
• Data mining is used for– Frequent Item-sets– Associations– Classifications– Clustering
Techniques • Algorithms– Apriori algorithm
– Decision tree• SLIQ– Supervised Learning in QUEST– IBM
• “GROUP BY”mysql> select sum(sal),deptno from emp group by deptno;
Data Mining Summary
• Helps in pattern analysis and thus taking actions –real time and future based.
• Analyzing trends and clusters in business operations.
References
• http://www.datawarehousing.com/ • http://www.dw-institute.com/ • http://www.almaden.ibm.com/cs/quest/index.html
Thank you
Any Questions?