Integration of data mining results into multi-dimensional data models

ENTER 2015 Research Track Slide Number 1

Volker Meyera

Wolfram Höpkena

Matthias Fuchsb

Maria Lexhagenb

a University of Applied Sciences Ravensburg-WeingartenWeingarten, Germany

{name.surname}@hs-weingarten.de

b Mid-Sweden UniversityÖstersund, Sweden

{name.surname}@miun.se

Integration of data mining results into multi-dimensional data models


Content

• Introduction

• State of the art

• Concepts for integrating DM results into MDM

• Conclusion


Motivation

• Business intelligence and data mining in tourism

– Amount of available information dramatically increased

• e.g. web-servers store tourists’ website navigation, data bases save transaction and survey data, etc.

– Methods of BI and DM used to mine information about tourists’ travel motives, service expectations, channel use, conversion rates or booking trends (Pyo, et al., 2002; Wong, et al., 2006)

• Business-IT gap

– DM tools demand a huge knowledge about the DM process and the single techniques (e.g. decision trees, association rules)

– Results can be unintelligible without the right technical knowledge of how to read them

Crucial business relevant information is available, but the user who needs the information is not able to decode it


Objective

• Objective

– Present DM results in a way understandable and managable forbusiness users

• Approach

– Integrate knowledge generated by DM techniques directly into the data warehouse structures the underlying data are stemming from

– DM results (e.g. decision trees or association rules) available by well-established analysis techniques, like online analytic processing (OLAP)

Integration concepts and data warehouse structures are presented for major data mining techniques, like frequent itemsets, decision trees,and clustering


Content

• Introduction



• Conclusion


Integrating DM results into databases

• Extending the database standard SQL(by new data types and database operations)

– Inductive query language by SINDBAD project (Kramer, et al., 2006)

– Mining Association Rule Extension (Meo, et al., 1998)

– Mining Structured Query Language (MSQL) (Imielinski, 1999)

– Data Mining Query Language (DMQL) (Han, et al., 1996)

• Integrating DM results without extending database standard

– ADReM-Group (http://adrem.ua.ac.be/adrem) or Fromont et. al (2007)

standard conformance (standard tools and analysis approaches)

suitable for integrating DM results into existing data warehouse structures


Multi-dimensional data models (MDM)

• Fundamental concept of MDM

– Separation between

• Performance indicators (facts),e.g. turnover or number of persons

• Context/dimensions, e.g. time, date,customer, or product

– Typically represented as star schema

• MDM became famous for data warehousing

– Effective support of complex queries and OLAP analyses

– Better understandability for end users

– Crucial in tourism due to complex data structures (Höpken et al. 2013)

Booking

BookingNo (DD)Turnover (F)NoPersons (F)

DimProduct

ProdDesriptionProdCategory

DimCustomer

CusNameCusAgeCusGenderCusOrigin

DimTime

DayTimeMinutesHours

DimDate

DayInWeekWeekendWeekMonthYearSeason


Integrating DM results into MDM

• Extending the MDM by additional facts and dimensions/attributes

– Complexity strongly depends onconcrete DM model

– Cluster membership can just berepresented as an additional attribute

– Decision trees or association rules need amore complex fact/dimension structure

• Current status

– Simple approaches for market baskets (i.e. frequent itemsets) exist(Kimball & Ross, 2002)

– Comprehensive approach for all DM models still missing

Booking

BookingNo (DD)Turnover (F)NoPersons (F)

DimProduct

ProdDesriptionProdCategory

DimCustomer

CusNameCusAgeCusGenderCusOrigin

DimTime

DayTimeMinutesHours

DimDate

DayInWeekWeekendWeekMonthYearSeason


Content

• Introduction



• Conclusion


Frequent itemsets

• Frequent itemset = attribute values often co-occuring

• Approach to store frequent itemsets– Reuse original data structures

• Store co-occuring attribute values inartifical entries within original star schema

– Add frequent itemset tablereferencing to artificalentries in orginal starschema


Frequent itemsets

• Example: „old“ and „Swedisch“ customers


Frequent itemsets and OLAP analyses

• Overall revenue per frequent itemset

– Frequent itemsets used as new analysis dimension

• Identifying most valuable frequent itemsets(which is not possible in typical data mining tools)


Frequent itemsets and OLAP analyses

• Drill-through by frequent itemsets– Looking at single bookings belonging to

(i.e. supporting) a frequent itemset

– Example:Frequent itemset „old customersbooking a hotel“ with detailedinformation booking data, season,customer age, origin, sex andbooking price


Clustering

• Clustering = grouping similar records into homogeneous clusters

• Approach for storing clusters within a multi-dimensional structure

– Cluster centroids (i.e. calculated cluster centers) stored as artificial entries in original star schema

– Cluster table stores characteristicsof each cluster of a cluster modeland points to cluster centroid asartifical entry in star schema

– In the original fact table the cluster membership is stored for each original data entry (attribute FKCluster pointing to the cluster table)


Clustering

• Example: Customer clusters


Cluster models and OLAP analyses

• Revenue per customer cluster

– Clusters are used as newdimension for dataanalyses

– New characteristics ofclusters can becalculated, e.g. sum ofbooking price, groupedby any other dimensioncharacteristic e.g. season


Decision trees

• Decision tree

– Separating data records into predefined classes based on a series of decisions


Decision trees

• Storing a decision tree in a multi-dimensional structure

– Each node is represented by a decision rule

• booking = short-term -> valuable = yes

• booking = long-term & type appartment = yes -> valuable = yes

– Decision rules stored by

• Reusing original star schema to specify attribute values of the rule

• Specific table specifying rule characteristics and referencing to artifical entry in original structure


Decision trees

• Example decision rulebooking = long-term & type appartment = yes -> valuable = yes


Decision trees and OLAP analyses

– Decision tree nodesare used as newdimension for dataanalyses

– New characteristicsof decision treenodes can becalculated, e.g. sumof booking price(based on any factof the fact table)


Decision trees and OLAP analyses

– Decision treesare used tonarrow down the analysis tointerestingsubgroups (i.e. nodes with a high accuracy)


Benefit of presented approach

• Advantages of integrating data mining results intooriginal multi-dimensional data structure

– Ordinary OLAP queries can be used to analyse data miningresults (like frequent itemsets, cluster models or decisiontrees)

– Data mining results complement existing information andenhance explanation power of analyses by constituting a new dimension

• E.g. calculate overall turnover of frequent itemsets, decision treenodes or clusters

• E.g. filter bookings by a specific frequent itemset (only looking at bookings from old and Swedish customers)


Content

• Introduction



• Conclusion


Conclusion & Outlook

• BI & data mining in tourism– Multi-dimensional data warehouse structures important concept for

tourism (destinations)

– All data mining techniques heavily used in tourism

• Novel approach for integrating data mining models intounderlying multi-dimensional data structures– Frequent itemsets, association rules, clustering, decision trees

– Complement existing information and enrich OLAP analyses

• Future activities– Automatic transformation of data mining results into multi-

dimensional structures to support broader evaluation

– Evaluate user acceptance of new analysis possibilities

Integration of data mining results into multi-dimensional data models

Education

Transcript of Integration of data mining results into multi-dimensional data models