KDDML: A Middleware Language and System for Knowledge Discovery in Databases Dipartimento di...

29
KDDML: A Middleware Language and System for Knowledge Discovery in Databases Dipartimento di Informatica, Università di Pisa A. Romei, S. Ruggieri, F. Turini Thirteenth Italian Symposium on Sistemi Evoluti per Basi di Dati (SEBD-2005) Brixen, Italy – 19-22 June, 2005
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    220
  • download

    0

Transcript of KDDML: A Middleware Language and System for Knowledge Discovery in Databases Dipartimento di...

KDDML: A Middleware Language and System for Knowledge Discovery in

Databases

Dipartimento di Informatica, Università di PisaA. Romei, S. Ruggieri, F. Turini

Thirteenth Italian Symposium onSistemi Evoluti per Basi di Dati (SEBD-2005)

Brixen, Italy – 19-22 June, 2005

SEBD 2005 - Brixen, June 2005

Application Area: KDD

Knowledge Discovery in Databases (KDD) is the non-trivial process of identifying

valid,novel,potentially useful,understandable

patterns in data.

SEBD 2005 - Brixen, June 2005

The CRISP-DM processMain focus on automatic-phases:

Data pre-processingModelingPost-processingModel evaluation

SEBD 2005 - Brixen, June 2005

In this work

KDDML: an XML-based middleware language and system in support of the KDD process.

KDDML as language.KDDML as system.

SEBD 2005 - Brixen, June 2005

Requirements

R1: data/models repository should be available for storing input, output and intermediate objects of the KDD process.

Several representations of data can be available. Automatic format conversions.Automatic meta-data mapping (e.g., ARFF, SQL).

R2: specifying logical meta-data (meta-model) in addition to the physical data (model).R3: compositionality of mining operations in the design of the language (closure principle).R4: high extensibility of the system architecture.

SEBD 2005 - Brixen, June 2005

KDDML as XML-based System

XML as data/model representation (R1, R2).

Machine-processable language.

XML as language definition.Ensures compositionality of operators (R3).

Extensibility and modularity (R4).

SEBD 2005 - Brixen, June 2005

Data/Model Representation

SEBD 2005 - Brixen, June 2005

Data FormatSeparing the logical data from the physical instances.

Data schema via proprietary XML.Actual data stored in CSV (Comma Separated Values).

CSV has been chosen as a trade-off between readability (binary file) and space occupation (xml).

SEBD 2005 - Brixen, June 2005

Data Format: Example<KDDML_TABLE data_file=“census.csv”>

<SCHEMA logical_name=“census” number_of_attributes=“6” number_of_instances=“16”>

<ATTRIBUTE name=“age” number_of_missed_values=“0“ type=“numeric”>

<NUMERIC_DESCRIPTION mean=“40.75” variance=“237.8” min=“18.0” max=“70.0”/>

</ATTRIBUTE>

<ATTRIBUTE name=“education” number_of_missed_values=“3“ type=“nominal”>

<NOMINAL_DESCRIPTION number_of_values=“4”>

<VALUE value=“HS-grad” cardinality=“3”/>

<VALUE value=“masters” cardinality=“2”/>

….

</NOMINAL_DESCRIPTION>

</ATTRIBUTE>

….

</SCHEMA>

</KDDML_TABLE>

Logical Metadata

Physical Data

SEBD 2005 - Brixen, June 2005

Model Format

PMML (Predictive Model Markup Language)

An industry standard for actual models representation as XML documents.Consists of DTDs for a wide spectrum of models, including RdA, decision trees, clustering, regression, neural networks.It does not cover the process of extracting models, but the exchange of extracted knowledge.

SEBD 2005 - Brixen, June 2005

Model Format: Example<PMML version="2.0">

….

<DataDictionary>

<DataField name="id" optype="continuous" />

<DataField name="amount" optype="continuous" />

</DataDictionary>

<TreeModel modelName="censusTree" splitCharacteristic="multiSplit">

<MiningSchema>

<MiningField name="id" usageType="supplementary" />

<MiningField name="class" usageType="predicted" />

</MiningSchema>

<Node score="" recordCount="48842">

<True/>

<ScoreDistribution value="<=50K" recordCount ="37155" />

...

</Node>

</PMML>

Logical Metadata

Physical Model

SEBD 2005 - Brixen, June 2005

Language

SEBD 2005 - Brixen, June 2005

Closure Principle (1)

Arguments of an operator must be of an appropriate type and sequence.We denote the signature of an operator op:t1 x … x tn t by defining a DTD for KDDML queries that constraints sub-elements to be of type t1, … , tn.

SEBD 2005 - Brixen, June 2005

Closure Principle (2)

Where:kdd_query_trees: all operators returning a classification tree;kdd_query_table: all operators returning a table;TREE_CLASSIFY belongs to the kdd_query_table entity.

fTREE_CLASSIFY: tree x table table

<!ELEMENT TREE_CLASSIFY ((%kdd_query_trees;), (%kdd_query_table;))><!ATTLIST TREE_CLASSIFY xml_dest %string; #IMPLIED>

SEBD 2005 - Brixen, June 2005

KDDML Types

The set of types of KDDML operators consists of:

Table, PPtableTree, clusters, rda, sequence, hierarchyAlgs, condition, expression

SEBD 2005 - Brixen, June 2005

KDDML Query structure

The structure of a KDDML query has a precise format.

XML tags element correspond to operation on data and models;XML attributes correspond to parameters of those operationsXML sub-elements define the arguments passed to the operators (KDDML Types).

<OPERATOR_NAME xml_dest="results.xml" att1="v1" ... attM="vM"> <ARG1_NAME> .... </ARG1_NAME> ... <ARGn_NAME> .... </ARGn_NAME></OPERATOR_NAME>

SEBD 2005 - Brixen, June 2005

Example (1)

Construction and application of a decision tree.

Loading of an ARFF source as training set.Simple sampling on training set.Construction of a decision tree on sampled training set.

Target attribute: play.Algorithm: C4.5.

Loading of a test set from the system repository.Application of the decision tree on the test set.

SEBD 2005 - Brixen, June 2005

Example (2)

<KDDML_OBJECT> <KDD_QUERY name="sample"> <TREE_CLASSIFY xml_dest="results.xml"> <TREE_MINER xml_dest="weather.xml" target_attribute="play"> <PP_SAMPLING> <ARFF_LOADER arff_file_name="weather.arff"/> <ALGORITHM algorithm_name=“simple_sampling”> <PARAM name=“percentage” value=“0.66”/> </ALGORITHM> </PP_SAMPLING> <ALGORITHM algorithm_name=“C4.5"> <PARAM name="confidence_for_pruning" value="0.4"/> <PARAM name="num_instances_for_leaf" value="6"/> </ALGORITHM> </TREE_MINER> <TABLE_LOADER xml_source="weather_test.xml"/> </TREE_CLASSIFY> </KDD_QUERY></KDDML_OBJECT>

<TREE_CLASSIFY xml_dest="results.xml"> <TREE_MINER ....> .... </TREE_MINER> <TABLE_LOADER xml_source="weather_test.xml"/></TREE_CLASSIFY>

...<TREE_MINER xml_dest="weather.xml" target_attribute="play"> <PP_SAMPLING> ..... </PP_SAMPLING> <ALGORITHM algorithm_name=“c4.5"> <PARAM name="confidence_for_pruning" value="0.4"/> <PARAM name="num_instances_for_leaf" value="6"/> </ALGORITHM></TREE_MINER>...

...<PP_SAMPLING> <ARFF_LOADER .../> <ALGORITHM algorithm_name=“simple_sampling”> <PARAM name=“percentage” value=“0.66”/> </ALGORITHM></PP_SAMPLING>...

...<TABLE_LOADER xml_source="weather_test.xml"/>...

...<ARFF_LOADER arff_file_name="weather.arff"/>...

RepositoryData

Table LoaderSource: weather_test.xml

Tree Classify

Tree MinerAlg: c4.5

Pruning confidence: 40%Num instances: 6

SamplingAlg: simple sampling

Percentage: 66%

Arff LoaderSource: weather.arff

RepositoryARFF

SEBD 2005 - Brixen, June 2005

Language Operators

Data/Model access.Preprocessing.

Data Cleaning, Sampling, Normalization, Discretization.

Model Extraction.Model application and evaluation.Model meta-reasoning & filtering.

SEBD 2005 - Brixen, June 2005

Example one: Discretization

....<PP_NUMERIC_DISCRETIZATION xml_dest= "census_discrete.xml", attribute_name = "age", label_type = "enumeration", enumerated_label_list = "young, middle, old"> <TABLE_LOADER xml_source= "census.xml"/> <ALGORITHM algorithm_name="natural_binning"> <PARAM name="cardinality" value="3"/> <PARAM name="having_number_of_intervals" value="true"/> </ALGORITHM></PP_NUMERIC_DISCRETIZATION>....

Discretization of a numeric attribute “age” into three intervals using the natural binning method.

SEBD 2005 - Brixen, June 2005

Example two: RdA filtering

....<RDA_FILTER> <RDA_LOADER xml_source="rules.xml"/> <CONDITION> <AND_COND> <BASE_COND op_type="is_in" term1="@body" term2="bread"/> <BASE_COND op_type="is_not_in" term1="@head" term2="milk"/> <BASE_COND op_type="equal" term1="@head_cardinality" term2="2"/> <BASE_COND op_type="greater" term1="@support" term2="0.3"/> </AND_COND> </CONDITION></RDA_FILTER>....

Selects the rules with item “bread” in the body and not having the item “milk” in the head and having exactly two items in the head and having the support greater than 30%.

SEBD 2005 - Brixen, June 2005

System Architecture

SEBD 2005 - Brixen, June 2005

Design targets

ExtensibilityData sourcesAlgorithmsModels

PortabilityModularity.

Architecture structured in 3 layers.

SEBD 2005 - Brixen, June 2005

Architecture Layers

RepositoryLayer

OperatorsLayer

InterpreterLayer

To upper layers…

Data Models

Operators Layer:

• Implementation of language operators.

• <OPERATOR_NAME> is implemented as a Java class satisfying an interface.

• Interface is task-dependent.

Repository Layer:

• Manages the read/write access to data and models repository.

• Manages the read/write access to data and models from external sources.

• Give a programmatic functionality to the higher layers.

Interpreter Layer:

• Accepts a validated KDDML query and returns the result as XML document.

• Recursively traverse the DOM tree representation.

• The interpreter is not-affected by data/algorithms/model extensibility.

SEBD 2005 - Brixen, June 2005

KDDML as Middleware System

Compiler

Query MQL

Query KDDML

ResultsResults

RepositoryLayer

OperatorsLayer

InterpreterLayer

Data Models

MQLHigh Level

GUI

Query KDDML

SEBD 2005 - Brixen, June 2005

Experiences with KDDML

SEBD 2005 - Brixen, June 2005

ClickWorldExtract DM models from visits to a city-news portal with the intent to characterize topics-of-interest of new visitors.

M. Baglioni, U. Ferrara, A. Romei, S. Ruggieri, F. Turini Preprocessing and mining web log data for web personalization. 8th Italian Conf. on Artificial Intelligence : 237-249. Vol. 2829 of LNCS, September 2003.

SEBD 2005 - Brixen, June 2005

KDDML-G

OP1

OP

OP2 OP3

A system for KDD on the GRID.Exploit the parallelism offered by the GRID Data immovability by moving the code on the place.

SEBD 2005 - Brixen, June 2005

Download KDDML

http://kdd.di.unipi.it/kddml/GNU (General Public Licence)