Knowledge Based Ontology

download Knowledge Based Ontology

of 14

Transcript of Knowledge Based Ontology

  • 8/8/2019 Knowledge Based Ontology

    1/14

    Knowledge-Based Interactive Postminingof Association Rules Using Ontologies

    Claudia Marinica and Fabrice Guillet

    Abstract In Data Mining, the usefulness of association rules is strongly limited by the huge amount of delivered rules. To overcomethis drawback, several methods were proposed in the literature such as itemset concise representations, redundancy reduction, andpostprocessing. However, being generally based on statistical information, most of these methods do not guarantee that the extractedrules are interesting for the user. Thus, it is crucial to help the decision-maker with an efficient postprocessing step in order to reducethe number of rules. This paper proposes a new interactive approach to prune and filter discovered rules. First, we propose to useontologies in order to improve the integration of user knowledge in the postprocessing task. Second, we propose the Rule Schemaformalism extending the specification language proposed by Liu et al. for user expectations. Furthermore, an interactive framework isdesigned to assist the user throughout the analyzing task. Applying our new approach over voluminous sets of rules, we were able, byintegrating domain expert knowledge in the postprocessing step, to reduce the number of rules to several dozens or less. Moreover,the quality of the filtered rules was validated by the domain expert at various points in the interactive process.

    Index Terms Clustering, classification, and association rules, interactive data exploration and discovery, knowledge managementapplications.

    1 INTRODUCTION

    A SSOCIATION rule mining, introduced in [1], is consideredas one of the most important tasks in KnowledgeDiscovery in Databases [2]. Among sets of items in transac-tion databases, it aims at discovering implicative tendenciesthat can be valuable information for the decision-maker.

    An association rule is defined as the implication X Y ,described by two interestingness measuressupport andconfidencewhere X and Y are the sets of items and

    X \ Y ; . Apriori [1] is the first algorithm proposed in theassociation rule mining field and many other algorithmswere derived from it. Starting from a database, it proposes toextract all association rules satisfying minimum thresholdsof support and confidence. It is very well known that miningalgorithms can discover a prohibitive amount of associationrules; for instance, thousands of rules are extracted from adatabase of several dozens of attributes and severalhundreds of transactions. Furthermore, as suggested bySilbershatz and Tuzilin [3], valuable information is oftenrepresented by those rarelow supportand unexpectedassociation rules which are surprising to the user. So, themore we increase the support threshold, the more efficient

    the algorithms are and the more the discovered rules areobvious, and hence, the less they are interesting for the user.As a result, it is necessary to bring the support threshold

    low enough in order to extract valuable information.

    Unfortunately, the lower the support is, the larger thevolume of rules becomes, making it intractable for adecision-maker to analyze the mining result. Experimentsshow that rules become almost impossible to use when thenumber of rules overpasses 100. Thus, it is crucial to helpthe decision-maker with an efficient technique for reducingthe number of rules.

    To overcome this drawback, several methods were

    proposed in the literature. On the one hand, differentalgorithms were introduced to reduce the number of itemsets by generating closed [4], maximal [5] or optimalitemsets [6], and several algorithms to reduce the number of rules, using nonredundant rules [7], [8], or pruningtechniques [9]. On the other hand, postprocessing methodscan improve the selection of discovered rules. Differentcomplementary postprocessing methods may be used, likepruning, summarizing, grouping, or visualization [10].Pruning consists in removing uninteresting or redundantrules. In summarizing, concise sets of rules are generated.Groups of rules are produced in the grouping process; andthe visualization improves the readability of a large number

    of rules by using adapted graphical representations.However, most of the existing postprocessing methodsare generally based on statistical information in thedatabase. Since rule interestingness strongly depends onuser knowledge and goals, these methods do not guaranteethat interesting rules will be extracted. For instance, if theuser looks for unexpected rules, all the already known rulesshould be pruned. Or, if the user wants to focus on specificschemas of rules, only this subset of rules should beselected. Moreover, as suggested in [11], the rule post-processing methods should be imperatively based on astrong interactivity with the user.

    The representation of user knowledge is an important

    issue. The more the knowledge is represented in a flexible,

    784 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 6, JUNE 2010

    . The authors are with KOD TeamLINA CNRS 6241, PolytechNantes Site de la Chantrerie, Rue Christian Pauc, BP 50609, 44306 Nantes cedex 3,France. E-mail: {claudia.marinica, fabrice.guillet}@univ-nantes.fr.

    Manuscript received 31 Mar. 2009; revised 23 Sept. 2009; accepted 7 Nov.2009; published online 4 Feb. 2010.Recommended for acceptance by C. Zhang, P.S. Yu, and D. Bell.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log NumberTKDESI-2009-03-0275.

    Digital Object Identifier no. 10.1109/TKDE.2010.29.1041-4347/10/$26.00 2010 IEEE Published by the IEEE Computer Society

  • 8/8/2019 Knowledge Based Ontology

    2/14

    expressive, and accurate formalism, the more the ruleselection is efficient. In the Semantic Web 1 field, ontology isconsidered as the most appropriate representation toexpress the complexity of the user knowledge, and severalspecification languages were proposed.

    This paper proposes a new interactive postprocessingapproach, ARIPSO (Association Rule Interactive post-Proces-sing using Schemas and Ontologies) to prune and filterdiscovered rules. First, we propose to use Domain Ontologiesin order to strengthen the integration of user knowledge inthe postprocessing task. Second, we introduce Rule Schema formalismby extending the specification language proposed by Liu et al. [12] for user beliefs and expectations toward theuse of ontology concepts. Furthermore, an interactive anditerative framework is designed to assist the userthroughouttheanalyzing task. Theinteractivity of ourapproach reliesonasetof rule miningoperators defined over theRule Schemas inorder to describe the actions that the user can perform.

    This paper is structured as follows: Section 2 introducesnotations and definitions used throughout the paper.

    Section 3 justifies our motivations for using ontologies.Section 4 describes the research domain and reviews relatedworks. Section 5 presents the proposed framework and itselements. Section 6 is devoted to the results obtained byapplying our method over a questionnaire database.Finally, Section 7 presents conclusions and shows directionsfor future research.

    2 NOTATIONS AND DEFINITIONSThe association rule mining task can be stated as follows:let I f i1 ; i2 ; . . . ; ing be a set of literals, called items. LetD f t1 ; t2 ; . . . ; tm g be a set of transactions over I . Anonempty subset of I is called itemset and is defined asX f i1 ; i2 ; . . . ; ikg. In short, itemset X can also be denotedas X i1 i2 . . . ik. For an itemset, the number of items iscalled length of the itemset and an itemset of length k isreferred to as k-itemset. Each transaction t i contains anitemset i1 i2 . . . ik, with a variable k number of items foreach t i .Definition 1. Let X I and T D . We define the set of all

    transactions that contain the itemset X as:

    t : I D; t X ft 2 D j X tg:

    Similarly, we describe the itemsets contained in all thetransactions T by:

    i : D I; i T fx 2 I j 8t 2 T; x 2 tg:

    Definition 2. An association rule is an implication X Y ,where X and Y are two itemsets andX \ Y ; . The former,X , is called the antecedent of the rule, and the latter, Y , iscalled the consequent .

    A rule X Y is described using two importantstatistical factors:

    . The support of the rule, defined as suppX Y suppX [ Y jtX [ Y j, is the ratio of the number

    of transactions containing X [ Y . If suppX Y s,s % of transactions contains the itemset X [ Y .

    . The confidenceof the rule, defined as conf X Y suppX Y =suppX suppX [ Y =suppX c,is the ratio ( c % ) of the number of transactions that,containing X , contain also Y .

    Starting from a database and two thresholds minsupp andminconf for the minimal support and, respectively, theminimal confidence, the problem of finding associationrules, as discussed in [1], is to generate all rules that havesupport and confidence greater than the given thresholds.This problem can be divided into two main problems:

    . first, all frequent itemsets are extracted. An itemset X is called frequent itemset in the transaction database Dif suppX ! minsupp ;

    . and then, for each frequent itemset X , the set of rulesX Y Y , with Y X , and satisfying conf X Y Y ! minconf is generated.

    If X is frequent and no superset of X is frequent, X is

    denoted as a maximal itemset.Theorem 1. Let X I and T D . Let cit X denotes the

    composition of the two mappingst iX itX . Also,let cti T i tT . Then, cit and cti are both Galoisclosure operators [13] on itemsets and sets of transactions,respectively.

    Definition 3. A closed itemset [14] is defined as an itemsetX which has the property of being the same as its closure, i.e.,X cit X . The minimal closed itemset containing an itemsetY is obtained by applying the closure operatorcit to Y .

    Definition 4. LetR 1 and R2 be two association rules. We say thatrule R 1 is more general than rule R2 , denotedR 1 " R 2 , if R 2 canbe generated by adding additional items to either the antecedentor consequent of R1 . In this case, we say that a ruleR j isredundant [15] if there exists some ruleR i such that R i " R j .In consequence, in a collection of rules, thenonredundantrules are the most general ones, i.e., those rules having minimalantecedents and consequents, in terms of subset relation.

    Definition 5. A rule set is optimal [6] with respect to aninterestingnessmetric ifit containsall therulesexcept thosewithno greater interestingness than oneof its more general rules.Anoptimal rule set is a subset of a nonredundant rule set .

    Definition 6. Formally, an ontology is a quintuple O f C;R;I ;H;A g [16]. C f C 1 ; C 2 ; . . . ; C ng is a set of concepts

    and R f R1 ; R2 ; . . . ; Rm g is a set of relations defined overconcepts.I is a set of instances of concepts andH is a Directed Acyclic Graph (DAG) defined by the subsumption relation (is-arelation, ) between concepts. We say thatC 2 is-a C 1 ,C 1 C 2 , ifthe conceptC 1 subsumes the conceptC 2 . A isasetof axioms bringing additional constraints on the ontology.

    3 MOTIVATIONS FOR THE GENERAL IMPRESSIONIMPROVEMENT USING ONTOLOGIES

    Since early 2000s, in the Semantic Web context, the numberof available ontologies has been increasing covering a widedomain of applications. This could be a great advantage in

    an ontology-based user knowledge representation.

    MARINICA AND GUILLET: KNOWLEDGE-BASED INTERACTIVE POSTMINING OF ASSOCIATION RULES USING ONTOLOGIES 785

    1. http://www.w3.org/2001/sw/.

  • 8/8/2019 Knowledge Based Ontology

    3/14

    This paper contributes on several levels at reducing thenumber of association rules. One of our most importantcontributions relies on using ontologies as user backgroundknowledge representation. Thus, we extend the specifica-tion language proposed by Liu et al. [17]GeneralImpressions (GI), Reasonably Precise Concepts (RPC), andPrecise Knowledge (PK)by the use of ontology concepts.Example. Let us consider the case of General Impressions.

    The user might believe that there exist some associationsamong milk OR cheese, Fruit items, and beef (assume thatthe user uses the taxonomy in Fig. 1). He/she could

    specify his/her beliefs using General Impressions:gi< f milk;cheese g;Fruit ; beef> :

    The following rules are examples of association rulesthat are conform to the specification:

    apple beef grape; pear; beef milk:

    But when working directly with database items, devel-oping specifications becomes a complex task. Moreover, itwill be very useful for the user to be able to introduce in theGI language interesting additional information. For exam-

    ple, in the market case, it would be very useful to be able tofind if the customers buying diet products, also buyecological products. In order to select this type of rules,the user should be able to create an RPC such as:

    rpc ;

    where DietProducts and EcologicalP roducts represent,respectively, the set of the products integrated in diets, andthose products which are produced in an ecological way.Defining such concepts is not possible using taxonomies.

    Starting from the taxonomy presented in Fig. 1, wedeveloped an ontology based on the earlier considerations.We propose to integrate two data properties of Boolean typein order to define the products that are useful in diets ( IsDiet),and those that are ecological ( IsEcological). Description logic[18], used in designing ontologies, allows concept definitionusing restrictions on properties. Therefore, the conceptDietProducts is defined as a restriction on FoodItemhierarchyusing the data property isDiet, describing the items useful ina diet. Similarly, we define EcologicalProductsconcept.

    In our example, apple and chicken items are diet products ,and milk , grape , and beef items are ecological products . InFig. 2, we present thestructure of theontology resulting afterapplying a reasoner, and the ontology construction isdetailed in Section 5.3.

    4 R ELATED WORK4.1 Concise Representations of Frequent ItemsetsInterestingness measures represent metrics in the process of capturing dependencies and implications between databaseitems, and express the strength of the pattern association.

    Since frequent itemset generation is considered as anexpensive operation, mining frequent closed itemsets (pre-liminary idea presented in [4]) was proposed in order toreduce the number of frequent itemsets. For example, anitemset X is denoted as closed frequent itemset if 69 itemset X 0 X so that tX tX 0. Thus, the numberof frequent closed itemsets generated is reduced incomparison with the number of frequent itemsets.

    The CLOSET algorithm was proposed in [19] as a newefficient method for mining closed itemsets. CLOSET uses anovel frequent pattern tree (FP-tree) structure, which is acompressed representation of all the transactions in thedatabase. Moreover, it uses a recursive divide-and-conquerand database projection approach to mine long patterns.

    Another solution for the reduction of the number of frequent itemsets is mining maximal frequent itemsets[5]. Theauthors proposed the MAFIA algorithm based on depth-firsttraversal and several pruning methods as Parent Equiva-lence Pruning (PEP), FHUT, HUTMFI, or Dynamic Record-ing. However, the main drawback of the methods extractingmaximal frequent itemsets is the loss of information because the subset frequency is not available; thus,generating rules is not possible.

    4.2 Redundancy Reduction of Association RulesConversely, generating all association rules that satisfy theconfidence threshold is a combinatorial problem.

    Zaki and Hsiao used frequent closed itemsets in theCHARM algorithm [20] in order to generate all frequentclosed itemsets. They used an itemset-tid set search tree andpursued with the aim of generating a small nonredundantrule set [7]. To this goal, the authors first found minimal generator for closed itemsets, and then, they generatednonredundant association rules using two closed itemsets.

    Pasquier et al. [8] proposed the Close algorithmin order toextract association rules. Close algorithm is based on a newmining method: pruning of the closed set lattice(closed itemsetlattice) in order to extract frequent closed itemsets. Associa-tion rules are generated starting from frequent itemsetsgenerated from frequent closed itemsets. Nevertheless, Zakiand Hsiao [20] proved that their algorithm CHARM out-

    performs CLOSET, Close, and Mafia algorithms.

    786 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 6, JUNE 2010

    Fig. 1. Supermarket item taxonomy [12].

    Fig. 2. Visualization of the ontology created based on the supermarketitem taxonomy.

  • 8/8/2019 Knowledge Based Ontology

    4/14

    More recently, Li [6] proposed optimal rules sets, definedwith respect to an interestingness metric. An optimal ruleset contains all rules except those with no greater interest-ingness than one of its more general rules.

    A set of reduction techniques for redundant rules wasproposed and implemented in [21]. The developedtechniques are based on the generalization/specification

    of the antecedent/consequent of the rules and they aredivided in methods for multiantecedent rules and multi-consequent rules.

    Hahsler et al. [22] were interested in the idea of generating association rules from arbitrary sets of itemsets.This makes possible for a user to propose a set of itemsetsand to integrate another set generated by a data miningtool. In order to generate rules, a support counter is needed;consequently, the authors proposed an adequate datastructure which provides fast access: prefix trees.

    Toivonen et al. proposed in [9] a novel technique forredundancy reduction based on rule covers. The notion of rule cover is defined as the subset of a rule set describingthe same database transaction set as the rule set. Thus, the

    authors developed an algorithm to efficiently extract a rulecover out of a set of given rules.The notion of subsumed rules, discussed in [23], describes

    a set of rules having the same consequent and severaladditional conditions in the antecedent regarding a certainrule. Bayardo, Jr., et al. [24] proposed a new pruningmeasure ( Minimum Improvement) described as the difference between the confidences of two rules in a specification/generalization relationship. The specific rule is pruned if theproposed measure is less than a prespecified threshold, sothe rule does not bring more information compared to thegeneral one.

    Nevertheless, both closed and maximal itemset miningstill break down at low support thresholds. To address these

    limitations, Omiecinski proposed in [25] three new impor-tant interestingness measures: any-confidence, all-confidence,and bond. All these measures are indicators of the degree of relatedness between the items in an association. The mostinteresting one, all-confidence, introduced as an alternative tosupport, represents the minimum confidence of all associa-tion rules extracted from an itemset. Bond is also similar tosupport, but with respect to a subset of the data rather thanthe entire database.

    4.3 User-Driven Association Rule MiningInterestingness measures were proposed in order to dis-cover only those association rules that are interestingaccording to these measures. They have been divided into

    objective measures and subjective measures. Objective mea-sures depend only on data structure. Many survey paperssummarize and compare the objective measure definitionsand properties [26], [27]. Unfortunately, being restricted todata evaluation, the objective measures are not sufficient toreduce the number of extracted rules and to capture theinteresting ones. Several approaches integrating userknowledge have been proposed.

    In addition, subjective measures were proposed tointegrate explicitly the decision-maker knowledge and tooffer a better selection of interesting association rules.Silbershatz and Tuzilin [3] proposed a classification of subjective measures in unexpectednessa pattern is interest-ing if it is surprising to the userand actionabilitya pattern

    is interesting if it can help the user take some actions.

    As early as 1994, in the KEFIR system [28], the keyfinding and deviation notions were suggested. Grouped infindings, deviations represent the difference between theactual and the expected values. KEFIR defines interesting-ness of a key finding in terms of the estimated benefits, andpotential savings of taking corrective actions that restore thedeviation back to its expected value. These corrective

    actions are specified in advance by the domain expert forvarious classes of deviations.Later, Klemettinen et al. [29] proposed templates to

    describe the form of interesting rules (inclusive templates)and not interesting rules (restrictive templates). The idea of using templates for association rule extraction was reusedin [30]. Other approaches proposed to use a rule-like formalism to express user expectations [3], [12], [31], andthe discovered association rules are pruned/summarized by comparing them to user expectations.

    Imielinski et al. [32] proposed a query language forassociation rule pruning based on SQL, called M-SQL. Itallows imposing constraints on the condition and/or theconsequent of the association rules. In the same domain of

    query-based association rule pruning, but more constraints-driven, Ng et al. [33] proposed an architecture for exploratorymining of rules. The authors suggested a set of solutions forseveral problems: the lack of user exploration and control,the rigid notion of relationship, and the lack of focus. Inorder to overcome these problems, Ng et al. proposed a newquery language called Constrained Association Query andthey pointed out the importance of user feedback and userflexibility in choosing interestingness metrics.

    Another related approach was proposed by An et al. in[34] where the authors introduced domain knowledgein orderto prune and summarize discovered rules. The firstalgorithm uses a data taxonomy, defined by user, in orderto describe the semantic distance between rules, and inorder to group the rules. The second algorithm allows togroup the discovered rules that share at least one item in theantecedent and the consequent.

    In 2007, a new methodology was proposed in [35] toprune and organize rules with the same consequent. Theauthors suggested transforming the database in an associa-tion rule base in order to extract second-level associationrules. Called metarules, the extracted rules r1 r2 expressrelations between the two association rules and helppruning/grouping discovered rules.

    4.4 Ontologies in Data MiningIn knowledge engineering and Semantic Web fields,ontologies have interested researchers since their firstproposition in the philosophy branch by Aristotle. Ontol-ogies have evolved over the years from controlled vocabul-aries to thesauri (glossaries), and later, to taxonomies [36].

    In the early 1990s, an ontology was defined by Gruber asa formal, explicit specification of a shared conceptualization[37].By conceptualization, we understand here an abstract modelof some phenomenon described by its important concepts.The formal notion denotes the idea that machines should beable to interpret an ontology. Moreover, explicit refers to thetransparent definition of ontology elements. Finally, sharedoutlines that an ontology brings together some knowledgecommon to a certain group, and not individual knowledge.

    Several other definitions are proposed in the literature.

    For instance, in [38], an ontology is viewed as a logical theory

    MARINICA AND GUILLET: KNOWLEDGE-BASED INTERACTIVE POSTMINING OF ASSOCIATION RULES USING ONTOLOGIES 787

  • 8/8/2019 Knowledge Based Ontology

    5/14

    accounting for the intended meaning of a formal vocabulary, andlater, in 2001, Maedche and Staab proposed a moreartificial-intelligence-oriented definition. Thus, ontologiesare described as (meta)data schemas, providing a controlledvocabulary of concepts, each with an explicitly defined andmachine processable semantics[16].

    Depending on the granularity, four types of ontologies

    are proposed in the literature: upper (or top level)ontologies, domain ontologies, task ontologies, and applica-tion ontologies [38]. Top-level ontologies deal with generalconcepts; while the other three types deal with domain-specific concepts.

    Ontologies, introduced in data mining for the first timein early 2000, can be used in several ways [39]: Domain andBackground Knowledge Ontologies, Ontologies for DataMining Process, or Metadata Ontologies. Background Knowl-edge Ontologies organize domain knowledge and playimportant roles at several levels of the knowledge discoveryprocess. Ontologies for Data Mining Process codify miningprocess description and choose the most appropriate taskaccording to the given problem; while Metadata Ontologies

    describe the construction process of items.In this paper, we focus on Domain and BackgroundKnowledge Ontologies. The first idea of using DomainOntologies was introduced by Srikant and Agrawal withthe concept of Generalized Association Rules (GAR)[40]. Theauthors proposed taxonomies of mined data (an is-ahierarchy) in order to generalize/specify rules.

    In [41], it is suggested that an ontology of backgroundknowledge can benefit all the phases of a KDD cycledescribed in CRISP-DM. The role of ontologies is based onthe given mining task and method, and on data character-istics. From business understanding to deployment, theauthors delivered a complete example of using ontologiesin a cardiovascular risk domain.

    Related to Generalized Association Rules, the notion of raising was presented in [42]. Raising is the operation of generalizing rules (making rules more abstract) in order toincrease support in keeping confidence high enough. Thisallows for strong rules to be discovered and also to obtainsufficient support for rules that, before raising, would nothave minimum support due to the particular items theyreferred to. The difference with Generalized AssociationRules is that this solution proposes to use a specific level forraising and mining.

    Another contribution, very close to [40], [41], usestaxonomies to generalize and prune association rules. Theauthors developed an algorithm, called GART [43], which,having several taxonomies over attributes, uses iterativelyeach taxonomy in order to generalize rules, and then,prunes redundant rules at each step.

    A very recent approach, [44], uses ontologies in a preprocessing step. Several domain-specific and user-definedconstraints are introduced and grouped into two types: pruning constraints, meant to filter uninteresting items, andabstraction constraints permitting the generalization of itemstoward ontology concepts. The data set is first preprocessedaccording to the constraints extracted from the ontology,and then, the data mining step takes place. The differencewith our approach is that, first, they apply constraints in thepreprocessing task, whereas we work in the postprocessingtask. The advantage of the pruning constraints is that it

    permits to exclude from the start the information that the

    user is not interested in, thus permitting to apply theApriori algorithm to this new database. Let us consider thatthe user is not sure about which items he/she should prune.In this case, he/she should create several pruning tests, andfor each test, he/she will have to apply the Apriorialgorithm whose execution time is very high. Second, theyuse SeRQL in order to express user knowledge, and wepropose a more expressive and flexible language for userexpectation representation, i.e., Rule Schemas.

    The item-relatedness filterwas proposed by Natarajan andShekar [45]. Starting from the idea that the discovered rulesare generally obvious, they introduced the idea of related-ness between the items measuring their similarity accordingto item taxonomies. This measure computes the relatednessof all the couples of rule items. We can notice that we cancompute the relatedness for the items of the condition or/and the consequent, or between the condition and theconsequent of the rule.

    While Natarajan and Shekar measure the item-relatednessof an association rule, Garcia et al. developed in [46] andextended in [47] a novel technique called KnowledgeCohesion (KC). The proposed metric is composed of twonew ones: Semantic Distance (SD) and Relevance Assessment(RA). The SD one measures how close two items aresemantically, using the ontologyeach type of relation being weighted differently. The numerical value RAexpresses the interest of the user for certain pairs of itemsin order to encourage the selection of rules containing thosepairs. In this paper, the ontology is used only for the SDcomputation, differing from our approach which usesontologies for Rule Schemas definition. Moreover, theauthors propose a metric-based approach for itemsetselection, while we propose a pruning/filtering schemas- based method of association rules.

    5 DESCRIPTION of the AR IPSO FRAMEWORKThe proposed approach is composed of two main parts (asshown in Fig. 3). First, the knowledge base allowsformalizing user knowledge and goals. Domain knowledgeoffers a general view over user knowledge in databasedomain, and user expectations express the prior userknowledge over the discovered rules. Second, the post-processing task consists in applying iteratively a set of filtersover extracted rules in order to extract interesting rules:minimum improvement constraint filter, item-relatednessfilter, rule schema filters/pruning.

    The novelty of this approach resides in supervising theknowledgediscovery process using two different conceptual

    structures for user knowledge representation: one or several

    788 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 6, JUNE 2010

    Fig. 3. Framework description.

  • 8/8/2019 Knowledge Based Ontology

    6/14

    ontologies and several rule schemas generalizing generalimpressions, and proposing an iterative process.

    5.1 Interactive Postmining ProcessThe ARIPSO framework proposes to the user an interactiveprocess of rule discovery, presented in Fig. 4. Taking intoaccount his/her feedbacks, the user is able to revise his/herexpectations in function of intermediate results. Severalsteps are suggested to the user in the framework as follows:

    1. ontology constructionstarting from the database,and eventually, from existing ontologies, the userdevelops an ontology on database items;

    2. defining Rule Schemas (as GIs and RPCs)the userexpresses his/her local goals and expectationsconcerning the association rules that he/she wantsto find;

    3. choosing the right operatorsto be applied over the ruleschemas created, and then, applying the operators;

    4. visualizing the resultsthe filtered association rulesare proposed to the user;

    5. selection/validationstarting from these preliminaryresults, the user can validate the results or he/shecan revise his/her information;

    6. we propose to the user two filters already existing inthe literature and detailed in Section 5.5. These twofilters can be applied over rules whenever the userneeds them with the main goal of reducing thenumber of rules; and

    7. the interactive loop permits to the user to revise theinformation that he/she proposed. Thus, he/she canreturn to step 2 in order to modify the rule schemas,or he/she can return to step 3 in order to change theoperators. Moreover, in the interactive loop, the usercould decide to apply one of the two predefinedfilters discussed in step 6.

    5.2 Improving General Impressions with OntologiesOne existing approach interests us in particularthespecification language proposed by Liu et al. [17]. Theauthors proposed to represent user expectations in terms of discovered rules using three levels of specification: GeneralImpressions, Reasonably Precise Conceptsrepresentinguser vague feelings, and finally, his/her Precise Knowledge.

    The authors developed a representation formalism whichis very close to association rule formalism, flexible enough,and comprehensible for the user. For the case of GeneralImpressions, the authors proposed the following syntax:

    gi support; confidence ;

    where S i is an element of an item taxonomy or anexpression defined using =? operators, and support andconfidence thresholds are optional.

    In the GI formalism, we can remark that the user knowsthat a set of items is associated, but he/she does not knowwhich is the direction of the implication, which items he/she would put in the antecedent, and which ones in the

    consequent. This is the main difference between GIs andRPCsRPCs are able to describe a complete implication.PKs express the same formalism as the RPCs, addingobligatory constraints of support and confidence.

    Moreover, the authors proposed to filter four types of rules using: conforming rules and unexpected rules con-cerning the antecedent and/or the consequent:

    . Conforming rulesassociation rules that are conform-ing to the specified beliefs;

    . Unexpected antecedent rulesassociation rules that areunexpected regarding the antecedent of the specified beliefs;

    . Unexpected consequent rulesassociation rules thatare unexpected regarding the consequent of thespecified beliefs; and

    . Both side unexpected rulesassociation rules that areunexpected regarding both the antecedent and theconsequent of the specified beliefs.

    To improve association rule selection, we propose a newrule filtering model, called Rule Schemas (RS). A rule schemadescribes, in a rule-like formalism, the user expectations interms of interesting/obvious rules. As a result, RuleSchemas act as a rule grouping, defining rule families.

    The Rule Schema formalism is based on the specificationlanguage for user knowledge introduced by Liu et al. [12].The model proposed by Liu et al. is described usingelements from an item taxonomy allowing an is-a organiza-tion of database attributes. Using item taxonomies hasmany advantages: the representation of user expectations ismore general, and thus, filtered rules are more interestingfor the user.

    However, a taxonomy of items might not be enough. Theuser might want to useconcepts that aremore expressive andaccurate than generalized concepts and that result fromrelationships other than the is-a relation (i.e., IsEcological,IsCookedWith). This is why we have considered that the use of ontologieswouldbe moreappropriate. An ontologyincludesthe features of taxonomies but adds more representationpower. In a taxonomy, the means for subject description

    consist essentially of one relationship: the subsumptionrelationship used to build the hierarchy. The set of items isopened, but the language used to describe them is closed [48] by using a single relationship (the subsumption). Thus, ataxonomy is simply a hierarchical categorization or classifi-cation of items in a domain. On the contrary, an ontology is aspecification of several characteristics of a domain, definedusing an open vocabulary.

    In addition, it is difficult for a domain expert to knowexactly the support and confidence thresholds for each ruleschema proposed, because of their statistical definition.That is why we consider that using Precise Knowledge inuser expectation representation might be useless. Thus, we

    propose to improve only two of the three representations

    MARINICA AND GUILLET: KNOWLEDGE-BASED INTERACTIVE POSTMINING OF ASSOCIATION RULES USING ONTOLOGIES 789

    Fig. 4. Interactive process description.

  • 8/8/2019 Knowledge Based Ontology

    7/14

    introduced in [12]: General Impressions and ReasonablyPrecise Concepts.

    Therefore, Rule Schemas bring the expressiveness of ontologies in the postprocessing task of association rulescombining not only item constraints, but also ontologyconcept constraints.Definition 7. A Rule Schema expresses the fact that the user

    expects certain elements to be associated in the extractedassociation rules. This can be expressed as

    RS ;

    where X i ; Y j 2 C of O f C;R;I ;H;A g and the implication is optional. In other words, we can note that the proposed formalism combines General Impressions andReasonably Precise Concepts. If we use the formalism as animplication, an implicative Rule Schema is defined extendingthe RPC. On the other hand, if we do not keep the implication,we define nonimplicative Rules Schemas, generalizing GI.

    Example. Let us consider the taxonomy in Fig. 1. Let us

    develop an ontology based on this taxonomy containingthe description of the concepts BioProducts and Ecologi-calProducts, as shown in Fig. 2. Thus, we can define anonimplicative Rule Schema:

    RS 1 ;

    and an implicative Rule Schema:

    RS 2 :

    5.3 Ontology DescriptionDomain knowledge, defined as the user informationconcerning the database, is described in our framework

    using ontologies.Compared to taxonomies used in the specificationlanguage proposed in [12], ontologies offer a more complexknowledge representation model by extending the only is-arelation presented in a taxonomy with the set R of relations.In addition, the axioms bring important improvementspermitting concept definition starting from existing infor-mation in the ontology.

    In this scenario, it is fundamental to connect ontologyconcepts C of O f C;R;I ;H;A g to the database, each oneof them being connected to one/several items of I . To thisend, we consider three types of concepts: leaf-concepts, generalized conceptsfrom the subsumption relation ( ) in H

    of O, and restriction concepts proposed only by ontologies.In order to proceed with the definition of each type of concepts, let us remind that a set of items in a database isdefined as I f i1 ; i2 ; . . . ; in g.

    The leaf-concepts(C 0 ) are defined as

    C 0 f c0 2 C j 69c0 2 C; c0 c0 g:

    They are connected in the easiest way to databaseeachconcept from C 0 is associated to one item in the database:

    f 0 : C 0 I; 8c0 2 C 0 ; 9i 2 I; i f 0 c0 :

    Generalized concepts(C 1 ) are described as the concepts thatsubsume other concepts in the ontology. A generalized

    concept is connected to the database through its subsumed

    concepts.This means that, recursively, only the leaf-conceptssubsumed by the generalized concept contribute to itsdatabase connection:

    f : C 1 2 I

    8c1 2 C 1 ; f c1 [c0 2 C 0

    f i f 0 c0 jc0 c1 g:

    Restriction concepts are described using logical expres-sions defined over items and are organized in the C 2 subset.In a first attempt, we base the description of the concepts onrestrictions over properties available in description logics.Thus, the restriction concept defined could be connected to

    a disjunction of items.Example. In order to explain restriction concepts, let us

    consider the database presented in Fig. 5 and described by three transactions. Moreover, let us consider theontology presented in the same figure as being theontology constructed over items of database anddescribed as follows:

    The concepts of the ontology are

    f FoodItems; Fruits; DailyProducts; Meat;DietProducts; EcologicProducts; . . . g:

    And the three types of concepts are:

    LeafConcepts : f grape; pear; apple; milk; cheese;butter; beef; chicken; pork g;

    GeneralizedConcepts : f Fruits; DailyProducts;Meat; FoodItem g;

    RestrictionConcepts : f DietProducts;EcologicalProducts g:

    Two data properties are also integrated in order to definewhether a product is useful for a diet, or is ecological. Forexample, the DietProduct restriction concept is describedusing description logics language by

    DietProducts FoodItems u 9 isDiet:TRUE

    790 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 6, JUNE 2010

    Fig. 5. Ontology description.

  • 8/8/2019 Knowledge Based Ontology

    8/14

    defining all food items that have the Boolean propertyisDiet on TRUE . For our example, isDiet is instantiatedas follows:

    isDiet : fapple; TRUE ; chicken; TRUE g:

    Now, we are able to connect the ontology and thedatabase. As already presented, leaf-concepts are connectedto items in a very simple way, for example, the conceptgrape is connected to the same item f 0 grape grape .

    On the contrary, the generalized concept Fruits isconnected through its three subsumed concepts:

    f Fruits fgrape;pear; apple g:

    Similarly, we can describe the connection for the otherconcepts.

    Moreinteresting, therestriction concept DietProducts will be connected through those concepts satisfying the restric-tions in the definition of the concept. Thus, DietProducts isconnected through the concepts apple and chicken :

    f DietProducts fapple;chicken g:

    5.4 Operations over Rule SchemasThe rule schema filter is based on operators applied overrule schemas allowing the user to perform several actionsover the discovered rules. We propose two importantoperators: pruning and filtering operators. The filteringoperator is composed of three different operators: conform-ing, unexpectedness, and exception. We propose to reuse theoperators proposed by Liu et al.: conforming and unexpect-edness, and we bring two new operators in the postproces-sing task: pruning and exceptions.

    These four operators will be presented in this section. To

    this end, let us consider an implicative rule schema RS 1 :, a nonimplicative rule schema RS 2 : ,and an association rule AR 1 : A B , where X , Y , U , and V are the ontology concepts, and A and B are the itemsets.Definition 8. Let us consider an ontology conceptC associated

    in the database to f C fy1 ; . . . ; yng and an itemsetX f x1 ; . . . ; xkg. We say that the itemset X is conformingto the conceptC if 9yi ; yi 2 X .

    Pruning. The pruning operator allows to the user toremove families of rules that he/she considers uninterest-ing. In databases, there exist, in most cases, relations between items that we consider obvious or that we already

    know. Thus, it is not useful to find these relations amongthe discovered associations. The pruning operator appliedover a rule schema, P RS , eliminates all association rulesmatching the rule schema. To extract all the rules matchinga rule schema, the conforming operator is used.

    Conforming. The conforming operator applied over arule schema, C RS , confirms an implication or finds theimplication between several concepts. As a result, rulesmatching all the elements of a nonimplicative rule schemaare filtered. For an implicative rule schema, the conditionand the conclusion of the association rule should matchthose of the schema.Example. The rule AR 1 is selected by the operator C RS 1 if

    the condition and the conclusion of the rule AR1

    are

    conforming to the condition and, respectively, the conclu-sion of RS 1 . Translating this description into the ontolo-gical definition of concepts means that AR 1 is conformingto RS 1 if theitemset A is conforming to theconcept X andif the itemset B is conforming to the concept Y .

    Similarly, rule AR 1 is filtered by C RS 2 if theconditionand/or the conclusion of the rule AR 1 are conforming tothe schema RS 2 . In other words, if the itemset A [ B isconforming to the concept U and the itemset A [ B isconforming to the concept V , then the rule AR 1 isconforming with the nonimplicative rule schema RS 2 .

    Unexpectedness. With a higher interest for the user, theunexpectedness operator U RS proposes to filter a set of rules with a surprise effect for the user. This type of rulesinterests the user more than the conforming one since,generally, a decision-maker searches to discover newknowledge with regard to his/her prior knowledge.

    Moreover, several types of unexpected rules can befiltered according to the rule schema: rules unexpected

    regarding the antecedent U p, rules unexpected regarding theconsequent U c, andrulesunexpected regardingbothsides U b.For instance, let us consider that the operator U pRS 1

    extracts the rule AR 1 which is unexpected according to thecondition of the rule schema RS 1 . This is possible if the ruleconsequent B is conforming to the concept Y , while thecondition itemset A is not conforming to the concept X .

    In a similar way, we define the two other unexpected-ness operators.

    Exceptions. Finally, the exception operator is definedonly over implicative rule schemas (i.e., RS 1 ) and extractsconforming rules with respect to the following newimplicative rule schema: X ^ Z : Y , where Z is a set

    of items.Example. Let us consider the implicative rule schema

    RS : Fruits EcologicalProducts , where

    f Fruits fgrape; apple;pear g

    and

    f EcologicalProducts fgrape;milk g;

    and I f grape; apple; pear; milk; beef g (see Fig. 1 forsupermarket taxonomy). Also, let us consider that thefollowing set of association rules is extracted bytraditional techniques:

    R1 : grape;beef milk;pear;R2 : apple beef;R3 : apple; pear; milk grape;R4 : grape; pear apple;R5 : beef grape;R6 : milk; beef grape:

    Thus, the operator C RS filters the rules R1 and R3 ,the operator UpRS filters the rules R5 and R 6 , and theoperator UcRS filters the rules R2 and R4 . The pruningoperator P RS prunes the rules selected by the conform-ing operator C RS . Let us explain the operator UcRS :

    Uc operator filters the rules whose conclusion itemset is

    MARINICA AND GUILLET: KNOWLEDGE-BASED INTERACTIVE POSTMINING OF ASSOCIATION RULES USING ONTOLOGIES 791

  • 8/8/2019 Knowledge Based Ontology

    9/14

    not conforming to the conclusion concept of theRS EcologicalProductsand whose condition itemset isconforming to the condition concept of the RS Fruits .The R 4 rule is filtered by UcRS because the itemset appledoes not contain an item corresponding to an Ecological-Products concept, apple 62 f EcologicalProducts , and be-cause the itemset grape pear contains at least one item

    corresponding to a Fruits concept, pear 2 f Fruits .5.5 FiltersIn order to reduce the number of rules, three filters integratethe framework: operators applied over rule schemas,minimum improvement constraint filter [24], and item-relatedness filter [45].

    Minimum improvement constraint filter [24] ( MICF)selects only those rules whose confidence is greater withminimp than the confidence of any of its simplifications.Example. Let us consider the following three association

    rules:

    grape; pear milk Confidence 85% ;grape milk Confidence 90% ; pear milk Confidence 83% :

    We can note that the last two rules are the simplificationsof the first one. The theory of Bayardo et al. tells us that thefirst rule is interesting only if its confidence improves theconfidence of all its simplifications. In our case, the firstrule does not improve the confidence of 90 percent of the best of its simplifications (the second rule), so it is notconsidered as an interesting rule, and it is not selected.

    The item-relatedness filter (IRF) was proposed byShekar and Natarajan [45]. Starting from the idea that thediscovered rules are generally obvious, they introducedthe idea of relatedness between items measuring theirsemantic distance in item taxonomies. This measurecomputes the relatedness of all the couples of rule items.We can notice that we can compute the relatedness forthe items of the condition or/and the consequent, or between the condition and the consequent of the rule.

    In our approach, we use the last type of item-relatedness because users are interested to find associa-tion between itemsets with different functionalities,coming from different domains. This measure is com-puted as the minimum distance between the conditionitems and the consequent items as presented hereafter.

    The distance between each pair of items from thecondition and, respectively, the consequent is computedas the minimum path that connects the two items in theontology, defined as d a; b. Thus, the item-relatedness(IR) for a rule is defined as the minimum of all thedistance computed between the items in the conditionand the consequent:

    RA 1 : A B;IR RA 1 MIN d ij a i ; b j; 8a i 2 A and b j 2 B:

    Example. Let us consider the ontology in Fig. 2. For theassociation rule R1 , we can define the item-relatedness

    as follows:

    R 1 : grape;pear; butter > milk;IR R 1 min d grape;milk ; d pear;milk ;

    d butter; milk min 4 ; 4; 2 2 :

    6 EXPERIMENTAL S TUDYThis study is based on a questionnaire database, provided by Nantes Habitat, 2 dealing with customers satisfactionconcerning accommodation. The database consists in anannual study (since 2003) performed by Nantes Habitat on asample of 1,500 out of a total of 50,000 customers.

    The questionnaire consists of 67 different questions withfour possible answers expressing the degree of satisfaction:very satisfied, quite satisfied, rather not satisfied, and dissatisfiedcoded as f 1 ; 2 ; 3; 4g.

    Table 1 introduces a question sample with the meaningfor each. For instance, the item q 1 1 describes that acustomer is very satisfied by the transport in his district

    (q 1 Is your district transport practical ? ).In order to target the most interesting rules, we fixed aminimum support of 2 percent, a maximum support of 30 percent, and a minimum confidence of 80 percent for theassociation rules mining process. Among available algo-rithms, we use the Apriori algorithm in order to extractassociation rules and 358 ;072 rules are discovered.

    For example, the following association rule describes therelationship between questions q 2 , q 3 , q 47 , and the questionq 70 . Thus, if the customers are very satisfied by the access tothe city center ( q 2), the shopping facilities ( q 3), and theapartment ventilation ( q 47 ), then they can be satisfied by thedocuments received from Nantes Habitat Agency ( q 70 ) witha confidence of 85.9 percent:

    R1 : q 2 1 q 3 1 q 47 1 > q 70 1 ;Support 15 :2% ; Confidence 85 :9% :

    6.1 Ontology Structure and Ontology-DatabaseMapping

    In the first step of the interactive process described in theSection 5.1, the user develops an ontology on databaseitems. In our case, starting from the database attributes, theontology was created by the Nantes Habitat expert. Duringseveral session, we discussed with the expert about thedatabase attributes and proposed her to classify them.

    792 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 6, JUNE 2010

    TABLE 1Examples of Questions and Meaning

    2. http://www.nantes-habitat.fr/.

  • 8/8/2019 Knowledge Based Ontology

    10/14

    Moreover, we found other interesting information askingher to develop her expectations and knowledge connectedto database attributes. In this section, we will present thedevelopment of the ontology in our case study.

    6.1.1 Conceptual Structure of the Ontology To describe the ontology, we propose to use the WebSemantic representation language, OWL-DL [49]. Based ondescription logics, OWL-DL language permits, along withthe ontological structure, to create restriction concepts usingnecessary and sufficient conditions over other concepts.Also, we use the Prote ge [50] software to edit the ontologyand validate it. The Jambalaya [51] environment was usedfor ontology graph exploration.

    During several exchanges with the Nantes Habitatexpert, she developed an ontology composed of two mainparts, a sample being presented in Fig. 6. The ontology hasseven depth levels, a total of 130 concepts among which 113are primitive concepts, and 17 are restriction concepts.Concerning siblings, the concepts have a mean of six childconcepts, with a maximum of 13 child concepts. Moreover,two data properties are introduced.

    The first part of the ontology is a database itemorganization with the root defined by the Attribute concept,grouping 113 subsumed concepts. The items are organizedamong the question topic in the Nantes Habitat question-naire. For instance, considering the District concept, itregroups 14 questions (from q 1 to q 14 ) concerning thefacilities and the life quality in a district.

    The second hierarchy Topics regroups all 17 restrictionconcepts created by the expert using necessary andsufficient conditions over primitive concepts.

    Moreover, the subsumption relation ( ) is completed bythe relation hasAnswer associating the Attribute concepts toan integer from f 1; 2 ; 3 ; 4g, simulating the relation attributevalue in the database.

    For instance, let us consider the restriction conceptSatisfactionDistrict . In natural language, it expresses thesatisfaction answers of clients in the questions concerning

    the district. In other words, an item is instantiated by the

    SatisfactionDistrict concept if it represents a question between q 1 and q 14 , subsumed by the District concept, witha satisfied answer (1 or 2). The SatisfactionDistrictrestriction concept is described using description logicslanguage by:

    SatisfactionDistrict District u 9 hasAnswer: 1 ORhasAnswer: 2

    6.1.2 Ontology-Database Mapping As a part of rule schemas, ontology concepts are mapped todatabase items. Thus, several connections between ontology

    and database can be designed. Due to implementationrequirements, the ontology and the database are mappedthrough instances.

    The ontology-database connection is made manually bythe expert. In our case, with the 67 attributes and fourvalues, the expert did not meet any problems to realize theconnection, but we agree that for large databases, amanually connection could be very time-consuming. Thatis why integrating an automatic ontology construction plug-in in our tool is one of our principal perspectives.

    Thus, using the simplest ontology-database mapping, theexpert directly connected one instance of the ontology to anitem (semantically, the nearest one). For example, the expertconnected the instance Q11 1 to the item q 11 1:f 0 Q11 1 q 11 1.

    Then, leaf concepts ( C 0 ) of the Attribute hierarchy wereconnected by the expert to a set of items (semantically, thenearest one). Considering the concept Q11 of the ontology, itis associated to the attribute q 1 Are you satisfied with the transport in your district ? : Furthermore, the con-cept Q11 has two instances describing the question q 11 withtwopossibleanswers:1and3.LetusconsiderthattheconceptQ11 was connected by the expert to two items as follows:f Q11 ff 0 Q11 1 ; f 0 Q11 3 g fq 11 1 ; q 11 3g. Theconnection of generalized concepts follows the same idea.

    A second type of connection implies connecting conceptsof the Topic hierarchy to the database. Let us consider therestriction concept DissastisfactionCalmDistrict (Fig. 7). Innatural language, it is defined as all the concepts, subsumed by CalmDistrict (connected to questions q 8 , q 9 , q 10 , and q 11 )and with a dissatisfied answer.

    The DissastisfactionCalmDistrict restriction concept isdescribedby theexpertusing description logics language by:

    DissastisfactionCalmDistrictCalmDistrict u 9hasAnswer: 3 OR hasAnswer: 4 :

    Considering that the user has instantiated the concept Q8with answer 3, and the concept Q11 with the answers 1 and3, then the concept DissastisfactionCalmDistrict is con-

    nected in the database as it follows:

    MARINICA AND GUILLET: KNOWLEDGE-BASED INTERACTIVE POSTMINING OF ASSOCIATION RULES USING ONTOLOGIES 793

    Fig. 6. Ontology structure visualized with Jambalaya Prote ge plug-in.

    Fig. 7. Restriction concept construction using necessary and sufficientconditions in Prote ge .

  • 8/8/2019 Knowledge Based Ontology

    11/14

    f DissastisfactionCalmDistrict ff 0 Q8 3 ; f 0 Q11 3 g f q 8 3 ; q 11 3g:

    6.2 ResultsExample 1. This first example proposes to present theefficiency of our new approach concerning the reduction of the number of rules. To this end, we propose to the expert totest the four filters: on the one hand the pruning filters MICF, IRF, and pruning rule schemasand on the other hand,the selection filters rule schema filters (meanings of acro-

    nyms in Table 5). The expert could use each filter separatelyand in several combinations in order to compare the resultsand validate them.

    Hence, the expert proposed a set of pruning rule schemas(Table 2) and a set of filtering rule schemas (Table 3). Sheconstructed these rule schemas during several meetings of testing the new tool and analyzing generated results.

    At the beginning, the expert is faced to the whole set of 358,072 association rules extracted. In a first attempt, wefocus on pruning filters. If the MICF is applied, all thespecialized rules not improving confidence are pruned. InTable 4, we can see that the MICF prunes 92.3 percent of rules, being a very efficient filter for redundancy pruning. Inaddition, IRF prunes 71 percent of rulesthese rules

    implying items close semantically. The third pruning filter,Pruning Rule Schemas, prunes 43 percent of rules.We propose to compare the three pruning filters and the

    combinations of the pruning filters, as presented in Table 4.The first column is the reference for our experiments. Therates of number of rules remaining after the three filters areused separately are presented in columns 2, 3, and 4. Wecan note that the MICF filter is the most discriminatory,pruning 92.3 percent of rules, comparing to other two onespruning 71 percent and, respectively, 43 percent of rules.

    We can also note that combining the first two filters, MICFand IRF, the pruning is more powerful than combining thefirst one with the third one. Nevertheless, applying the threefilters over the set of the association rules implies a rule

    reduction of 96.3 percent.

    However, applying the most reducing combinationnumber 8 (Table 4), the expert should analyze 13,382 ruleswhich is impossible manually. Thus, other filters should beapplied. The expert was interested in the dissatisfactionphenomena, presented by answers 3 and 4 in thequestionnaire. The expert is interested in applying all therule schemas with the corresponding operator (Table 3) foreach combination of the first three filters presented inTable 4. Table 6 presents the number of rules filtered by

    each rule schema.In Table 6, the first column Nb represents the identifica-tion of each filter combination as denoted in Table 4. We cannote that the rule schema filters are very efficient. More-over, studying the dissatisfaction of the clients improves thefiltering power of the rule schemas.

    Let us consider the second rule schema. Applied over theinitial set of 358,072 association rules with the conformingoperator, it filters 1,008 rules representing 0.28 percent of thecomplete set. But it is obvious that it is very difficult for anexpert to analyze a set of rules of the order of thousands of rules. Thus, we can note the importance of the pruningfilters, the set of rules extracted in each case having lessthan 500 rules. We can also note that the IRF filter is morepowerful than the other pruning filters, and the combina-tion of two filters at the same time gives remarkable results:

    . on the fifth line, combining MICF with IRF reducesthe number of rules to 77 rules;

    . combining IRF with pruning using Rule Schemas theset of rules is reduced to three rules; and

    . we can also note that in the last two rows, the filtershave the same results. We can explain this by the factthat we are working on an incomplete set of rules because of the maximum support threshold that weimpose in the mining process.

    It is very important to note that the quality of the selected

    rules was certified by the Nantes Habitat expert.

    794 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 6, JUNE 2010

    TABLE 2Pruning Rule Schemas

    TABLE 3Filtering Rule Schemas

    TABLE 4Pruning Rate for Each Filter Combination

    TABLE 5Notation Meaning

  • 8/8/2019 Knowledge Based Ontology

    12/14

    Example 2. This second example is proposed in order tooutline the quality of the filtered rules, and to confirmthe importance of the interactivity in our framework. Tothis end, we present the sequence of steps (Fig. 8)performed by the expert during the interactivityprocess, steps already described in Section 6.1. We havealready presented the first step of the interactiveprocessontology constructionin Section 5.1.As in the first example, the expert is faced to the whole

    set of rules. In a first attempt ( steps 2 and 3), she proposed toinvestigate the quality of rules filtered by two of the ruleschemas RS 2 and RS 3 with the conforming operator. Thefirst one deals with dissatisfaction concerning the tranquilityin the district, and the second one searches rules associatingdissatisfaction in price with dissatisfaction concerning thecommon areas of the building.

    Applying these two schemas to the whole rule set, animportant selection is made:

    . C RS 2 filters 1,008 association rules; and

    . C RS 3 filters 96 association rules.The expert is in the visualization and validation steps ( 4

    and 5), and she analyzes the 96 rules filtered by C RS 3 , because of the reduced number of rules comparing to

    1,008 filtered by C RS 2

    . For example, let us consider thefollowing set of association rules:

    q 17 4 ; q 26 4; q 97 4 > q 28 4 C 92 :8% S 2 :6%q 16 4 ; q 17 4; q 26 4 ; q 97 4 > q 28 4 C 92 :5%

    S 2 :5%q 15 4 ; q 17 4; q 97 4 > q 28 4 C 80 :5% S 1 :9%q 15 4 ; q 17 4; q 97 4 > q 26 4 ; q 28 4 C 80 :5%

    S 1 :9% :

    The expert noted that the second rule is a specializationof the first rulethe item q 16 4 is added in theantecedent, and she also noted that its confidence is lower

    than the confidence of the more general rule. Thus, the

    second rule does not bring important information to thewhole set of rule; hence, it can be pruned. In the same way,the expert noted that the forth rule is the specialization of the third one, and the confidence is not improved in thiscase neither. The expert decided to modify her initialinformation (step 5) and to go to the beginning of the processvia the interactivity loop (step 7), choosing to apply the MCIF (step 6) which extracts 27,602 rules. The expertdecided to keep these results ( steps 4 and 5) and to returnin the interactivity loop, going back to steps 2 and 3 in orderto redefine rule schemas and operators.

    This time the expert proposed to use only the rule schemaC RS 3 , as a consequence of high volume of rules extracted by the other one. Using C RS 3 , 50 rules are filtered, and thepresence of rules 1 and 3 and the absence of rules 2 and 4(from the set presented above) validate the use of MICF(steps 4 and 5). Moreover, the hight reduction of number of rules validate the application of C RS 3 . In this state, theexpert returned to step 2 in order to modify the rule schemaproposing RS 4 and first, she applied the unexpectednessregarding the antecedent operator U pRS 4 , and then, shereturned to step 3 in order to modify the operator, choosingthe exceptionone E RS 4 . These results are briefly presentedin Table 6, but due to space limit, they are not detailed in thissection.

    The expert analyzed the 50 rules extracted by C RS 3 and she found several trivial implications noting that theimplication between several items did not interest her. Forinstance, let us consider the following set of rules:

    q 17 4 ; q 97 4 > q 16 4 C 86 :7% S 3 :5%q 25 4 ; q 28 4 ; q 97 4 > q 26 4 C 100% S 2 :0% :

    These rules imply items from EntryHall and CloseSur-rounding; thus, the expert proposed to apply rule schemasRS 5 to RS 8 with the pruning operator ( steps 2 and 3) in orderto prune those not interesting rules. In consequence, 15 rulesare extracted, and the absence of the above rules validatesthe application of pruning rule schemas ( steps 4 and 5).

    Let us consider the following two rules:

    q 28 4 ; q 97 4 > q 17 4 C 81 :1% S 2 :9%q 8 4 ; q 16 4; q 97 4 > q 9 4 C 88 :6% S 2 :1% :

    The expert noted that a great part of the 15 rules areimplications between attributes subsumed by the sameconcept in the ontology. For instance, the attributes q 28

    and q 17

    of the first rule, described by the Q28

    and the

    MARINICA AND GUILLET: KNOWLEDGE-BASED INTERACTIVE POSTMINING OF ASSOCIATION RULES USING ONTOLOGIES 795

    Fig. 8. Description of the interactive process during the experiment.

    TABLE 6Rates for Rule Schema Filters Applied after

    the Other Three Filter Combinations

  • 8/8/2019 Knowledge Based Ontology

    13/14

    Q17 concepts, are subsumed by the concept Stairwell .Similarly, for the second rule, q 8 and q 9 are subsumed byCalmDistrict concept. Thus, the expert applied the IRFfilter, and only three rules are filtered. One of these rulesattracts the interest of the expert:

    q 15 4; q 16 4 ; q 97 4 > q 9 4 ;

    Support 2 :3% Confidence 79 :1% ;which can be translated by: if a client is not satisfied withthe cleaning of the close surrounding and the entry hall, andif he is not satisfied with the service charges, then it ispossible with a confidence of 79.1 percent that he considersthat his district has a bad reputation. This rule is veryinteresting because the expert thought that the buildingstate does not influence the opinion concerning the district, but it is obvious that this is the case.

    7 C ONCLUSIONThis paper discusses the problem of selecting interestingassociation rules throughout huge volumes of discoveredrules. The major contributions of our paper are stated below. First, we propose to integrate user knowledge inassociation rule mining using two different types of formalism: ontologies and rule schemas. On the one hand,domain ontologies improve the integration of user domainknowledge concerning the database field in the postpro-cessing step. On the other hand, we propose a newformalism, called Rule Schemas, extending the specificationlanguage proposed by Liu et al. The latter is especiallyused to express the user expectations and goals concerningthe discovered rules.

    Second, a set of operators, applicable over the ruleschemas, is proposed in order to guide the user throughoutthe postprocessing step. Thus, several types of actions, aspruning and filtering, are available to the user. Finally, theinteractivity of our AR IPSO framework, relying on the set of rule mining operators, assists the user throughout theanalyzing task and permits him/her an easier selection of interesting rules by reiterating the process of filtering rules.

    By applying our new approach over a voluminousquestionnaire database, we allowed the integration of domain expert knowledge in the postprocessing step inorder to reduce the number of rules to several dozens orless. Moreover, the quality of the filtered rules wasvalidated by the expert throughout the interactive process.

    ACKNOWLEDGMENTS

    The authors would like to thank Nantes Habitat, the PublicHousing Unit in Nantes, France, and more specially Ms.Christelle Le Bouter, and also M. Loic Glimois forsupporting this work.

    REFERENCES[1] R. Agrawal, T. Imielinski, and A. Swami, Mining Association

    Rules between Sets of Items in Large Databases, Proc. ACMSIGMOD, pp. 207-216, 1993.

    [2] U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, Advances in Knowledge Discovery and Data Mining. AAAI/MITPress, 1996.

    [3] A. Silberschatz and A. Tuzhilin, What Makes Patterns Interestingin Knowledge Discovery Systems, IEEE Trans. Knowledge and

    Data Eng. vol. 8, no. 6, pp. 970-974, Dec. 1996.

    [4] M.J. Zaki and M. Ogihara, Theoretical Foundations of Associa-tion Rules, Proc. Workshop Research Issues in Data Mining andKnowledge Discovery (DMKD 98),pp. 1-8, June 1998.

    [5] D. Burdick, M. Calimlim, J. Flannick, J. Gehrke, and T. Yiu,Mafia: A Maximal Frequent Itemset Algorithm, IEEE Trans.Knowledge and Data Eng., vol. 17, no. 11, pp. 1490-1504, Nov. 2005.

    [6] J. Li, On Optimal Rule Discovery, IEEE Trans. Knowledge andData Eng., vol. 18, no. 4, pp. 460-471, Apr. 2006.

    [7] M.J. Zaki, Generating Non-Redundant Association Rules, Proc.Intl Conf. Knowledge Discovery and Data Mining, pp. 34-43, 2000.

    [8] N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal, Efficient Miningof Association Rules Using Closed Itemset Lattices, InformationSystems, vol. 24, pp. 25-46, 1999.

    [9] H. Toivonen, M. Klemettinen, P. Ronkainen, K. Hatonen, and H.Mannila, Pruning and Grouping of Discovered AssociationRules, Proc. ECML-95 Workshop Statistics, Machine Learning, andKnowledge Discovery in Databases,pp. 47-52, 1995.

    [10] B. Baesens, S. Viaene, and J. Vanthienen, Post-Processing of Association Rules, Proc. Workshop Post-Processing in MachineLearning and Data Mining: Interpretation, Visualization, Integration,and Related Topics with Sixth ACM SIGKDD, pp. 20-23, 2000.

    [11] J. Blanchard, F. Guillet, and H. Briand, A User-Driven andQuality-Oriented Visualization for Mining Association Rules,Proc. Third IEEE Intl Conf. Data Mining, pp. 493-496, 2003.

    [12] B. Liu, W. Hsu, K. Wang, and S. Chen, VisuallyAided Exploration

    of Interesting Association Rules, Proc. Pacific-Asia Conf. KnowledgeDiscovery and Data Mining (PAKDD), pp. 380-389, 1999.[13] G. Birkhoff, Lattice Theory, vol. 25. Am. Math. Soc., 1967.[14] N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal, Discovering

    Frequent Closed Itemsets for Association Rules, Proc. Seventh IntlConf. Database Theory (ICDT 99),pp. 398-416, 1999.

    [15] M. Zaki, Mining Non-Redundant Association Rules, Data Mining and Knowledge Discovery, vol. 9, pp. 223-248, 2004.

    [16] A. Maedche and S. Staab, Ontology Learning for the SemanticWeb, IEEE Intelligent Systems, vol. 16, no. 2, pp. 72-79, Mar. 2001.

    [17] B. Liu, W. Hsu, L.-F. Mun, and H.-Y. Lee, Finding InterestingPatterns Using User Expectations, IEEE Trans. Knowledge andData Eng., vol. 11, no. 6, pp. 817-832, Nov. 1999.

    [18] I. Horrocks and P.F. Patel-Schneider, Reducing owl Entailment toDescription Logic Satisfiability, J. Web Semantics, pp. 17-29,vol. 2870, 2003.

    [19] J. Pei, J. Han, and R. Mao, Closet: An Efficient Algorithm forMining Frequent Closed Itemsets, Proc. ACM SIGMOD WorkshopResearch Issues in Data Mining and Knowledge Discovery,pp. 21-30,2000.

    [20] M.J. Zaki and C.J. Hsiao, Charm: An Efficient Algorithm forClosed Itemset Mining, Proc. Second SIAM Intl Conf. Data Mining,pp. 34-43, 2002.

    [21] M.Z. Ashrafi, D. Taniar, and K. Smith, Redundant AssociationRules Reduction Techniques, AI 2005: Advances in ArtificialIntelligence Proc 18th Australian Joint Conf. Artificial Intelligencepp. 254-263, 2005.

    [22] M. Hahsler, C. Buchta, and K. Hornik, Selective Association RuleGeneration, Computational Statistic, vol. 23, no. 2, pp. 303-315,Kluwer Academic Publishers, 2008.

    [23] J. Bayardo, J. Roberto, and R. Agrawal, Mining the MostInteresting Rules, Proc. ACM SIGKDD, pp. 145-154, 1999.

    [24] R.J. Bayardo, Jr., R. Agrawal, and D. Gunopulos, Constraint-Based Rule Mining in Large, Dense Databases, Proc. 15th IntlConf. Data Eng. (ICDE 99), pp. 188-197, 1999.

    [25] E.R. Omiecinski, Alternative Interest Measures for MiningAssociations in Databases, IEEE Trans. Knowledge and Data Eng.,vol. 15, no. 1, pp. 57-69, Jan./Feb. 2003.

    [26] F. Guillet and H. Hamilton, Quality Measures in Data Mining.Springer, 2007.

    [27] P.-N. Tan, V. Kumar, and J. Srivastava, Selecting the RightObjective Measure for Association Analysis, Information Systems,vol. 29, pp. 293-313, 2004.

    [28] G. Piatetsky-Shapiro and C.J. Matheus, The Interestingness of Deviations, Proc. AAAI94 Workshop Knowledge Discovery inDatabases, pp. 25-36, 1994.

    [29] M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and A.I.Verkamo, Finding Interesting Rules from Large Sets of Dis-covered Association Rules, Proc. Intl Conf. Information and

    Knowledge Management (CIKM),pp. 401-407, 1994.

    796 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 6, JUNE 2010

  • 8/8/2019 Knowledge Based Ontology

    14/14

    [30] E. Baralis and G. Psaila, Designing Templates for MiningAssociation Rules, J. Intelligent Information Systems, vol. 9, pp. 7-32, 1997.

    [31] B. Padmanabhan and A. Tuzhuilin, Unexpectedness as aMeasure of Interestingness in Knowledge Discovery, Proc.Workshop Information Technology and Systems (WITS), pp. 81-90,1997.

    [32] T. Imielinski, A. Virmani, and A. Abdulghani, Datamine:Application Programming Interface and Query Language for

    Database Mining, Proc. Intl Conf. Knowledge Discovery and Data Mining (KDD), pp. 256-262, http://www.aaai.org/Papers/KDD/1996/KDD96-042.pdf, 1996.

    [33] R.T. Ng, L.V.S. Lakshmanan, J. Han, and A. Pang, ExploratoryMining and Pruning Optimizations of Constrained AssociationsRules, Proc. ACM SIGMOD Intl Conf. Management of Data,vol. 27,pp. 13-24, 1998.

    [34] A. An, S. Khan, and X. Huang, Objective and SubjectiveAlgorithms for Grouping Association Rules, Proc. Third IEEEIntl Conf. Data Mining (ICDM 03), pp. 477-480, 2003.

    [35] A. Berrado and G.C. Runger, Using Metarules to Organize andGroup Discovered Association Rules, Data Mining and KnowledgeDiscovery, vol. 14, no. 3, pp. 409-431, 2007.

    [36] M. Uschold and M. Gru ninger, Ontologies: Principles, Methods,and Applications, Knowledge Eng. Rev.,vol. 11, pp. 93-155, 1996.

    [37] T.R. Gruber, A Translation Approach to Portable OntologySpecifications, Knowledge Acquisition,vol. 5, pp. 199-220, 1993.

    [38] N. Guarino, Formal Ontology in Information Systems, Proc. FirstIntl Conf. Formal Ontology in Information Systems, pp. 3-15, 1998.[39] H. Nigro, S.G. Cisaro, and D. Xodo, Data Mining with Ontologies:

    Implementations, Findings and Frameworks. Idea Group, Inc., 2007.[40] R. Srikant and R. Agrawal, Mining Generalized Association

    Rules, Proc. 21st Intl Conf. Very Large Databases, pp. 407-419,http://citeseer.ist.psu.edu/srikant95mining.html, 1995.

    [41] V. Svatek and M. Tomeckova, Roles of Medical Ontology inAssociation Mining Crisp-dm Cycle, Proc. Workshop KnowledgeDiscovery and Ontologies in ECML/PKDD, 2004.

    [42] X. Zhou and J. Geller, Raising, to Enhance Rule Mining in WebMarketing with the Use of an Ontology, Data Mining withOntologies: Implementations, Findings and Frameworks, pp. 18-36,Idea Group Reference, 2007.

    [43] M.A. Domingues and S.A. Rezende, Using Taxonomies toFacilitate the Analysis of the Association Rules, Proc. Second IntlWorkshop Knowledge Discovery and Ontologies, held with ECML/

    PKDD, pp. 59-66, 2005.[44] A. Bellandi, B. Furletti, V. Grossi, and A. Romei, Ontology-Driven Association Rule Extraction: A Case Study, Proc. Work-shop Context and Ontologies: Representation and Reasoning, pp. 1-10,2007.

    [45] R. Natarajan and B. Shekar, A Relatedness-Based Data-DrivenApproach to Determination of Interestingness of AssociationRules, Proc. 2005 ACM Symp. Applied Computing (SAC), pp. 551-552, 2005.

    [46] A.C.B. Garcia and A.S. Vivacqua, Does Ontology Help MakeSense of a Complex World or Does It Create a Biased Interpreta-tion? Proc. Sensemaking Workshop in CHI 08 Conf. Human Factorsin Computing Systems, 2008.

    [47] A.C.B. Garcia, I. Ferraz, and A.S. Vivacqua, From Data toKnowledge Mining, Artificial Intelligence for Eng. Design, Analysisand Manufacturing, vol. 23, pp. 427-441, 2009.

    [48] L.M. Garshol, Metadata? Thesauri? Taxonomies? Topic Maps

    Making Sense of It All, J. Information Science, vol. 30, no. 4,pp. 378-391, 2004.[49] I. Horrocks and P.F. Patel-Schneider, A Proposal for an owl Rules

    Language, Proc. 13th Intl Conf. World Wide Web, pp. 723-731,2004.

    [50] W.E. Grosso, H. Eriksson, R.W. Fergerson, J.H. Gennari, S.W. Tu,and M.A. Musen, Knowledge Modeling at the Millennium (theDesign and Evolution of Protege-2000), Proc. 12th WorkshopKnowledge Acquisition, Modeling and Management (KAW 99),1999.

    [51] M.-A. Storey, N.F. Noy, M. Musen, C. Best, R. Fergerson, and N.Ernst, Jambalaya: An Interactive Environment for ExploringOntologies, Proc. Seventh Intl Conf. Intelligent User Interfaces(IUI 02), pp. 239-239, 2002.

    Claudia Marinica received the masters de-gree in KDD from the Polytechnique Schoolof Nantes University in 2006, and the Com-puter Science degree from Politehnica Uni-versity of Bucharest, Romania, in 2006. She iscurrently working toward the PhD degree incomputer science in the Knowledge andDecision Team, LINA UMR CNRS 6241 atPolytechnique School of Nantes University,

    France. Her main research interests are inAssociation Rule Mining and Semantic Web.

    Fabrice Guillet received the PhD degree incomputer sciences from the Ecole NationaleSuperieure des Telecommunications de Bre-tagne in 1995. He has been an associateprofessor (HdR) in computer science at Poly-techNantes, and a member of the KnOwledgeand Decision team (KOD) in the Nantes-Atlantic Laboratory of Computer Sciences (LINAUMR CNRS 6241) since 1997. He is a founderof the Knowledge Extraction and Management

    French-speaking association of research (EGC, www.egc.asso.fr). Hisresearch interests include knowledge quality and visualization in theframeworks of Data Mining and Knowledge Engineering. He has

    recently coedited two refereed books of chapter entitled Quality Measures in Data Mining (Springer, 2007), and Statistical Implicative AnanlysisTheory and Applications (Springer, 2008).

    . For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

    MARINICA AND GUILLET: KNOWLEDGE-BASED INTERACTIVE POSTMINING OF ASSOCIATION RULES USING ONTOLOGIES 797