Linguistic summaries on relational databases Miroslav Hudec University of Economics in Bratislava,...

Linguistic summaries on relational databases

Miroslav Hudec

University of Economics in Bratislava,

Department of Applied Informatics

FSTA, 2014

Relational knowledge from a data set

Most of municipalities with high altitude have small pollution?

Validity of rule 1] [0, v

If then rules: if population density is high then waste production is high?

Linguistic summary - introduction

Q is a linguistic quantifier, X ={x} is a universe of disclosure and P(x) is a predicate depicting summariser S

Qx(Px)

Q entities in database are (have) S

Truth value of summaries called validity and gets values from the [0, 1] interval

Linguistic summary - elementary

Q entities in database are (have) S

1())((

iiQ xPxQxT

where n is the cardinality of database (number of entities),

is the proportion of objects in a database that satisfy P(x),

µq is quantifier

Linguistic summary - extended

Q R objects in database are (have) S

))(),((())((

xxtPxQxT

the proportion of R objects in a database that satisfy S, t is a t-norm, µq is quantifier.

Linguistic summary - graph

Q R objects in database are (have) S

Issues

Summarizer

Let Dmin and Dmax be the lowest and the highest domain values of attribute A i.e. Dom(A) = [Dmin, Dmax] and L and H be the lowest and the highest values in the current content of a database respectively. In practice, [L, H] [Dmin, Dmax]. This fact should be considered in linguistic summaries.

Family of summarizer

variable

small medium high

A B C D

μP(xi)

The uniform domain covering method (Tudorie, 2008)

Quantifier

3.0 ,0

8.03.0 6.02

8.0 ,1

For a regular non-decreasing quantifier (e.g. most) its membership function should meet the following property:

)()( yxyx QQ 1)1( ;0)0( QQ

Quantifier most might be given as (Kacprzyk and Zadrożny 2009)

quantifier most

Example

Linguistic summary (rule) Validity

Most municipalities having high population density have high production of waste

Most municipalities having medium population density have medium production of wa

Most municipalities having small population density have small production of waste

if population density is small then production of waste is small with cf = 1;if population density is high then production of waste is high with cf = 0.662.

Family of quantifiers

quantifier

fewabout half

AQ BQ CQ DQ

QQ QQ Q

μQ(y)

Uniform domain covering method on the [0, 1] interval

25.0QA

375.0QB

625.0QC

75.0QD

Comparison of quantifiers

quantifier

0.25 0.375 0.675 0.750 1

μQ(y)

0.2 0.3 0.7 0.8

Quantifiers most (Kacprzyk and Zadrożny, 2009) and few

Quantifiers most, about half and few (our approach)

Optimization of summaries1. Decision maker creates particular linguistic summary

or sentence of interest and evaluate its validity2. Automatic generation of relevant linguistic summaries

(Liu, 2011).

tosubject

RandSQFind

is a set of relevant quantifiers, is a set of relevant linguistic expressions, is a set defining subpopulation of interest and β is the threshold value from the {0, 1] interval. Each solution produces a linguistic summary Q* R * are S*.

Optimization of summaries

}),(|),{(___

RxSRSRSPc __

),( cr PPRS

tosubject

RandSQFind

{(small, small), (small, medium),(medium, medium), (high, high)}

Attribute A Attribute B

Fuzzy functional dependencies

medium

Attribute A

medium

hightn

t1Attribute B

Linguistic summaries

Fuzzy functional dependencies and linguistic summaries

jjiQi nRixPxQxT

N ,1 , ))(N

1())((

Queries by summaries

Data on lower hierarchical level are basis for summaries but only data on higher level are revealed ranked downward from the best to the worst. Select regions where most of municipalities has small attitude above sea level

where n is number of entities in whole database, Ni is number of entities in cluster i (municipalities in region i), R is number of clusters in database (regions), µp(xji) is matching degree of j-th entity in i-th cluster.

Advantages:1.Sensitive or data that are not free of charge remain hidden2.Policy maker… is interested in general overview not in data

Example

Select regions where most of municipalities has small attitude above sea level

Region Validity of the

summaryBratislava 1Trnava 1Nitra 1Trenčín 0.7719Košice 0.6314Banská Bystrica 0.2116Žilina 0

Prešov 0

Conclusion

The work demonstrates how we can start with a simple linguistic summary and build more complex summaries by merging knowledge from several fields: mining parameters for functions of summarizers from data and extending to defining parameters of quantifiers, optimization of summaries, fuzzy queries. Although fuzzy set theory has been already established as an adequate framework to deal with linguistic summaries, there is still space for improvements.

Some topics for further research

• Linguistic summaries on fuzzy databases,• Operations research task for optimisation the process of

rules generation • Full applications for practitioners• Fuzzy functional dependencies and linguistic summaries

in data mining

Thank you for your attention

Linguistic summaries on relational databases Miroslav Hudec University of Economics in Bratislava,...

Documents

Transcript of Linguistic summaries on relational databases Miroslav Hudec University of Economics in Bratislava,...

Nikolic, Hudec - Principi Biomehanike

FSTA - Document explicatiu (en castellà)

MIROSLAV KUČERA

Thorsten Döhring and Rene Hudec JEUMICO: Joint … · Thorsten Döhring BTHA & Rene Hudec Opening Ceremony 2016 Project JEUMICO Joint European Mirror Competence ... Source: ...

Key Takeaways from FSTA

Jan Hudec – slavný molenburský rodák · 2012. 11. 29. · 1 Jan Hudec – slavný molenburský rodák Jan Hudec se narodil 6. října 1856 v Molenburku na statku č. 1 „U Hadam

Mladen Hudec, Professor Emeritus of Civil Engineering ... · PDF fileMladen Hudec, Professor Emeritus of Civil Engineering, BSCE, MSCE, PhD, University of ... 8.7 Geotehnički projekt

TYPOLOGY OF PRODUCTS IN OFFICIAL STATISTICS Thomas Burg Marcus Hudec.

Miroslav Todorovic.Sudija.Smrti

Miroslav Tyrš

Miroslav Kurelac

FSTA · FSTA . FSTA és una base de dades d’àmbit mundial que abasta temes relacionats amb cada branca de la cadena alimentària incloent els diferents productes ...

Rene Hudec Welcome to Prague, AXRO 2012 City of Astronomy and X-ray Astronomy René Hudec AXRO2011.

Miroslav Bulešić

The Novel Method for Identifying Fast Optical Transients The Application of Multi-Exposure Plates for OT/GRBs analyses René Hudec 1 and Lukáš Hudec 2 1.

SOFT COMPUTING TECHNIQUES FOR STATISTICAL DATABASES Miroslav Hudec INFOSTAT – Bratislava MSIS 2009.

FSTA · 2016. 7. 29. · FSTA . FSTA es una base de datos de ámbito mundial que recoge temas relacionados con todos los apartados de la cada alimentaria, además de biotecnología,

Miroslav Vujovic

Architecture of Ladislav Hudec in Shanghai

PowerPoint-præsentation - FSTA