Introducing New Tool for Official Statistics – Genetic Programming · Introducing New Tool for...

15
Introducing New Tool for Official Statistics – Genetic Programming Miroslav Kľúčik INFOSTAT Slovakia

Transcript of Introducing New Tool for Official Statistics – Genetic Programming · Introducing New Tool for...

Page 1: Introducing New Tool for Official Statistics – Genetic Programming · Introducing New Tool for Official Statistics – Genetic Programming Miroslav Kľúčik INFOSTAT Slovakia.

Introducing New Tool for Official Statistics – Genetic Programming

Miroslav KľúčikINFOSTAT Slovakia

Page 2: Introducing New Tool for Official Statistics – Genetic Programming · Introducing New Tool for Official Statistics – Genetic Programming Miroslav Kľúčik INFOSTAT Slovakia.

BackgroundThe research is a part of a FP7 project in the area of Social Sciences andHumanities dealing with data management for statistics called BLUE-ETS (BLUE –Enterprise and Trade Statistics).

April 2010 – March 2013 (36 months)now – 11th month – relatively early for results

Work package – WP5 – New methods for data collection and data analysis

Supporting the European initiatives as MEETS and ESS-NET – developing innovative methods, tools and procedures to exploit better and more

efficiently the potential of administrative data (Intrastat).

Result = knowledge gain for NSIs

Miroslav Kľúčik, INFOSTAT Slovakia 2

Page 3: Introducing New Tool for Official Statistics – Genetic Programming · Introducing New Tool for Official Statistics – Genetic Programming Miroslav Kľúčik INFOSTAT Slovakia.

Reasons for Research• Main purpose is to analyze data from various official data sources and capture all new

information, which cannot be obtained using classical methods.

• Artificial intelligence tools included in the research area of official statistics can makeuse of the vast amount of micro-data collected at NSIs. The official statistics researchcould win this way an advantage, which by now was only the privilege of privateresearch know-how.

• As the European Plan of Research in Official Statistics concludes (2007): “Also the use ofsuch techniques as neural networks/artificial intelligence in data mining andcomparisons with classical statistical approaches need to be further researched.“

• This is an attempt to produce an artificial intelligence tool for National StatisticalInstitutes in EU.

Miroslav Kľúčik, INFOSTAT Slovakia 3

Page 4: Introducing New Tool for Official Statistics – Genetic Programming · Introducing New Tool for Official Statistics – Genetic Programming Miroslav Kľúčik INFOSTAT Slovakia.

Reasons for ResearchIntroduce a new intelligent system into the area of data mining and specificallyinto the area of official statistics. The system’s main component should be thegenetic programming (GP) method and its main aim would be to learn from dataand uncover new information.

The research is in its first year so the main aim of this paper is to introduce areasof GP method use and present instances of its application.

Knowledge Extraction for Statistical Offices (KESO) 1996-1998Spatial Mining for Data of Public Interest (SPIN) 2000-2002Analysis System of Symbolic Official data (ASSO) 2001-2003Visual Data Mining System (VITAMIN S) 2001-2003 Environment Consolidated Statistical Tool (ECOSTAT) 2001-2003

Miroslav Kľúčik, INFOSTAT Slovakia 4

Past projects on data mining in EU:

Page 5: Introducing New Tool for Official Statistics – Genetic Programming · Introducing New Tool for Official Statistics – Genetic Programming Miroslav Kľúčik INFOSTAT Slovakia.

Introducing GP• Genetic programming (GP):

• Computer Science:– Artificial Intelligence

• Computational Intelligence(evolutionary computations)

J. Holland (1975 ) – genetic algorithmsJ. R. Koza (1992) – genetic programming

GP uses knowledge from other different science branches as genetics, biologyand ecology. GP is a computation method based on evolution of computerprograms (individuals). During a process of evolution it transforms a set ofpopulation into a new population using the principles of reproduction andsurvival of the fittest.

Miroslav Kľúčik, INFOSTAT Slovakia 5

Page 6: Introducing New Tool for Official Statistics – Genetic Programming · Introducing New Tool for Official Statistics – Genetic Programming Miroslav Kľúčik INFOSTAT Slovakia.

Introducing GP• Literature:

– First papers about modelling – 1994– First papers about data mining – 1996

– Banzhaf, W., Nordin, P., Keller, R. E., Francone, F. D.: Genetic Programming – AnIntroduction (1998)

– Hassani, H., Gheitanchi, S., Yeganegi, M. R. (2008) On the Application of DataMining to Official Data, Ludwig-Maximilians-Universität München

– Saporta, G. (2000) Data Mining and Official Statistics– Affenzeller, M., Winkler, S., Wagner, S., Beham, A. (2009) Genetic Algorithms

and Genetic Programming (Modern Concepts and Practical Applications)– Eggermont, J. (2005) Data Mining using Genetic Programming (Classification

and Symbolic Regression), Leiden University– Smith, P.W.H. (2002) Genetic Programming as a Data-Mining Tool, City

University, UK

Miroslav Kľúčik, INFOSTAT Slovakia 6

Page 7: Introducing New Tool for Official Statistics – Genetic Programming · Introducing New Tool for Official Statistics – Genetic Programming Miroslav Kľúčik INFOSTAT Slovakia.

GP Computation Cycle• Representation of individuals – tree

• Computation cycle

1 individual = (x + y)*1.5

(1 tree = 1 solution = 1 individual)

Producing initial population (random trees)

Fitness measure (best individuals)

Genetic operations (crossover, mutation)

New generation of individuals

Miroslav Kľúčik, INFOSTAT Slovakia 7

Page 8: Introducing New Tool for Official Statistics – Genetic Programming · Introducing New Tool for Official Statistics – Genetic Programming Miroslav Kľúčik INFOSTAT Slovakia.

GP Computation Cycle

Miroslav Kľúčik, INFOSTAT Slovakia 8

Page 9: Introducing New Tool for Official Statistics – Genetic Programming · Introducing New Tool for Official Statistics – Genetic Programming Miroslav Kľúčik INFOSTAT Slovakia.

GP Software• The GP method has to be programmed in a conventional programming

language. INFOSTAT writes its first own experimental program (GP Coeus1.0) – in EViews programming language. The final software has to befinally rewritten into C++ and finished as a common user friendly software.

Miroslav Kľúčik, INFOSTAT Slovakia 9

smpl @allvector(200) _results1!d = 0for !xx = 1 to !round %observ = @str(!xx)%11 = @elem(numberofg, %observ)!d = !d +1scalar a = 0for !c = 1 to _ind{%11}.@countscalar dd = _ind{%11}.@count%5 = _ind{%11}.@seriesname(!c)%7 = @elem({%5}, "1")%8 = @elem({%5}, “2”)

%9 = @elem({%5}, "3")a = a + 1a = dd - aseries _{%5} = {%8} {%7} {%9}if a = 0 thensmpl @allseries rmse{%5} = (_{%5} - crs)^2scalar _rmse{%5} = (@sum(rmse{%5})/@obs(rmse{%5})*0.5_results1(!d) = _rmse{%5}endif

nextnext

Example – fitness function

Page 10: Introducing New Tool for Official Statistics – Genetic Programming · Introducing New Tool for Official Statistics – Genetic Programming Miroslav Kľúčik INFOSTAT Slovakia.

GP Application Areas in Official Statistics

• 1. Symbolic regression• 2. Classification• 3. Clustering• 4. Deviation detection

Miroslav Kľúčik, INFOSTAT Slovakia 10

Page 11: Introducing New Tool for Official Statistics – Genetic Programming · Introducing New Tool for Official Statistics – Genetic Programming Miroslav Kľúčik INFOSTAT Slovakia.

Symbolic Regression, Classification• Symbolic regression can be used to estimate missing data as well as for flash estimates of

economic aggregates before official data release.• The costs of search for traditional regression models in the case of large data sets are too

high. Where there are too many possible explanatory variables a heuristic approach ismore appropriate. The near optimal final solution is found fully automatically and isdisposable to the user immediately.

• Main purpose of dataset classification is to divide the data into classes by means ofbeforehand known characteristics.

• In the case of Intrastat database, we can discover interesting classification, e.g. thethreshold analysis for including units (companies) above or below the threshold of datareporting obligation.

Miroslav Kľúčik, INFOSTAT Slovakia 11

Page 12: Introducing New Tool for Official Statistics – Genetic Programming · Introducing New Tool for Official Statistics – Genetic Programming Miroslav Kľúčik INFOSTAT Slovakia.

Clustering, Deviation Detection

• Statistical units can sometimes fill in incorrect data in the surveys. In the case of verylarge database, these flaws are sometimes processed through the computationprocess without a notice. This is the room for GP outlier detection, where with thehelp of multiple fitness functions (with contradictory objectives) deviation andchanges from normative behaviour are identified.

• GP deviation detection can participate unexpected changes and unnecessary revisionscould be dismantled.

• Clustering is another method for discovering hidden patterns in large data sets(unsupervised automatic search for patterns, either categories or classes are notknown in advance). Clusters are composed on the base of similarity of inner units.

• In practice, clustering with GP can be used for example in sample stratification forsurveys. Semi-automatic cluster creation should ensure formation of distinct strata withhomogenous units inside each one. The program could be used with desirablefrequency to support regular replacement of addressed units. Clustering can be alsoused for different databases connection search.

Miroslav Kľúčik, INFOSTAT Slovakia 12

Page 13: Introducing New Tool for Official Statistics – Genetic Programming · Introducing New Tool for Official Statistics – Genetic Programming Miroslav Kľúčik INFOSTAT Slovakia.

Integration of Heuristic Methods• Final step:• The integration of evolutionary techniques, fuzzy approach and neural networksallows more effective data mining. Fuzzy logic introduces linguistic variables anddegrees of exactness into data analysis, while data are not looked as precisenumbers but interpreted as „scale of truth‟. The neural network approach is taughtto correctly classify input variables into target variables by imitation of theinformation processes of nervous system in biology.

• There are many possibilities for extraction of each mentioned methodproperties and integration into a hybrid system. This is possible for GP/Neuralnetwork, GP/Fuzzy approach, Fuzzy/Neural network and also GP/Neural/Fuzzysystem.

Miroslav Kľúčik, INFOSTAT Slovakia 13

Page 14: Introducing New Tool for Official Statistics – Genetic Programming · Introducing New Tool for Official Statistics – Genetic Programming Miroslav Kľúčik INFOSTAT Slovakia.

Plan for Next 24 Months• Finishing and testing of software for GP classification, regression,

clustering and deviation detection

• Testing on real data from different NSIs provided by another workpackage of the BLUE-ETS project

• Finishing user-friendly software for classification, regression,clustering and deviation detection for NSIs

Miroslav Kľúčik, INFOSTAT Slovakia 14

Page 15: Introducing New Tool for Official Statistics – Genetic Programming · Introducing New Tool for Official Statistics – Genetic Programming Miroslav Kľúčik INFOSTAT Slovakia.

Conclusion

[email protected]

Miroslav Kľúčik, INFOSTAT Slovakia 15

The development of estimation methods for incomplete public data collection setvia GP can enable early publishing and increased precision of official data.

The GP through its dynamics can incorporate all aspects of information availablein collected data and contribute to the necessary improvement of effectivenessand timeliness of NSIs in collecting and publishing of official data.

The application of artificial intelligence tool has an immense potential in the areaof data mining. The research area of official statistics can make use of the vastamount of disaggregated data collected at NSIs.