The Estimations Based on the Kolmogorov Complexity and ...
Transcript of The Estimations Based on the Kolmogorov Complexity and ...
The Fifth International Conference on Neural Networks and Artificial IntelligenceMay 27-30, Minsk, Belarus
The Estimations Based on the Kolmogorov Complexity and Machine Learning from Examples
Vladimir I. Donskoy
Taurian National University, 4, Vernadsky Avenue, Simferopol, 95007, Ukraine, [email protected]
Abstract - In this paper, interrelation between the Kolmogorov complexity and VCD of classes of the partial recursive functions used in machine learning from examples is researched. The novel pVCD method of programming of estimations of VCD and the Kolmogorov complexity is proposed. It is shown, how Kolmogorov complexity can be used for the substantiation of the significance of regularities discovered in the training samples.
Keywords - Kolmogorov complexity, VCD, Machine Learning, Samples.
I. INTRODUCTION
Examining the problems of machine learning, it is natural to limit the class of used decision functions by the partly recursive functions. In that case, we need to use algorithmic approach to machine learning and to examine algorithmic complexity of models. Statistical Vapnik-Chervonenkis theory of learning [1], and Kolmogorov approach [2], and MDL [3], and various heuristics used in Machine Learning - are based on the concepts of complexity of models which are used to find regularities or decision making rules. From the different points of view, when the learning from examples is used, it is expedient to choose as possible simpler deciding rule (model). The nature of the arising problem can be looked as decree of Nature: regularity almost always has to be very simple or has very simple description, by other words, - low complexity.
In this paper, Vapnik-Chervonenkis Theory is extensively used. This theory begins with the concepts of shatter coefficient and Vapnik-Chervonenkis Dimension
(VCD) [4]. Let be a sample ; ,
; is a set which is defined in the applications.
is the number of various partitions of
the sample on two classes which can be realized by
the rules (functions) of the family . It
is evident that . The function
is called the growth function of the family or the l-th shatter coefficient of [4]. The set of all possible
samples is denoted . The growth function
either is identically equal to , or majorized by function
, where is the minimum value of the , on
which . The following definition is based on
the estimation : if exists such
that for any , then it is said that the
family has finite capacity (or VCD(S)). If
then it is said that VCD is infinite:
. If card then and
. The main result of the statistical Vapnik-Chervonenkis theory is: the finiteness of the guarantees the learning ability by the method of empirical induction when classification rule is chosen from the family S. The fundamental inequality
is used to estimate the length of a sample which is necessary for a guaranty that empirical error of the learning (frequency ratio) of the classification rule
will be -close to the unknown probability of the
error of this rule. The main purpose of this paper is to analyze a process of machine learning from examples when the recursive function families are used. To achieve this purpose, we
defined the Kolmogorov complexity [5] of the
family of recursive functions and proved the inequality
.The novel method of estimation both
and is proposed. Finally, the majorant
is obtained for the probability of the random
choice of the recursive rule , which is absolutely
correct on all examples of a sample of length , when
this rule is found by means machine learning. The results obtained in this paper are based on the Kolmogorov approach supposing consideration of nonrandomness as regularity.
II. KOLMOGOROV COMPLEXITY OF THE RECURSIVE CLASSIFIERS
Let be a family of general recursive functions (of
algorithms) in the form of
. A
training sample, which is denoted as ,
contains arbitrary elements from the . This sample presents a sequenced collection which consists of the
limited natural numbers. The bounded set of all
these samples is denoted as and it requires bit to
present states. The set of 0-1-strings (words) of arbitrary length as usually presents numbers 0, 1, 2,… . A length of the string is denoted
. is the class of partly recursive functions. We define more exactly the training sequence or the
sample as the pairs , where ,
, ; is some a priory
unknown, but existing classification function. The set of
all possible training samples is denoted as .
This set is a general population from which samples can be extracted. The machine learning problem consists in finding the unknown function by using given sample
. Practically, the result of machine learning is
the function , which is not equal to , but which is,
in a certain sense, as possible closer to . The family is defined by a choice of the model of machine learning (and by the corresponding family of classification algorithms), for example, by decision trees, neural networks, potential functions, and another heuristics. The most intricate problem is a determination of the family
which is relevant, adequate to the initial information ;
therefore empirical learning problems are so complicated.
Definition 1.
1º The complexity of the algorithm relatively to the
sample by the partly recursive function is
where is a binary word of the
length .
2º The complexity of the algorithm at the set by
the partly recursive function is
3º The complexity of the family of algorithms at the
set by the partly recursive function is
4º The complexity of the family of algorithms at the set
is .
Theorem 1. Let the family of the partly recursive
functions has finite and Kolmogorov
complexity . Then
for any and .
Proof. The complexity of the family is
defined by the expression , where a
binary word fixes the variant of the partition of the
sample on two subsets. All possible variants of such
partitions are defined by functions of the family . For the function the binary word is defined by
expression , and moreover, if the functions A
and B from are equivalent, the binary word
and are the same. If the partly
recursive function is fixed, it needs the equality
= be fulfilled for any on any
sample according to definition 1. Therefore, the
argument must to admit not less than the number
of values, where is the growth function of
the family . Remind, that is the maximum
number of various partitions of the sample , therefore
defines the maximum possible number of various
binary words of the length for all samples from . And because that is a function, the inequality
takes a place. Furthermore, the equality
= (1)
is true. Really, it is sufficient to point the function
such that = . This
function can be defined by the following table 1
consisting of the cells.
TABLE I
DETERMINATION OF THE FUNCTION
The code(number)
of theprogram
The code (number) of the sample
… …
0 … … … … …… … … … … …
… … … … … …… … … … …
The values , , , contained in the table, are the binary words of the length , which are the binary natural numbers. We mean the
natural numbers extended with a zero. Just as values
, samples and codes are interpreted as natural
numbers. So, the function can be defined on the finite set of values of arguments presented in the table 1. On any other admissible values of arguments which are not contained in this table, the function can be determined as a zero. We remind: the natural functions of natural arguments, which have nonzero values only on a finite set from their domain of definition, are recursive.
Under the conditions and the following expressions take a place:
,
.
And finally, taking into account the equality (1), we get
.
Corollary 1. The Kolmogorov complexity of the family of algorithms is equal to the least whole number, which is more or equal to the logarithm of the l-th shatter coefficient
of this family : = .
Corollary 2.
III. THE METHOD OF PROGRAMMING
OF ESTIMATIONS OF VCD AND SHATTER COEFFICIENTS
The complexity of the class of algorithms is
defined above as the minimum length of the binary word (program) , which can be used to define the word
by means the corresponding
partly recursive function (external algorithm) in the
most unfavorable case of the set of samples and
algorithms from . It is evident, that
for any function .
Therefore, for the upper estimation of the , any
Turing machine can be used alternatively as
algorithm , if this calculates .
The appropriate program in any program language
such that for the input can be
used as well as a Turing machine. So, if the word and an appropriate way of calculation of are defined, then
VCD can be estimated: and
. The novel so-called method of programming of the estimation of VCD is based on the inequality , where the word is defined by the expressions
and . Taking in account equality
= (corollary 1), we have
and for
any . The shutter coefficient can be
estimated by inequality . The following
very important detail must be underlined. As we noted above, we consider binary strings as natural numbers,
therefore the algorithm transforms the pair of
the natural numbers into the natural number . When is found as a number, this number must be decoded into the string of the length . To present the number as a binary string we need to have information about the value of , so we need binary digits added into the word , which defines any
algorithm . We denote and
. To realize method the
following steps must be done: 1º Analysis of the family ; definition as more restricted set of parameters and/or properties of this family in order to form the structure for the word , which completely defines any algorithm . Pointing out the algorithm (the Turing machine, the partly recursive function, the program for the any computer) such that
. 2º Definition of the length of the word ,
, for the upper estimation of the , or
as the upper estimation of the .The
method suggests designing of the compressed description for any element of the family and the
algorithm which processes the input . In
particular, it is sufficiently of evidence of existing of such algorithm, but generally, to use the art of programming and of data organization are needed to present the structure of the word and the algorithm . If we use a computer with register capacity , and the
algorithms from the family use this register capacity to present any parameter of the algorithm, the more detailed estimation can be obtained.
We illustrate the method for the family
of Binary Decision Trees with not more than
terminal nodes. We suppose Boolean samples and
space dimension . Every internal node of any tree from
contains the number of Boolean variable from
the set and two pointers: the left and the right. Each pointer defines a transition to the next node according to the value of this variable. Any terminal node contains the number of a class (the result of computation) 0 or 1. The tree with and is shown on the Fig.1. Any tree defines the algorithm
. This algorithm can be compressed
into the word by the following way. The word consists of the concatenation of the fragments containing the number of Boolean variable and the generalized pointer as it shown on the Fig.2. Finally, these fragments as well as the whole word are presented as binary numbers. The meaning of the generalized pointer
is explained in the Table 2.
Fig.1. The BDT with internal nodes
Fig.2. The structure of the fragment
TABLE I I
THE MEANING OF THE GENERALIZED POINTER
Value Explanations
0 return_class( )
1 return_class( )
2 If then return_class( )
else next_fragment3 If then return_class( )
else next_fragment4 If then return_class( )
else next_fragment5 If then return_class( )
else next_fragment6 If then goto_fragment( )
else next_fragment… …… …
If then goto_fragment( )
else next_fragment
Now we can write the word which contains all information which is need to decode the tree given on fig.1. This word consists of four concatenated fragments corresponding to the internal nodes
. Each fragment consists of two fields presented in decimal form for easy understanding. But below we should suppose binary fixed fields of all fragments. According to the Table 2, we have
Note, the fragments with the indexes 0 and 1 in the word never need be pointed. Therefore the generalized
pointer always points to indexes of fragments beginning from 2. The algorithm which decodes the given word can be easily understood.
a) Get a fragment 0. b) Decode the number extracted from the
first field of the fragment, and the value extracted from the second fragment.
c) Execute the program code for the value of according to the Table 2. The result – the number of a class - will be obtained, and the algorithm will be stopped; or the transition operator to the fragment pointed by value , or a transition operator to the next concatenated fragment will be completed. We explain the procedures used: return_class( ) – returns the answer 0 if or the answer 1 if and then the algorithm is ended; next_fragment – transition to the right to the next fragment; goto_fragment( ) –
transition to the fragment number .
To encode any tree , at most
fragments are needed because is the number of internal nodes if the number of terminal nodes is . Thus, generalized pointer has to possess special values
and values to point fragments indexed as . Therefore the number
of values for the generalized pointer is needed and to encode them, the number
of binary digits is needed. Finally,
binary digits are needed to encode one fragment, and the length of the binary word
is obtained:
Note, of binary digits are added into the word to define the length of the binary string which is the
output of the algorithm . Since a never
depend of the sample length , the addition must
be excluded from the when VCD is estimated
by the method. Taking into account the
inequality , we get the
following estimation for the family of Binary Decision Trees with at most terminal nodes when at most binary variables are used:
.
For the family of Binary Decision Trees with at
most terminal nodes, with a linear predicates in any internal node, with at most variables, and with coefficients and variable values presented in digits per word, we easily get the estimation
.
Note, that the family is very extensible class of
algorithms, therefore the estimation of
defines large values when all variables are used to define a linear separating rule in any internal node. For the neural networks with nodes in the single hidden layer the following estimation is obtained:
= .
IV. VERIFICATION OF SIGNIFICANCE LEVEL OF REGULARITIES DISCOVERED IN EMPIRICAL DATA IN
THE TERMS OF THE KOLMOGOROV APPROACH
Definition 2. Let be fixed sample given from
, - the family of algorithms used for training.
The solution of the functional system (2), if it exists,
is called a correct tuning on the sample . The
solution of the functional system (3), if it exists, is called a
tuning on the fixed elements of the
sample .
(2)
(3)
Evidently, a tuning on the fixed elements
of the sample is a correct tuning on
some part of the sample . In the machine learning
problems, as usually, the sample is random and
independently derived from the general population . Below we use the model with deriving from the general
population . In the random derived pair
, the Boolean vector appears with a certain
probability. When the correct tuning is realized by some way and there are no errors on the given sample,
the values on the set can be arbitrary,
and the decision rule , which is found, can be erroneous
generally speaking on any . In other words,
a direct solving of the systems (2) or (3) is absolutely not equivalent to learning from examples! To realize an empirical induction based on the sample, it is necessary to generalize properties of this sample to obtain not only zero empirical error on this sample, but as possibly less
errors on all admissible objects of the set . What
is happened when we chose the family which contains the correct tuning on the given sample, but not contains the true (or close to the true) regularity which generates samples derived from the general population
? We consider such event as a random tuning on the sample.
Theorem 2. Let the probability model of derivation
from the general population is such that an
appearance of any Boolean vector in the arbitrary
derived pair is equally probable. Then the
probability of a random tuning on some
elements of the sample satisfies the inequality
,
where is the Kolmogorov complexity of the
family , - the number of errors assumed on the
training sample by the algorithm realized as
result of training. Proof. The family unambiguously generates the finite
set of various ways of a classification for any
given sample . Cardinality of the set is at
most . A correct tuning on all elements of a
sample can be realized if and only if the way of a
classification of the sequence onto two classes is
contained in the set (in other words, when a
random extraction of a sample is realized, the vector
“hits” into the set ). Any possible which
can be presented in a sample is equally probable according to the condition of the theorem. Therefore probability of a correct tuning on the fixed part of the sample of a length of is at most .
The elements from can be chosen by ways.
Therefore we have the estimation
.
According to the corollary 1, ,
.Therefore
.
Corollary 3. A probability of a random
correct tuning on the sample satisfy the
inequality .
Corollary 4. If the estimation of the Kolmogorov complexity by the pVCD method is obtained so that
then .
If then ,
and nonrandomness of the regularity found will be not less 0,96. This is acceptably on a practice. Thus we have The rule ”of plus five”: To obtain reliable regularity when machine learning is used, the length of the training sample must be more on 5 than Kolmogorov complexity of the algorithm family used.
CONCLUSION
The novel method presented in this paper
allows to estimate both and complexity
of the family of learning algorithms by using
technique of programming, what defines advantages as compared to more complicated combinatorial approach. The possible applications of the presented results are
the following: obtaining of the novel estimations of the; reliability estimation of the algorithms which are
found as a result of machine learning; estimation of the required lengths of training samples.
REFERENCES
[1] V. N. Vapnik. Recovery of dependencies by empirical data. Moscow: Nauka, 1979 (In Russian).[2] A. N. Kolmogorov. Information theory and theory of algorithms. Moscow: Nauka, 1987 (In Russian).[3] P. M. B. Vitanyi, M. Li. Minimum Description Length Induction, Bayesianism, and Kolmogorov complexity, IEEE Trans. on Inf. Theory, 46(2), pp. 446-464.[4] L. Devroye, L. Györfi, G. Lugosi. A Probabilistic Theory of Pattern Recognition. NY:Springer-Verlag, 1997.[5] V. I. Donskoy. Kolmogorov complexity of the classes of partly recursive functions with a restricted capacity, Tavrian Herald for Computer Science Theory and Mathematics, 1, 2005, pp. 25-34. (In Russian).