Large-Scale Knowledge Processing (1st part, lecture-5)minato/LSKP2017/lskp2017... · 2017. 12....

Large-Scale Knowledge Processing (1st part, lecture-5)

Shin-ichi Minato

Division of Computer Science and Information Technology,Graduate School of Information Science and Technology,

Hokkaido University

2017.12.15 Large-Scale Knowledge Processing 2

Review of the last lecture

• BDD applications– VLSI design automation

• Verification, logic optimization, test generation– Combinatorial problems, optimization

• Knapsack problem, 8-queens, Traveling salesman, etc.– Comparison with problem-specific method

• Representation of sets of combinations and ZDDs (Zero-suppressed BDDs)

• Sets of combinations• ZDD reduction rule• Basic properties of ZDDs

BDD applications and ZDDs

Exercises

• Confirm that BDDs’ logical OR operation algorithm can be applied for ZDDs as well, and it corresponds the Union operation for sets of combinations.

• Similarly, BDDs’ logical AND operation algorithm can be applied for ZDDs as well, and it corresponds the Intersection operation for sets of combinations.

2017.12.15 3Large-Scale Knowledge Processing


0 1

a

b

c

{ab, acd, cd}

d

c

N1

N3

N2

N4

N5 N6

N7 N8

b

a

{ac, ad, bc, bd}

N0

a=0 a=1N7∪N8

N3∪N6 N5∪N4

0∪N4

b=0b=1b=0 b=1

N3∪0 1∪0N3∪N4

N2∪10∪N2

0∪1

c=0 c=1

d=0 d=1

= N3

= N2

= 1

= 1

{ab, acd, cd} ∪ {ac, ad, bc, bd}

= N4

1∪0 = 1


Topics of this lecture

ZDD-vector and “Itemset-histogram algebra” ZDD-representation for polynomials with integer coefficients Operation algorithms for itemset-histogram data VSOP: Interpreter for manipulating itemset-histogram data

Frequent itemset mininig using ZDDs ZDD-growth: ZDD-based frequent itemset mining method Algorithm for generating maximal/closed itemsets LCM-ZDD: Fast method for generating frequent itemsets.

ZDD variable ordering in database analysis Mining internal structures of combinatorial itemset data

Simple disjunctive decomposition of Boolean functions

ZDD applications (1): Database analysis


ZDDs for Integer-valued function

We developed the manipulation system not only for (Boolean) combinatorial itemsets, but also for “itemset histogram” data. Applications for database analysis and knowledge

discovery. Convenient tool for research and development on

various combinatorial problems.


Itemset histogram (integer-valued itemsets) Set of item combinations (sum-of-products).

With value (coefficient or weight) for each product term. Consider integer values only, so far.

We don’t consider the higher-order variables 1 + 1 = 2 , a + a = 2 a . 2×2 = 4, a×b = a b but a×a = a. we only consider combinatorial item sets, not general polynomials.

(e.g.) 5 a b c + 3 a b + 2 b c + c set of four product terms, with values 5, 3, 2, 1.(we assume zero values for other item combinations.)

Basic model in various combinatorial problems


Algebraic operations in (ordinary) ZDDs

φ, { λ }, P.top can be executed in a constant time. Other operations are almost linear time for ZDD size.


Representation of itemset histogram using ZDDs Original ZDD distinguishes only existence of each

combinations. (cannot count numbers.) Two kinds of extended ZDDs have been proposed.

0 1

a0 1

b

c

b

cc

532

Multi-Terminal ZDD

0 1

a

b

c

aa

b b

c

F2 F1 F0

10 1

0

0

0

0

00

0

11

11

1

1

ZDD-Vector


ZDD-vector for itemset-histogram

We use a binary encoded method with ZDD-vectors: Encode frequency numbers into m-bit binary code, and

represent each bit of combination set using a ZDD.

F0 = {abc, ab, c}F1 = {ab, bc}, F2 = {abc}

itemset frequency F2 F1 F0

abc 5 (101) 1 0 1ab 3 (011) 0 1 1bc 2 (010) 0 1 0c 1 (001) 0 0 1

0 1

a

b

c

aa

b b

c

F2 F1 F0

10 1

0

0

0

0

00

0

11

11

1

1


Negative numbers in ZDD-vectors

Conventional methods for negative numbers: 2’s complement expression:

Many non-zero bits appear on higher digits, unsuitable for ZDD reduction.

(–1) “1111111111” Absolute value with a sign bit:

Addition becomes complicated. We use binary coding based on (–2)n: (1, –2, 4, –8, …)

(ex) (–12) = (–2)5 + (–2)4 + (–2)2 = –32 +16 + 4. This representation is also unique as ordinary binary coding. Higher digits become zero both for positives and negatives. ZDD reduction rule is effective to eliminate those meaningless higher digits. [Minato95]


Combining ZDD-vectors to ZDDs Special item symbols are defined to combine

a ZDD-vector to a ZDD. 20 special symbols can deal with 1,000,000 digits. An itemset histogram data is

represented by a 1-word pointer.


Itemset histogram by arithmetic operations

ZDDs grow as a result of applying arithmetic operations.

a

b

c 3 c

a – b

3 a c – 3 b c

minus

times

ZDDpackage

3

times


Basic operations of Itemset-histogram algebra

P + Q Addition of number of occurrences.


Multiplication by a monomial

Multiplication by an item (F×v) Attach v to each product term without v. just applying Offset and Change operations of ZDDs.

(ex) (a b + 3 b – c) × a a b + 3 a b c – a c Multiplication by a constant integer (F×n)

Numerical multiplication for each value of terms.(ex) (a b + 3 b – c) × 5 5 a b + 15 b – 5 c If n is (–2)k, just shift k digit of F.

otherwise, decompose n into each digit (–2)k, and make total sum of each product.


Addition of two polynomials

Addition of two polynomials (F + G): Each value of term is sum of values of same item

combinations in F and G.(ex) F = a b + 2 b c – 3 c

G = 3 a c – 2 b c + c (F+G) = a b + 3 a c – 2 c

Considering ZDD-vectors of F and G, common product terms (F∩G) on each digit set of carries in (F+G).

If (F∩G) is empty, (F+G) is just union of F, G.otherwise the common terms are doubled and added.

Doubling (1-bit shifting) may cause new common terms. Repeat addition until common terms disappear.


Algorithm for addition & subtraction

1-bit shifting is (–2) times in our implementation, so(S + 2 C) (S – (– 2) C) .

Addition operation calls subtraction. Similarly,

(D – 2 B) (D + (– 2) B) Subtraction operation calls addition.

Addition & Subtraction have dual structure. Number of repetition is bounded by maximum run of

carries in addition/subtraction.


Multiplication of two polynomials

(F×G) means sum of all possible combinations of two product terms in respective F and G.(ex) F = a b + 3 c, G = 4 a c – a

(F×G) = 4 a b c – a b + 9 a c When v is an item of highest ordered in ZDDs,

then we can get sub-functions:F = F0 + v F1, G = G0 + v G1

Multiplication is decomposed as:(F×G) = (F0×G0) + v (F0×G1 + F1×G0 + F1×G1) Each sub-operation is executed recursively. We use hash-based cache to avoid duplicated operations. computation time depends on size of ZDD nodes.


Division by a monomial

Division by an item: quotient (F / v), remainder (F % v) Just same as Onset and Offset operations of ZDDs.(ex) F = 3 a b c + a b – c + 1 (F / c) = 3 a b – 1, (F % c) = a b + 1

Computation time is linear to ZDD size lower than v. Division by a constant integer: (F / n) and (F % n)

Integer division for each value of terms.(ex) F = 7 a b c + 5 b c + 3 (F / 3) = 2 a b c + b c + 1, (F % 3) = a b c + 2 b c

Uses conventional division algorithm for binary numbers. Computation time depends on ZDD size and bit-length.


Division by a polynomial Division of two polynomials (F / G) is not unique.

We should define the quotient in our system. Weak-division method has been used in VLSI CAD.

Quotient is the common terms of quotients by respective terms in G: (F/G1) ∩ (F/G2) ∩…∩ (F/Gm)

(ex) F = a b + a c + a d + b c + b d, G = a + b (F/G) = (b + c + d) ∩ (a + c + d) = (c + d) .

We propose valued weak division method. Quotient is the absolutely minimum valued terms of quotients by

respective terms in G, as: MinAbs{(F/G1),(F/G2), …,(F/Gm)}(ex) F = 2 a b + 4 a c + a d – 2 b c + 3 b d, G = a + b (F/G) = MinAbs{(2 b + 4 c + d),(2 a – 2 c + 3 d)} = (– 2 c + d) .

Natural extension of Boolean weak division. Efficient recursive algorithm using hash-based cache.


Numerical comparison

Comparison operators: ( F [ ==, !=, >, >=, <, <= ] G ) Set of item combinations included F or G which satisfies the

numerical relation.(ex) F = 3 a b + 2 b c – c, G = 2 a b – 2 b + 3 c (F > G) = a b + b c + b, (F != 0) = a b + b c + c

Almost same computation time as subtraction. (F != 0) can be used for normalization of all values to 1.(ex) F – (F != 0) decrements the values of all existing terms.


Other operations

If-Then-Else operation: (F ? G : H) Gives the values of G if the terms are included in F,

otherwise gives values of H. In this operation, various non-linear functions are obtained.(ex) ((F > 0)? F: –F) Abs(F), ((F > G)? F : G) Max(F, G)

Restrict and Permit operations [Okuno98] F.Restrict(G) extract terms from F such that the item

combination is a superset of at least one combination in G. F.Permit(G) extract terms from F such that the item

combination is a subset of at least one combination in G. Useful for solving constraint satisfaction problems.


Display formats of itemset histograms

Polynomial expression Integer Karnaugh map Sorting by values Bitwise listing Displaying statistical information

Number of product terms Density of solutions ZDD size

Displaying satisfiable solutions Any one term Max/Min value of term Max/Min item cost of term


VSOP: interpreter for handling itemset-histograms

VSOP：nickname after “Valued-Sum-Of-Products” Written in C, C++, and yacc/lex, on Linux PCs. C-shell-like user interface:

Interactive keyboard input, or batch style script file input. Two type of symbols:

Item symbols (starting with lower-case letter), (up to 65,510). Program variables (starting with upper-case letter)

to store a result of other expression. (no limit to use.) Numerical operations for integers. (no bit-width limit) No loops and branches are allowed.

We may use another script language for unrolling, and then apply VSOP to a script file.

No limit for lines of the script file. (Pipelined execution) About 30 byte per ZDD node. (10G nodes in 512GB.)


Basic performances

Using a PC (800MHz, 512MB, SuSE Linux 9) ZDD for (100 !) can be generated in 0.2 sec.

121 ZDD nodes. ( 160 digits of decimal number)

Calculating VSOP for (x1+1)(x2+2)(x3+3)…(xn+n) # of terms : 2n, maximum value : n!

n # Terms Max value # Node Time (s)16 65,536 20,922,789,888,000 9,383 0.6920 1,048,576 2.43 × 1018 76,705 14.4024 16,777,216 6.20 × 1023 530,308 276.99


VSOP for Constraint Satisfaction Problems (CSP)

ZDD-based CSP solver is studied. [Okuno98] N-queens problem, magic square problem, etc. Many applications for real-life problems.

VSOP is also useful for solving CSPs. VSOP’s numerical operations facilitates

constraint descriptions more compactly and simply. Each product term a solution of problem. Value of terms costs of solutions


Probabilistic symbolic simulation

VSOP algebra is same as probabilistic calculation. a + a = 2 a, a×a = a, a×b = a b,

where a, b are probability variable in [0, 1] . Two events occur independently: Logical AND becomes arithmetic product.

Two events are caused by a same event: does not become x2 but just x.

VSOP can be used for probabilistic analysis.


Probabilistic symbolic logic simulation

Computing full-expressions of signal probabilitiesfor given logic circuits. NOT-gate: Y = 1 – A AND-gate: Y = A B OR-gate: Y = A + B – A B XOR-gate: Y = A + B – 2 A B

Useful not only for logic circuits but also for any other system analysis.


Frequent Item set (Pattern) Mining Basic and well-known problem in database analysis.

Useful method for finding interesting association rulesfrom huge size of databases, such as transaction data, web contents, bioinformatics data, etc.

RecordID Tuple

1 a b c2 a b3 a b c4 b c5 a b6 a b c7 c8 a b c9 a b c10 a b11 b c

Frequency threshold = 8 { a, b, c }

Frequency threshold = 7 { ab, bc, a, b, c }

Frequency threshold = 5 { abc, ab, bc, ac, a, b, c }

Frequency threshold = 10 { b }

Frequency threshold = 1 { abc, ab, bc, ac, a, b, c }

32

"Pattern-histograms"

Here, a "pattern" means an item set seen in a tuple of transaction data. A tuple of n items includes 2n patterns.

Computing pattern histogram is much harder than tuple-histogram. Difficult to completely compute

pattern-histograms in many cases. Conventional methods extract only

frequent patterns. (> α times appears)

Tuple Freq.abc 5ab 3bc 2c 1

(ex.)

{abc, ab, bc, ac, a, b, c, 1} {ab, a, b, 1} {bc, b, c, 1} {c, 1}

Pattern Freq.abc 5ab 8bc 7ac 5a 8b 10c 81 11

2017.12.15 Large-Scale Knowledge Processing


Frequent pattern enumeration by ZDD-growth

When the pattern-histogram becomes too large, we can put minimum support α as a larger number, then less number of frequent patterns are found, and ZDD size becomes smaller.

We have developed “ZDD-growth” algorithm to generate a ZDD of pattern-histogram directly from a given ZDD of tuple-histogram.

Minimum threshokdα = 7

{ ab, bc, a, b, c }

ZDD-growth

F

a

b b

c c

0 1

0

0

0

0 0

1

1 1

11

tuple Freq.abc 5ab 3bc 2c 1

Tuple-histogram


Key point of ZDD-growth algorithm

Depth-first search on ZDD-based tuple-histogram. Depth-first manner matches ZDD’s recursive structure.

Frequent pattern set Freq[H] can be computed as:Freq[H] = Freq[H.factor1(v)] * v

+ Freq[H.factor1(v) + H.factor0(v)] Freq[H] is recursively computed by the

sub-problems for the two cofactors ofH.factor1(v), H.factor0(v). depth-first manner.

Cofactors does not includes item v. recursive call depth is bounded by number of items.

We use hash-based cache technique to avoid duplicated operations at the shared nodes.


Algebra on tuple-histograms

Tuple Freq.a b c 5a b 3

b c 2c 1

Primitive operations: Factoring into two parts

by an item. Attaching an item. Sum of two histograms. Counting lines in the

table.

Tuple Freq.b c 5b 3

Tuple Freq.b c 2

c 1

H H1=H.factor1(a)

H0=H.factor0(a)

Tuple Freq.a b c d 5a b d 3

b c d 2c d 1

H * d

Tuple Freq.b c 7b 3

c 1

H1 + H0

- Each table is compactly represented by ZDDs.- ZDDs are shared each other in the memory.




b c 2c 1



table.

Tuple Freq.b c 5b 3

Tuple Freq.b c 2

c 1

H H1=H.factor1(a)

H0=H.factor0(a)


b c d 2c d 1

H * d

Tuple Freq.b c 7b 3

c 1

H1 + H0



ZDD-growth algorithm

Get frequent patterns only.

Check cache to avoidduplicated calls.

Frequent item sets with v.Frequent item sets without v.

Enter result to cache.


Experimental resultSelected from FIMI2003 benchmark datasets.


Properties of ZDD-growth method

For large-scale database, which is hard to make pattern-histogram, ZDD-growth generates frequent patterns if we put some minimum frequency α.

Outputs a huge number of frequent patterns represented by a compressed ZDD. For some examples, much (exponentially) faster than

previous method. For “BMS-WebView-1”, we can generate 30 trillions of

patterns in a feasible time and space. Not good for “T10I4D100K”, which is randomly generated

data. (In principle, random data cannot be compressed.)


Closed item sets

Subset of item set patterns each of which is the unique representative for a sub-group of patterns having the same occurrence in the database. “Common item set” : Com(ST) for the given set of tuples ST ,

such that Com(ST) is the maximal set of items commonly included in all tuples T ∈ ST .

“Occurrence” : Occ(D,X) for the given database D and item set X, such that Occ(D,X) is the subset of tuples in D, each of which includes X.

if an item set X satisfies Com(Occ(D,X)) = X, we call X is a “closed item set” in D.

Closed set is a kind of compressed expression of all patterns of frequent item set.


Example of closed item sets

RecordID Tuple

1 a b c2 a b3 a b c4 b c5 a b6 a b c7 c8 a b c9 a b c10 a b11 b c

Closed item set patterns:{abc, ab, bc, b, c}

All patterns appears at least once:{abc, ab, ac, a, bc, b, c},

- “ac” is deleted because theoccurrence is same as “abc”.

- “a” is deleted because theoccurrence is same as “ab”.


ZDD-growthC algorithm

Quite simplemodification


Conclusion on ZDD-growth method.

ZDD-growth-M, ZDD-growth-C: Quite simple modification of ZDD-growth to filter the closed

patterns from all frequent patterns. Additional computation cost is relatively small.

In general, closed item sets are already reduced forms, and thus, the impact of ZDD is not remarkable. If our final goal is enumeration only (no need to store),

the LCM algorithm [Uno2003] would be much better. Our method can also constructs efficient indexing of the

item sets on main memory, and used for various data analysis by ZDD-based algebraic set operations.

Combining ZDDs and LCM would be interesting. we started project to develop “LCM-ZDD” method.


“LCM over ZDDs” [Minato et al. 2008]

LCM: [Uno2003]Output-linear time algorithm of frequent itemset mining.

ZDD: [Minato93]A compact graph-based representation for large-scale sets of combinations.

Combination of the two techniques

Generates large-scale frequent itemsets on the main memory, with a very small overhead from the original LCM.

( Sub-linear time and space to the number of solutions when ZDD compression works well.)


Naïve implementation of LCM-ZDD

Original LCM Naïve implementation of“LCM-ZDD”

This naïve method generates the correct ZDD, but not very effective.


Problem in naïve implementation

A ZDD grows by repeating union operations of the solutions found in the LCM depth-first search. The consecutive solutions are quite similar to each other in

most cases, namely, only a few bottom levels are different. Union operation requires O(n) steps

while a very small (almost constant) part are meaningful. Namely, this algorithm may become n times slower.


Improved implementation of LCM-ZDD

On each recursive step, we construct a ZDD only for the meaningful part.

After returning from the subsidiary recursive call, we put the top item on the current result of ZDD. we can avoid redundant traversals in the ZDD union

operation. ( the overhead factor becomes a constant.)


Acceleration by hypercube decomp.[Uno2003]

Hypercube: H(P) set of items e satisfying e > tail(P) and Occ(P) = Occ(P∪{e}) Original LCM algorithm can avoid duplicated backtracking with

respect to items included in H(P). ( fast counting of solutions.)


Our implementation and experiments

We implemented LCM-ZDD by modifying LCM ver.5, open software. Only 50 lines of modification in the LCM code. with our own ZDD package (2,300 lines in C/C++)

Experiments for FIMI2003 benchmark datasets. mushroom: where ZDD is very effective. T10I4D100K: where ZDD is ineffective. BMS-WebView-1: Intermediate property.

Comparison with the original LCM and ZDD-growth. LCM-count: only counting the number of solutions. LCM-dump: printing all solutions into a file.


Effects of LCM-ZDD

Clearly, LCM-ZDD is more efficient thanZDD-growth in most cases. ZDD-growth shows comparable performances to our method

only in “mushroom” with very low minimum support, but for all the other cases, our method overwhelms ZDD-growth.

LCM-ZDD is much faster than the LCM-dump. The original LCM-dump is known as an output linear time

algorithm, but our LCM-ZDD requires a sub-linear time to the number of itemsets in this experiment.

Our method requires almost same time as LCM-count. But LCM-ZDD stores all the solutions.


Post processing for generated patterns

Huge number of patterns can be stored in the main memory. Knowledge indexing based on ZDDs.

The result can be analyzed flexibly using algebraic set operations. Sub-pattern matching for the frequent patterns. Extracting long/short frequent patterns. Comparison of two sets of frequent patterns. Calculating statistical data (e.g. confidence, support) Finding disjoint sub-factors in frequent patterns.

Useful means of inductive database analysis.


ZDD variable ordering for itemset mining

In general, BDD size greatly depends on variable order. Exponential effect is observed in some Boolean functions. Exact optimization is shown as NP-complete. [Tani93]

ZDDs of frequent itemsets also depends on variable ordering. We proposed a heuristic ordering method [Iwasaki2007]

based on some structural information of databases. We show the instances of database where the ZDD sizes

are exponentially sensitive to the variable ordering. [Minato2007] Theoretical results are useful for developing good heuristic methods.


Variable ordering of (ordinary) BDDs

In VLSI CAD area, many techniques are developed for good variable orderings of ordinary BDDs.

Local computability：

Pairs of inputs closely related to each otherhad better be kept close positions in the ordering.

Output controllability：

Inputs having a strong controllability to the output had better be at higher positions in the ordering.

Two empirical rules for good ordering

ZDDs also have similar properties.


Consideration on ZDDs for frequent itemsets

Namely, missing items have stronger controllabilityin the ZDDs for frequent itemsets. This is a completely opposite effect as the ordinary BDDs

for logic expressions, where missing variables have less controllability.

Record ID Tuple1 a b c2 a b3 a b c4 b c Item “a” controls the difference of patterns

between the two tuples “abc” and “bc.”

For simple discussion, we first assume (minimum frequency = 1).

Sub-patterns extracted from “ bc”{ bc, b, c, 1 }

Sub-patterns extracted from “abc”{ abc, ab, ac, a, bc, b, c, 1 }


An instance dominated by local computability

Based on the above consideration, we made an artificial database: “one-pair missing.”

(a1 b1) are missing.(a2 b2) are missing.(a3 b3) are missing.

(an bn) are missing.

n records

(2n – 2) items


Effect of variable ordering on “one-pair missing”

Exponential effect is observed.


Structures of ZDDs for “one-pair missing”

Item a’s

Item b’s

a1b1

a2b2a3b3a4b4

a5b5a6b6a7b7

a8b8O(n) O(2n)


An instance dominated by output controllability

A database corresponds to the 8-bit data-selector function.

n record

(n – 1) items (log2 n) items

(y0 y1)(y2 y3)(y4 y5)Either of each pair appears.

Represents 3-digitbinary number k.

xk Is missing.


Effect of variable ordering on “data-selector”

Also, exponential effect is observed between the reversal two variable ordering.


Structures of ZDDs for “data-selector”

O(2n)O(n・3log n) ≈ O(n2.7)

Item x’s（n）

Item y’s（2 log n） Item x’s

（n）

Item y’s（2 log n）


A case where no good variable ordering exists

N (= n + 2 log2 n) copies of “data-selector” datasets with the rotated variable orders.

Anyway we cannot avoid a bad variable order at least one of the blocks.

- Input size (database size)is still polynomial: O (N 3)

An exponential ZDDat least one block.

This database requires an exponential size of ZDD in any variable ordering.


Decomposition of itemset data

{ x39 x86 x85 x34, x39 x86 x85, x39 x86 x34, x39 x86, x39 x85 x34, x39 x85, x39 x34, x39, x90 x86 x85 x36 x34, x90 x86 x85 x36, x90 x86 x85 x34, x90 x86 x85, x90 x86 x36 x34, x90 x86 x36, x90 x86 x34, x90 x86, x90 x85 x36 x34, x90 x85 x36, x90 x85 x34, x90 x85, x90 x36 x34, x90 x36, x90 x34, x90, x86 x85 x36 x34, x86 x85 x36, x86 x85 x34, x86 x85, x86 x36 x34, x86 x36, x86 x34, x86, x85 x59, x85 x36 x34, x85 x36, x85 x34, x85, x59, x36 x34, x36, x34, 1 }

We propose a new method of finding “hidden structures”from huge and complicated itemset data.

(x85 + 1) (x59 + (x34+1) (x86+1) (x39 + (x36 + 1) (x90 + 1) ) )

Finding decompositions in the “set of combinations”


Simple disjoint decomposition

Basic and useful concept in logic design theory. Defined in Boolean function algebra.

g(s,Y)

h(X)

f(X,Y)

x1x2

x|X|y1y2

y|Y|

X

Y

s

- Only one connection between two blocks. (“Simple”)- No common inputs in the two blocks. (“Disjoint”)

We will apply this concept to data mining.


Simple disjoint decomp. on Boolean functions

a b + c d

abcd

ab

c

a b + b c

Not always exists in a function. Multiple decompositions may exist in one function.


Multiple decompositions in one function

(a + b)(c + d) + e f

abcd

ef

abc

abc

a b c = (a b) c= a (b c)= (a c) b

Four decompositions.(with nesting structure.)

Associative logic operationsmay produce an exponential number of sub-decompositions.


Decomposition tree

Representing structure of decompositions. All simple disjoint decompositions are included in one graph. Using special labels for the decompositions caused by

associative logic operations.

a

AND(y,z)

x

AND(a,b,c)

cb zy

s x + t x

a b c x + x y z

s t

a

AND(y,z)

x

AND(a,b,c)

cb zy

s x + t x

a b c x + x y za b c x + x y z

s t

Finding all simple disjoint decompositions.

Generating a decomp. tree for given function.


BDD-based fast decomposition algorithm

The method of simple disjoint decomposition are studied for long time. [Ashenhurst57, Roth62]

A fast algorithm is proposed [Bertacco97], based on BDD (Binary Decision Diagrams) techniques. Efficiently generates a decomposition tree including all

simple disjoint decompositions. Using fast equivalent checking by BDDs, and avoids

duplicated computation using a hash table. Time complexity is almost square to the BDD-size. Remarkably fast when BDD-size is feasible.

We modified the original BDD-based algorithm to handle itemset data with Zero-suppressed BDDs.


Simple disjoint decomp. on “sets of combinations”

g(s,Y)

h(X)

f(X,Y)

x1x2

x|X|

y1y2

y|Y|

X

Y

s

For example, f(a, b, c, d) = {ac, abc, acd, abcd, cd, d} has a simple disjoint decomposition: s = h(a, b) = {a, a b} and g(s, c, d) = {sc, scd, cd, d}

h(X) means an independent factor of f(X).


Properties of s. s. decomp. on sets of combinations.

One set of combinations can include multiple decompositions. For example, {abcx, abcy, abcz, xy, xz} includes five

decompositions: h(X) = {ab}, {bc}, {ac}, {abc}, {y, z}. Decomposition group

with associative operations. “AND” (product of sets) “OR” (union of sets)

Decomposition tree canbe generated as well asoriginal decomposition algorithm.

a

OR(y,z)

x

AND(a,b,c)

cb zy

{abcx, abcy, abcz, xy, xz }

{sx, st, tx }s t

a

OR(y,z)

x

AND(a,b,c)

cb zy

{abcx, abcy, abcz, xy, xz }

{sx, st, tx }s t


Observation in decomposition algorithm

If F(X, Y) includes simple disjoint decomp. P(X), then:F(X, Y) = P(X)・Q(Y) + R(Y) .

F can be expanded by one of item variable v as:F(X) = v ・ F1(X’) + F0(X’) （where X’ = X – v).We call F1, F0 as cofactors of F with variable v.

As X and Y are disjoint, either v∈X or v∈Y. When v∈X:

F1(X’, Y) = P1(X’)・Q(Y),F0(X’, Y) = P0(X’)・Q(Y) + R(Y) P(X) = v・P1(X’) + P0(X’).

When v∈Y:F1(X, Y’) = P(X)・Q1(Y’) + R1(Y’), F0(X, Y’) = P(X)・Q0(Y’) + R0(Y’)

P(X) are commonly included in decompositions of F1, F0.


ZDD-based recursive algorithm

Similar recursive algorithm as BDD-based original one. but modification from BDD to ZDD is not trivial

because of different semantics and algebra.

When “f0” and “f1” are child nodes of “f ” in ZDD, then the decomposition tree for “f ” can be generated by “merging” the two decomposition trees for “f0” and “f1”.

Main theorem:

Fast recursive algorithm.

Time complexity is almost square to the ZDD-size. Remarkably fast when ZDD-size is feasible.


Implementation and experiments

The program is implemented in C and C++ code.on our BDD package. 1,600 lines for the decomposition algorithm. Pen4, 800MHz, 512MB memory, SuSE Linux 9.

(manipulates up to 10,000,000 ZDD nodes.) Experiment for benchmark databases [FIMI03].

“mushroom” – 119 items, 8124 records. “T10I4D100K” (first 1000 lines) – 795 Items, 1000 records.

First we generate frequent itemset data with various frequency thresholds, and then apply our decomposition algorithm for each itemset data.


Experimental results


Result for “mushroom” with threshold = 5,000

{ x39 x86 x85 x34, x39 x86 x85, x39 x86 x34, x39 x86, x39 x85 x34, x39 x85, x39 x34, x39, x90 x86 x85 x36 x34, x90 x86 x85 x36, x90 x86 x85 x34, x90 x86 x85, x90 x86 x36 x34, x90 x86 x36, x90 x86 x34, x90 x86, x90 x85 x36 x34, x90 x85 x36, x90 x85 x34, x90 x85, x90 x36 x34, x90 x36, x90 x34, x90, x86 x85 x36 x34, x86 x85 x36, x86 x85 x34, x86 x85, x86 x36 x34, x86 x36, x86 x34, x86, x85 x59, x85 x36 x34, x85 x36, x85 x34, x85, x59, x36 x34, x36, x34, 1 }

(x85 + 1) (x59 + (x34+1) (x86+1) (x39 + (x36 + 1) (x90 + 1) ) )

Itemset list, included in at least 5,000 records in the database:


Results for “mushroom” with various thresholds

(#Patterns: 18,094,822)

(#Patterns: 1,442,504)

(#Patterns: 123,278)

(#Patterns: 6,624)


Concluding remarks

Recent topics on ZDD-based data mining & knowledge discovery are presented. ZDDs automatically compress the data of a large number of

combinatorial items, and store in main memory. Variousset operations can be executed without decompression.

Aborted when memory overflowed. in memory algorithm In VLSI design area, we may divide the problem into small sub-

parts so as to fit in memory. successfully used in 1990s. In database area, the original data usually placed in external

storage, thus ZDD-based techniques is not popular in 1990s. After 2000, PC memory size grows rapidly, and in memory

computation is used more widely in data mining and knowledge processing. BDD/ZDD-based techniques will be utilized more.


Summary







Exercises

Draw a ZDD-vector representing the polynomial a b c + 3 a b + 2 a c + 6 a + b c + 3 b + 2 c + 6 ,which is equivalent to (a+1)(b+2)(c+3). Use ordinary 3-bit binary encoding. Use (-2) base binary encoding.

Large-Scale Knowledge Processing (1st part, lecture-5)minato/LSKP2017/lskp2017... · 2017. 12....

Documents

Transcript of Large-Scale Knowledge Processing (1st part, lecture-5)minato/LSKP2017/lskp2017... · 2017. 12....