Large-Scale Knowledge Processing (1st part, lecture-5)minato/LSKP2017/lskp2017... · 2017. 12....
Transcript of Large-Scale Knowledge Processing (1st part, lecture-5)minato/LSKP2017/lskp2017... · 2017. 12....
Large-Scale Knowledge Processing (1st part, lecture-5)
Shin-ichi Minato
Division of Computer Science and Information Technology,Graduate School of Information Science and Technology,
Hokkaido University
2017.12.15 Large-Scale Knowledge Processing 2
Review of the last lecture
• BDD applications– VLSI design automation
• Verification, logic optimization, test generation– Combinatorial problems, optimization
• Knapsack problem, 8-queens, Traveling salesman, etc.– Comparison with problem-specific method
• Representation of sets of combinations and ZDDs (Zero-suppressed BDDs)
• Sets of combinations• ZDD reduction rule• Basic properties of ZDDs
BDD applications and ZDDs
Exercises
• Confirm that BDDs’ logical OR operation algorithm can be applied for ZDDs as well, and it corresponds the Union operation for sets of combinations.
• Similarly, BDDs’ logical AND operation algorithm can be applied for ZDDs as well, and it corresponds the Intersection operation for sets of combinations.
2017.12.15 3Large-Scale Knowledge Processing
2017.12.15 Large-Scale Knowledge Processing 4
0 1
a
b
c
{ab, acd, cd}
d
c
N1
N3
N2
N4
N5 N6
N7 N8
b
a
{ac, ad, bc, bd}
N0
a=0 a=1N7∪N8
N3∪N6 N5∪N4
0∪N4
b=0b=1b=0 b=1
N3∪0 1∪0N3∪N4
N2∪10∪N2
0∪1
c=0 c=1
d=0 d=1
= N3
= N2
= 1
= 1
{ab, acd, cd} ∪ {ac, ad, bc, bd}
= N4
1∪0 = 1
2017.12.15 Large-Scale Knowledge Processing 5
Topics of this lecture
ZDD-vector and “Itemset-histogram algebra” ZDD-representation for polynomials with integer coefficients Operation algorithms for itemset-histogram data VSOP: Interpreter for manipulating itemset-histogram data
Frequent itemset mininig using ZDDs ZDD-growth: ZDD-based frequent itemset mining method Algorithm for generating maximal/closed itemsets LCM-ZDD: Fast method for generating frequent itemsets.
ZDD variable ordering in database analysis Mining internal structures of combinatorial itemset data
Simple disjunctive decomposition of Boolean functions
ZDD applications (1): Database analysis
2017.12.15 Large-Scale Knowledge Processing 6
Topics of this lecture
ZDD-vector and “Itemset-histogram algebra” ZDD-representation for polynomials with integer coefficients Operation algorithms for itemset-histogram data VSOP: Interpreter for manipulating itemset-histogram data
Frequent itemset mininig using ZDDs ZDD-growth: ZDD-based frequent itemset mining method Algorithm for generating maximal/closed itemsets LCM-ZDD: Fast method for generating frequent itemsets.
ZDD variable ordering in database analysis Mining internal structures of combinatorial itemset data
Simple disjunctive decomposition of Boolean functions
ZDD applications (1): Database analysis
2017.12.15 Large-Scale Knowledge Processing 7
ZDDs for Integer-valued function
We developed the manipulation system not only for (Boolean) combinatorial itemsets, but also for “itemset histogram” data. Applications for database analysis and knowledge
discovery. Convenient tool for research and development on
various combinatorial problems.
2017.12.15 Large-Scale Knowledge Processing 8
Itemset histogram (integer-valued itemsets) Set of item combinations (sum-of-products).
With value (coefficient or weight) for each product term. Consider integer values only, so far.
We don’t consider the higher-order variables 1 + 1 = 2 , a + a = 2 a . 2×2 = 4, a×b = a b but a×a = a. we only consider combinatorial item sets, not general polynomials.
(e.g.) 5 a b c + 3 a b + 2 b c + c set of four product terms, with values 5, 3, 2, 1.(we assume zero values for other item combinations.)
Basic model in various combinatorial problems
2017.12.15 Large-Scale Knowledge Processing 9
Algebraic operations in (ordinary) ZDDs
φ, { λ }, P.top can be executed in a constant time. Other operations are almost linear time for ZDD size.
2017.12.15 Large-Scale Knowledge Processing 10
Representation of itemset histogram using ZDDs Original ZDD distinguishes only existence of each
combinations. (cannot count numbers.) Two kinds of extended ZDDs have been proposed.
0 1
a0 1
b
c
b
cc
532
Multi-Terminal ZDD
0 1
a
b
c
aa
b b
c
F2 F1 F0
10 1
0
0
0
0
00
0
11
11
1
1
ZDD-Vector
2017.12.15 Large-Scale Knowledge Processing 11
ZDD-vector for itemset-histogram
We use a binary encoded method with ZDD-vectors: Encode frequency numbers into m-bit binary code, and
represent each bit of combination set using a ZDD.
F0 = {abc, ab, c}F1 = {ab, bc}, F2 = {abc}
itemset frequency F2 F1 F0
abc 5 (101) 1 0 1ab 3 (011) 0 1 1bc 2 (010) 0 1 0c 1 (001) 0 0 1
0 1
a
b
c
aa
b b
c
F2 F1 F0
10 1
0
0
0
0
00
0
11
11
1
1
2017.12.15 Large-Scale Knowledge Processing 12
Negative numbers in ZDD-vectors
Conventional methods for negative numbers: 2’s complement expression:
Many non-zero bits appear on higher digits, unsuitable for ZDD reduction.
(–1) “1111111111” Absolute value with a sign bit:
Addition becomes complicated. We use binary coding based on (–2)n: (1, –2, 4, –8, …)
(ex) (–12) = (–2)5 + (–2)4 + (–2)2 = –32 +16 + 4. This representation is also unique as ordinary binary coding. Higher digits become zero both for positives and negatives. ZDD reduction rule is effective to eliminate those meaningless higher digits. [Minato95]
2017.12.15 Large-Scale Knowledge Processing 13
Combining ZDD-vectors to ZDDs Special item symbols are defined to combine
a ZDD-vector to a ZDD. 20 special symbols can deal with 1,000,000 digits. An itemset histogram data is
represented by a 1-word pointer.
2017.12.15 Large-Scale Knowledge Processing 14
Itemset histogram by arithmetic operations
ZDDs grow as a result of applying arithmetic operations.
a
b
c 3 c
a – b
3 a c – 3 b c
minus
times
ZDDpackage
3
times
2017.12.15 Large-Scale Knowledge Processing 15
Basic operations of Itemset-histogram algebra
P + Q Addition of number of occurrences.
2017.12.15 Large-Scale Knowledge Processing 16
Multiplication by a monomial
Multiplication by an item (F×v) Attach v to each product term without v. just applying Offset and Change operations of ZDDs.
(ex) (a b + 3 b – c) × a a b + 3 a b c – a c Multiplication by a constant integer (F×n)
Numerical multiplication for each value of terms.(ex) (a b + 3 b – c) × 5 5 a b + 15 b – 5 c If n is (–2)k, just shift k digit of F.
otherwise, decompose n into each digit (–2)k, and make total sum of each product.
2017.12.15 Large-Scale Knowledge Processing 17
Addition of two polynomials
Addition of two polynomials (F + G): Each value of term is sum of values of same item
combinations in F and G.(ex) F = a b + 2 b c – 3 c
G = 3 a c – 2 b c + c (F+G) = a b + 3 a c – 2 c
Considering ZDD-vectors of F and G, common product terms (F∩G) on each digit set of carries in (F+G).
If (F∩G) is empty, (F+G) is just union of F, G.otherwise the common terms are doubled and added.
Doubling (1-bit shifting) may cause new common terms. Repeat addition until common terms disappear.
2017.12.15 Large-Scale Knowledge Processing 18
Algorithm for addition & subtraction
1-bit shifting is (–2) times in our implementation, so(S + 2 C) (S – (– 2) C) .
Addition operation calls subtraction. Similarly,
(D – 2 B) (D + (– 2) B) Subtraction operation calls addition.
Addition & Subtraction have dual structure. Number of repetition is bounded by maximum run of
carries in addition/subtraction.
2017.12.15 Large-Scale Knowledge Processing 19
Multiplication of two polynomials
(F×G) means sum of all possible combinations of two product terms in respective F and G.(ex) F = a b + 3 c, G = 4 a c – a
(F×G) = 4 a b c – a b + 9 a c When v is an item of highest ordered in ZDDs,
then we can get sub-functions:F = F0 + v F1, G = G0 + v G1
Multiplication is decomposed as:(F×G) = (F0×G0) + v (F0×G1 + F1×G0 + F1×G1) Each sub-operation is executed recursively. We use hash-based cache to avoid duplicated operations. computation time depends on size of ZDD nodes.
2017.12.15 Large-Scale Knowledge Processing 20
Division by a monomial
Division by an item: quotient (F / v), remainder (F % v) Just same as Onset and Offset operations of ZDDs.(ex) F = 3 a b c + a b – c + 1 (F / c) = 3 a b – 1, (F % c) = a b + 1
Computation time is linear to ZDD size lower than v. Division by a constant integer: (F / n) and (F % n)
Integer division for each value of terms.(ex) F = 7 a b c + 5 b c + 3 (F / 3) = 2 a b c + b c + 1, (F % 3) = a b c + 2 b c
Uses conventional division algorithm for binary numbers. Computation time depends on ZDD size and bit-length.
2017.12.15 Large-Scale Knowledge Processing 21
Division by a polynomial Division of two polynomials (F / G) is not unique.
We should define the quotient in our system. Weak-division method has been used in VLSI CAD.
Quotient is the common terms of quotients by respective terms in G: (F/G1) ∩ (F/G2) ∩…∩ (F/Gm)
(ex) F = a b + a c + a d + b c + b d, G = a + b (F/G) = (b + c + d) ∩ (a + c + d) = (c + d) .
We propose valued weak division method. Quotient is the absolutely minimum valued terms of quotients by
respective terms in G, as: MinAbs{(F/G1),(F/G2), …,(F/Gm)}(ex) F = 2 a b + 4 a c + a d – 2 b c + 3 b d, G = a + b (F/G) = MinAbs{(2 b + 4 c + d),(2 a – 2 c + 3 d)} = (– 2 c + d) .
Natural extension of Boolean weak division. Efficient recursive algorithm using hash-based cache.
2017.12.15 Large-Scale Knowledge Processing 22
Numerical comparison
Comparison operators: ( F [ ==, !=, >, >=, <, <= ] G ) Set of item combinations included F or G which satisfies the
numerical relation.(ex) F = 3 a b + 2 b c – c, G = 2 a b – 2 b + 3 c (F > G) = a b + b c + b, (F != 0) = a b + b c + c
Almost same computation time as subtraction. (F != 0) can be used for normalization of all values to 1.(ex) F – (F != 0) decrements the values of all existing terms.
2017.12.15 Large-Scale Knowledge Processing 23
Other operations
If-Then-Else operation: (F ? G : H) Gives the values of G if the terms are included in F,
otherwise gives values of H. In this operation, various non-linear functions are obtained.(ex) ((F > 0)? F: –F) Abs(F), ((F > G)? F : G) Max(F, G)
Restrict and Permit operations [Okuno98] F.Restrict(G) extract terms from F such that the item
combination is a superset of at least one combination in G. F.Permit(G) extract terms from F such that the item
combination is a subset of at least one combination in G. Useful for solving constraint satisfaction problems.
2017.12.15 Large-Scale Knowledge Processing 24
Display formats of itemset histograms
Polynomial expression Integer Karnaugh map Sorting by values Bitwise listing Displaying statistical information
Number of product terms Density of solutions ZDD size
Displaying satisfiable solutions Any one term Max/Min value of term Max/Min item cost of term
2017.12.15 Large-Scale Knowledge Processing 25
VSOP: interpreter for handling itemset-histograms
VSOP:nickname after “Valued-Sum-Of-Products” Written in C, C++, and yacc/lex, on Linux PCs. C-shell-like user interface:
Interactive keyboard input, or batch style script file input. Two type of symbols:
Item symbols (starting with lower-case letter), (up to 65,510). Program variables (starting with upper-case letter)
to store a result of other expression. (no limit to use.) Numerical operations for integers. (no bit-width limit) No loops and branches are allowed.
We may use another script language for unrolling, and then apply VSOP to a script file.
No limit for lines of the script file. (Pipelined execution) About 30 byte per ZDD node. (10G nodes in 512GB.)
2017.12.15 Large-Scale Knowledge Processing 26
Basic performances
Using a PC (800MHz, 512MB, SuSE Linux 9) ZDD for (100 !) can be generated in 0.2 sec.
121 ZDD nodes. ( 160 digits of decimal number)
Calculating VSOP for (x1+1)(x2+2)(x3+3)…(xn+n) # of terms : 2n, maximum value : n!
n # Terms Max value # Node Time (s)16 65,536 20,922,789,888,000 9,383 0.6920 1,048,576 2.43 × 1018 76,705 14.4024 16,777,216 6.20 × 1023 530,308 276.99
2017.12.15 Large-Scale Knowledge Processing 27
VSOP for Constraint Satisfaction Problems (CSP)
ZDD-based CSP solver is studied. [Okuno98] N-queens problem, magic square problem, etc. Many applications for real-life problems.
VSOP is also useful for solving CSPs. VSOP’s numerical operations facilitates
constraint descriptions more compactly and simply. Each product term a solution of problem. Value of terms costs of solutions
2017.12.15 Large-Scale Knowledge Processing 28
Probabilistic symbolic simulation
VSOP algebra is same as probabilistic calculation. a + a = 2 a, a×a = a, a×b = a b,
where a, b are probability variable in [0, 1] . Two events occur independently: Logical AND becomes arithmetic product.
Two events are caused by a same event: does not become x2 but just x.
VSOP can be used for probabilistic analysis.
2017.12.15 Large-Scale Knowledge Processing 29
Probabilistic symbolic logic simulation
Computing full-expressions of signal probabilitiesfor given logic circuits. NOT-gate: Y = 1 – A AND-gate: Y = A B OR-gate: Y = A + B – A B XOR-gate: Y = A + B – 2 A B
Useful not only for logic circuits but also for any other system analysis.
2017.12.15 Large-Scale Knowledge Processing 30
Topics of this lecture
ZDD-vector and “Itemset-histogram algebra” ZDD-representation for polynomials with integer coefficients Operation algorithms for itemset-histogram data VSOP: Interpreter for manipulating itemset-histogram data
Frequent itemset mininig using ZDDs ZDD-growth: ZDD-based frequent itemset mining method Algorithm for generating maximal/closed itemsets LCM-ZDD: Fast method for generating frequent itemsets.
ZDD variable ordering in database analysis Mining internal structures of combinatorial itemset data
Simple disjunctive decomposition of Boolean functions
ZDD applications (1): Database analysis
2017.12.15 Large-Scale Knowledge Processing 31
Frequent Item set (Pattern) Mining Basic and well-known problem in database analysis.
Useful method for finding interesting association rulesfrom huge size of databases, such as transaction data, web contents, bioinformatics data, etc.
RecordID Tuple
1 a b c2 a b3 a b c4 b c5 a b6 a b c7 c8 a b c9 a b c10 a b11 b c
Frequency threshold = 8 { a, b, c }
Frequency threshold = 7 { ab, bc, a, b, c }
Frequency threshold = 5 { abc, ab, bc, ac, a, b, c }
Frequency threshold = 10 { b }
Frequency threshold = 1 { abc, ab, bc, ac, a, b, c }
32
"Pattern-histograms"
Here, a "pattern" means an item set seen in a tuple of transaction data. A tuple of n items includes 2n patterns.
Computing pattern histogram is much harder than tuple-histogram. Difficult to completely compute
pattern-histograms in many cases. Conventional methods extract only
frequent patterns. (> α times appears)
Tuple Freq.abc 5ab 3bc 2c 1
(ex.)
{abc, ab, bc, ac, a, b, c, 1} {ab, a, b, 1} {bc, b, c, 1} {c, 1}
Pattern Freq.abc 5ab 8bc 7ac 5a 8b 10c 81 11
2017.12.15 Large-Scale Knowledge Processing
2017.12.15 Large-Scale Knowledge Processing 33
Frequent pattern enumeration by ZDD-growth
When the pattern-histogram becomes too large, we can put minimum support α as a larger number, then less number of frequent patterns are found, and ZDD size becomes smaller.
We have developed “ZDD-growth” algorithm to generate a ZDD of pattern-histogram directly from a given ZDD of tuple-histogram.
Minimum threshokdα = 7
{ ab, bc, a, b, c }
ZDD-growth
F
a
b b
c c
0 1
0
0
0
0 0
1
1 1
11
tuple Freq.abc 5ab 3bc 2c 1
Tuple-histogram
2017.12.15 Large-Scale Knowledge Processing 34
Key point of ZDD-growth algorithm
Depth-first search on ZDD-based tuple-histogram. Depth-first manner matches ZDD’s recursive structure.
Frequent pattern set Freq[H] can be computed as:Freq[H] = Freq[H.factor1(v)] * v
+ Freq[H.factor1(v) + H.factor0(v)] Freq[H] is recursively computed by the
sub-problems for the two cofactors ofH.factor1(v), H.factor0(v). depth-first manner.
Cofactors does not includes item v. recursive call depth is bounded by number of items.
We use hash-based cache technique to avoid duplicated operations at the shared nodes.
2017.12.15 Large-Scale Knowledge Processing 35
Algebra on tuple-histograms
Tuple Freq.a b c 5a b 3
b c 2c 1
Primitive operations: Factoring into two parts
by an item. Attaching an item. Sum of two histograms. Counting lines in the
table.
Tuple Freq.b c 5b 3
Tuple Freq.b c 2
c 1
H H1=H.factor1(a)
H0=H.factor0(a)
Tuple Freq.a b c d 5a b d 3
b c d 2c d 1
H * d
Tuple Freq.b c 7b 3
c 1
H1 + H0
- Each table is compactly represented by ZDDs.- ZDDs are shared each other in the memory.
2017.12.15 Large-Scale Knowledge Processing 36
Algebra on tuple-histograms
Tuple Freq.a b c 5a b 3
b c 2c 1
Primitive operations: Factoring into two parts
by an item. Attaching an item. Sum of two histograms. Counting lines in the
table.
Tuple Freq.b c 5b 3
Tuple Freq.b c 2
c 1
H H1=H.factor1(a)
H0=H.factor0(a)
Tuple Freq.a b c d 5a b d 3
b c d 2c d 1
H * d
Tuple Freq.b c 7b 3
c 1
H1 + H0
- Each table is compactly represented by ZDDs.- ZDDs are shared each other in the memory.
2017.12.15 Large-Scale Knowledge Processing 37
Algebra on tuple-histograms
Tuple Freq.a b c 5a b 3
b c 2c 1
Primitive operations: Factoring into two parts
by an item. Attaching an item. Sum of two histograms. Counting lines in the
table.
Tuple Freq.b c 5b 3
Tuple Freq.b c 2
c 1
H H1=H.factor1(a)
H0=H.factor0(a)
Tuple Freq.a b c d 5a b d 3
b c d 2c d 1
H * d
Tuple Freq.b c 7b 3
c 1
H1 + H0
- Each table is compactly represented by ZDDs.- ZDDs are shared each other in the memory.
2017.12.15 Large-Scale Knowledge Processing 38
ZDD-growth algorithm
Get frequent patterns only.
Check cache to avoidduplicated calls.
Frequent item sets with v.Frequent item sets without v.
Enter result to cache.
2017.12.15 Large-Scale Knowledge Processing 39
Experimental resultSelected from FIMI2003 benchmark datasets.
2017.12.15 Large-Scale Knowledge Processing 40
Properties of ZDD-growth method
For large-scale database, which is hard to make pattern-histogram, ZDD-growth generates frequent patterns if we put some minimum frequency α.
Outputs a huge number of frequent patterns represented by a compressed ZDD. For some examples, much (exponentially) faster than
previous method. For “BMS-WebView-1”, we can generate 30 trillions of
patterns in a feasible time and space. Not good for “T10I4D100K”, which is randomly generated
data. (In principle, random data cannot be compressed.)
2017.12.15 Large-Scale Knowledge Processing 41
Closed item sets
Subset of item set patterns each of which is the unique representative for a sub-group of patterns having the same occurrence in the database. “Common item set” : Com(ST) for the given set of tuples ST ,
such that Com(ST) is the maximal set of items commonly included in all tuples T ∈ ST .
“Occurrence” : Occ(D,X) for the given database D and item set X, such that Occ(D,X) is the subset of tuples in D, each of which includes X.
if an item set X satisfies Com(Occ(D,X)) = X, we call X is a “closed item set” in D.
Closed set is a kind of compressed expression of all patterns of frequent item set.
2017.12.15 Large-Scale Knowledge Processing 42
Example of closed item sets
RecordID Tuple
1 a b c2 a b3 a b c4 b c5 a b6 a b c7 c8 a b c9 a b c10 a b11 b c
Closed item set patterns:{abc, ab, bc, b, c}
All patterns appears at least once:{abc, ab, ac, a, bc, b, c},
- “ac” is deleted because theoccurrence is same as “abc”.
- “a” is deleted because theoccurrence is same as “ab”.
2017.12.15 Large-Scale Knowledge Processing 43
ZDD-growthC algorithm
Quite simplemodification
2017.12.15 Large-Scale Knowledge Processing 44
Conclusion on ZDD-growth method.
ZDD-growth-M, ZDD-growth-C: Quite simple modification of ZDD-growth to filter the closed
patterns from all frequent patterns. Additional computation cost is relatively small.
In general, closed item sets are already reduced forms, and thus, the impact of ZDD is not remarkable. If our final goal is enumeration only (no need to store),
the LCM algorithm [Uno2003] would be much better. Our method can also constructs efficient indexing of the
item sets on main memory, and used for various data analysis by ZDD-based algebraic set operations.
Combining ZDDs and LCM would be interesting. we started project to develop “LCM-ZDD” method.
2017.12.15 Large-Scale Knowledge Processing 45
“LCM over ZDDs” [Minato et al. 2008]
LCM: [Uno2003]Output-linear time algorithm of frequent itemset mining.
ZDD: [Minato93]A compact graph-based representation for large-scale sets of combinations.
Combination of the two techniques
Generates large-scale frequent itemsets on the main memory, with a very small overhead from the original LCM.
( Sub-linear time and space to the number of solutions when ZDD compression works well.)
2017.12.15 Large-Scale Knowledge Processing 46
Naïve implementation of LCM-ZDD
Original LCM Naïve implementation of“LCM-ZDD”
This naïve method generates the correct ZDD, but not very effective.
2017.12.15 Large-Scale Knowledge Processing 47
Problem in naïve implementation
A ZDD grows by repeating union operations of the solutions found in the LCM depth-first search. The consecutive solutions are quite similar to each other in
most cases, namely, only a few bottom levels are different. Union operation requires O(n) steps
while a very small (almost constant) part are meaningful. Namely, this algorithm may become n times slower.
2017.12.15 Large-Scale Knowledge Processing 48
Improved implementation of LCM-ZDD
On each recursive step, we construct a ZDD only for the meaningful part.
After returning from the subsidiary recursive call, we put the top item on the current result of ZDD. we can avoid redundant traversals in the ZDD union
operation. ( the overhead factor becomes a constant.)
2017.12.15 Large-Scale Knowledge Processing 49
Acceleration by hypercube decomp.[Uno2003]
Hypercube: H(P) set of items e satisfying e > tail(P) and Occ(P) = Occ(P∪{e}) Original LCM algorithm can avoid duplicated backtracking with
respect to items included in H(P). ( fast counting of solutions.)
2017.12.15 Large-Scale Knowledge Processing 50
Our implementation and experiments
We implemented LCM-ZDD by modifying LCM ver.5, open software. Only 50 lines of modification in the LCM code. with our own ZDD package (2,300 lines in C/C++)
Experiments for FIMI2003 benchmark datasets. mushroom: where ZDD is very effective. T10I4D100K: where ZDD is ineffective. BMS-WebView-1: Intermediate property.
Comparison with the original LCM and ZDD-growth. LCM-count: only counting the number of solutions. LCM-dump: printing all solutions into a file.
2017.12.15 Large-Scale Knowledge Processing 51
2017.12.15 Large-Scale Knowledge Processing 52
Effects of LCM-ZDD
Clearly, LCM-ZDD is more efficient thanZDD-growth in most cases. ZDD-growth shows comparable performances to our method
only in “mushroom” with very low minimum support, but for all the other cases, our method overwhelms ZDD-growth.
LCM-ZDD is much faster than the LCM-dump. The original LCM-dump is known as an output linear time
algorithm, but our LCM-ZDD requires a sub-linear time to the number of itemsets in this experiment.
Our method requires almost same time as LCM-count. But LCM-ZDD stores all the solutions.
2017.12.15 Large-Scale Knowledge Processing 53
Post processing for generated patterns
Huge number of patterns can be stored in the main memory. Knowledge indexing based on ZDDs.
The result can be analyzed flexibly using algebraic set operations. Sub-pattern matching for the frequent patterns. Extracting long/short frequent patterns. Comparison of two sets of frequent patterns. Calculating statistical data (e.g. confidence, support) Finding disjoint sub-factors in frequent patterns.
Useful means of inductive database analysis.
2017.12.15 Large-Scale Knowledge Processing 54
Topics of this lecture
ZDD-vector and “Itemset-histogram algebra” ZDD-representation for polynomials with integer coefficients Operation algorithms for itemset-histogram data VSOP: Interpreter for manipulating itemset-histogram data
Frequent itemset mininig using ZDDs ZDD-growth: ZDD-based frequent itemset mining method Algorithm for generating maximal/closed itemsets LCM-ZDD: Fast method for generating frequent itemsets.
ZDD variable ordering in database analysis Mining internal structures of combinatorial itemset data
Simple disjunctive decomposition of Boolean functions
ZDD applications (1): Database analysis
2017.12.15 Large-Scale Knowledge Processing 55
ZDD variable ordering for itemset mining
In general, BDD size greatly depends on variable order. Exponential effect is observed in some Boolean functions. Exact optimization is shown as NP-complete. [Tani93]
ZDDs of frequent itemsets also depends on variable ordering. We proposed a heuristic ordering method [Iwasaki2007]
based on some structural information of databases. We show the instances of database where the ZDD sizes
are exponentially sensitive to the variable ordering. [Minato2007] Theoretical results are useful for developing good heuristic methods.
2017.12.15 Large-Scale Knowledge Processing 56
Variable ordering of (ordinary) BDDs
In VLSI CAD area, many techniques are developed for good variable orderings of ordinary BDDs.
Local computability:
Pairs of inputs closely related to each otherhad better be kept close positions in the ordering.
Output controllability:
Inputs having a strong controllability to the output had better be at higher positions in the ordering.
Two empirical rules for good ordering
ZDDs also have similar properties.
2017.12.15 Large-Scale Knowledge Processing 57
Consideration on ZDDs for frequent itemsets
Namely, missing items have stronger controllabilityin the ZDDs for frequent itemsets. This is a completely opposite effect as the ordinary BDDs
for logic expressions, where missing variables have less controllability.
Record ID Tuple1 a b c2 a b3 a b c4 b c Item “a” controls the difference of patterns
between the two tuples “abc” and “bc.”
For simple discussion, we first assume (minimum frequency = 1).
Sub-patterns extracted from “ bc”{ bc, b, c, 1 }
Sub-patterns extracted from “abc”{ abc, ab, ac, a, bc, b, c, 1 }
2017.12.15 Large-Scale Knowledge Processing 58
An instance dominated by local computability
Based on the above consideration, we made an artificial database: “one-pair missing.”
(a1 b1) are missing.(a2 b2) are missing.(a3 b3) are missing.
(an bn) are missing.
n records
(2n – 2) items
2017.12.15 Large-Scale Knowledge Processing 59
Effect of variable ordering on “one-pair missing”
Exponential effect is observed.
2017.12.15 Large-Scale Knowledge Processing 60
Structures of ZDDs for “one-pair missing”
Item a’s
Item b’s
a1b1
a2b2a3b3a4b4
a5b5a6b6a7b7
a8b8O(n) O(2n)
2017.12.15 Large-Scale Knowledge Processing 61
An instance dominated by output controllability
A database corresponds to the 8-bit data-selector function.
n record
(n – 1) items (log2 n) items
(y0 y1)(y2 y3)(y4 y5)Either of each pair appears.
Represents 3-digitbinary number k.
xk Is missing.
2017.12.15 Large-Scale Knowledge Processing 62
Effect of variable ordering on “data-selector”
Also, exponential effect is observed between the reversal two variable ordering.
2017.12.15 Large-Scale Knowledge Processing 63
Structures of ZDDs for “data-selector”
O(2n)O(n・3log n) ≈ O(n2.7)
Item x’s(n)
Item y’s(2 log n) Item x’s
(n)
Item y’s(2 log n)
2017.12.15 Large-Scale Knowledge Processing 64
A case where no good variable ordering exists
N (= n + 2 log2 n) copies of “data-selector” datasets with the rotated variable orders.
Anyway we cannot avoid a bad variable order at least one of the blocks.
- Input size (database size)is still polynomial: O (N 3)
An exponential ZDDat least one block.
This database requires an exponential size of ZDD in any variable ordering.
2017.12.15 Large-Scale Knowledge Processing 65
Topics of this lecture
ZDD-vector and “Itemset-histogram algebra” ZDD-representation for polynomials with integer coefficients Operation algorithms for itemset-histogram data VSOP: Interpreter for manipulating itemset-histogram data
Frequent itemset mininig using ZDDs ZDD-growth: ZDD-based frequent itemset mining method Algorithm for generating maximal/closed itemsets LCM-ZDD: Fast method for generating frequent itemsets.
ZDD variable ordering in database analysis Mining internal structures of combinatorial itemset data
Simple disjunctive decomposition of Boolean functions
ZDD applications (1): Database analysis
2017.12.15 Large-Scale Knowledge Processing 66
Decomposition of itemset data
{ x39 x86 x85 x34, x39 x86 x85, x39 x86 x34, x39 x86, x39 x85 x34, x39 x85, x39 x34, x39, x90 x86 x85 x36 x34, x90 x86 x85 x36, x90 x86 x85 x34, x90 x86 x85, x90 x86 x36 x34, x90 x86 x36, x90 x86 x34, x90 x86, x90 x85 x36 x34, x90 x85 x36, x90 x85 x34, x90 x85, x90 x36 x34, x90 x36, x90 x34, x90, x86 x85 x36 x34, x86 x85 x36, x86 x85 x34, x86 x85, x86 x36 x34, x86 x36, x86 x34, x86, x85 x59, x85 x36 x34, x85 x36, x85 x34, x85, x59, x36 x34, x36, x34, 1 }
We propose a new method of finding “hidden structures”from huge and complicated itemset data.
(x85 + 1) (x59 + (x34+1) (x86+1) (x39 + (x36 + 1) (x90 + 1) ) )
Finding decompositions in the “set of combinations”
2017.12.15 Large-Scale Knowledge Processing 67
Simple disjoint decomposition
Basic and useful concept in logic design theory. Defined in Boolean function algebra.
g(s,Y)
h(X)
f(X,Y)
x1x2
x|X|y1y2
y|Y|
X
Y
s
- Only one connection between two blocks. (“Simple”)- No common inputs in the two blocks. (“Disjoint”)
We will apply this concept to data mining.
2017.12.15 Large-Scale Knowledge Processing 68
Simple disjoint decomp. on Boolean functions
a b + c d
abcd
ab
c
a b + b c
Not always exists in a function. Multiple decompositions may exist in one function.
2017.12.15 Large-Scale Knowledge Processing 69
Multiple decompositions in one function
(a + b)(c + d) + e f
abcd
ef
abc
abc
a b c = (a b) c= a (b c)= (a c) b
Four decompositions.(with nesting structure.)
Associative logic operationsmay produce an exponential number of sub-decompositions.
2017.12.15 Large-Scale Knowledge Processing 70
Decomposition tree
Representing structure of decompositions. All simple disjoint decompositions are included in one graph. Using special labels for the decompositions caused by
associative logic operations.
a
AND(y,z)
x
AND(a,b,c)
cb zy
s x + t x
a b c x + x y z
s t
a
AND(y,z)
x
AND(a,b,c)
cb zy
s x + t x
a b c x + x y za b c x + x y z
s t
Finding all simple disjoint decompositions.
Generating a decomp. tree for given function.
2017.12.15 Large-Scale Knowledge Processing 71
BDD-based fast decomposition algorithm
The method of simple disjoint decomposition are studied for long time. [Ashenhurst57, Roth62]
A fast algorithm is proposed [Bertacco97], based on BDD (Binary Decision Diagrams) techniques. Efficiently generates a decomposition tree including all
simple disjoint decompositions. Using fast equivalent checking by BDDs, and avoids
duplicated computation using a hash table. Time complexity is almost square to the BDD-size. Remarkably fast when BDD-size is feasible.
We modified the original BDD-based algorithm to handle itemset data with Zero-suppressed BDDs.
2017.12.15 Large-Scale Knowledge Processing 72
Simple disjoint decomp. on “sets of combinations”
g(s,Y)
h(X)
f(X,Y)
x1x2
x|X|
y1y2
y|Y|
X
Y
s
For example, f(a, b, c, d) = {ac, abc, acd, abcd, cd, d} has a simple disjoint decomposition: s = h(a, b) = {a, a b} and g(s, c, d) = {sc, scd, cd, d}
h(X) means an independent factor of f(X).
2017.12.15 Large-Scale Knowledge Processing 73
Properties of s. s. decomp. on sets of combinations.
One set of combinations can include multiple decompositions. For example, {abcx, abcy, abcz, xy, xz} includes five
decompositions: h(X) = {ab}, {bc}, {ac}, {abc}, {y, z}. Decomposition group
with associative operations. “AND” (product of sets) “OR” (union of sets)
Decomposition tree canbe generated as well asoriginal decomposition algorithm.
a
OR(y,z)
x
AND(a,b,c)
cb zy
{abcx, abcy, abcz, xy, xz }
{sx, st, tx }s t
a
OR(y,z)
x
AND(a,b,c)
cb zy
{abcx, abcy, abcz, xy, xz }
{sx, st, tx }s t
2017.12.15 Large-Scale Knowledge Processing 74
Observation in decomposition algorithm
If F(X, Y) includes simple disjoint decomp. P(X), then:F(X, Y) = P(X)・Q(Y) + R(Y) .
F can be expanded by one of item variable v as:F(X) = v ・ F1(X’) + F0(X’) (where X’ = X – v).We call F1, F0 as cofactors of F with variable v.
As X and Y are disjoint, either v∈X or v∈Y. When v∈X:
F1(X’, Y) = P1(X’)・Q(Y),F0(X’, Y) = P0(X’)・Q(Y) + R(Y) P(X) = v・P1(X’) + P0(X’).
When v∈Y:F1(X, Y’) = P(X)・Q1(Y’) + R1(Y’), F0(X, Y’) = P(X)・Q0(Y’) + R0(Y’)
P(X) are commonly included in decompositions of F1, F0.
2017.12.15 Large-Scale Knowledge Processing 75
ZDD-based recursive algorithm
Similar recursive algorithm as BDD-based original one. but modification from BDD to ZDD is not trivial
because of different semantics and algebra.
When “f0” and “f1” are child nodes of “f ” in ZDD, then the decomposition tree for “f ” can be generated by “merging” the two decomposition trees for “f0” and “f1”.
Main theorem:
Fast recursive algorithm.
Time complexity is almost square to the ZDD-size. Remarkably fast when ZDD-size is feasible.
2017.12.15 Large-Scale Knowledge Processing 76
Implementation and experiments
The program is implemented in C and C++ code.on our BDD package. 1,600 lines for the decomposition algorithm. Pen4, 800MHz, 512MB memory, SuSE Linux 9.
(manipulates up to 10,000,000 ZDD nodes.) Experiment for benchmark databases [FIMI03].
“mushroom” – 119 items, 8124 records. “T10I4D100K” (first 1000 lines) – 795 Items, 1000 records.
First we generate frequent itemset data with various frequency thresholds, and then apply our decomposition algorithm for each itemset data.
2017.12.15 Large-Scale Knowledge Processing 77
Experimental results
2017.12.15 Large-Scale Knowledge Processing 78
Result for “mushroom” with threshold = 5,000
{ x39 x86 x85 x34, x39 x86 x85, x39 x86 x34, x39 x86, x39 x85 x34, x39 x85, x39 x34, x39, x90 x86 x85 x36 x34, x90 x86 x85 x36, x90 x86 x85 x34, x90 x86 x85, x90 x86 x36 x34, x90 x86 x36, x90 x86 x34, x90 x86, x90 x85 x36 x34, x90 x85 x36, x90 x85 x34, x90 x85, x90 x36 x34, x90 x36, x90 x34, x90, x86 x85 x36 x34, x86 x85 x36, x86 x85 x34, x86 x85, x86 x36 x34, x86 x36, x86 x34, x86, x85 x59, x85 x36 x34, x85 x36, x85 x34, x85, x59, x36 x34, x36, x34, 1 }
(x85 + 1) (x59 + (x34+1) (x86+1) (x39 + (x36 + 1) (x90 + 1) ) )
Itemset list, included in at least 5,000 records in the database:
2017.12.15 Large-Scale Knowledge Processing 79
Results for “mushroom” with various thresholds
(#Patterns: 18,094,822)
(#Patterns: 1,442,504)
(#Patterns: 123,278)
(#Patterns: 6,624)
2017.12.15 Large-Scale Knowledge Processing 80
Concluding remarks
Recent topics on ZDD-based data mining & knowledge discovery are presented. ZDDs automatically compress the data of a large number of
combinatorial items, and store in main memory. Variousset operations can be executed without decompression.
Aborted when memory overflowed. in memory algorithm In VLSI design area, we may divide the problem into small sub-
parts so as to fit in memory. successfully used in 1990s. In database area, the original data usually placed in external
storage, thus ZDD-based techniques is not popular in 1990s. After 2000, PC memory size grows rapidly, and in memory
computation is used more widely in data mining and knowledge processing. BDD/ZDD-based techniques will be utilized more.
2017.12.15 Large-Scale Knowledge Processing 81
Summary
ZDD-vector and “Itemset-histogram algebra” ZDD-representation for polynomials with integer coefficients Operation algorithms for itemset-histogram data VSOP: Interpreter for manipulating itemset-histogram data
Frequent itemset mininig using ZDDs ZDD-growth: ZDD-based frequent itemset mining method Algorithm for generating maximal/closed itemsets LCM-ZDD: Fast method for generating frequent itemsets.
ZDD variable ordering in database analysis Mining internal structures of combinatorial itemset data
Simple disjunctive decomposition of Boolean functions
ZDD applications (1): Database analysis
2017.12.15 Large-Scale Knowledge Processing 82
Exercises
Draw a ZDD-vector representing the polynomial a b c + 3 a b + 2 a c + 6 a + b c + 3 b + 2 c + 6 ,which is equivalent to (a+1)(b+2)(c+3). Use ordinary 3-bit binary encoding. Use (-2) base binary encoding.