Distributed Decision Tree Learning for Mining Big Data Streams
Mining High-Speed Data Streams
description
Transcript of Mining High-Speed Data Streams
![Page 1: Mining High-Speed Data Streams](https://reader035.fdocuments.net/reader035/viewer/2022062221/56814553550346895db221f2/html5/thumbnails/1.jpg)
MINING HIGH-SPEED DATA STREAMS
Presented by:
Yumou Wang
Dongyun Zhang
Hao Zhou
![Page 2: Mining High-Speed Data Streams](https://reader035.fdocuments.net/reader035/viewer/2022062221/56814553550346895db221f2/html5/thumbnails/2.jpg)
INTRODUCTION The world’s information is doubling
every two years. From 2006 to 2011, the amount of
information grew by a factor of 9 in just five years.
![Page 3: Mining High-Speed Data Streams](https://reader035.fdocuments.net/reader035/viewer/2022062221/56814553550346895db221f2/html5/thumbnails/3.jpg)
INTRODUCTION By 2020 the world will generate 50
times the amount of information and 75 times the number of "information containers"
However, IT staff to manage it will grow less than 1.5 times.
Current algorithms can only deal with small amount of data less than a day’s data of many applications.
For example, banks, telecommunication companies.
![Page 4: Mining High-Speed Data Streams](https://reader035.fdocuments.net/reader035/viewer/2022062221/56814553550346895db221f2/html5/thumbnails/4.jpg)
INTRODUCTION Problems : When new examples arrive at a
higher rate than they can be mined, the amount of unused data grows without bounds as time progresses.
Today, to deal with these huge amount of data in a responsible way is very important.
Mining these continuous data streams brings unique opportunities, but also new challenges.
![Page 5: Mining High-Speed Data Streams](https://reader035.fdocuments.net/reader035/viewer/2022062221/56814553550346895db221f2/html5/thumbnails/5.jpg)
BACKGROUNDDesign Criteria for mining High
Speed Data Streams It must be able to build a model using at
most one scan of the data. It must use only a fixed amount of main
memory. It must require small constant time per
record.
![Page 6: Mining High-Speed Data Streams](https://reader035.fdocuments.net/reader035/viewer/2022062221/56814553550346895db221f2/html5/thumbnails/6.jpg)
BACKGROUND Usually, use KDD system to operate
this examples when they arrive.Shortcomings: learning model
learned are highly sensitive to example ordering compare to the batch model.
Others can produce the same model as batch version but very slower.
![Page 7: Mining High-Speed Data Streams](https://reader035.fdocuments.net/reader035/viewer/2022062221/56814553550346895db221f2/html5/thumbnails/7.jpg)
CLASSIFICATION METHOD Input: Examples of the form (x,y), y is the class
label, x is the vector of attributes. Output: A model y=f(x), predict the classes y of
future examples x with high accuracy.
![Page 8: Mining High-Speed Data Streams](https://reader035.fdocuments.net/reader035/viewer/2022062221/56814553550346895db221f2/html5/thumbnails/8.jpg)
DECISION TREE One of the most effective
and widely-used classification methods.
A decision tree is a decision support tool that uses a tree-like graph or model.
Decision trees are commonly used in machine learning.
![Page 9: Mining High-Speed Data Streams](https://reader035.fdocuments.net/reader035/viewer/2022062221/56814553550346895db221f2/html5/thumbnails/9.jpg)
BUILDING A DECISION TREE 1. Starting at the root. 2. Testing all the attributes and choose
the best one according to some heuristic measure.
3. Split one node into branches and leaves.
4. Recursively replacing leaves by test nodes.
![Page 10: Mining High-Speed Data Streams](https://reader035.fdocuments.net/reader035/viewer/2022062221/56814553550346895db221f2/html5/thumbnails/10.jpg)
EXAMPLE OF DECISION TREE
![Page 11: Mining High-Speed Data Streams](https://reader035.fdocuments.net/reader035/viewer/2022062221/56814553550346895db221f2/html5/thumbnails/11.jpg)
EXAMPLE OF DECISION TREE
![Page 12: Mining High-Speed Data Streams](https://reader035.fdocuments.net/reader035/viewer/2022062221/56814553550346895db221f2/html5/thumbnails/12.jpg)
PROBLEMS There are some problems existed in
traditional decision tree. Some of them assume that all training data
examples can be stored simultaneously in main memory.
Disadvantages: Limited the number of examples can be learned from.
Disk-based decision tree learners: examples in disk, repeatedly reading them.
Disadvantages: expensive when learning complex trees.
![Page 13: Mining High-Speed Data Streams](https://reader035.fdocuments.net/reader035/viewer/2022062221/56814553550346895db221f2/html5/thumbnails/13.jpg)
HOEFFDING TREES Designed for extremely large datasets Main idea: To find the best attribute at
a given node by considering only a small subset of the training examples that pass through the node.
Using how many examples is sufficient
![Page 14: Mining High-Speed Data Streams](https://reader035.fdocuments.net/reader035/viewer/2022062221/56814553550346895db221f2/html5/thumbnails/14.jpg)
HOEFFDING BOUND
n
In
2
)1(R 2
Definition: The statistical result that can decide how many examples “n” using by each node is called Hoeffding bound.
Assume: R—the range of variable r n independent observations mean: r’
With probability 1-δ, the true mean of r is at least r’-є
![Page 15: Mining High-Speed Data Streams](https://reader035.fdocuments.net/reader035/viewer/2022062221/56814553550346895db221f2/html5/thumbnails/15.jpg)
HOEFFDING BOUND
n
In
2
)1(R 2
This function is a decreasing function n is bigger, the є is smaller It is the difference between true value and
mean value of r.
![Page 16: Mining High-Speed Data Streams](https://reader035.fdocuments.net/reader035/viewer/2022062221/56814553550346895db221f2/html5/thumbnails/16.jpg)
HOEFFDING TREE ALGORITHM
![Page 17: Mining High-Speed Data Streams](https://reader035.fdocuments.net/reader035/viewer/2022062221/56814553550346895db221f2/html5/thumbnails/17.jpg)
HOEFFDING TREE ALGORITHM Inputs:
S -> is a sequence of examples,X -> is a set of discrete attributes,G(.) -> is a split evaluation
function, δ -> is one minus the desired
probability of choosing the correct attribute at any given node.
Outputs: HT -> is a decision tree.
![Page 18: Mining High-Speed Data Streams](https://reader035.fdocuments.net/reader035/viewer/2022062221/56814553550346895db221f2/html5/thumbnails/18.jpg)
HOEFFDING TREE ALGORITHMGoal: Ensure that, with a high probability, the attribute chosen using n examples, is the same as that would be chosen using infinite examples.
Let Xa be the attribute with the highest observed G’ and Xb be with second highest attribute.After seeing n examples.
Let ΔG’ = G’(Xa) – G’(Xb)ΔG’ > ϵ
Thus a node needs to accumulate examples from the stream until ϵ becomes smaller than ΔG.
![Page 19: Mining High-Speed Data Streams](https://reader035.fdocuments.net/reader035/viewer/2022062221/56814553550346895db221f2/html5/thumbnails/19.jpg)
HOEFFDING TREE ALGORITHM The algorithm constructs the tree using
the same procedure as ID3. It calculates the information gain for the attributes and determines the best attributes.
At each node it checks for condition ΔG > ϵ. If the condition is satisfied, then it creates child nodes based on the test at the node.
If not it streams in more training examples and carries out the calculations till it satisfies the condition.
![Page 20: Mining High-Speed Data Streams](https://reader035.fdocuments.net/reader035/viewer/2022062221/56814553550346895db221f2/html5/thumbnails/20.jpg)
HOEFFDING TREE ALGORITHMMemory cost d—number of attributes c—number of classes v—number of values per attribute l—number of leaves in the tree The memory cost for each leaf is
O(dvc) The memory cost for whole tree is
O(ldvc)
![Page 21: Mining High-Speed Data Streams](https://reader035.fdocuments.net/reader035/viewer/2022062221/56814553550346895db221f2/html5/thumbnails/21.jpg)
ADVANTAGES OF HOEFFDING TREE
1. Can deal with extremely large datasets.
2. Each example to be read at most once in a small constant time. Makes it possible to mine online data sources.
3. Build very complex trees with acceptable computational cost.
![Page 22: Mining High-Speed Data Streams](https://reader035.fdocuments.net/reader035/viewer/2022062221/56814553550346895db221f2/html5/thumbnails/22.jpg)
VFDT—VERY FAST DECISION TREE
Breaking ties Reduce waste Useful under condition where
Use of Split may not change with a single example Significantly reduce the time of re-computation
Memory cleanup Measurement of Clearance of least promising leaves Option of enabling reactivation
![Page 23: Mining High-Speed Data Streams](https://reader035.fdocuments.net/reader035/viewer/2022062221/56814553550346895db221f2/html5/thumbnails/23.jpg)
VFDT—VERY FAST DECISION TREE
Filtering out poor attributes Dropping early Reduces memory consumption
Initialization Can be initialized with other existing tree Set a head start
Rescans
![Page 24: Mining High-Speed Data Streams](https://reader035.fdocuments.net/reader035/viewer/2022062221/56814553550346895db221f2/html5/thumbnails/24.jpg)
TESTS—CONFIGURATION
14 Concepts Generated by random decision trees using Number of leaves: 2.2k to 61k Noise level: 0 to 30%
50k examples for testing Available memory: 40MB Legacy processors
![Page 25: Mining High-Speed Data Streams](https://reader035.fdocuments.net/reader035/viewer/2022062221/56814553550346895db221f2/html5/thumbnails/25.jpg)
TESTS—SYNTHETIC DATA
, ,
![Page 26: Mining High-Speed Data Streams](https://reader035.fdocuments.net/reader035/viewer/2022062221/56814553550346895db221f2/html5/thumbnails/26.jpg)
TESTS—SYNTHETIC DATA
![Page 27: Mining High-Speed Data Streams](https://reader035.fdocuments.net/reader035/viewer/2022062221/56814553550346895db221f2/html5/thumbnails/27.jpg)
TESTS—SYNTHETIC DATA
![Page 28: Mining High-Speed Data Streams](https://reader035.fdocuments.net/reader035/viewer/2022062221/56814553550346895db221f2/html5/thumbnails/28.jpg)
TESTS—SYNTHETIC DATA
![Page 29: Mining High-Speed Data Streams](https://reader035.fdocuments.net/reader035/viewer/2022062221/56814553550346895db221f2/html5/thumbnails/29.jpg)
TESTS—SYNTHETIC DATA
![Page 30: Mining High-Speed Data Streams](https://reader035.fdocuments.net/reader035/viewer/2022062221/56814553550346895db221f2/html5/thumbnails/30.jpg)
TESTS—SYNTHETIC DATA
Time consumption20m examples
VFDT takes 5752s to read, 625s to process
100k examplesC4.5 takes 36sVFDT takes 47s
![Page 31: Mining High-Speed Data Streams](https://reader035.fdocuments.net/reader035/viewer/2022062221/56814553550346895db221f2/html5/thumbnails/31.jpg)
TESTS—PARAMETERS
W/ & w/o over-pruning
![Page 32: Mining High-Speed Data Streams](https://reader035.fdocuments.net/reader035/viewer/2022062221/56814553550346895db221f2/html5/thumbnails/32.jpg)
TESTS—PARAMETERS
W/ ties vs. w/o ties65 nodes vs. 8k nodes for VFDT805 nodes vs. 8k nodes for VFDT-boot72.9% vs. 86.9% for VFDT83.3% vs. 88.5% for VFDT-boot
vs. VFDT: +1.1% accuracy, +3.8x timeVFDT-boot: -0.9% accuracy, +3.7x time5% more nodes
![Page 33: Mining High-Speed Data Streams](https://reader035.fdocuments.net/reader035/viewer/2022062221/56814553550346895db221f2/html5/thumbnails/33.jpg)
TESTS—PARAMETERS
40MB vs. 80MB memory7.8k more nodesVFDT: +3.0% accuracyVFDT-boot: +3.2% accuracy
vs. 30% less nodesVFDT: +2.3% accuracyVFDT-boot: +1.0% accuracy
![Page 34: Mining High-Speed Data Streams](https://reader035.fdocuments.net/reader035/viewer/2022062221/56814553550346895db221f2/html5/thumbnails/34.jpg)
TESTS—WEB DATA
For predicting accesses
1.89m examples
61.1% with most common class
276230 examples for testing
![Page 35: Mining High-Speed Data Streams](https://reader035.fdocuments.net/reader035/viewer/2022062221/56814553550346895db221f2/html5/thumbnails/35.jpg)
TESTS—WEB DATA
Decision dump 64.2% accuracy 1277s to learn
C4.5 with 40MB memory 74.5k examples 2975s to learn 73.3% accuracy
VFDT-bootstrapped with C4.5 1.61m examples 1450s to learn after initialization(983s to read)
![Page 36: Mining High-Speed Data Streams](https://reader035.fdocuments.net/reader035/viewer/2022062221/56814553550346895db221f2/html5/thumbnails/36.jpg)
TESTS—WEB DATA
![Page 37: Mining High-Speed Data Streams](https://reader035.fdocuments.net/reader035/viewer/2022062221/56814553550346895db221f2/html5/thumbnails/37.jpg)
MINING TIME-CHANGING DATA STREAMS
![Page 38: Mining High-Speed Data Streams](https://reader035.fdocuments.net/reader035/viewer/2022062221/56814553550346895db221f2/html5/thumbnails/38.jpg)
WHY IS VFDT NOT ENOUGH?
VFDT, assume training data is a sample drawn from stationary distribution.
•Most large databases or data streams violate this assumption –Concept Drift: data is generated by a time-
changing concept function, e.g. •Seasonal effects •Economic cycles
•Goal: –Mining continuously changing data streams –Scale well
![Page 39: Mining High-Speed Data Streams](https://reader035.fdocuments.net/reader035/viewer/2022062221/56814553550346895db221f2/html5/thumbnails/39.jpg)
WHY IS VFDT NOT ENOUGH?
Common Approach: when a new example arrives, reapply a traditional learner to a sliding window of w most recent examples
–Sensitive to window size •If w is small relative to the concept shift rate,
assure the availability of a model reflecting the current concept
•Too small w may lead to insufficient examples to learn the concept
–If examples arrive at a rapid rate or the concept changes quickly, the computational cost of reapplying a learner may be prohibitively high.
![Page 40: Mining High-Speed Data Streams](https://reader035.fdocuments.net/reader035/viewer/2022062221/56814553550346895db221f2/html5/thumbnails/40.jpg)
CVFDT
CVFDT (Concept-adapting Very Fast Decision Tree learner) –Extend VFDT –Maintain VFDT’s speed and accuracy –Detect and respond to changes in the example-
generating process
![Page 41: Mining High-Speed Data Streams](https://reader035.fdocuments.net/reader035/viewer/2022062221/56814553550346895db221f2/html5/thumbnails/41.jpg)
CVFDT (CONTD.) With a time-changing concept, the current
splitting attribute of some nodes may not be the best anymore.
An out dated subtree may still be better than the best single leaf, particularly if it is near the root. – Grow an alternative subtree with the new best
attribute at its root, when the old attribute seems out-of-date.
Periodically use a bunch of samples to evaluate qualities of trees. – Replace the old subtree when the alternate one
becomes more accurate.
![Page 42: Mining High-Speed Data Streams](https://reader035.fdocuments.net/reader035/viewer/2022062221/56814553550346895db221f2/html5/thumbnails/42.jpg)
HOW CVFDT WORKS
![Page 43: Mining High-Speed Data Streams](https://reader035.fdocuments.net/reader035/viewer/2022062221/56814553550346895db221f2/html5/thumbnails/43.jpg)
EXAMPLE
![Page 44: Mining High-Speed Data Streams](https://reader035.fdocuments.net/reader035/viewer/2022062221/56814553550346895db221f2/html5/thumbnails/44.jpg)
SAMPLE EXPERIMENT RESULT
![Page 45: Mining High-Speed Data Streams](https://reader035.fdocuments.net/reader035/viewer/2022062221/56814553550346895db221f2/html5/thumbnails/45.jpg)
CONCLUSION AND FUTURE WORK
CVFDT is able to maintain a decision-tree up-to—date with a window of examples by using a small constant amount of time for each new examples that arrives.
Empirical studies show that CVFDT is effectively able to keep its model up-to-date with a massive data stream even in the face of large and frequent concept shifts.
Future Work: Currently CVFDT discards subtrees that are out-of-date, but some concepts change periodically and these subtrees may become useful again – identifying these situations and taking advantage of them is another area for further study.
![Page 46: Mining High-Speed Data Streams](https://reader035.fdocuments.net/reader035/viewer/2022062221/56814553550346895db221f2/html5/thumbnails/46.jpg)
THANK YOU