ARGUS: A Prototype Stream Anomaly Monitoring System Thesis Proposal Chun Jin Thesis Committee Jaime...
-
date post
20-Dec-2015 -
Category
Documents
-
view
218 -
download
2
Transcript of ARGUS: A Prototype Stream Anomaly Monitoring System Thesis Proposal Chun Jin Thesis Committee Jaime...
ARGUS: A Prototype Stream Anomaly Monitoring System
Thesis Proposal
Chun Jin
Thesis CommitteeJaime Carbonell (Chair)Christopher OlstonJamie CallanPhil Hayes, DYNAMiX Technologies
Chun Jin Carnegie Mellon 2
Thesis Statement Stream Anomaly Monitoring System (SAMS) is
an important sub-class of stream applications. The difficulty is raised by the very-large-volume data and a large number of queries the system is supposed to handle.
Propose an approach for SAMS’s that implements incremental evaluation schemes with adapted Rete algorithm upon a traditional DBMS platform and exploit SAMS characteristics for query evaluation optimization.
Demonstrate how the approach and the improvements could lead to a simple and fast implementation of an effective and efficient SAMS system.
Chun Jin Carnegie Mellon 3
Outline Motivation My ARGUS Approach Current Work Status
Current System Preliminary Results
Proposed Work and Timeline
Chun Jin Carnegie Mellon 4
Stream Processing Stream Processing Applications
Network Traffic Analysis and Router Configuration
Internet Services Sensor Data Analysis Anomaly Detection
Stream Processing Projects STREAM, TelegraphCQ, Aurora NiagaraCQ, OpenCQ, WebCQ Gigascope, Tribeca Tapestry, Alert, Tukwila, etc.
Chun Jin Carnegie Mellon 5
Stream Anomaly Monitoring Systems (SAMS)
SAMS monitors structured data streams for anomalies or potential hazards.
Continuous queries may number in thousands or tens of thousands.
Daily stream volumes may exceed millions of records.
Satisfaction of a SAMS query is often rare (very-high-selectivity).
Chun Jin Carnegie Mellon 6
SAMS Dataflow
Analyst
Stream Anomaly Monitoring System
Stream Anomaly Monitoring System
Storage
Que
ries
Alerts
Data Streams
FedWire Money TransfersPatient Records
Chun Jin Carnegie Mellon 7
Query Example 4 Suppose for every big transaction of
type code 1000, the analyst wants to check if the money stayed in the bank or left within ten days. An additional sign of possible fraud is that transactions involve at least one intermediate bank. The query generates an alarm whenever the receiver of a large transaction (over $1,000,000) transfers at least half of the money further within ten days of this transaction using an intermediate bank.
Chun Jin Carnegie Mellon 8
SQL Query for Example 4FROM transaction r1, transaction r2, transaction r3WHERE r2.type_code = 1000 AND
r3.type_code = 1000 ANDr1.type_code = 1000 ANDr1.amount > 1000000 ANDr1.rbank_aba = r2.sbank_aba ANDr1.benef_account = r2.orig_account ANDr2.amount > 0.5 * r1.amount ANDr1.tran_date <= r2.tran_date ANDr2.tran_date <= r1.tran_date + 10 ANDr2.rbank_aba = r3.sbank_aba ANDr2.benef_account = r3.orig_account ANDr2.amount = r3.amount ANDr2.tran_date <= r3.tran_date ANDr3.tran_date <= r2.tran_date + 10;
Chun Jin Carnegie Mellon 9
ARGUS as a Prototype SAMS Implement the Adapted Rete Algorithm upon a
traditional DBMS platform Rete (Forgy 1982): Incremental Evaluation based
on Materialized Intermediate Results. SAMS’s assumption of very-high-selectivity query
over very-large-volume data justifies employment of Rete and necessitates some unique improvements.
Transitivity Inference Ono/Lohman VLDB90, Pirahesh/Leung/Hasan ICDE97
Predicate Set Evaluation and Materialization Partial Rete (Materialization skipping) Complex Common Computation Identification for Sharing Intermingled Sharing and Optimization processing
Chun Jin Carnegie Mellon 10
ARGUS System Architecture
Rete NetworkGenerator
Query
ReteNetworks
Data Tables
Analyst
Identified Threats
IntermediateTables
Data Streams
QueryTable
StreamAnomalyMonitoring
Do_queries
Scheduler
Chun Jin Carnegie Mellon 11
ReteGenerator Architecture
SystemCatalog
TopologyTable
History-basedRete Optimizer
ReteGenManager
QueryRewriter
TopologyChecker
TransitivityInference Counter
Table
SQL Queries
Check TopologyRegister Rete Networks
Update Tables
History-based Cost Estimating
Sharing
ReteGenerator
Chun Jin Carnegie Mellon 12
Selected ARGUS Topics Adapted Rete Algorithm
ReteGenerator translates a query into a Rete network that is wrapped as a stored procedure.
The procedure implements the Adapted Rete Algorithm accounting for the incremental evaluation
Transitivity Inference Rete Optimization Computation Sharing
Chun Jin Carnegie Mellon 13
Adapted Rete Algorithm (Selection)
n and m are old data sets Δn and Δm are the new much
smaller incremental data sets. Selection ơ
ơ(n+ Δn) ơ(n) ơ(Δn)= +
Chun Jin Carnegie Mellon 14
Adapted Rete Algorithm (Join) Join (n+Δn) (m+Δm)
= n m + Δn m + n Δm + Δn Δm
When Δn and Δm are very small compared to n and m, time complexity of incremental join is O(n+m)
Old ResultsNew Incremental Results
Chun Jin Carnegie Mellon 15
Incremental Evaluation in Rete Example 4
DataTable
r1, r2, r3
Type_code=1000Amount>1000000
Type_code=1000
Type_code=1000
r1.rbank_aba = r2.sbank_abar1.benef_account = r2.orig_accountr2.amount > r1.amount*0.5r1.tran_date <= r2.tran_dater2.tran_date >= r1.tran_date+10
r2.rbank_aba = r3.sbank_abar2.benef_account = r3.orig_accountr2.amount = r3.amountr2.tran_date <= r3.tran_dater3.tran_date >= r2.tran_date+10
Chun Jin Carnegie Mellon 16
Complex Queries A continuous query may contain multiple
SQL statements, and a single SQL statement may contain unions of multiple SQL terms.
Each SQL term is mapped to a sub-Rete network.
These sub-Rete networks are then connected to form the statement-level sub-networks.
And the statement-level subnetworks are further connected based on the view references to form the final query-level Rete network.
Chun Jin Carnegie Mellon 17
Transitivity Inference Exploring transitivity properties of
comparison operators To derive hidden high-selective
selection predicates High-selective selection predicates can
significantly improve performance as they may produce very small intermediate results. Subsequent join could be performed very fast on the materialized intermediate results.
Chun Jin Carnegie Mellon 18
Transitivity Inference Example Given
r1.amount > 1000000 and r2.amount > r1.amount * 0.5 and r3.amount = r2.amount
r1.amount > 1000000 is very high-selective on r1
We can infer high-selective predicates: r2.amount > 500000 r3.amount > 500000
Chun Jin Carnegie Mellon 19
Rete Optimization
Active List
Join Graph
StructureBuilder
JoinEnumerator
History-basedCost Estimator DB
SQL Query
Rete network
Update Tables
History-basedRete Optimizer
Chun Jin Carnegie Mellon 21
History-based Cost Estimator Run sub-plans on historical data To estimate the costs of sub-plans on future
data Assume same data distribution in past and future Apply heuristic functions to avoid estimating
extremely high cost sub-plans. Justify History-based Cost Estimator
Compiled and optimized once, and executed multiple times
Tolerable to spend more time on the one-time optimization
Accurate cost estimates compensate as queries run more and more times
Chun Jin Carnegie Mellon 22
Computation Sharing Predicate Indexing Extended predicate set operations Sharing Algorithm
Chun Jin Carnegie Mellon 23
Predicate Indexing Predicate Indexing Concepts:
Equivalent Predicate, p1 ≡ p2, iff ∀D, p1(D) = p2(D)
Equivalent Predicate Class Canonical Predicate Form
Predicates are converted into the canonical forms and stored as records in tables.
Searching a predicate becomes data retrieval from tables.
Chun Jin Carnegie Mellon 24
Relationship between Predicate Sets and Their Result Tuple Sets Predicate Set: a set of conjunctive
predicates Its Result Tuple Set: a set of database
tuples that satisfy all the predicates of the Predicate Set.
Fix database status D, a mapping from predicate set P to its result tuple set SD(P): SD: P ---> SD(P)
Predicate sets and their result tuple sets are complementary: Predicates are filters of data items The more number of predicates, the less
number of result tuples
Chun Jin Carnegie Mellon 25
Extending Predicate Set Operations
Defined on predicate sets Definitions are justified by the
relationships among corresponding result tuple sets
Important to common computation identification
Chun Jin Carnegie Mellon 26
Semantic Subset ⊆≡
Given two predicate sets P1 and P2, we say that P1 is a semantic subset of P2, and denote as P1⊆≡P2, if for any database status D, we have SD(P1)⊇SD(P2).
Chun Jin Carnegie Mellon 27
Semantic Subset Example
p1: t1.a>1, p2: t1.a>2 P1 = {p1}, P2 = {p2} S(P1)⊇S(P2),
P1⊆≡ P2. Why?
P2 ≡≡ {p1, p2}
Chun Jin Carnegie Mellon 28
Sharing TypesT1
T2 POT2
POT1
PFJ
POJ-PFJ
)(
)(
)(
22
11
OTNT
OTNT
FJNJ
PP
PP
PP
T1
T2 POT2
POT1POJ
PNT1-POT1
PNJ
PNT2-POT2
T1
T2 POT2
POT1 POJ
)(
)(
22
11
OTNT
OTNTNJ
PP
PPP
T1
T2 POT2
POT1
POJ PNJ-POJ
Non-change Add-only
Reconstruction
Selection Add-only
Chun Jin Carnegie Mellon 29
Sharing Algorithm Overview Non-change sharing. Add-only sharing. Optimizing the remaining query. Reconstruction and selection
sharing. Constructing the remaining Rete
network based on the optimized plan with possible sharing.
Chun Jin Carnegie Mellon 30
Current Work Status A preliminary system
Database A preliminary ReteGenerator
With the Adapted Rete and Transitivity Inference Will be expanded to incorporate optimization,
computation sharing, and incremental aggregation, etc.
A Preliminary evaluation Will conduct full evaluation on the complete
system in future
Chun Jin Carnegie Mellon 31
Preliminary Evaluation:Queries and Data
7 queries on synthesized FedWire money transfer database. 320006 records.
Two Data Conditions: Data1: Old: first 300000 records
New: remaining 20006 recordsALERT
Data2: Old: first 300000 recordsNew: next 20000 recordsNOT alert
Chun Jin Carnegie Mellon 32
Preliminary Results
Rete with Transitivity Inference
0
10
20
30
40
50
Q1 Q2 Q3 Q4 Q5 Q6 Q7
Ex
ecu
tio
n T
ime
(s)
Rete Data1 SQL Data1 Rete Data2 SQL Data2
Chun Jin Carnegie Mellon 33
Transitivity Inference
Q2
Q4
0
5
10
15
20
25
Data1 Data2
Exe
cuti
on
Tim
e(s)
05
101520253035404550
Data1 Data2
Ex
ec
uti
on
Tim
e(s
)
Rete TI Rete Non-TI SQL Non-TI SQL TI
Chun Jin Carnegie Mellon 34
Partial Rete Generation
Q4 assumes Transitivity Inference not applicable
05
101520253035404550
Data1 Data2
Ex
ecu
tio
n T
ime
(s)
Partial Rete
Rete
SQL
Chun Jin Carnegie Mellon 36
System Design and Implementation Rete Optimization (am doing) (05–08/2004) Computation Sharing (will do) (07–11/2004) Incremental Aggregation (will do) (12/2004–
02/2005) Constraint Exploiting (optional) (04–05/2005) Transitivity Inference Enhancements
(optional) ( 06 – 08/2005) Automatic Index Selection (optional) (09–
12/2005)
Chun Jin Carnegie Mellon 37
System Evaluation Data Collection ( 12/2004 – 01/2005) Query Generation ( 12/2004 – 01/2005) Simulation and Evaluation ( 02 – 05/2005)
Single SQL vs. Single Rete, Multiple SQL vs. Multiple Shared Optimized Rete
Single Non-optimized Rete vs. Single Optimized Rete
Multiple Non-shared Optimized Rete vs. Multiple Shared Optimized Rete
Non-incremental Aggregation vs. Incremental Aggregation
Chun Jin Carnegie Mellon 38
Evaluation: Data Collection FedWire Money Transfer Transactions
Synthesized 0.5M records. Plan to generate 0.5M more. 23 attributes/record
Massachusetts Medical Data Real 1.6M records (sanitized) 70 attributes/record In-patient admission and discharge
records. Expand to 10M.
Chun Jin Carnegie Mellon 39
Evaluation: Queries Now, 7 queries on FedWire, 3 queries on
Medical. Plan to extend to 20-40 queries for each
domain. Further extend query sets:
Similar predicates matching different constants Join predicate sets have non-empty
intersections Same where_clauses but different
groupby_clauses Same where_clauses and groupby_clauses but
different aggregation operators
Chun Jin Carnegie Mellon 40
Timeline System Design and Implementation
(Required) 03/2004 – 02/2005 System Implementation (Optional)
04/2005 – 12/2005 Evaluation on Required Parts 12/2004 –
05/2005 Thesis Writing and Defense 06/2005 –
03/2006 Thesis Writing 06 – 12/2005 Thesis Finalizing 01 – 03/2006 Defense 02 or 03/2006
Chun Jin Carnegie Mellon 41
ARGUS Summary Implement the incremental evaluation
schemes with the Adapted Rete Algorithm upon a traditional DBMS platform
To deal with very-large-volume data, exploit the very-high-selectivity query property for optimization: Transitivity Inference Predicate Set Evaluation and Materialization Partial Rete (Materialization skipping) Complex Common Computation Identification
for Sharing Intermingled Sharing and Optimization
processing