Hadoop and Data Science for the Enterprise (Strata & Hadoop World Conference Oct 29 2013)
Apache Eagle Strata Hadoop World London 2016
-
Upload
arun-karthick-manoharan -
Category
Technology
-
view
113 -
download
3
Transcript of Apache Eagle Strata Hadoop World London 2016
EBAY MARKETPLACE AT A GLANCE
$19.6B GMV in Q1 2016
9.5MNew listings added via
mobile per week
300MSearches each day
63%Transactions that ship
for free (in US, UK, DE)
79%Items sold as new
Q1 2016 data
~900MLive listings
One of the world’s largest and most vibrant marketplaces
Most Powerful Selling Platform
For business sellers: the potential to drive profitable sales and build a brand
For consumer sellers: an easy way to declutter, sell and make money
A partnership not a competition
Best Choice
Providing the greatest selection of inventory for our buyers
From new, everyday items to rare and unique goods
And incredible deals only found on eBay
Most Relevance
A shopping experience that is simple, data-driven and personalized
Enabling buyers to easily find, compare and purchase items they need and want
Highlighting the unique value that eBay brings
OUR STRATEGY
SMART COMMERCE
Identify an interesting set of candidate items,
trends, events, etc.
Personalize the results
Inspiration at scale!
6
Apache EagleMonitor Hadoop in Real Time Arun Manoharan | Product Manager|
@lycos_86
+200 Petabytes of Consumer
Data and growing…
Consumers on 6
Continents
Millions ofTransactions
1000’s ofProduct
Categories
Multiple cookies across
dozens of business
Actionable search insights
+ 9M payments every day
+ 6K Total
Payment Volume per
second
LoyaltyClick behavior
and patterns
Device IDs
100’s of millions of
Email addresses
Bank accounts
POS
AutosProducts
IP Address
+200 Petabytes of Consumer
Data and growing…
Consumers on 6
Continents
Credit cards
1000’s ofProduct
Categories
Multiple cookies across
dozens of business
Actionable search insights
+ 9M payments every day
+ 6K Total
Payment Volume per
second
Pair of shoes sold
every 2 second
Loyalty
Cell phone sold every 4
seconds
Click behavior
and patterns
Device IDs
100’s of millions of
Email addresses
Bank accounts
POS
A ladies handbag is bought via
mobile every 12 seconds
AutoProducts
IP Address
COLLECT, ANALYZE, PREDICT
Big Data @ eBay
*Q3 2015 data
7 Hadoop Clusters*
800MHDFS operations (single cluster)*
120 PB Data*
Hadoop @ eBay
HADOOP SECURITY
Authorization & Access Control
Perimeter Security
Data Classification
Activity Monitoring
Security
Security for Hadoop
Who is accessing the data?
What data are they accessing?
Is someone trying to access data that they don’t have access to?
Are there any anomalous access patterns?
Is there a security threat?
How to monitor and get notified during or prior to an anomalous event occurring?
Motivation for Eagle
Apache Eagle
Apache Eagle: Monitor Hadoop in Real Time
Apache Eagle is an Open Source Monitoring Platform for Hadoop eco-system, which started with monitoring data activities in Hadoop. It can instantly identify access to sensitive data, recognize attacks/malicious activities and blocks access in real time.
In conjunction with components such as Ranger, Sentry, Knox, DgSecure and Splunk etc., Eagle provides comprehensive solution to secure sensitive data stored in Hadoop.
Eagle Architecture
Apache Eagle Composition
Apache Eagle
Integrations Alert EngineHDFSAUDIT
HIVEQUERY
HBASEAUDIT
CASSANDRAAUDIT
MapRAUDIT
HADOOPPerformanceMetric
Namenode JMX Metrics
DatanodeJMX Metrics
SystemMetrics
M/R JobPerformanceMetric
History Job Metrics
Running Job Metrics
SparkJobPerformanceMetric
Spark Job Metrics
QueueMetrics
Data Activity Monitoring
RMJMXMetrics
Policy Store
Metadata API
Scalability
Extensibility
[Domains] [Applications]
Eagle
Data Classification - HDFS
• Browse HDFS file system• Batch import sensitivity metadata through Eagle API• Manually mark sensitivity in Eagle UI
Data Classification - Hive
• Browse Hive databases/tables/columns• Batch import sensitivity metadata through Eagle API• Manually mark sensitivity in Eagle UI
Define policy in UI and API
curl -u ${EAGLE_SERVICE_USER}:${EAGLE_SERVICE_PASSWD} -X POST -H 'Content-Type:application/json' \ "http://${EAGLE_SERVICE_HOST}:${EAGLE_SERVICE_PORT}/eagle-service/rest/entities?serviceName=AlertDefinitionService" \ -d ' [ { "prefix": "alertdef", "tags": { "site": "sandbox", "application": "hadoopJmxMetricDataSource", "policyId": "capacityUsedPolicy", "alertExecutorId": "hadoopJmxMetricAlertExecutor", "policyType": "siddhiCEPEngine" }, "description": "jmx metric ", "policyDef": "{\"expression\":\"from hadoopJmxMetricEventStream[metric == \\\"hadoop.namenode.fsnamesystemstate.capacityused\\\" and convert(value, \\\"long\\\") > 0] select metric, host, value, timestamp, component, site insert into tmp; \",\"type\":\"siddhiCEPEngine\"}", "enabled": true, "dedupeDef": "{\"alertDedupIntervalMin\":10,\"emailDedupIntervalMin\":10}", "notificationDef": "[{\"sender\":\"[email protected]\",\"recipients\":\"[email protected]\",\"subject\":\"missing block found.\",\"flavor\":\"email\",\"id\":\"email_1\",\"tplFileName\":\"\"}]" } ] '
1 Create policy using API 2 Create policy using UI
Define policy
1 Single event evaluation• threshold check with various
conditions
Policy Capabilities
2 Event window based evaluation• various window semantics (time/length sliding/batch
window)• comprehensive aggregation support
3 Correlation for multiple event streams• SQL-like join
4 Pattern Match and Sequence• a happens followed by b
Powered by Siddhi 3.0.5, and Eagle provides dynamic capabilities and intuitive API/UI
Scalability
•Scale with # of events•Scale with # of policies
Eagle Alert Engine Overview
1 Runs CEP engine on Apache Storm• Use CEP engine as library (Siddhi CEP)• Evaluate policy on streamed data• Rule is hot deployable
2 Inject policy dynamically• API• Intuitive UI
3 Scalability• Computation # of policies (policy placement)• Storage # of events (event partition)
4 Extensibility for policy enforcement• Post-alert processing with plugin
Eagle Alert
Statistics• # of events evaluated per
second• audit for policy change
Eagle ServiceAs of 0.3.0, Eagle stores metadata and statistics into HBASE, and support Druid as metric store.
Metadata• Policy• Event schema• Site/Application/UI Features
HBASE• Store metrics• Store M/R job/task data• Rowkey design for time-series
data• HBase Coprocessor
Raw data• Druid for metric• HBASE for M/R job/task
etc.• ES for log (future)
1 Data to be stored
2 Storage 3 API/UI
Druid• Consume data from Kafka
HBASE• filter, groupby, sort,
top
Druid• Druid query API• Dashboard in Eagle
Highlights
1. Ease of use: after installation, user defines rules2. Comprehensive rules on high volume of data: Eagle solves some
unique problem in Hadoop3. Hot deploy rule: Eagle does not provide a lot of charts, instead it
allows user to write ad-hoc rule and hot deploy it.4. Metadata driven: metadata includes policy, event schema and UI
component etc.5. Monolithic storm topology: application pre-processing running
together with alert engine 6. Extensibility: Eagle can’t succeed alone, Eagle has to be integrated
with other system for example data classification, policy enforcement etc.
Alert Engine Limitations in Eagle 0.3
1 High cost for integrating• Coding for onboarding new data source• Monolithic topology for pre-processing and
alert
3 Policy capability restricted by event partition• Can’t do ad-hoc group-by policy expressionFor example from groupby user to groupby cmd
2 Not multi-tenant• Alert engine is embedded into application• Many separate Storm topologies
4 Correlation is not declarative• Coding for correlating existing data sources
If traffic is partitioned by user, policy only supports expression of user based group-by
One storm topology even for one trivial data source
Even if it is a simple data source, you have to write storm topology and then deploy
Can’t declare correlations for multiple metrics
5 Stateful policy evaluation• fail over when bolt is down
How to replay one week history data when node is down
Integrations
•Cassandra•MapR•Mongo DB•Job Queue
Extensibility
Sentry/Ranger• As remediation engine• As generic data source
DgSecure• Source of truth for data classification
Splunk• Syslog format output• EAGLE alert output is the 1st abstraction of analytics and
Splunk is the 2nd abstraction
USER PROFILE ALGORITHMS…Eigen Value Decomposition
• Compute mean and variance
• Compute Eigen Vectors and determine Principal Components
• Normal data points lie near first few principal components
• Abnormal data points lie further from first few principal components and
closer to later components
USER PROFILE ARCHITECTURE
Eagle Next Releases
• Improve User experience Remote start storm topology Metadata stored in RDBMS
Eagle 0.4 Eagle 0.5
• Alert Engine as Platform No monolithic topology Declarative data source onboard Easy correlation Support policies with any field
group-by Elastic capacity management
http://eagle.incubator.apache.org
https://github.com/apache/incubator-eagle Github
Welcome Contributors in Apache Eagle
Dev Mail List
@TheApacheEagleTwitter
Q & A
34
Thank You!!