Apache Eagle Strata Hadoop World London 2016

33

Transcript of Apache Eagle Strata Hadoop World London 2016

Page 1: Apache Eagle Strata Hadoop World London 2016
Page 2: Apache Eagle Strata Hadoop World London 2016

2

Arun ManoharanProduct Manager – [email protected]

@lycos_86

Page 3: Apache Eagle Strata Hadoop World London 2016

EBAY MARKETPLACE AT A GLANCE

$19.6B GMV in Q1 2016

9.5MNew listings added via

mobile per week

300MSearches each day

63%Transactions that ship

for free (in US, UK, DE)

79%Items sold as new

Q1 2016 data

~900MLive listings

One of the world’s largest and most vibrant marketplaces

Page 4: Apache Eagle Strata Hadoop World London 2016

Most Powerful Selling Platform

For business sellers: the potential to drive profitable sales and build a brand

For consumer sellers: an easy way to declutter, sell and make money

A partnership not a competition

Best Choice

Providing the greatest selection of inventory for our buyers

From new, everyday items to rare and unique goods

And incredible deals only found on eBay

Most Relevance

A shopping experience that is simple, data-driven and personalized

Enabling buyers to easily find, compare and purchase items they need and want

Highlighting the unique value that eBay brings

OUR STRATEGY

Page 5: Apache Eagle Strata Hadoop World London 2016

SMART COMMERCE

Identify an interesting set of candidate items,

trends, events, etc.

Personalize the results

Inspiration at scale!

Page 6: Apache Eagle Strata Hadoop World London 2016

6

Apache EagleMonitor Hadoop in Real Time Arun Manoharan | Product Manager|

@lycos_86

Page 7: Apache Eagle Strata Hadoop World London 2016

+200 Petabytes of Consumer

Data and growing…

Consumers on 6

Continents

Millions ofTransactions

1000’s ofProduct

Categories

Multiple cookies across

dozens of business

Actionable search insights

+ 9M payments every day

+ 6K Total

Payment Volume per

second

LoyaltyClick behavior

and patterns

Device IDs

100’s of millions of

Email addresses

Bank accounts

POS

AutosProducts

IP Address

Page 8: Apache Eagle Strata Hadoop World London 2016

+200 Petabytes of Consumer

Data and growing…

Consumers on 6

Continents

Credit cards

1000’s ofProduct

Categories

Multiple cookies across

dozens of business

Actionable search insights

+ 9M payments every day

+ 6K Total

Payment Volume per

second

Pair of shoes sold

every 2 second

Loyalty

Cell phone sold every 4

seconds

Click behavior

and patterns

Device IDs

100’s of millions of

Email addresses

Bank accounts

POS

A ladies handbag is bought via

mobile every 12 seconds

AutoProducts

IP Address

COLLECT, ANALYZE, PREDICT

Page 9: Apache Eagle Strata Hadoop World London 2016

Big Data @ eBay

*Q3 2015 data

7 Hadoop Clusters*

800MHDFS operations (single cluster)*

120 PB Data*

Hadoop @ eBay

Page 10: Apache Eagle Strata Hadoop World London 2016

HADOOP SECURITY

Authorization & Access Control

Perimeter Security

Data Classification

Activity Monitoring

Security

Security for Hadoop

Page 11: Apache Eagle Strata Hadoop World London 2016

Who is accessing the data?

What data are they accessing?

Is someone trying to access data that they don’t have access to?

Are there any anomalous access patterns?

Is there a security threat?

How to monitor and get notified during or prior to an anomalous event occurring?

Motivation for Eagle

Page 12: Apache Eagle Strata Hadoop World London 2016

Apache Eagle

Apache Eagle: Monitor Hadoop in Real Time

Apache Eagle is an Open Source Monitoring Platform for Hadoop eco-system, which started with monitoring data activities in Hadoop. It can instantly identify access to sensitive data, recognize attacks/malicious activities and blocks access in real time.

In conjunction with components such as Ranger, Sentry, Knox, DgSecure and Splunk etc., Eagle provides comprehensive solution to secure sensitive data stored in Hadoop.

Page 13: Apache Eagle Strata Hadoop World London 2016

Eagle Architecture

Page 14: Apache Eagle Strata Hadoop World London 2016

Apache Eagle Composition

Apache Eagle

Integrations Alert EngineHDFSAUDIT

HIVEQUERY

HBASEAUDIT

CASSANDRAAUDIT

MapRAUDIT

HADOOPPerformanceMetric

Namenode JMX Metrics

DatanodeJMX Metrics

SystemMetrics

M/R JobPerformanceMetric

History Job Metrics

Running Job Metrics

SparkJobPerformanceMetric

Spark Job Metrics

QueueMetrics

Data Activity Monitoring

RMJMXMetrics

Policy Store

Metadata API

Scalability

Extensibility

[Domains] [Applications]

Page 15: Apache Eagle Strata Hadoop World London 2016

Eagle

Page 16: Apache Eagle Strata Hadoop World London 2016

Data Classification - HDFS

• Browse HDFS file system• Batch import sensitivity metadata through Eagle API• Manually mark sensitivity in Eagle UI

Page 17: Apache Eagle Strata Hadoop World London 2016

Data Classification - Hive

• Browse Hive databases/tables/columns• Batch import sensitivity metadata through Eagle API• Manually mark sensitivity in Eagle UI

Page 18: Apache Eagle Strata Hadoop World London 2016

Define policy in UI and API

curl -u ${EAGLE_SERVICE_USER}:${EAGLE_SERVICE_PASSWD} -X POST -H 'Content-Type:application/json' \ "http://${EAGLE_SERVICE_HOST}:${EAGLE_SERVICE_PORT}/eagle-service/rest/entities?serviceName=AlertDefinitionService" \ -d ' [ { "prefix": "alertdef", "tags": { "site": "sandbox", "application": "hadoopJmxMetricDataSource", "policyId": "capacityUsedPolicy", "alertExecutorId": "hadoopJmxMetricAlertExecutor", "policyType": "siddhiCEPEngine" }, "description": "jmx metric ", "policyDef": "{\"expression\":\"from hadoopJmxMetricEventStream[metric == \\\"hadoop.namenode.fsnamesystemstate.capacityused\\\" and convert(value, \\\"long\\\") > 0] select metric, host, value, timestamp, component, site insert into tmp; \",\"type\":\"siddhiCEPEngine\"}", "enabled": true, "dedupeDef": "{\"alertDedupIntervalMin\":10,\"emailDedupIntervalMin\":10}", "notificationDef": "[{\"sender\":\"[email protected]\",\"recipients\":\"[email protected]\",\"subject\":\"missing block found.\",\"flavor\":\"email\",\"id\":\"email_1\",\"tplFileName\":\"\"}]" } ] '

1 Create policy using API 2 Create policy using UI

Page 19: Apache Eagle Strata Hadoop World London 2016

Define policy

Page 20: Apache Eagle Strata Hadoop World London 2016

1 Single event evaluation• threshold check with various

conditions

Policy Capabilities

2 Event window based evaluation• various window semantics (time/length sliding/batch

window)• comprehensive aggregation support

3 Correlation for multiple event streams• SQL-like join

4 Pattern Match and Sequence• a happens followed by b

Powered by Siddhi 3.0.5, and Eagle provides dynamic capabilities and intuitive API/UI

Page 21: Apache Eagle Strata Hadoop World London 2016

Scalability

•Scale with # of events•Scale with # of policies

Page 22: Apache Eagle Strata Hadoop World London 2016

Eagle Alert Engine Overview

1 Runs CEP engine on Apache Storm• Use CEP engine as library (Siddhi CEP)• Evaluate policy on streamed data• Rule is hot deployable

2 Inject policy dynamically• API• Intuitive UI

3 Scalability• Computation # of policies (policy placement)• Storage # of events (event partition)

4 Extensibility for policy enforcement• Post-alert processing with plugin

Page 23: Apache Eagle Strata Hadoop World London 2016

Eagle Alert

Page 24: Apache Eagle Strata Hadoop World London 2016

Statistics• # of events evaluated per

second• audit for policy change

Eagle ServiceAs of 0.3.0, Eagle stores metadata and statistics into HBASE, and support Druid as metric store.

Metadata• Policy• Event schema• Site/Application/UI Features

HBASE• Store metrics• Store M/R job/task data• Rowkey design for time-series

data• HBase Coprocessor

Raw data• Druid for metric• HBASE for M/R job/task

etc.• ES for log (future)

1 Data to be stored

2 Storage 3 API/UI

Druid• Consume data from Kafka

HBASE• filter, groupby, sort,

top

Druid• Druid query API• Dashboard in Eagle

Page 25: Apache Eagle Strata Hadoop World London 2016

Highlights

1. Ease of use: after installation, user defines rules2. Comprehensive rules on high volume of data: Eagle solves some

unique problem in Hadoop3. Hot deploy rule: Eagle does not provide a lot of charts, instead it

allows user to write ad-hoc rule and hot deploy it.4. Metadata driven: metadata includes policy, event schema and UI

component etc.5. Monolithic storm topology: application pre-processing running

together with alert engine 6. Extensibility: Eagle can’t succeed alone, Eagle has to be integrated

with other system for example data classification, policy enforcement etc.

Page 26: Apache Eagle Strata Hadoop World London 2016

Alert Engine Limitations in Eagle 0.3

1 High cost for integrating• Coding for onboarding new data source• Monolithic topology for pre-processing and

alert

3 Policy capability restricted by event partition• Can’t do ad-hoc group-by policy expressionFor example from groupby user to groupby cmd

2 Not multi-tenant• Alert engine is embedded into application• Many separate Storm topologies

4 Correlation is not declarative• Coding for correlating existing data sources

If traffic is partitioned by user, policy only supports expression of user based group-by

One storm topology even for one trivial data source

Even if it is a simple data source, you have to write storm topology and then deploy

Can’t declare correlations for multiple metrics

5 Stateful policy evaluation• fail over when bolt is down

How to replay one week history data when node is down

Page 27: Apache Eagle Strata Hadoop World London 2016

Integrations

•Cassandra•MapR•Mongo DB•Job Queue

Page 28: Apache Eagle Strata Hadoop World London 2016

Extensibility

Sentry/Ranger• As remediation engine• As generic data source

DgSecure• Source of truth for data classification

Splunk• Syslog format output• EAGLE alert output is the 1st abstraction of analytics and

Splunk is the 2nd abstraction

Page 29: Apache Eagle Strata Hadoop World London 2016

USER PROFILE ALGORITHMS…Eigen Value Decomposition

• Compute mean and variance

• Compute Eigen Vectors and determine Principal Components

• Normal data points lie near first few principal components

• Abnormal data points lie further from first few principal components and

closer to later components

Page 30: Apache Eagle Strata Hadoop World London 2016

USER PROFILE ARCHITECTURE

Page 31: Apache Eagle Strata Hadoop World London 2016

Eagle Next Releases

• Improve User experience Remote start storm topology Metadata stored in RDBMS

Eagle 0.4 Eagle 0.5

• Alert Engine as Platform No monolithic topology Declarative data source onboard Easy correlation Support policies with any field

group-by Elastic capacity management

Page 32: Apache Eagle Strata Hadoop World London 2016

[email protected]

http://eagle.incubator.apache.org

https://github.com/apache/incubator-eagle Github

Welcome Contributors in Apache Eagle

Dev Mail List

@TheApacheEagleTwitter

Q & A

Page 33: Apache Eagle Strata Hadoop World London 2016

34

Thank You!!