Apache Eagle Dublin Hadoop Summit 2016

32

Transcript of Apache Eagle Dublin Hadoop Summit 2016

Page 1: Apache Eagle   Dublin Hadoop Summit 2016
Page 2: Apache Eagle   Dublin Hadoop Summit 2016

2

Apache EagleMonitor Hadoop in Real Time

Yong Zhang | Senior Architect | [email protected] Manoharan | Senior Product Manager | @lycos_86

Page 3: Apache Eagle   Dublin Hadoop Summit 2016

Big Data @ eBay

800MListings *

159M Global Active Buyers *

*Q3 2015 data

7 Hadoop Clusters*

800MHDFS operations (single cluster)*

120 PB Data*

Hadoop @ eBay

Page 4: Apache Eagle   Dublin Hadoop Summit 2016

HADOOP SECURITY

Authorization & Access Control

Perimeter Security

Data Classification

Activity Monitoring

SecurityMDR

• Perimeter Security• Authorization &

Access Control• Discovery• Activity Monitoring

Security for Hadoop

Page 5: Apache Eagle   Dublin Hadoop Summit 2016

Who is accessing the data?

What data are they accessing?

Is someone trying to access data that they don’t have access to?

Are there any anomalous access patterns?

Is there a security threat?

How to monitor and get notified during or prior to an anomalous event occurring?

Motivation

Page 6: Apache Eagle   Dublin Hadoop Summit 2016

Apache Eagle

Apache Eagle: Monitor Hadoop in Real Time

Apache Eagle is an Open Source Monitoring Platform for Hadoop eco-system, which started with monitoring data activities in Hadoop. It can instantly identify access to sensitive data, recognize attacks/malicious activities and blocks access in real time.

In conjunction with components such as Ranger, Sentry, Knox, DgSecure and Splunk etc., Eagle provides comprehensive solution to secure sensitive data stored in Hadoop.

Page 7: Apache Eagle   Dublin Hadoop Summit 2016

Apache Eagle Composition

Apache Eagle

Integrations Alert EngineHDFSAUDIT

HIVEQUERY

HBASEAUDIT

CASSANDRAAUDIT

MapRAUDIT

2 HADOOPPerformanceMetric

Namenode JMX Metrics

DatanodeJMX Metrics

SystemMetrics

3 M/R JobPerformanceMetric

History Job Metrics

Running Job Metrics

4 Spark JobPerformanceMetric

Spark Job Metrics

QueueMetrics

1 Data Activity Monitoring

RMJMXMetrics

1 Policy Store

2 Metadata API

3 Scalability

4 Extensibility

[Domains] [Applications]

Page 8: Apache Eagle   Dublin Hadoop Summit 2016

More Integrations

•Cassandra•MapR•Mongo DB•Job•Queue

Page 9: Apache Eagle   Dublin Hadoop Summit 2016

Extensibility

Ranger• As remediation engine• As generic data source

DgSecure• Source of truth for data classification

Splunk• Syslog format output• EAGLE alert output is the 1st abstraction of analytics and

Splunk is the 2nd abstraction

Page 10: Apache Eagle   Dublin Hadoop Summit 2016

Eagle Architecture

Page 11: Apache Eagle   Dublin Hadoop Summit 2016

Highlights

1. Turn-key integration: after installation, user defines rules2. Comprehensive rules on high volume of data: Eagle solves some

unique problem in Hadoop3. Hot deploy rule: Eagle does not provide a lot of charts, instead it

allows user to write ad-hoc rule and hot deploy it.4. Metadata driven: kept in mind, here metadata includes policy, event

schema and UI component etc.5. Extensibility: Keep in mind that Eagle can’t succeed alone, Eagle has to

be integrated with other system for example data classification, policy enforcement etc.

6. Monolithic storm topology: application pre-processing are running together with alert engine.

Page 12: Apache Eagle   Dublin Hadoop Summit 2016

Example 1: Integration with HDFS AUDIT log

• Ingestion KafkaLog4jAppender+Ka

fka Logstash+Kafka

• Partition By user

• Pre-processing Sensitivity join Command re-assembler

Namenode

Kafka Partition_1

Kafka Partition_2

Kafka Partition_N

StormKafkaSpout

User1 User1

Alert Executor_1

Alert Executor_2

Alert Executor_K

User2 User2

User1

User2

Page 13: Apache Eagle   Dublin Hadoop Summit 2016

Data Classification - HDFS

• Browse HDFS file system• Batch import sensitivity metadata through Eagle API• Manually mark sensitivity in Eagle UI

Page 14: Apache Eagle   Dublin Hadoop Summit 2016

One user command generates multiple HDFS audit events Eagle does reverse engineering to figure out original user command Example COPYFROMLOCAL_PATTERN = “every a = eventStream[cmd==‘getfileinfo’] ” + “-> b = eventStream[cmd==‘getfileinfo’ and user==a.user and src==str:concat(a.src,‘._COPYING_’)] ” + “-> c = eventStream[cmd==‘create’ and user==a.user and src==b.src] ” + “-> d = eventStream[cmd==‘getfileinfo’ and user==a.user and src==b.src] ” + “-> e = eventStream[cmd==‘delete’ and user==a.user and src==a.src] ” + “-> f = eventStream[cmd==‘rename’ and user==a.user and src==b.src and dst==a.src]”

2015-11-20 00:06:47,090 INFO FSNamesystem.audit: allowed=true ugi=root (auth:SIMPLE) ip=/10.0.2.15 cmd=getfileinfo src=/tmp/private dst=null perm=null proto=rpc2015-11-20 00:06:47,185 INFO FSNamesystem.audit: allowed=true ugi=root (auth:SIMPLE) ip=/10.0.2.15 cmd=getfileinfo src=/tmp/private._COPYING_ dst=null perm=null proto=rpc2015-11-20 00:06:47,254 INFO FSNamesystem.audit: allowed=true ugi=root (auth:SIMPLE) ip=/10.0.2.15 cmd=create src=/tmp/private._COPYING_ dst=null perm=root:hdfs:rw-r--r-- proto=rpc2015-11-20 00:06:47,289 INFO FSNamesystem.audit: allowed=true ugi=root (auth:SIMPLE) ip=/10.0.2.15 cmd=getfileinfo src=/tmp/private._COPYING_ dst=null perm=null proto=rpc2015-11-20 00:06:47,609 INFO FSNamesystem.audit: allowed=true ugi=root (auth:SIMPLE) ip=/10.0.2.15 cmd=delete src=/tmp/private dst=null perm=null proto=rpc2015-11-20 00:06:47,624 INFO FSNamesystem.audit: allowed=true ugi=root (auth:SIMPLE) ip=/10.0.2.15 cmd=rename src=/tmp/private._COPYING_ dst=/tmp/private perm=root:hdfs:rw-r--r-- proto=rpc

User Command Re-assembly

Page 15: Apache Eagle   Dublin Hadoop Summit 2016

• Policy evaluation is stateful (one user’s data has to go to one physical bolt)• Partition by user all the way (hash)• User is not balanced at all• Greedy algorithm https://en.wikipedia.org/wiki/Partition_problem#The_greedy_algorithm

Data Skew Problem

Page 16: Apache Eagle   Dublin Hadoop Summit 2016

Policy weight is not even• Regex policy is CPU intensive• Window based policy is Memory intensive

Computation Skew Problem

Page 17: Apache Eagle   Dublin Hadoop Summit 2016

Example 2: Integration with Hive

• Ingestion Yarn API

• Partition user

• Pre-processing

Sensitivity join Hive SQL

parser

Page 18: Apache Eagle   Dublin Hadoop Summit 2016

Data Classification - Hive

• Browse Hive databases/tables/columns• Batch import sensitivity metadata through Eagle API• Manually mark sensitivity in Eagle UI

Page 19: Apache Eagle   Dublin Hadoop Summit 2016

Eagle Alert Engine Overview

1 Runs CEP engine on Apache Storm• Use CEP engine as library (Siddhi CEP)• Evaluate policy on streamed data• Rule is hot deployable

2 Inject policy dynamically• API• Intuitive UI

3 Scalability• Computation # of policies (policy placement)• Storage # of events (event partition)

4 Extensibility for policy enforcement• Post-alert processing with plugin

Page 20: Apache Eagle   Dublin Hadoop Summit 2016

Run CEP Engine on Storm

Storm BoltCEPWorkerCEPWorker

CEPWorker

… …

Policy Check Thread Polic

y Store

Metadata API

event1

event1event1

event1

policy1,2,3,4,5,6

policy1,2,3policy1

policy2

policy3

Storm Bolt

event1

policy4,5,6

event schema

Page 21: Apache Eagle   Dublin Hadoop Summit 2016

Primitives – event, policy, alert

Raw Event2015-10-11 01:00:00,014 INFO FSNamesystem.audit: allowed=true [email protected] (auth:KERBEROS) ip=/10.0.0.1 cmd=getfileinfo src=/tmp/private dst=null perm=null

Alert EventTimestamp, cmd, src, dst, ugi, sensitivityType, securityZone

PolicyviewPrivate: from hdfsAuditLogEventStream[(cmd=='getfileinfo') and (src=’/tmp/private’)]

Alert2015-10-11 01:00:09[UTC] hdfsAuditLog viewPrivate user_tom /10.0.0.1 The Policy "viewPrivate" has been detected with the below information: timestamp="1445993770932" allowed="true" cmd="getfileinfo" host="/10.0.0.1" sensitivityType="PRIVATE" securityZone="NA" src="/tmp/private" dst="NA" user=“user_tom”

Page 22: Apache Eagle   Dublin Hadoop Summit 2016

Event Schema

• Modeling event

Page 23: Apache Eagle   Dublin Hadoop Summit 2016

1 Single event evaluation• threshold check with various

conditions

Policy Capabilities

2 Event window based evaluation• various window semantics (time/length sliding/batch

window)• comprehensive aggregation support

3 Correlation for multiple event streams• SQL-like join

4 Pattern Match and Sequence• a happens followed by b

Powered by Siddhi 3.0.5, but Eagle provides dynamic capabilities and intuitive API/UI

Page 24: Apache Eagle   Dublin Hadoop Summit 2016

1 Namenode master/slave lag from every a = hadoopJmxMetricEventStream[metric=="hadoop.namenode.journaltransaction.lastappliedorwrittentxid"] -> b = hadoopJmxMetricEventStream[metric==a.metric and b.host != a.host and (max(convert(a.value, "long")) + 100) <= max(convert(value, "long"))] within 5 min select a.host as hostA, a.value as transactIdA, b.host as hostB, b.value as transactIdB insert into tmp;

Some policy examples

3 Namenode HA state changefrom every a = hadoopJmxMetricEventStream[metric=="hadoop.namenode.hastate.active.count"] -> b = hadoopJmxMetricEventStream[metric==a.metric and b.host == a.host and (convert(a.value, "long") != convert(value, "long"))] within 10 min select a.host, a.value as oldHaState, b.value as newHaState, b.timestamp as timestamp, b.metric as metric, b.component as component, b.site as site insert into tmp;

2 Namenode last checkpoint time• from hadoopJmxMetricEventStream[metric ==

"hadoop.namenode.dfs.lastcheckpointtime" and (convert(value, "long") + 18000000) < timestamp] select metric, host, value, timestamp, component, site insert into tmp;

Page 25: Apache Eagle   Dublin Hadoop Summit 2016

Define policy in UI and API

curl -u ${EAGLE_SERVICE_USER}:${EAGLE_SERVICE_PASSWD} -X POST -H 'Content-Type:application/json' \ "http://${EAGLE_SERVICE_HOST}:${EAGLE_SERVICE_PORT}/eagle-service/rest/entities?serviceName=AlertDefinitionService" \ -d ' [ { "prefix": "alertdef", "tags": { "site": "sandbox", "application": "hadoopJmxMetricDataSource", "policyId": "capacityUsedPolicy", "alertExecutorId": "hadoopJmxMetricAlertExecutor", "policyType": "siddhiCEPEngine" }, "description": "jmx metric ", "policyDef": "{\"expression\":\"from hadoopJmxMetricEventStream[metric == \\\"hadoop.namenode.fsnamesystemstate.capacityused\\\" and convert(value, \\\"long\\\") > 0] select metric, host, value, timestamp, component, site insert into tmp; \",\"type\":\"siddhiCEPEngine\"}", "enabled": true, "dedupeDef": "{\"alertDedupIntervalMin\":10,\"emailDedupIntervalMin\":10}", "notificationDef": "[{\"sender\":\"[email protected]\",\"recipients\":\"[email protected]\",\"subject\":\"missing block found.\",\"flavor\":\"email\",\"id\":\"email_1\",\"tplFileName\":\"\"}]" } ] '

1 Create policy using API 2 Create policy using UI

Page 26: Apache Eagle   Dublin Hadoop Summit 2016

Scalability

•Scale with # of events•Scale with # of policies

Page 27: Apache Eagle   Dublin Hadoop Summit 2016

Statistics• # of events evaluated per

second• audit for policy change

Eagle ServiceAs of 0.3.0, Eagle stores metadata and statistics into HBASE, and support Druid as metric store.

Metadata• Policy• Event schema• Site/Application/UI Features

HBASE• Store metrics• Store M/R job/task data• Rowkey design for time-series

data• HBase Coprocessor

Raw data• Druid for metric• HBASE for M/R job/task

etc.• ES for log (future)

1 Data to be stored

2 Storage 3 API/UI

Druid• Consume data from Kafka

HBASE• filter, groupby, sort,

top

Druid• Druid query API• Dashboard in Eagle

Page 28: Apache Eagle   Dublin Hadoop Summit 2016

Alert Engine Limitations in Eagle 0.3

1 High cost for integrating• Coding for onboarding new data source• Monolithic topology for pre-processing and

alert

3 Policy capability restricted by event partition• Can’t do ad-hoc group-by policy expressionFor example from groupby user to groupby cmd

2 Not multi-tenant• Alert engine is embedded into application• Many separate Storm topologies

4 Correlation is not declarative• Coding for correlating existing data sources

If traffic is partitioned by user, policy only supports expression of user based group-by

One storm topology even for one trivial data source

Even if it is a simple data source, you have to write storm topology and then deploy

Can’t declare correlations for multiple metrics

5 Stateful policy evaluation• fail over when bolt is down

How to replay one week history data when node is down

Page 29: Apache Eagle   Dublin Hadoop Summit 2016

Eagle Next Releases

• Improve User experience Remote start storm topology Metadata stored in RDBMS

Eagle 0.4 Eagle 0.5

• Alert Engine as Platform No monolithic topology Declarative data source onboard Easy correlation Support policies with any field

group-by Elastic capacity management

Page 30: Apache Eagle   Dublin Hadoop Summit 2016

USER PROFILE ALGORITHMS…Eigen Value Decomposition

• Compute mean and variance

• Compute Eigen Vectors and determine Principal Components

• Normal data points lie near first few principal components

• Abnormal data points lie further from first few principal components and

closer to later components

Page 31: Apache Eagle   Dublin Hadoop Summit 2016

USER PROFILE ARCHITECTURE

Page 32: Apache Eagle   Dublin Hadoop Summit 2016

[email protected]

http://eagle.incubator.apache.org

https://github.com/apache/incubator-eagle Github

Welcome Contributors in Apache Eagle

Dev Mail List

@TheApacheEagleTwitter

Q & A