Daniel Lanza Zbigniew Baranowski€¦ · •For aggregations and reporting SQL is very handy –no...
Transcript of Daniel Lanza Zbigniew Baranowski€¦ · •For aggregations and reporting SQL is very handy –no...
![Page 1: Daniel Lanza Zbigniew Baranowski€¦ · •For aggregations and reporting SQL is very handy –no need to be a Java expert ... •For parquet/avro (using Spark) hadoop jar hadoop-streaming.jar](https://reader033.fdocuments.net/reader033/viewer/2022042222/5ec95a76e92aff32af2fdde6/html5/thumbnails/1.jpg)
Hadoop Tutorials
Daniel Lanza
Zbigniew Baranowski
![Page 2: Daniel Lanza Zbigniew Baranowski€¦ · •For aggregations and reporting SQL is very handy –no need to be a Java expert ... •For parquet/avro (using Spark) hadoop jar hadoop-streaming.jar](https://reader033.fdocuments.net/reader033/viewer/2022042222/5ec95a76e92aff32af2fdde6/html5/thumbnails/2.jpg)
4 sessions
• Hadoop Foundations (today)
• Data Ingestion (20-July)
• Spark (3-Aug)
• Data analytic tools and techniques (31-Aug)
![Page 3: Daniel Lanza Zbigniew Baranowski€¦ · •For aggregations and reporting SQL is very handy –no need to be a Java expert ... •For parquet/avro (using Spark) hadoop jar hadoop-streaming.jar](https://reader033.fdocuments.net/reader033/viewer/2022042222/5ec95a76e92aff32af2fdde6/html5/thumbnails/3.jpg)
Hands-on setup
• 12 node virtualized cluster– 8GB of RAM, 4 cores per node– 20GB of SSD storage per node
• Access (haperf10[1-12].cern.ch)– Everybody who subscribed should have the access– Try: ssh haperf10* 'hdfs dfs -ls‘
• List of commands and queries to be used$> ssh haper10*$> kinit$> git clone https://:@gitlab.cern.ch:8443/db/hadoop-tutorials-2016.git
or$> sh /tmp/clone-hadoop-tutorials-repo
• Alternative environment:http://www.cloudera.com/downloads/quickstart_vms/5-7.html
![Page 4: Daniel Lanza Zbigniew Baranowski€¦ · •For aggregations and reporting SQL is very handy –no need to be a Java expert ... •For parquet/avro (using Spark) hadoop jar hadoop-streaming.jar](https://reader033.fdocuments.net/reader033/viewer/2022042222/5ec95a76e92aff32af2fdde6/html5/thumbnails/4.jpg)
Recap of the 1st session
• A framework for large scale data processing
• Data locality (shared nothing) – scales out
Interconnect network
MEMORY
CPU
Disks
MEMORY
CPU
Disks
MEMORY
CPU
Disks
MEMORY
CPU
Disks
MEMORY
CPU
Disks
MEMORY
CPU
Disks
Node 1 Node 2 Node 3 Node 4 Node 5 Node X
4
![Page 5: Daniel Lanza Zbigniew Baranowski€¦ · •For aggregations and reporting SQL is very handy –no need to be a Java expert ... •For parquet/avro (using Spark) hadoop jar hadoop-streaming.jar](https://reader033.fdocuments.net/reader033/viewer/2022042222/5ec95a76e92aff32af2fdde6/html5/thumbnails/5.jpg)
Recap of the 1st session
• Hadoop is distributed platform for large scale data processing
• For aggregations and reporting SQL is very handy – no need to be a Java expert
• In order to achieve good performance and optimal resources utilization– Partition your data – spread them across multiple
directories
– Use compact data formats - (Avro or Parquet)
![Page 6: Daniel Lanza Zbigniew Baranowski€¦ · •For aggregations and reporting SQL is very handy –no need to be a Java expert ... •For parquet/avro (using Spark) hadoop jar hadoop-streaming.jar](https://reader033.fdocuments.net/reader033/viewer/2022042222/5ec95a76e92aff32af2fdde6/html5/thumbnails/6.jpg)
Data Ingestion to Hadoop
![Page 7: Daniel Lanza Zbigniew Baranowski€¦ · •For aggregations and reporting SQL is very handy –no need to be a Java expert ... •For parquet/avro (using Spark) hadoop jar hadoop-streaming.jar](https://reader033.fdocuments.net/reader033/viewer/2022042222/5ec95a76e92aff32af2fdde6/html5/thumbnails/7.jpg)
Goals for today
• What are the challenges in storing data on Hadoop
• How to decrease data latency – ingestion in near-real-time
• How to ensure scalability and no data losses
• Learn about commonly used ingestion tools
![Page 8: Daniel Lanza Zbigniew Baranowski€¦ · •For aggregations and reporting SQL is very handy –no need to be a Java expert ... •For parquet/avro (using Spark) hadoop jar hadoop-streaming.jar](https://reader033.fdocuments.net/reader033/viewer/2022042222/5ec95a76e92aff32af2fdde6/html5/thumbnails/8.jpg)
What are the challenges?
• Variety of data sources– Databases– Web– REST– Logs– Whatever…
• Not all of them are necessary producing files…• HDFS is a file system, not a database
– You need to store files
• Extraction-Tranformation-Loading tools needed
Streaming data
Files in batch
![Page 9: Daniel Lanza Zbigniew Baranowski€¦ · •For aggregations and reporting SQL is very handy –no need to be a Java expert ... •For parquet/avro (using Spark) hadoop jar hadoop-streaming.jar](https://reader033.fdocuments.net/reader033/viewer/2022042222/5ec95a76e92aff32af2fdde6/html5/thumbnails/9.jpg)
Data ingestion types
• Batch ingestion– Data are already produced and available to store
on Hadoop (archive logs, files produced by external systems, RDBMS)
– Typically big chunks of data
• Real time ingestion– Data are continuously produced
– Streaming sources
![Page 10: Daniel Lanza Zbigniew Baranowski€¦ · •For aggregations and reporting SQL is very handy –no need to be a Java expert ... •For parquet/avro (using Spark) hadoop jar hadoop-streaming.jar](https://reader033.fdocuments.net/reader033/viewer/2022042222/5ec95a76e92aff32af2fdde6/html5/thumbnails/10.jpg)
Batch ingestion
• hdfs dfs –put or HDFS API– sends file from local system to HDFS
– file is sent sequentially
• Custom programs with using HDFS API
• Kite SDK– sends (text) files and encodes in Avro, Parquet or store
in HBase
– multithreaded
• Apache Sqoop – loading data from external relational databases
![Page 11: Daniel Lanza Zbigniew Baranowski€¦ · •For aggregations and reporting SQL is very handy –no need to be a Java expert ... •For parquet/avro (using Spark) hadoop jar hadoop-streaming.jar](https://reader033.fdocuments.net/reader033/viewer/2022042222/5ec95a76e92aff32af2fdde6/html5/thumbnails/11.jpg)
About Kite
• High level data API for Hadoop
• Two steps to store your date
– Create dataset - configure how to store the data
• Data schema, partitioning strategy
• File format: JSON,Parquet, Avto, Hbase
• dataset metadata: on HDFS or in Hive (as a table)
– Import the data
• From local file system, or HDFS
![Page 12: Daniel Lanza Zbigniew Baranowski€¦ · •For aggregations and reporting SQL is very handy –no need to be a Java expert ... •For parquet/avro (using Spark) hadoop jar hadoop-streaming.jar](https://reader033.fdocuments.net/reader033/viewer/2022042222/5ec95a76e92aff32af2fdde6/html5/thumbnails/12.jpg)
KiteSDK – Hand-on• Loading a CSV data to HDFS into parquet format0) get a CSV data (hadoop-tutorials-2016/2_data_ingestion/0_batch_ingestion/kite )
1) infer schema from the data (script: ./1_get_schema)
2) create data partitioning policy (script: ./2_create_part_file)
3) create a datastore on HDFS (script: ./3_create_datastore)
4) load the data (script: ./4_load_data)
hdfs dfs –get /tmp/ratings.csv .
$ kite-dataset csv-schema ratings.csv --record-name ratings -o ratings.avsc$ cat ratings.avsc
$ kite-dataset create dataset:hdfs:/user/zbaranow/datasets/ratings –schema \ ratings.avsc --format parquet --partition-by partition.policy.json
$ echo “[{"type": "year", "source": "timestamp"}
]”>> partition.policy.json
$ kite-dataset csv-import ratings.csv --delimiter ',' \dataset:hdfs:/user/zbaranow/datasets/ratings
![Page 13: Daniel Lanza Zbigniew Baranowski€¦ · •For aggregations and reporting SQL is very handy –no need to be a Java expert ... •For parquet/avro (using Spark) hadoop jar hadoop-streaming.jar](https://reader033.fdocuments.net/reader033/viewer/2022042222/5ec95a76e92aff32af2fdde6/html5/thumbnails/13.jpg)
About Apache Sqoop
• Tool to transfer data between structured databases and Hadoop
• Sqoop tasks are implemented as map reduce jobs – can scale on the cluster
HDFS(Files, Hive, HBase)
Database(MySQL, Oracle, PostgreSQL, DB2
Sqoop Import
Sqoop Export
JDB
C
![Page 14: Daniel Lanza Zbigniew Baranowski€¦ · •For aggregations and reporting SQL is very handy –no need to be a Java expert ... •For parquet/avro (using Spark) hadoop jar hadoop-streaming.jar](https://reader033.fdocuments.net/reader033/viewer/2022042222/5ec95a76e92aff32af2fdde6/html5/thumbnails/14.jpg)
Tips for Sqoop
• For big data exports (>1G) use dedicated connectors for target database– Logic that understand how to read data efficiently from
target db system
• Do not use too many mappers/sessions– Excessive number of concurrent sessions can kill a
database
– max 10 mappers/sessions
• Remember: Sqoop does not retransfer (automatically) updated data
![Page 15: Daniel Lanza Zbigniew Baranowski€¦ · •For aggregations and reporting SQL is very handy –no need to be a Java expert ... •For parquet/avro (using Spark) hadoop jar hadoop-streaming.jar](https://reader033.fdocuments.net/reader033/viewer/2022042222/5ec95a76e92aff32af2fdde6/html5/thumbnails/15.jpg)
How to run a Sqoop job
Example (target Oracle):
sqoop import \ #impoting from DB to HDFS
--direct
--connect jdbc:oracle:thin:@itrac5110:10121/PIMT_RAC51.cern.ch \
--username meetup \ #database user
--table meetup_data #table name to be imported
-P \
--num-mappers 2 \ #number of parallel sessions
--target-dir meetup_data_sqooop \ #target HDFS directory
![Page 16: Daniel Lanza Zbigniew Baranowski€¦ · •For aggregations and reporting SQL is very handy –no need to be a Java expert ... •For parquet/avro (using Spark) hadoop jar hadoop-streaming.jar](https://reader033.fdocuments.net/reader033/viewer/2022042222/5ec95a76e92aff32af2fdde6/html5/thumbnails/16.jpg)
Sqoop hands-on
• Run a sqoop job to export meetup data from Oracle database
– in a text format
– (custom) incremental import to parquet,
!build-in incremental import jobs does not support direct connectors
cd hadoop-tutorials-2016/2_data_ingestion/0_batch_ingestion/sqoop./1_run_sqoop_import
./2_create_sqoop_job
./3_run_job
![Page 17: Daniel Lanza Zbigniew Baranowski€¦ · •For aggregations and reporting SQL is very handy –no need to be a Java expert ... •For parquet/avro (using Spark) hadoop jar hadoop-streaming.jar](https://reader033.fdocuments.net/reader033/viewer/2022042222/5ec95a76e92aff32af2fdde6/html5/thumbnails/17.jpg)
‘Real’ time ingestion to HDFS
• More challenging than batch ingestion
– ETL on fly
• There is always a latency when storing to HDFS
– data streams has to be materialized in files
– creating a file per a stream event will kill HDFS
– events has to be written in groups
![Page 18: Daniel Lanza Zbigniew Baranowski€¦ · •For aggregations and reporting SQL is very handy –no need to be a Java expert ... •For parquet/avro (using Spark) hadoop jar hadoop-streaming.jar](https://reader033.fdocuments.net/reader033/viewer/2022042222/5ec95a76e92aff32af2fdde6/html5/thumbnails/18.jpg)
What are the challenges ?
• Hadoop works efficiently with big files
– (> a block size=128MB)
– placing big data in small files will degrade processing performance (typically one worker per file)
– and reduce hdfs capacity
1.00
10.00
100.00
1000.00
10000.00
100000.00
0.1 1 10 100 1000 10000 100000
clu
ste
r ca
pac
ity
(TB
)
avg file size (MB)
File meta size = ~125BDirectory meta size = ~155BBlock meta size = ~184B
NameNode fs image memory available = 20GB
![Page 19: Daniel Lanza Zbigniew Baranowski€¦ · •For aggregations and reporting SQL is very handy –no need to be a Java expert ... •For parquet/avro (using Spark) hadoop jar hadoop-streaming.jar](https://reader033.fdocuments.net/reader033/viewer/2022042222/5ec95a76e92aff32af2fdde6/html5/thumbnails/19.jpg)
Solutions for low latency (1)
• 1) Writing small files to HDFS• Decreases data latency
• Creates many files
• 2) Compacting them periodically
HDFS
Stream Source
Events
File
1
File
2
File
3
File
4
File
5
File
6
File
7
File
8
File
9
File
10
File
11
File
12
File
13
Dat
a Si
nk
HDFS
File
1
File
2
File
3
File
4
File
5
File
6
File
7
File
8
File
9
File
10
File
11
File
12
File
13
Merge Big File
![Page 20: Daniel Lanza Zbigniew Baranowski€¦ · •For aggregations and reporting SQL is very handy –no need to be a Java expert ... •For parquet/avro (using Spark) hadoop jar hadoop-streaming.jar](https://reader033.fdocuments.net/reader033/viewer/2022042222/5ec95a76e92aff32af2fdde6/html5/thumbnails/20.jpg)
Compacting examples
• For text files (just merging with MR)
• For parquet/avro (using Spark)
hadoop jar hadoop-streaming.jar \-Dmapred.reduce.tasks=1 -input meetup-data/2016/07/13 \-output meetup-data/2016/07/13/merged \
-mapper cat \-reducer cat
val df = sqlContext.parquetFile("meetup-data/2016/07/13/*.parquet")val dfp=df.repartition(1)dfp.saveAsParquetFile("meetup-data/2016/07/13/merge")
![Page 21: Daniel Lanza Zbigniew Baranowski€¦ · •For aggregations and reporting SQL is very handy –no need to be a Java expert ... •For parquet/avro (using Spark) hadoop jar hadoop-streaming.jar](https://reader033.fdocuments.net/reader033/viewer/2022042222/5ec95a76e92aff32af2fdde6/html5/thumbnails/21.jpg)
Solutions for low latency (2)
• 1) Stage data into staging buffers
– Making them immediately accessible to access
• 2) Flush buffers periodically to HDFS files
– Requires two access path to the data (buffers + HDFS)
HDFS
Stream Source
Events
Dat
a Si
nk
Staging area
Flush
Big File
![Page 22: Daniel Lanza Zbigniew Baranowski€¦ · •For aggregations and reporting SQL is very handy –no need to be a Java expert ... •For parquet/avro (using Spark) hadoop jar hadoop-streaming.jar](https://reader033.fdocuments.net/reader033/viewer/2022042222/5ec95a76e92aff32af2fdde6/html5/thumbnails/22.jpg)
How to sink data streams to HDFS
• There are specialized tools– Apache Flume
– LinkedIn Gobblin
– Apache Nifi
– and more
![Page 23: Daniel Lanza Zbigniew Baranowski€¦ · •For aggregations and reporting SQL is very handy –no need to be a Java expert ... •For parquet/avro (using Spark) hadoop jar hadoop-streaming.jar](https://reader033.fdocuments.net/reader033/viewer/2022042222/5ec95a76e92aff32af2fdde6/html5/thumbnails/23.jpg)
Apache Flume
• Data transfer engine driven by events– Flume events
• Headers• Body (byte array)
• Data can be– Collected– Processed (interceptors)– Aggregated
• Main features– Distributed
• Agents can be placed on different machines
– Reliable• Transactions
![Page 24: Daniel Lanza Zbigniew Baranowski€¦ · •For aggregations and reporting SQL is very handy –no need to be a Java expert ... •For parquet/avro (using Spark) hadoop jar hadoop-streaming.jar](https://reader033.fdocuments.net/reader033/viewer/2022042222/5ec95a76e92aff32af2fdde6/html5/thumbnails/24.jpg)
Flume agent• Agents has at least
– A source• Files/Logs/Directories, Kafka, Twitter• STDOUT from a program• …• Custom (a.g. JDBCSource)• Note: could have interceptors
– A channel• Memory• File• Kafka• JDBC• …• Custom
– A sink• HDFS, HBase,• Kafka, ElasticSearch• ….• Custom
flume-agent
Channel
Source Sink
![Page 25: Daniel Lanza Zbigniew Baranowski€¦ · •For aggregations and reporting SQL is very handy –no need to be a Java expert ... •For parquet/avro (using Spark) hadoop jar hadoop-streaming.jar](https://reader033.fdocuments.net/reader033/viewer/2022042222/5ec95a76e92aff32af2fdde6/html5/thumbnails/25.jpg)
Flume data flow• Multiple agents can be deployed
– On the same machine or in a distributed manner
flume-agent
Channel
Source Sink
flume-agent
Channel
Source Sink
flume-agent
Channel
Source Sink
flume-agent
Channel
Source Sink
Consolidation/aggregation
![Page 26: Daniel Lanza Zbigniew Baranowski€¦ · •For aggregations and reporting SQL is very handy –no need to be a Java expert ... •For parquet/avro (using Spark) hadoop jar hadoop-streaming.jar](https://reader033.fdocuments.net/reader033/viewer/2022042222/5ec95a76e92aff32af2fdde6/html5/thumbnails/26.jpg)
Flume data flow
• Agents can have more than one data flow
– Replication
– Multiplexing
agent
Source
Sink
Channel Sink
Channel
Channel Sink
![Page 27: Daniel Lanza Zbigniew Baranowski€¦ · •For aggregations and reporting SQL is very handy –no need to be a Java expert ... •For parquet/avro (using Spark) hadoop jar hadoop-streaming.jar](https://reader033.fdocuments.net/reader033/viewer/2022042222/5ec95a76e92aff32af2fdde6/html5/thumbnails/27.jpg)
Flume hands-on
• Flume chathaperf101.cern.ch
gateway-agent
Avro
HDFS
haperf10*.cern.ch
chat-client
Memory
NetCat Avro
haperf10*.cern.ch
chat-client
Memory
NetCat Avro
chat-client
Memory
NetCat Avro
haperf10*.cern.ch
chat-client
Memory
NetCat Avro
Memory Local
Memory
![Page 28: Daniel Lanza Zbigniew Baranowski€¦ · •For aggregations and reporting SQL is very handy –no need to be a Java expert ... •For parquet/avro (using Spark) hadoop jar hadoop-streaming.jar](https://reader033.fdocuments.net/reader033/viewer/2022042222/5ec95a76e92aff32af2fdde6/html5/thumbnails/28.jpg)
Flume hands-on
• Flume chat gateway# Name the components on this agentgateway-agent.sources = avro_sourcegateway-agent.channels = hdfs_channel local_channelgateway-agent.sinks = hdfs_sink local_sink
# Configure sourcegateway-agent.sources.avro_source.type = avrogateway-agent.sources.avro_source.selector.type = replicatinggateway-agent.sources.avro_source.channels = hdfs_channel local_channelgateway-agent.sources.avro_source.bind = 0.0.0.0gateway-agent.sources.avro_source.port = 12123gateway-agent.sources.avro_source.interceptors = addtimestampgateway-agent.sources.avro_source.interceptors.addtimestamp.type = AddTimestampInterceptor$Builder
# Use a channel which buffers events in memorygateway-agent.channels.hdfs_channel.type = memorygateway-agent.channels.hdfs_channel.capacity = 10000gateway-agent.channels.hdfs_channel.transactionCapacity = 100
gateway-agent.channels.local_channel.type = memorygateway-agent.channels.local_channel.capacity = 10000gateway-agent.channels.local_channel.transactionCapacity = 100
# Describe the HDFS sinkgateway-agent.sinks.hdfs_sink.type = hdfsgateway-agent.sinks.hdfs_sink.channel = hdfs_channelgateway-agent.sinks.hdfs_sink.hdfs.fileType = DataStreamgateway-agent.sinks.hdfs_sink.hdfs.path = hdfs://haperf100.cern.ch:8020/tmp/flume-chat-data/gateway-agent.sinks.hdfs_sink.hdfs.rollCount = 100gateway-agent.sinks.hdfs_sink.hdfs.rollSize = 1000000
# Describe the local files sinkgateway-agent.sinks.local_sink.type = file_rollgateway-agent.sinks.local_sink.channel = local_channelgateway-agent.sinks.local_sink.sink.directory = /tmp/flume-chat-data/gateway-agent.sinks.local_sink.batchSize = 10
![Page 29: Daniel Lanza Zbigniew Baranowski€¦ · •For aggregations and reporting SQL is very handy –no need to be a Java expert ... •For parquet/avro (using Spark) hadoop jar hadoop-streaming.jar](https://reader033.fdocuments.net/reader033/viewer/2022042222/5ec95a76e92aff32af2fdde6/html5/thumbnails/29.jpg)
Flume hands-on• Flume chat gateway (AddTimestampInterceptor)
![Page 30: Daniel Lanza Zbigniew Baranowski€¦ · •For aggregations and reporting SQL is very handy –no need to be a Java expert ... •For parquet/avro (using Spark) hadoop jar hadoop-streaming.jar](https://reader033.fdocuments.net/reader033/viewer/2022042222/5ec95a76e92aff32af2fdde6/html5/thumbnails/30.jpg)
Flume hands-on: chat gateway
• Clone repository and go to gateway directory
• Compile and run it
• After running client we can see the data
git clone https://:@gitlab.cern.ch:8443/db/hadoop-tutorials-2016.gitcd hadoop-tutorials-2016/2_data_ingestion/1_flume_chat_gateway/
./compile
./run-agent
cat /tmp/flume-chat-data/*hdfs dfs -cat "/tmp/flume-chat-data/*"
![Page 31: Daniel Lanza Zbigniew Baranowski€¦ · •For aggregations and reporting SQL is very handy –no need to be a Java expert ... •For parquet/avro (using Spark) hadoop jar hadoop-streaming.jar](https://reader033.fdocuments.net/reader033/viewer/2022042222/5ec95a76e92aff32af2fdde6/html5/thumbnails/31.jpg)
Flume hands-on
• Flume chat client
# Name the components on this agentchat-client.sources = netcat_sourcechat-client.channels = memory_channelchat-client.sinks = avro_sink
# Configure sourcechat-client.sources.netcat_source.type = netcatchat-client.sources.netcat_source.channels = memory_channelchat-client.sources.netcat_source.bind = 0.0.0.0chat-client.sources.netcat_source.port = 1234chat-client.sources.netcat_source.interceptors = adduserchat-client.sources.netcat_source.interceptors.adduser.type = AddUserInterceptor$Builder
# Use a channel which buffers events in memorychat-client.channels.memory_channel.type = memorychat-client.channels.memory_channel.capacity = 1000chat-client.channels.memory_channel.transactionCapacity = 100
# Describe the sinkchat-client.sinks.avro_sink.type = avrochat-client.sinks.avro_sink.channel = memory_channelchat-client.sinks.avro_sink.hostname = haperf101.cern.chchat-client.sinks.avro_sink.port = 12123
![Page 32: Daniel Lanza Zbigniew Baranowski€¦ · •For aggregations and reporting SQL is very handy –no need to be a Java expert ... •For parquet/avro (using Spark) hadoop jar hadoop-streaming.jar](https://reader033.fdocuments.net/reader033/viewer/2022042222/5ec95a76e92aff32af2fdde6/html5/thumbnails/32.jpg)
Flume hands-on• Flume chat client (AddUserInterceptor)
![Page 33: Daniel Lanza Zbigniew Baranowski€¦ · •For aggregations and reporting SQL is very handy –no need to be a Java expert ... •For parquet/avro (using Spark) hadoop jar hadoop-streaming.jar](https://reader033.fdocuments.net/reader033/viewer/2022042222/5ec95a76e92aff32af2fdde6/html5/thumbnails/33.jpg)
Flume hands-on: chat client
• Go to client directory (on any machine)
• Compile and run it
• Initialize chat (different terminal)
ssh haperf10*.cern.chcd hadoop-tutorials-2016/2_data_ingestion/1_flume_chat_client/
./compile
./run-agent
cd hadoop-tutorials-2016/2_data_ingestion/1_flume_chat_client/./init_chat # Ctrl + ] and quit to exit
![Page 34: Daniel Lanza Zbigniew Baranowski€¦ · •For aggregations and reporting SQL is very handy –no need to be a Java expert ... •For parquet/avro (using Spark) hadoop jar hadoop-streaming.jar](https://reader033.fdocuments.net/reader033/viewer/2022042222/5ec95a76e92aff32af2fdde6/html5/thumbnails/34.jpg)
Is Apache Flume enough?
• Yes, for simple use cases• Performance limited by a single machine
– Consolidating multiple sources -> requires a lot of resources– Multiplexing of sinks -> duplicating data in channels
• If HDFS in down for maintenance– Flume channel can be full quickly
• It does not provide high availability– flume agent machine is a single point of failure– if it breaks we will lose data
• Solution?• Stage data in a reliable distributed event broker• Apache Kafka
![Page 35: Daniel Lanza Zbigniew Baranowski€¦ · •For aggregations and reporting SQL is very handy –no need to be a Java expert ... •For parquet/avro (using Spark) hadoop jar hadoop-streaming.jar](https://reader033.fdocuments.net/reader033/viewer/2022042222/5ec95a76e92aff32af2fdde6/html5/thumbnails/35.jpg)
Apache Kafka
• Messages broker– Topics
• Main features– Distributed
• Instances can be deployed on different machines
– Scalable• Topic could have many partitions
– Reliable• Partitions are replicated• Messages can be acknowledged
![Page 36: Daniel Lanza Zbigniew Baranowski€¦ · •For aggregations and reporting SQL is very handy –no need to be a Java expert ... •For parquet/avro (using Spark) hadoop jar hadoop-streaming.jar](https://reader033.fdocuments.net/reader033/viewer/2022042222/5ec95a76e92aff32af2fdde6/html5/thumbnails/36.jpg)
Apache Kafka• Topics
– Partitions• Replicated
• One is the leader
• Message written depending on the message key
• Data retention can be limited by size or time
![Page 37: Daniel Lanza Zbigniew Baranowski€¦ · •For aggregations and reporting SQL is very handy –no need to be a Java expert ... •For parquet/avro (using Spark) hadoop jar hadoop-streaming.jar](https://reader033.fdocuments.net/reader033/viewer/2022042222/5ec95a76e92aff32af2fdde6/html5/thumbnails/37.jpg)
Apache Kafka
• Consumer groups
– Offset is kept in Zookeeper
![Page 38: Daniel Lanza Zbigniew Baranowski€¦ · •For aggregations and reporting SQL is very handy –no need to be a Java expert ... •For parquet/avro (using Spark) hadoop jar hadoop-streaming.jar](https://reader033.fdocuments.net/reader033/viewer/2022042222/5ec95a76e92aff32af2fdde6/html5/thumbnails/38.jpg)
Apache Kafka – how to use it
• Flume out-of-the-box can use Kafka as
– Source, Channel, Sink
• Other ingestion or processing tools support Kafka
– Spark, Gobblin, Storm…
• Custom implementation of producer and consumer
– Java API, Scala, C++, Python
![Page 39: Daniel Lanza Zbigniew Baranowski€¦ · •For aggregations and reporting SQL is very handy –no need to be a Java expert ... •For parquet/avro (using Spark) hadoop jar hadoop-streaming.jar](https://reader033.fdocuments.net/reader033/viewer/2022042222/5ec95a76e92aff32af2fdde6/html5/thumbnails/39.jpg)
How Kafka can improve data ingestion
Stream Source
Staging area
Flush periodically
HDFS
Big Files
Events
• As reliable big staging area
Indexed data
Flush immediately
Batch processing
Fast data access
Real time stream
processing
Stream SourceStream Source
Stream Source
![Page 40: Daniel Lanza Zbigniew Baranowski€¦ · •For aggregations and reporting SQL is very handy –no need to be a Java expert ... •For parquet/avro (using Spark) hadoop jar hadoop-streaming.jar](https://reader033.fdocuments.net/reader033/viewer/2022042222/5ec95a76e92aff32af2fdde6/html5/thumbnails/40.jpg)
Flume hands-on
• Flume chat to Kafka
haperf10*.cern.ch
chat-client
Memory
NetCat Kafka
haperf10*.cern.ch
chat-client
Memory
NetCat Kafka
chat-client
Memory
NetCat Kafka
haperf10*.cern.ch
chat-client
Memory
NetCat Kafka
Memory
topic = flume-chat
![Page 41: Daniel Lanza Zbigniew Baranowski€¦ · •For aggregations and reporting SQL is very handy –no need to be a Java expert ... •For parquet/avro (using Spark) hadoop jar hadoop-streaming.jar](https://reader033.fdocuments.net/reader033/viewer/2022042222/5ec95a76e92aff32af2fdde6/html5/thumbnails/41.jpg)
Flume hands-on
• Flume chat client to Kafka# Name the components on this agentchat-client.sources = netcat_sourcechat-client.channels = memory_channelchat-client.sinks = kafka_sink
# Configure sourcechat-client.sources.netcat_source.type = netcatchat-client.sources.netcat_source.channels = memory_channelchat-client.sources.netcat_source.bind = 0.0.0.0chat-client.sources.netcat_source.port = 1234chat-client.sources.netcat_source.interceptors = adduser addtimestampchat-client.sources.netcat_source.interceptors.adduser.type = AddUserInterceptor$Builderchat-client.sources.netcat_source.interceptors.addtimestamp.type = AddTimestampInterceptor$Builder
# Use a channel which buffers events in memorychat-client.channels.memory_channel.type = memorychat-client.channels.memory_channel.capacity = 1000chat-client.channels.memory_channel.transactionCapacity = 100
# Describe the sinkchat-client.sinks.kafka_sink.type = org.apache.flume.sink.kafka.KafkaSinkchat-client.sinks.kafka_sink.channel = memory_channelchat-client.sinks.kafka_sink.brokerList = haperf111.cern.ch:9092,haperf107.cern.ch:9092chat-client.sinks.kafka_sink.topic = flume-chatchat-client.sinks.kafka_sink.batchSize = 1
![Page 42: Daniel Lanza Zbigniew Baranowski€¦ · •For aggregations and reporting SQL is very handy –no need to be a Java expert ... •For parquet/avro (using Spark) hadoop jar hadoop-streaming.jar](https://reader033.fdocuments.net/reader033/viewer/2022042222/5ec95a76e92aff32af2fdde6/html5/thumbnails/42.jpg)
Flume hands-on: chat client to Kafka• Go to client directory (on any machine)
• Compile and run it
• Initialize chat (different terminal)
• Kill background process
ssh haperf10*.cern.chcd hadoop-tutorials-2016/2_data_ingestion/2_flume_chat_client_to_kafka/
./compile
./run-agent
./stop_kafka_consumer
cd hadoop-tutorials-2016/2_data_ingestion/2_flume_chat_client_to_kafka/./consume-kafka-topic &./init_chat # Ctrl + ] and quit to exit
![Page 43: Daniel Lanza Zbigniew Baranowski€¦ · •For aggregations and reporting SQL is very handy –no need to be a Java expert ... •For parquet/avro (using Spark) hadoop jar hadoop-streaming.jar](https://reader033.fdocuments.net/reader033/viewer/2022042222/5ec95a76e92aff32af2fdde6/html5/thumbnails/43.jpg)
Streaming data
• Source
– Meetups are: neighbours getting together to learn something, do something, share something…
• Streaming API– curl -s http://stream.meetup.com/2/rsvps
![Page 44: Daniel Lanza Zbigniew Baranowski€¦ · •For aggregations and reporting SQL is very handy –no need to be a Java expert ... •For parquet/avro (using Spark) hadoop jar hadoop-streaming.jar](https://reader033.fdocuments.net/reader033/viewer/2022042222/5ec95a76e92aff32af2fdde6/html5/thumbnails/44.jpg)
Flume hands-on
• Kafka as persistent buffer
Memory
haperf10*.cern.ch
htutorial-agent
Memory
StrAPI Kafka
haperf10*.cern.ch
htutorial-agent
Memory
Kafka HDFS
![Page 45: Daniel Lanza Zbigniew Baranowski€¦ · •For aggregations and reporting SQL is very handy –no need to be a Java expert ... •For parquet/avro (using Spark) hadoop jar hadoop-streaming.jar](https://reader033.fdocuments.net/reader033/viewer/2022042222/5ec95a76e92aff32af2fdde6/html5/thumbnails/45.jpg)
haperf10*.cern.ch
htutorial-agent
Memory
StrAPI Kafka
Flume hands-on
• Streaming data from Meetup to Kafka
haperf10*.cern.ch
htutorial-agent
Memory
StrAPI Kafka
haperf10*.cern.ch
htutorial-agent
Memory
StrAPI Kafka
htutorial-agent
Memory
StrAPI Kafka
Memory
topic = meetup-data-<username>
![Page 46: Daniel Lanza Zbigniew Baranowski€¦ · •For aggregations and reporting SQL is very handy –no need to be a Java expert ... •For parquet/avro (using Spark) hadoop jar hadoop-streaming.jar](https://reader033.fdocuments.net/reader033/viewer/2022042222/5ec95a76e92aff32af2fdde6/html5/thumbnails/46.jpg)
Flume hands-on
• Streaming data from Meetup to Kafka
# Name the components on this agenthtutorial-agent.sources = meetup_sourcehtutorial-agent.channels = memory_channelhtutorial-agent.sinks = kafka_sink
# Configure sourcehtutorial-agent.sources.meetup_source.type = StreamingAPISourcehtutorial-agent.sources.meetup_source.channels = memory_channelhtutorial-agent.sources.meetup_source.url = http://stream.meetup.com/2/rsvpshtutorial-agent.sources.meetup_source.batch.size = 5htutorial-agent.sources.meetup_source.interceptors = addtimestamphtutorial-agent.sources.meetup_source.interceptors.addtimestamp.type = timestamp
# Use a channel which buffers events in memoryhtutorial-agent.channels.memory_channel.type = memoryhtutorial-agent.channels.memory_channel.capacity = 1000htutorial-agent.channels.memory_channel.transactionCapacity = 100
# Describe the sinkhtutorial-agent.sinks.kafka_sink.type = org.apache.flume.sink.kafka.KafkaSinkhtutorial-agent.sinks.kafka_sink.channel = memory_channelhtutorial-agent.sinks.kafka_sink.brokerList = haperf111.cern.ch:9092,haperf107.cern.ch:9092htutorial-agent.sinks.kafka_sink.topic = meetup-data-<username>htutorial-agent.sinks.kafka_sink.batchSize =
![Page 47: Daniel Lanza Zbigniew Baranowski€¦ · •For aggregations and reporting SQL is very handy –no need to be a Java expert ... •For parquet/avro (using Spark) hadoop jar hadoop-streaming.jar](https://reader033.fdocuments.net/reader033/viewer/2022042222/5ec95a76e92aff32af2fdde6/html5/thumbnails/47.jpg)
Flume hands-on
• Streaming data from Meetup to Kafka
– StreamingAPISource
![Page 48: Daniel Lanza Zbigniew Baranowski€¦ · •For aggregations and reporting SQL is very handy –no need to be a Java expert ... •For parquet/avro (using Spark) hadoop jar hadoop-streaming.jar](https://reader033.fdocuments.net/reader033/viewer/2022042222/5ec95a76e92aff32af2fdde6/html5/thumbnails/48.jpg)
Flume hands-on: Meetup to Kafka• Go to client directory (on any machine)
• Compile and run it
• Initialize chat (different terminal)
ssh haperf10*.cern.chcd hadoop-tutorials-2016/2_data_ingestion/3_meetup_to_kafka/
./compile
./run-agent
cd hadoop-tutorials-2016/2_data_ingestion/3_meetup_to_kafka/./consume-kafka-topic
![Page 49: Daniel Lanza Zbigniew Baranowski€¦ · •For aggregations and reporting SQL is very handy –no need to be a Java expert ... •For parquet/avro (using Spark) hadoop jar hadoop-streaming.jar](https://reader033.fdocuments.net/reader033/viewer/2022042222/5ec95a76e92aff32af2fdde6/html5/thumbnails/49.jpg)
Flume hands-on
• From Kafka to partitioned data into HDFS
haperf10*.cern.ch
htutorial-agent
Memory
Kafka HDFS
haperf10*.cern.ch
htutorial-agent
Memory
Kafka HDFS
htutorial-agent
Memory
Kafka HDFS
Memory
topic = meetup-data-<username>
haperf10*.cern.ch
htutorial-agent
Memory
Kafka HDFS
![Page 50: Daniel Lanza Zbigniew Baranowski€¦ · •For aggregations and reporting SQL is very handy –no need to be a Java expert ... •For parquet/avro (using Spark) hadoop jar hadoop-streaming.jar](https://reader033.fdocuments.net/reader033/viewer/2022042222/5ec95a76e92aff32af2fdde6/html5/thumbnails/50.jpg)
Flume hands-on• From Kafka to partitioned data into HDFS
# Name the components on this agenthtutorial-agent.sources = kafka_sourcehtutorial-agent.channels = memory_channelhtutorial-agent.sinks = hdfs_sink
# Configure sourcehtutorial-agent.sources.kafka_source.type = org.apache.flume.source.kafka.KafkaSourcehtutorial-agent.sources.kafka_source.channels = memory_channelhtutorial-agent.sources.kafka_source.zookeeperConnect = haperf104:2181,haperf105:2181htutorial-agent.sources.kafka_source.topic = meetup-data-<username>
# Use a channel which buffers events in memoryhtutorial-agent.channels.memory_channel.type = memoryhtutorial-agent.channels.memory_channel.capacity = 10000htutorial-agent.channels.memory_channel.transactionCapacity = 1000
# Describe the sinkhtutorial-agent.sinks.hdfs_sink.type = hdfshtutorial-agent.sinks.hdfs_sink.channel = memory_channelhtutorial-agent.sinks.hdfs_sink.hdfs.fileType = DataStreamhtutorial-agent.sinks.hdfs_sink.hdfs.path = hdfs://haperf100.cern.ch:8020/user/<username>/meetup-data/year=%Y/month=%m/day=%d/htutorial-agent.sinks.hdfs_sink.hdfs.rollCount = 100htutorial-agent.sinks.hdfs_sink.hdfs.rollSize = 1000000
![Page 51: Daniel Lanza Zbigniew Baranowski€¦ · •For aggregations and reporting SQL is very handy –no need to be a Java expert ... •For parquet/avro (using Spark) hadoop jar hadoop-streaming.jar](https://reader033.fdocuments.net/reader033/viewer/2022042222/5ec95a76e92aff32af2fdde6/html5/thumbnails/51.jpg)
Flume hands-on: Kafka to HDFS• Go to client directory (on any machine)
• Run it
• Check data is landing (different terminal)
ssh haperf10*.cern.chcd hadoop-tutorials-2016/2_data_ingestion/3_kafka_to_part_hdfs
./run-agent
cd hadoop-tutorials-2016/2_data_ingestion/3_kafka_to_part_hdfs/hdfs dfs -cat meetup-data/year=2016/month=07/day=20/*
![Page 52: Daniel Lanza Zbigniew Baranowski€¦ · •For aggregations and reporting SQL is very handy –no need to be a Java expert ... •For parquet/avro (using Spark) hadoop jar hadoop-streaming.jar](https://reader033.fdocuments.net/reader033/viewer/2022042222/5ec95a76e92aff32af2fdde6/html5/thumbnails/52.jpg)
Flume hands-on
• Can we simplify the previous architecture?
Memory
haperf10*.cern.ch
htutorial-agent
Memory
StrAPI Kafka
haperf10*.cern.ch
htutorial-agent
Memory
Kafka HDFS
![Page 53: Daniel Lanza Zbigniew Baranowski€¦ · •For aggregations and reporting SQL is very handy –no need to be a Java expert ... •For parquet/avro (using Spark) hadoop jar hadoop-streaming.jar](https://reader033.fdocuments.net/reader033/viewer/2022042222/5ec95a76e92aff32af2fdde6/html5/thumbnails/53.jpg)
Flume hands-on
• Kafka as channel
Memory
topic = flume-channel-<username>
haperf10*.cern.ch
htutorial-agent
StrAPI HDFS
![Page 54: Daniel Lanza Zbigniew Baranowski€¦ · •For aggregations and reporting SQL is very handy –no need to be a Java expert ... •For parquet/avro (using Spark) hadoop jar hadoop-streaming.jar](https://reader033.fdocuments.net/reader033/viewer/2022042222/5ec95a76e92aff32af2fdde6/html5/thumbnails/54.jpg)
Flume hands-on• Kafka as channel
# Name the components on this agenthtutorial-agent.sources = meetup_sourcehtutorial-agent.channels = kafka_channelhtutorial-agent.sinks = hdfs_sink
# Configure sourcehtutorial-agent.sources.meetup_source.type = StreamingAPISourcehtutorial-agent.sources.meetup_source.channels = kafka_channelhtutorial-agent.sources.meetup_source.url = http://stream.meetup.com/2/rsvpshtutorial-agent.sources.meetup_source.batch.size = 5
# Use a channel which buffers events in Kafkahtutorial-agent.channels.kafka_channel.type = org.apache.flume.channel.kafka.KafkaChannelhtutorial-agent.channels.kafka_channel.topic = flume-channel-<username>htutorial-agent.channels.kafka_channel.brokerList = haperf111.cern.ch:9092,haperf107.cern.ch:9092htutorial-agent.channels.kafka_channel.zookeeperConnect = haperf104:2181,haperf105:2181
# Describe the sinkhtutorial-agent.sinks.hdfs_sink.type = hdfshtutorial-agent.sinks.hdfs_sink.channel = kafka_channelhtutorial-agent.sinks.hdfs_sink.hdfs.fileType = DataStreamhtutorial-agent.sinks.hdfs_sink.hdfs.path = hdfs://haperf100.cern.ch/user/<username>/meetup-datahtutorial-agent.sinks.hdfs_sink.hdfs.rollCount = 100htutorial-agent.sinks.hdfs_sink.hdfs.rollSize = 1000000
![Page 55: Daniel Lanza Zbigniew Baranowski€¦ · •For aggregations and reporting SQL is very handy –no need to be a Java expert ... •For parquet/avro (using Spark) hadoop jar hadoop-streaming.jar](https://reader033.fdocuments.net/reader033/viewer/2022042222/5ec95a76e92aff32af2fdde6/html5/thumbnails/55.jpg)
Flume hands-on: Kafka as channel• Go to client directory (on any machine)
• Compile and run it
• See data coming (different terminal)
ssh haperf10*.cern.chcd hadoop-tutorials-2016/2_data_ingestion/4_meetup_kafka_as_channel/
./compile
./run-agent
cd hadoop-tutorials-2016/2_data_ingestion/4_meetup_kafka_as_channel/./consume-kafka-topichdfs dfs -cat "meetup-data/*"
![Page 56: Daniel Lanza Zbigniew Baranowski€¦ · •For aggregations and reporting SQL is very handy –no need to be a Java expert ... •For parquet/avro (using Spark) hadoop jar hadoop-streaming.jar](https://reader033.fdocuments.net/reader033/viewer/2022042222/5ec95a76e92aff32af2fdde6/html5/thumbnails/56.jpg)
haperf10*.cern.ch
htutorial-agent
Kafka
StrAPI HBase
Flume hands-on
• Streaming data from Meetup to HBase
haperf10*.cern.ch
htutorial-agent
Kafka
StrAPI HBase
haperf10*.cern.ch
htutorial-agent
Kafka
StrAPI HBase
htutorial-agent
Kafka
StrAPI HBase
Memory
table = meetup-flume-<username>
![Page 57: Daniel Lanza Zbigniew Baranowski€¦ · •For aggregations and reporting SQL is very handy –no need to be a Java expert ... •For parquet/avro (using Spark) hadoop jar hadoop-streaming.jar](https://reader033.fdocuments.net/reader033/viewer/2022042222/5ec95a76e92aff32af2fdde6/html5/thumbnails/57.jpg)
Flume hands-on• Streaming data from Meetup to HBase
# Name the components on this agenthtutorial-agent.sources = meetup_sourcehtutorial-agent.channels = kafka_channelhtutorial-agent.sinks = hbase_sink
# Configure sourcehtutorial-agent.sources.meetup_source.type = StreamingAPISourcehtutorial-agent.sources.meetup_source.channels = kafka_channelhtutorial-agent.sources.meetup_source.url = http://stream.meetup.com/2/rsvpshtutorial-agent.sources.meetup_source.batch.size = 5
# Use a channel which buffers events in Kafkahtutorial-agent.channels.kafka_channel.type = org.apache.flume.channel.kafka.KafkaChannelhtutorial-agent.channels.kafka_channel.topic = flume-channel-<username>htutorial-agent.channels.kafka_channel.brokerList = haperf111.cern.ch:9092,haperf107.cern.ch:9092htutorial-agent.channels.kafka_channel.zookeeperConnect = haperf104:2181,haperf105:2181
# Describe the sinkhtutorial-agent.sinks.hbase_sink.type = hbasehtutorial-agent.sinks.hbase_sink.channel = kafka_channelhtutorial-agent.sinks.hbase_sink.table = meetup_flume_<username>htutorial-agent.sinks.hbase_sink.columnFamily = eventhtutorial-agent.sinks.hbase_sink.batchSize = 10
![Page 58: Daniel Lanza Zbigniew Baranowski€¦ · •For aggregations and reporting SQL is very handy –no need to be a Java expert ... •For parquet/avro (using Spark) hadoop jar hadoop-streaming.jar](https://reader033.fdocuments.net/reader033/viewer/2022042222/5ec95a76e92aff32af2fdde6/html5/thumbnails/58.jpg)
Flume hands-on: Meetup to HBase• Go to client directory (on any machine)
• Compile, create table and run it
• See data coming (different terminal)
ssh haperf10*.cern.chcd hadoop-tutorials-2016/2_data_ingestion/5_meetup_to_hbase/
./compile
./create-hbase-table
./run-agent
hbase shell> scan 'meetup_flume_<username>'
![Page 59: Daniel Lanza Zbigniew Baranowski€¦ · •For aggregations and reporting SQL is very handy –no need to be a Java expert ... •For parquet/avro (using Spark) hadoop jar hadoop-streaming.jar](https://reader033.fdocuments.net/reader033/viewer/2022042222/5ec95a76e92aff32af2fdde6/html5/thumbnails/59.jpg)
Summary
• There are many tools that can help you in ingesting data to Hadoop
• Batch ingestion is easy– Sqoop, Kite, HDFS API
• Real time ingestion is more complex– 2 phases needed
• Flume + Kafka for reliable, scalable data ingestion– can help to integrate data from multiple sources in near
real time
– not only for Hadoop
![Page 60: Daniel Lanza Zbigniew Baranowski€¦ · •For aggregations and reporting SQL is very handy –no need to be a Java expert ... •For parquet/avro (using Spark) hadoop jar hadoop-streaming.jar](https://reader033.fdocuments.net/reader033/viewer/2022042222/5ec95a76e92aff32af2fdde6/html5/thumbnails/60.jpg)
Summary – when to use what
NameNode
DataNode
DataNode
DataNode
DataNode
Apache Sqoop
Streaming data
Remote copy
Pre-processing
Files in batch
+