Big Data Ingestion with Kafka -> HDFS using Apache Apex

13
Big Data Ingestion with Kafka Chinmay Kolhatkar [email protected]

Transcript of Big Data Ingestion with Kafka -> HDFS using Apache Apex

Page 1: Big Data Ingestion with Kafka -> HDFS using Apache Apex

Big Data Ingestion with Kafka

Chinmay [email protected]

Page 2: Big Data Ingestion with Kafka -> HDFS using Apache Apex

Agenda

● Data Ingestion● Use case: Kafka => HDFS● Brief about Kafka● Steps for development● Let’s code!!!

2

Page 3: Big Data Ingestion with Kafka -> HDFS using Apache Apex

Data Ingestion3

● Reading data in

● Storing in accessible location

● Beginning data pipeline or write path

● From here, it is processed further or read path

Page 4: Big Data Ingestion with Kafka -> HDFS using Apache Apex

Use case: KAFKA => HDFS4

● Reading from Kafka Messaging Queue

● Writing to HDFS

KAFKA HDFS

Page 5: Big Data Ingestion with Kafka -> HDFS using Apache Apex

Use case: Examples5

● Log Aggregation○ Collect logs from various sources○ Streams them as a single topic○ Put all the logs in centralized place i.e. HDFS

● Real time sensor data processing○ Read sensor data from various sources○ Process stream○ Dump results to HDFS

Page 6: Big Data Ingestion with Kafka -> HDFS using Apache Apex

Brief about Kafka6

● Distributed Messaging System

● Fast Reads and Writes

● Can handle large number of clients

● Scalable, fault-tolerant, partitionable

● Persistent messages

Page 7: Big Data Ingestion with Kafka -> HDFS using Apache Apex

Brief about Kafka (contd.)7

● Terminologies○ Topic○ Producer○ Consumer○ Broker

Page 8: Big Data Ingestion with Kafka -> HDFS using Apache Apex

Steps for developing application8

1. Create maven project using apex mvn archetype2. Add required maven dependencies3. Add operators to DAG4. Add stream(s) to DAG5. Set properties in properties.xml6. Compile and run

Page 9: Big Data Ingestion with Kafka -> HDFS using Apache Apex

9

Page 10: Big Data Ingestion with Kafka -> HDFS using Apache Apex

Summary10

● Ease of development using Apex

● Reusable malhar components

● Fault-tolerant, Scalable

● Reduced Time to Production

Page 11: Big Data Ingestion with Kafka -> HDFS using Apache Apex

11

Page 12: Big Data Ingestion with Kafka -> HDFS using Apache Apex

Resources

Apache Apex Meetup

• Apache Apex - http://apex.apache.org/• Subscribe - http://apex.apache.org/community.html• Download - https://www.datatorrent.com/download/• Twitter

o @ApacheApex; Follow - https://twitter.com/apacheapexo @DataTorrent; Follow – https://twitter.com/datatorrent

• Meetups - http://www.meetup.com/topics/apache-apex• Webinars - https://www.datatorrent.com/webinars/• Videos - https://www.youtube.com/user/DataTorrent• Slides - http://www.slideshare.net/DataTorrent/presentations• Startup Accelerator Program - Full featured enterprise product

o https://www.datatorrent.com/product/startup-accelerator/

Page 13: Big Data Ingestion with Kafka -> HDFS using Apache Apex

We Are Hiring

Apache Apex Meetup

[email protected]• Developers/Architects• QA Automation Developers• Information Developers• Build and Release• Community Leaders