Hadoop Demystified + Automation Smackdown! Austin JUG June 24 2014

43
Hadoop Demystified 101 ETL + Automation Smackdown Learn Big Data: Learn manually, then ask “Which approach makes me the most valuable as developer?” Slides, code, youtube, resources at end

Transcript of Hadoop Demystified + Automation Smackdown! Austin JUG June 24 2014

Page 1: Hadoop Demystified + Automation Smackdown!  Austin JUG June 24 2014

Hadoop Demystified 101 ETL + Automation Smackdown

Learn Big Data: Learn manually, then ask “Which approach makes me the most valuable as developer?”

Slides, code, youtube, resources at end

Page 2: Hadoop Demystified + Automation Smackdown!  Austin JUG June 24 2014

Q&A at end

Page 3: Hadoop Demystified + Automation Smackdown!  Austin JUG June 24 2014

Bio - Pete Carapetyan

• Java dev last 15 years, dev 20 years

• Grew up automating in a different industry

• Almost involuntary obsession with systems & automation

• Since 2000 as dataFundamentals, now a 2 man shop

Contact info at last slide

Page 4: Hadoop Demystified + Automation Smackdown!  Austin JUG June 24 2014

Special Skills - Special Snowflakes

• Let me show you these Hadoop basics.

• Then, we code for special snowflakes. (data sets)

• Thus we are more valuable, and can up our bill rates!

• This is Approach #1: Special Snowflake (manual)

Page 5: Hadoop Demystified + Automation Smackdown!  Austin JUG June 24 2014

My 2013 Manual Hadoop Benchmark

• 15 ETL jobs [Partial scope]

• Brilliant, ninja level team

• 1 year of competitive NIH* copy paste spaghetti coding - AKA special snowflake approach

• Is this the best I can do?

*NIH: Not Invented Here

Page 6: Hadoop Demystified + Automation Smackdown!  Austin JUG June 24 2014

Hadoop sidebar: which Serialization Protocol?

• [text - native via SequenceFile]

• Binary protocols include

• Thrift (Facebook, Evernote)

• Protocol Buffers (Google)

• Avro (Hadoop author, Cloudera)

• What about character based?

• XML

• JSON

• etc

http://www.slideshare.net/IgorAnishchenko/pb-vs-thrift-vs-avroLess than complementary view of Avro:

Page 7: Hadoop Demystified + Automation Smackdown!  Austin JUG June 24 2014

Transform to Avro

• Not detailed in this talk

• Demo’d here as a binary

• Code listed at end of talk

Page 8: Hadoop Demystified + Automation Smackdown!  Austin JUG June 24 2014

[Demo Basics of Hadoop ETL Job]

Page 9: Hadoop Demystified + Automation Smackdown!  Austin JUG June 24 2014

Whoops - [lots of moving parts!]

• What if I make a misteak?

• Dig through log files

• Obtuse messages

• Scripts for logs are critical

• Budget lots of time

• Error UI

Page 10: Hadoop Demystified + Automation Smackdown!  Austin JUG June 24 2014

I don’t get it. What makes Hadoop so cool?• Expands to thousands of machines

• Placement of my data across those machines (uses HDFS)

• Moves program to data, not data to program

• Tooling/ecosystem

• Much of which is now usable outside Hadoop

• Examples:

• Hive

• Pig

• Zookeeper

Page 11: Hadoop Demystified + Automation Smackdown!  Austin JUG June 24 2014

Map Reduce 101?

• Makes more sense as ”MapShuffleReduce”

• API for handing program to the data.

• Primary feature is the two pass heuristic for dealing with data on clusters

• You can avoid understanding Map Reduce if Hive is all you use :(

• Yes, MapReduce runs on other systems than Hadoop!

Page 12: Hadoop Demystified + Automation Smackdown!  Austin JUG June 24 2014

Special Snowflake Approach:Human drama!

What limitations of this manual special skills special snowflakes

approach do we observe?

Page 13: Hadoop Demystified + Automation Smackdown!  Austin JUG June 24 2014

How To Un-Pack Either Approach?

What if we remove the human drama?

Page 14: Hadoop Demystified + Automation Smackdown!  Austin JUG June 24 2014
Page 15: Hadoop Demystified + Automation Smackdown!  Austin JUG June 24 2014
Page 16: Hadoop Demystified + Automation Smackdown!  Austin JUG June 24 2014
Page 17: Hadoop Demystified + Automation Smackdown!  Austin JUG June 24 2014
Page 18: Hadoop Demystified + Automation Smackdown!  Austin JUG June 24 2014
Page 19: Hadoop Demystified + Automation Smackdown!  Austin JUG June 24 2014
Page 20: Hadoop Demystified + Automation Smackdown!  Austin JUG June 24 2014

Now, what happens if we automate?

Automated Approach

Page 21: Hadoop Demystified + Automation Smackdown!  Austin JUG June 24 2014

Carrie

Our own internal project for automating big data.Name inspired by the horror film…

Page 22: Hadoop Demystified + Automation Smackdown!  Austin JUG June 24 2014
Page 23: Hadoop Demystified + Automation Smackdown!  Austin JUG June 24 2014

How to drive focus? The Phoenix Project

• Results, not drama

• Focus only on bottleneck

• Brent as bottleneck

Page 24: Hadoop Demystified + Automation Smackdown!  Austin JUG June 24 2014

Brent: The bad guy?

• Brent is a team’s best asset! Brent is a ninja.Brent is not the bad guy.

• Brent is bottleneck only when treating every situation like a special snowflake.

• Brent enjoys attention???

• Brent is not the drama queen, others bring the drama to him.

• Often victim of his own success.

Brent?

Page 25: Hadoop Demystified + Automation Smackdown!  Austin JUG June 24 2014

Automation Basics

1. Brent spends time on clean design, PIE*, not NIH*

• Uses [Camel] - Integration Server

2. Brent automates the rule, codes the exception

• Apply metadata to templates

• Infrastructure as code: servers(Devops)

* NIH: Not Invented Here especially as opposed to PIE “Proudly Invented Elsewhere”

Page 26: Hadoop Demystified + Automation Smackdown!  Austin JUG June 24 2014

Demo Integration Server

• Raw linux OS (Centos)

• Java

• Maven

• Ruby

• networking

• maven repo - binaries

• [created with vagrant]

youtube link

Page 27: Hadoop Demystified + Automation Smackdown!  Austin JUG June 24 2014

Test your Chef and Vagrant knowledge:

1. What is Vagrant?

2. Name 4 other tools like Chef.

3. Dev, Test, Prod all identical ?????

4. In Chef box as ‘run list' of ?????

5. Idempotent in Chef defined as ????

6. Extra credit: VirtualBox is to VM as Docker is to ?????

Page 28: Hadoop Demystified + Automation Smackdown!  Austin JUG June 24 2014

Chef and Vagrant Basic Answers

1. Vagrant is a command line front end for creating VMs

2. Chef as (1 of 5) Chef, Puppet, Ansible, Salt, CFEngine

3. Dev, Test, Prod all identical ‘code’

4. Box as ‘run list' of features or recipes

5. Idempotent creates or updates same code

6. Virtualbox is to VM as Docker is to container

Page 29: Hadoop Demystified + Automation Smackdown!  Austin JUG June 24 2014

Demo Metadata Collection

• Simple properties

• Collected using a cheesy UI

• UI and code generation bothwritten in Ruby

youtube link

Page 30: Hadoop Demystified + Automation Smackdown!  Austin JUG June 24 2014

Demo Generated Code

• Camel ETL binary

• OSGi, versioned, modular jar

• Only 3 primary outputs!

• simple

• clean

• well designed (?)

• JUnit/integration tested

• Supporting scripting

• messy

youtube link

Page 31: Hadoop Demystified + Automation Smackdown!  Austin JUG June 24 2014

Demo Server Deploy

• One line deploy/run command

• Compiles on server with Maven

• Also runnable as jar

youtube link

Page 32: Hadoop Demystified + Automation Smackdown!  Austin JUG June 24 2014

Does it work?

• Make custom file

• Drop into ETL folder

• Inspect

youtube link

Page 33: Hadoop Demystified + Automation Smackdown!  Austin JUG June 24 2014

Demo - Review

• Schema created

• DDL run

• Avro binary (JSON) transform

• Data Migration

• FTP to server

• Into HDFS partition

• Alter Table: Date Partition

youtube link

Page 34: Hadoop Demystified + Automation Smackdown!  Austin JUG June 24 2014

Takeaways

• Brent coding the exception manually, rule by template.

• Brent has time to focus on design & exceptions.

• Brent may lose some personal attention and status.

• Resulting code is

• clean

• consistent, easy to maintain

• But is there a Home Run?

• defined as anything not possible via special snowflake approach

Page 35: Hadoop Demystified + Automation Smackdown!  Austin JUG June 24 2014

Home Run 1: Instant, identical, dev/test/prod

Page 36: Hadoop Demystified + Automation Smackdown!  Austin JUG June 24 2014

Home Run 2: Big Data, Beyond Hadoop!

1. Pick your provider

• Hadoop

• Cassandra

• Couchbase

• any of hundreds…

2. Adopt your templates, VMs, etc

3. Even stick with Avro…

Page 37: Hadoop Demystified + Automation Smackdown!  Austin JUG June 24 2014

Home Run 3: Effort as Idempotent

• Idempotent effort? No penalty for discontinuous development.

• Walkup - The 10 minute test

• Walkaway - Requirements

• Features

• Testing, technical debt, already in place for code

• VMs and recipes for dev, test, prod

• OSGi etc modularity for binaries

• Does what we see here pass this test?

Page 38: Hadoop Demystified + Automation Smackdown!  Austin JUG June 24 2014

What to leave with

• De-mystify: how to Avro/Hadoop a delimited file

• Review motives for automating this process

• Code automation basics

• Infrastructure automation basics

• Code for above

Page 39: Hadoop Demystified + Automation Smackdown!  Austin JUG June 24 2014

Further Hadoop Tutuorial Resources

• Hortonworks

• best free stuff? Except networking vas

• Cloudera

• Lots but appear to prefer to get paid

• Apache Hadoop

• haven’t tried but it is Apache

Page 40: Hadoop Demystified + Automation Smackdown!  Austin JUG June 24 2014

Further Camel Resources

• Gerald Cantor of this group, Mark of this group (AMD)

• Camel In Action Book

• Camel mail list

• Red Hat support (Fuse)

Page 41: Hadoop Demystified + Automation Smackdown!  Austin JUG June 24 2014

6 week deep dives: Candidates• Apache Camel

• Serialization choices, formats, code, tools

• Big Data: NoSQL and newSQL variants and choices

• Test Driven Development

• Jenkins, CI, etc

• Hive, Impala, Hawq other Hadoop sql engines

• Pig, and MapReduce for Hadoop

• Hadoop clustering

• OSGi, Felix

• Maven, Gradle etc

• bash

• Chef or Puppet, Salt, Ansible, CFEngine

• Devops, Phoenix Project

Page 42: Hadoop Demystified + Automation Smackdown!  Austin JUG June 24 2014

Wish To See More?

• In office demos

• Your sample data

Page 43: Hadoop Demystified + Automation Smackdown!  Austin JUG June 24 2014

Code, Content, Contacts• This Slide Deck: http://www.slideshare.net/datafundamentals/hadoop-big-data-35762308

• or just remember slideshare.net/datafundamentals it may be the only one there

• Youtube - 11 minute slide-less version of code demo - https://www.youtube.com/playlist?list=PLO_T9AjxEaYeByfqBqHVCmg4GbLFkYCJe

• Dev Code

• Carrie (ruby UI and generator) https://github.com/datafundamentals/df_ui_carrie

• Avro from delimited https://bitbucket.org/datafundamentals/avro_from_delimited

• Camel-Avro https://bitbucket.org/datafundamentals/camel-avro-etl

• Ops Code - cookbook recipes

• https://github.com/datafundamentals

• Contact

[email protected], [email protected]

Be careful! It’s a competitive world out there!