Brust hadoopecosystem
-
Upload
andrew-brust -
Category
Technology
-
view
1.759 -
download
1
description
Transcript of Brust hadoopecosystem
![Page 1: Brust hadoopecosystem](https://reader034.fdocuments.net/reader034/viewer/2022051609/547fe67fb4af9fcd458b478e/html5/thumbnails/1.jpg)
Hadoop and its Ecosystem Hadoop and its Ecosystem Components in ActionComponents in Action
Andrew J. BrustAndrew J. BrustCEO and Founder
Blue Badge Insights
Level: Intermediate
![Page 2: Brust hadoopecosystem](https://reader034.fdocuments.net/reader034/viewer/2022051609/547fe67fb4af9fcd458b478e/html5/thumbnails/2.jpg)
• CEO and Founder, Blue Badge Insights• Big Data blogger for ZDNet• Microsoft Regional Director, MVP• Co-chair VSLive! and 17 years as a speaker• Founder, Microsoft BI User Group of NYC
– http://www.msbinyc.com
• Co-moderator, NYC .NET Developers Group– http://www.nycdotnetdev.com
• “Redmond Review” columnist for Visual Studio Magazine and Redmond Developer News
• brustblog.com, Twitter: @andrewbrust
Meet AndrewMeet Andrew
![Page 3: Brust hadoopecosystem](https://reader034.fdocuments.net/reader034/viewer/2022051609/547fe67fb4af9fcd458b478e/html5/thumbnails/3.jpg)
My New Blog (My New Blog (bit.ly/bigondata))
![Page 4: Brust hadoopecosystem](https://reader034.fdocuments.net/reader034/viewer/2022051609/547fe67fb4af9fcd458b478e/html5/thumbnails/4.jpg)
Read All About It!Read All About It!
![Page 5: Brust hadoopecosystem](https://reader034.fdocuments.net/reader034/viewer/2022051609/547fe67fb4af9fcd458b478e/html5/thumbnails/5.jpg)
AgendaAgenda
• Understand:– MapReduce, Hadoop, and Hadoop stack elements
• Review:– Microsoft Hadoop on Azure, Amazon Elastic
MapReduce (EMR), Cloudera CDH4
• Demo:– Hadoop, HBase, Hive, Pig
• Discuss:– Sqoop, Flume
• Demo– Mahout
![Page 6: Brust hadoopecosystem](https://reader034.fdocuments.net/reader034/viewer/2022051609/547fe67fb4af9fcd458b478e/html5/thumbnails/6.jpg)
MapReduce, in a DiagramMapReduce, in a Diagram
mapper
mapper
mapper
mapper
mapper
mapper
Input
reducer
reducer
reducer
Input
Input
Input
Input
Input
Input
Output
Output
Output
Output
Output
Output
Output
Input
Input
Input
K1, K4…
K2, K5…
K3, K6…
Output
Output
Output
![Page 7: Brust hadoopecosystem](https://reader034.fdocuments.net/reader034/viewer/2022051609/547fe67fb4af9fcd458b478e/html5/thumbnails/7.jpg)
A MapReduce ExampleA MapReduce Example
• Count by suite, on each floor
• Send per-suite, per platform totals to lobby
• Sort totals by platform
• Send two platform packets to 10th, 20th, 30th floor
• Tally up each platform
• Merge tallies into one spreadsheet
• Collect the tallies
![Page 8: Brust hadoopecosystem](https://reader034.fdocuments.net/reader034/viewer/2022051609/547fe67fb4af9fcd458b478e/html5/thumbnails/8.jpg)
What’s a Distributed File System?What’s a Distributed File System?
• One where data gets distributed over commodity drives on commodity servers
• Data is replicated• If one box goes down, no data lost
– Except the name node = SPOF!
• BUT: HDFS is immutable– Files can only be written to once
– So updates require drop + re-write (slow)
![Page 9: Brust hadoopecosystem](https://reader034.fdocuments.net/reader034/viewer/2022051609/547fe67fb4af9fcd458b478e/html5/thumbnails/9.jpg)
Hadoop = MapReduce + HDFSHadoop = MapReduce + HDFS
• Modeled after Google MapReduce + GFS• Have more data? Just add more nodes to
cluster. – Mappers execute in parallel
– Hardware is commodity
– “Scaling out”
• Use of HDFS means data may well be local to mapper processing
• So, not just parallel, but minimal data movement, which avoids network bottlenecks
![Page 10: Brust hadoopecosystem](https://reader034.fdocuments.net/reader034/viewer/2022051609/547fe67fb4af9fcd458b478e/html5/thumbnails/10.jpg)
The Hadoop StackThe Hadoop Stack
• Hadoop– MapReduce, HDFS
• Hbase – NoSQL Database– Lesser extent: Cassandra, HyperTable
• Hive, Pig– SQL-like “data warehouse” system– Data transformation language
• Sqoop– Import/export between HDFS, HBase,
Hive and relational data warehouses
• Flume– Log file integration
• Mahout– Data Mining
![Page 11: Brust hadoopecosystem](https://reader034.fdocuments.net/reader034/viewer/2022051609/547fe67fb4af9fcd458b478e/html5/thumbnails/11.jpg)
Ways to workWays to work• Microsoft Hadoop on Azure
– Visit www.hadooponazure.com– Request invite– Free, for now
• Amazon Web Services Elastic MapReduce– Create AWS account– Select Elastic MapReduce in Dashboard– Cheap for experimenting, but not free
• Cloudera CDH VM image– Download as .tar.gz file– “Un-tar” (can use WinRAR, 7zip)– Run via VMWare Player or Virtual Box– Everything’s free
![Page 12: Brust hadoopecosystem](https://reader034.fdocuments.net/reader034/viewer/2022051609/547fe67fb4af9fcd458b478e/html5/thumbnails/12.jpg)
Microsoft Hadoop on AzureMicrosoft Hadoop on Azure• Much simpler than the others• Browser-based portal
– Provisioning cluster, managing ports, MapReduce jobs– Gathering external data
Configure Azure Storage, Amazon S3Import from DataMarket to Hive
• Interactive JavaScript console– HDFS, Pig, light data visualization
• Interactive Hive console– Hive commands and metadata discovery
• From Portal page you can RDP directly to Hadoop head node– Double click desktop shortcut for CLI access– Certain environment variables may need to be set
![Page 13: Brust hadoopecosystem](https://reader034.fdocuments.net/reader034/viewer/2022051609/547fe67fb4af9fcd458b478e/html5/thumbnails/13.jpg)
Microsoft Hadoop on Microsoft Hadoop on AzureAzure
![Page 14: Brust hadoopecosystem](https://reader034.fdocuments.net/reader034/viewer/2022051609/547fe67fb4af9fcd458b478e/html5/thumbnails/14.jpg)
Amazon Elastic MapReduceAmazon Elastic MapReduce
• Lots of steps!• At a high level:
– Setup AWS account and S3 “buckets”
– Generate Key Pair and PEM file
– Install Ruby and EMR Command Line Interface
– Provision the cluster using CLI
A batch file can work very well here– Setup and run SSH/PuTTY
– Work interactively at command line
![Page 15: Brust hadoopecosystem](https://reader034.fdocuments.net/reader034/viewer/2022051609/547fe67fb4af9fcd458b478e/html5/thumbnails/15.jpg)
Amazon Elastic MapReduceAmazon Elastic MapReduce
![Page 16: Brust hadoopecosystem](https://reader034.fdocuments.net/reader034/viewer/2022051609/547fe67fb4af9fcd458b478e/html5/thumbnails/16.jpg)
Cloudera CDH4 Virtual MachineCloudera CDH4 Virtual Machine
• Get it for free, in VMWare and Virtual Box versions.– VMWare player and Virtual Box are free too
• Run it, and configure it to have its own IP on your network.
• Assuming IP of 192.168.1.59, open browser on your own (host ) machine and navigate to:– http://192.168.1.59:8888
• Can also use browser in VM and hit:– http://localhost:8888
• Work in “Hue”…
![Page 17: Brust hadoopecosystem](https://reader034.fdocuments.net/reader034/viewer/2022051609/547fe67fb4af9fcd458b478e/html5/thumbnails/17.jpg)
HueHue• Browser based UI,
with front ends for:– HDFS (w/ upload &
download)– MapReduce job
creation and monitoring
– Hive (“Beeswax”)• And in-browser
command line shells for:– HBase– Pig (“Grunt”)
![Page 18: Brust hadoopecosystem](https://reader034.fdocuments.net/reader034/viewer/2022051609/547fe67fb4af9fcd458b478e/html5/thumbnails/18.jpg)
Cloudera CDH4Cloudera CDH4
![Page 19: Brust hadoopecosystem](https://reader034.fdocuments.net/reader034/viewer/2022051609/547fe67fb4af9fcd458b478e/html5/thumbnails/19.jpg)
Hadoop commandsHadoop commands
• HDFS– hadoop fs filecommand– Create and remove directories:
mkdir, rm, rmr– Upload and download files to/from HDFS
get, put– View directory contents
ls, lsr– Copy, move, view files
cp, mv, cat• MapReduce
– Run a Java jar-file based jobhadoop jar jarname params
![Page 20: Brust hadoopecosystem](https://reader034.fdocuments.net/reader034/viewer/2022051609/547fe67fb4af9fcd458b478e/html5/thumbnails/20.jpg)
Hadoop (directly)Hadoop (directly)
![Page 21: Brust hadoopecosystem](https://reader034.fdocuments.net/reader034/viewer/2022051609/547fe67fb4af9fcd458b478e/html5/thumbnails/21.jpg)
HBaseHBase• Concepts:
– Tables, column families– Columns, rows– Keys, values
• Commands:– Definition: create, alter, drop, truncate– Manipulation: get, put, delete, deleteall, scan– Discovery: list, exists, describe, count– Enablement: disable, enable– Utilities: version, status, shutdown, exit– Reference: http://wiki.apache.org/hadoop/Hbase/Shell
• Moreover,– Interesting HBase work can be done in MapReduce, Pig
![Page 22: Brust hadoopecosystem](https://reader034.fdocuments.net/reader034/viewer/2022051609/547fe67fb4af9fcd458b478e/html5/thumbnails/22.jpg)
HBase ExamplesHBase Examples
• create 't1', 'f1', 'f2', 'f3'• describe 't1'• alter 't1', {NAME => 'f1',
VERSIONS => 5} • put 't1', 'r1', 'c1:f1', 'value'• get 't1', 'r1'• count 't1'
![Page 23: Brust hadoopecosystem](https://reader034.fdocuments.net/reader034/viewer/2022051609/547fe67fb4af9fcd458b478e/html5/thumbnails/23.jpg)
HBaseHBase
![Page 24: Brust hadoopecosystem](https://reader034.fdocuments.net/reader034/viewer/2022051609/547fe67fb4af9fcd458b478e/html5/thumbnails/24.jpg)
HiveHive
• Used by most BI products which connect to Hadoop
• Provides a SQL-like abstraction over Hadoop– Officially HiveQL, or HQL
• Works on own tables, but also on HBase• Query generates MapReduce job, output
of which becomes result set• Microsoft has Hive ODBC driver
– Connects Excel, Reporting Services, PowerPivot, Analysis Services Tabular Mode (only)
![Page 25: Brust hadoopecosystem](https://reader034.fdocuments.net/reader034/viewer/2022051609/547fe67fb4af9fcd458b478e/html5/thumbnails/25.jpg)
Hive, ContinuedHive, Continued
• Load data from flat HDFS files– LOAD DATA [LOCAL] INPATH 'myfile'INTO TABLE mytable;
• SQL Queries– CREATE, ALTER, DROP– INSERT OVERWRITE (creates whole tables)
– SELECT, JOIN, WHERE, GROUP BY– SORT BY, but ordering data is tricky!
– MAP/REDUCE/TRANSFORM…USING allows for custom map, reduce steps utilizing Java or streaming code
![Page 26: Brust hadoopecosystem](https://reader034.fdocuments.net/reader034/viewer/2022051609/547fe67fb4af9fcd458b478e/html5/thumbnails/26.jpg)
HiveHive
![Page 27: Brust hadoopecosystem](https://reader034.fdocuments.net/reader034/viewer/2022051609/547fe67fb4af9fcd458b478e/html5/thumbnails/27.jpg)
PigPig• Instead of SQL, employs a language (“Pig
Latin”) that accommodates data flow expressions– Do a combo of Query and ETL
• “10 lines of Pig Latin ≈ 200 lines of Java.”• Works with structured or unstructured data• Operations
– As with Hive, a MapReduce job is generated– Unlike Hive, output is only flat file to HDFS or text at
command line console– With MS Hadoop, can easily convert to JavaScript array,
then manipulate
• Use command line (“Grunt”) or build scripts
![Page 28: Brust hadoopecosystem](https://reader034.fdocuments.net/reader034/viewer/2022051609/547fe67fb4af9fcd458b478e/html5/thumbnails/28.jpg)
ExampleExample
• A = LOAD 'myfile' AS (x, y, z);B = FILTER A by x > 0;C = GROUP B BY x;D = FOREACH A GENERATE x, COUNT(B);STORE D INTO 'output';
![Page 29: Brust hadoopecosystem](https://reader034.fdocuments.net/reader034/viewer/2022051609/547fe67fb4af9fcd458b478e/html5/thumbnails/29.jpg)
Pig Latin ExamplesPig Latin Examples• Imperative, file system commands
– LOAD, STORESchema specified on LOAD
• Declarative, query commands (SQL-like)– xxx = file or data set– FOREACH xxx GENERATE (SELECT…FROM xxx)– JOIN (WHERE/INNER JOIN)– FILTER xxx BY (WHERE)– ORDER xxx BY (ORDER BY)– GROUP xxx BY / GENERATE COUNT(xxx)(SELECT COUNT(*) GROUP BY)
– DISTINCT (SELECT DISTINCT)• Syntax is assignment statement-based:
– MyCusts = FILTER Custs BY SalesPerson eq 15;• Access Hbase
– CpuMetrics = LOAD 'hbase://SystemMetrics' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('cpu:','-loadKey -returnTuple');
![Page 30: Brust hadoopecosystem](https://reader034.fdocuments.net/reader034/viewer/2022051609/547fe67fb4af9fcd458b478e/html5/thumbnails/30.jpg)
PigPig
![Page 31: Brust hadoopecosystem](https://reader034.fdocuments.net/reader034/viewer/2022051609/547fe67fb4af9fcd458b478e/html5/thumbnails/31.jpg)
SqoopSqoop
sqoop import --connect "jdbc:sqlserver://<servername>. database.windows.net:1433; database=<dbname>; user=<username>@<servername>; password=<password>" --table <from_table> --target-dir <to_hdfs_folder> --split-by <from_table_column>
![Page 32: Brust hadoopecosystem](https://reader034.fdocuments.net/reader034/viewer/2022051609/547fe67fb4af9fcd458b478e/html5/thumbnails/32.jpg)
SqoopSqoop
sqoop export --connect "jdbc:sqlserver://<servername>. database.windows.net:1433; database=<dbname>; user=<username>@<servername>; password=<password>" --table <to_table> --export-dir <from_hdfs_folder> --input-fields-terminated-by "<delimiter>"
![Page 33: Brust hadoopecosystem](https://reader034.fdocuments.net/reader034/viewer/2022051609/547fe67fb4af9fcd458b478e/html5/thumbnails/33.jpg)
Flume NGFlume NG
• Source– Avro (data serialization system – can read json-
encoded data files, and can work over RPC)
– Exec (reads from stdout of long-running process)
• Sinks– HDFS, HBase, Avro
• Channels– Memory, JDBC, file
![Page 34: Brust hadoopecosystem](https://reader034.fdocuments.net/reader034/viewer/2022051609/547fe67fb4af9fcd458b478e/html5/thumbnails/34.jpg)
Flume NG (next generation)Flume NG (next generation)• Setup conf/flume.conf# Define a memory channel called ch1 on agent1agent1.channels.ch1.type = memory
# Define an Avro source called avro-source1 on agent1 and tell it# to bind to 0.0.0.0:41414. Connect it to channel ch1.agent1.sources.avro-source1.channels = ch1agent1.sources.avro-source1.type = avroagent1.sources.avro-source1.bind = 0.0.0.0agent1.sources.avro-source1.port = 41414
# Define a logger sink that simply logs all events it receives# and connect it to the other end of the same channel.agent1.sinks.log-sink1.channel = ch1agent1.sinks.log-sink1.type = logger
# Finally, now that we've defined all of our components, tell# agent1 which ones we want to activate.agent1.channels = ch1agent1.sources = avro-source1agent1.sinks = log-sink1
• From the command line:flume-ng agent --conf ./conf/ -f conf/flume.conf -n agent1
![Page 35: Brust hadoopecosystem](https://reader034.fdocuments.net/reader034/viewer/2022051609/547fe67fb4af9fcd458b478e/html5/thumbnails/35.jpg)
Mahout AlgorithmsMahout Algorithms
• Recommendation– Your info + community info– Give users/items/ratings; get user-user/item-item– itemsimilarity
• Classification/Categorization– Drop into buckets– Naïve Bayes, Complementary Naïve Bayes, Decision
Forests
• Clustering– Like classification, but with categories unknown– K-Means, Fuzzy K-Means, Canopy, Dirichlet, Mean-
Shift
![Page 36: Brust hadoopecosystem](https://reader034.fdocuments.net/reader034/viewer/2022051609/547fe67fb4af9fcd458b478e/html5/thumbnails/36.jpg)
Workflow, SyntaxWorkflow, Syntax• Workflow
– Run the job– Dump the output– Visualize, predict
• mahout algorithm -- input folderspec -- output folderspec -- param1 value1 -- param2 value2…
• Example:– mahout itemsimilarity --input <input-hdfs-path> --output <output-hdfs-path> --tempDir <tmp-hdfs-path> -s SIMILARITY_LOGLIKELIHOOD
![Page 37: Brust hadoopecosystem](https://reader034.fdocuments.net/reader034/viewer/2022051609/547fe67fb4af9fcd458b478e/html5/thumbnails/37.jpg)
MahoutMahout
![Page 38: Brust hadoopecosystem](https://reader034.fdocuments.net/reader034/viewer/2022051609/547fe67fb4af9fcd458b478e/html5/thumbnails/38.jpg)
The Truth About MahoutThe Truth About Mahout
• Mahout is really just an algorithm engine• Its output is almost unusable by non-
statisticians/non-data scientists• You need a staff or a product to visualize, or
make into a usable prediction model• Investigate Predixion Software
– CTO, Jamie MacLennan, used to lead SQL Server Data Mining team
– Excel add-in can use Mahout remotely, visualize its output, run predictive analyses
– Also integrates with SQL Server, Greenplum, MapReduce
– http://www.predixionsoftware.com
![Page 39: Brust hadoopecosystem](https://reader034.fdocuments.net/reader034/viewer/2022051609/547fe67fb4af9fcd458b478e/html5/thumbnails/39.jpg)
ResourcesResources• Big On Data blog
– http://www.zdnet.com/blog/big-data
• Apache Hadoop home page– http://hadoop.apache.org/
• Hive & Pig home pages– http://hive.apache.org/– http://pig.apache.org/
• Hadoop on Azure home page– https://www.hadooponazure.com/
• Cloudera CDH 4 download– https://ccp.cloudera.com/display/SUPPORT/CDH+Downloads
• SQL Server 2012 Big Data– http://bit.ly/sql2012bigdata
![Page 40: Brust hadoopecosystem](https://reader034.fdocuments.net/reader034/viewer/2022051609/547fe67fb4af9fcd458b478e/html5/thumbnails/40.jpg)
Thank youThank you
• [email protected]• @andrewbrust on twitter• Get Blue Badge’s free briefings
– Text “bluebadge” to 22828