MapReduce Component Design

8/7/2019 MapReduce Component Design

http://slidepdf.com/reader/full/mapreduce-component-design 1/5

Impact Determination Service cloudcomputing part design

CloudHadoopInitiator & IDSFilter & HiveReport

High level design

The Hado opInitiator component will responsible for communicating with the Hadoop cluster, so

that we can control the MapReduce job, monitoring the job status and extend the cluster when

the computing resource is lacking of.

The IDSFilter (MapReduce component) responsible for computing the Transaction data from

retailers, get a list of impacted customer and product information.

And the HiveReport component is a new component which is used to demonstrate the BI feature

in Cloud. We directly use the Hive (a subproject of Hadoop) since its easy to reuse the current

environment of Ha doop cluster.

As the diagram shows, there are 4 components related with MapReduceHadoop: IDSFilter: focus on the data filtering and s orting.

HadoopInitiator -focus on

Initiating a MapReduce job

Monitoring the MapRed uce job status, provide input for related Q listeners

Scale: A sub component of HadoopInitiator, focus on monitoring cluster scalability status

Field Code Changed



and scaling monitoring and send out notification to cloud provisioning system to scale

up/down the Hadoop cluster.

HiveReport Simple component for demonstrating the BI in cloud feature.

Detail Design of each component

y IDSFilter - MapReduce component

Classes to analysis and filtering/sorting y IDSMapper - implement the Mapper interface of MapReduce

y IDSReducer - implement the Reducer interface of MapReduce

y IDSPartitioner - implement the Partitioner interface of MapReduce, just a placeholder

incase the output file format will n eed multiple Reducer to deal with.

y

IDSFilter - As the job control class to invoke the Mapper and Reducer, the

entry of our custom Java program which will be uploaded to AWS S3.

y FilterDriver ± Just the entry for Java program jar.

The input file will be uploaded to Amazon S3 storage:

Todo:

Decide the input file path and naming convention, eg. S3://input/$Product or

S3://input/$Produc.$date

Input file format:

TransactionID|TransactionDate|CustomerName|EmailID|Product,Manufacturer,BatchNo$Produc

t,Manufacturer,BatchNo$Product,Manufacturer,BatchNo

TransactionID|TransactionDate|CustomerName|EmailID|Product,Manufacturer,BatchNo$Produc

t,Manufacturer,BatchNo$Product,Manufacturer,BatchNo

Suggestion:

1. Suggest adding a count number after each BatchNo, so that we can really know the impact,

just to make it more real.

2. Suggest only including one product for each line, s o that we d ont need to have complicated

logic in MapReduce module. Put more than one record in a line will make MapReduce

component more complicated, and we might need to add more steps into this module, that

means more than one Job Flow needed.

Eg:

TransactionID|TransactionDate|CustomerName|EmailID|Product,Manufac

turer,BatchNo,Count

TransactionID|TransactionDate|CustomerName|EmailID|Product,Manufac

turer,BatchNo,Count



Output file format:

Product|Manufacturer|BatchNo|TotalCound |EmailID$EmailID$EmailID$E mailID

Product|Manufacturer|BatchNo|TotalCound |EmailID$EmailID$EmailID$E

mailID

Todo:

To decide the output file path and naming convention, eg.

S3://output/$manufacturer.$product.$batchno

AmazonConsoleAgent HadoopInitiator Remote MapReduce

Job Control

Description: Class to communicate with AWS Management ConsoleHadoop cluster, focuses on:

y add a job flow

y get job state - this function can be abstracted to generate the message

Queue of job flow state

y terminate a job flow

y add a file to Hadoop distributed file system

o copy the file onto Hadoop master

o call put sub command to add it

y get file from Hadoop distributed file system

o Option 1:

save it to local

copy it to server that need this file

o Option 2:

Get this file directly via the URL of this file:

http://datanode:50075/$user_dir/$file_name

y update the job flow with parameters (id, instances)

y monitoring on the scalability of Hadoop cluster, dynamically scale up/down

Here we use SSH connection to connect with the primary Hadoop namenode (first

master, we will call it Hadoop master. If the job tracker is not configured on the

same server, than the job controller better talk to the job tracker directly. Let ¶s keep

it simple here).

The HadoopInitiator can reach the Hadoop master via SSH, we need to make sure

about this when the Hadoop cluster is built. These information can be stored in a

configuration file (we will hard code them in the ScaleListener for temporary).

There are two ways to do this:



Use SOAP - just use the MapReduce wsdl provided by Amazon

Use HttpClient to send request and receive XML response, it's easier and faster

Notes:

Need a generic design here to make sure the component can work with both Apache

MapReduce and Amazon MapReduce.

ScaleListener Hadoop Auto-Scaling component

y IScaleListener ± Interface of this component, we can use different

implementations based on the environment our cloud is built on.

o Just a place holder. The implementation is depends on how the

scalability information is collect.

y ScaleListener ± Dynamically check the scaling of MR component.

o Option 1: get CPU/Mem utilization data from HP Cloud controller

o Option 2: get the utilization from Hadoop Vaidya script.

o Option 3: use tool like CSSH/Puppet to directly get utilization info

form all server instances in the Hadoop cluster.

y AWS scale toolkit Wrapper - Wrap the Amazon's AutoScaling toolkit

(deprecated, we won¶t use Amazon MapReduce, just leave it here for future

reuse)

HiveReport BI in Cloud sample

Hive is a sub project of Hadoop. By using Hive, you can build a data warehouse directly on top of

the Hadoop distributed file system. It provides the tool to query data from the HDFS directly,

without t ransforming or integration.

Since Hive enable the HiveQL (similar with normal SQL) and JDBC liked operations,

we can just use the normal reporting tools (eg. JasperReport), analyze the database

and generate summary or analysis reports of all distributed data.

Steps:

1. Design what kind of report we want to generate

2. Create Hive tables to store required data

3. Use JasperReport to generate a report, use JDBC connection to get data fromHive data warehouse.

4. Create HiveServlet to generate the report

Notes:



- The performance of Hive is a bit poor, the similar feature by using normal MR is

almost 3 times faster than Hive, but the Hive team is improving the performance.

I think it¶s pretty good to try it.

- We can even put all data in the HDFS, instead of using standalone database to

store all of those data. Then we can have a pure distributed database.

MapReduce Component Design

Documents

Transcript of MapReduce Component Design