MapReduce Component Design
-
Upload
arivumathi -
Category
Documents
-
view
215 -
download
0
Transcript of MapReduce Component Design
8/7/2019 MapReduce Component Design
http://slidepdf.com/reader/full/mapreduce-component-design 1/5
Impact Determination Service cloudcomputing part design
CloudHadoopInitiator & IDSFilter & HiveReport
High level design
The Hado opInitiator component will responsible for communicating with the Hadoop cluster, so
that we can control the MapReduce job, monitoring the job status and extend the cluster when
the computing resource is lacking of.
The IDSFilter (MapReduce component) responsible for computing the Transaction data from
retailers, get a list of impacted customer and product information.
And the HiveReport component is a new component which is used to demonstrate the BI feature
in Cloud. We directly use the Hive (a subproject of Hadoop) since its easy to reuse the current
environment of Ha doop cluster.
As the diagram shows, there are 4 components related with MapReduceHadoop: IDSFilter: focus on the data filtering and s orting.
HadoopInitiator -focus on
Initiating a MapReduce job
Monitoring the MapRed uce job status, provide input for related Q listeners
Scale: A sub component of HadoopInitiator, focus on monitoring cluster scalability status
Field Code Changed
8/7/2019 MapReduce Component Design
http://slidepdf.com/reader/full/mapreduce-component-design 2/5
and scaling monitoring and send out notification to cloud provisioning system to scale
up/down the Hadoop cluster.
HiveReport Simple component for demonstrating the BI in cloud feature.
Detail Design of each component
y IDSFilter - MapReduce component
Classes to analysis and filtering/sorting y IDSMapper - implement the Mapper interface of MapReduce
y IDSReducer - implement the Reducer interface of MapReduce
y IDSPartitioner - implement the Partitioner interface of MapReduce, just a placeholder
incase the output file format will n eed multiple Reducer to deal with.
y
IDSFilter - As the job control class to invoke the Mapper and Reducer, the
entry of our custom Java program which will be uploaded to AWS S3.
y FilterDriver ± Just the entry for Java program jar.
The input file will be uploaded to Amazon S3 storage:
Todo:
Decide the input file path and naming convention, eg. S3://input/$Product or
S3://input/$Produc.$date
Input file format:
TransactionID|TransactionDate|CustomerName|EmailID|Product,Manufacturer,BatchNo$Produc
t,Manufacturer,BatchNo$Product,Manufacturer,BatchNo
TransactionID|TransactionDate|CustomerName|EmailID|Product,Manufacturer,BatchNo$Produc
t,Manufacturer,BatchNo$Product,Manufacturer,BatchNo
Suggestion:
1. Suggest adding a count number after each BatchNo, so that we can really know the impact,
just to make it more real.
2. Suggest only including one product for each line, s o that we d ont need to have complicated
logic in MapReduce module. Put more than one record in a line will make MapReduce
component more complicated, and we might need to add more steps into this module, that
means more than one Job Flow needed.
Eg:
TransactionID|TransactionDate|CustomerName|EmailID|Product,Manufac
turer,BatchNo,Count
TransactionID|TransactionDate|CustomerName|EmailID|Product,Manufac
turer,BatchNo,Count
8/7/2019 MapReduce Component Design
http://slidepdf.com/reader/full/mapreduce-component-design 3/5
Output file format:
Product|Manufacturer|BatchNo|TotalCound |EmailID$EmailID$EmailID$E mailID
Product|Manufacturer|BatchNo|TotalCound |EmailID$EmailID$EmailID$E
mailID
Todo:
To decide the output file path and naming convention, eg.
S3://output/$manufacturer.$product.$batchno
AmazonConsoleAgent HadoopInitiator Remote MapReduce
Job Control
Description: Class to communicate with AWS Management ConsoleHadoop cluster, focuses on:
y add a job flow
y get job state - this function can be abstracted to generate the message
Queue of job flow state
y terminate a job flow
y add a file to Hadoop distributed file system
o copy the file onto Hadoop master
o call put sub command to add it
y get file from Hadoop distributed file system
o Option 1:
save it to local
copy it to server that need this file
o Option 2:
Get this file directly via the URL of this file:
http://datanode:50075/$user_dir/$file_name
y update the job flow with parameters (id, instances)
y monitoring on the scalability of Hadoop cluster, dynamically scale up/down
Here we use SSH connection to connect with the primary Hadoop namenode (first
master, we will call it Hadoop master. If the job tracker is not configured on the
same server, than the job controller better talk to the job tracker directly. Let ¶s keep
it simple here).
The HadoopInitiator can reach the Hadoop master via SSH, we need to make sure
about this when the Hadoop cluster is built. These information can be stored in a
configuration file (we will hard code them in the ScaleListener for temporary).
There are two ways to do this:
8/7/2019 MapReduce Component Design
http://slidepdf.com/reader/full/mapreduce-component-design 4/5
Use SOAP - just use the MapReduce wsdl provided by Amazon
Use HttpClient to send request and receive XML response, it's easier and faster
Notes:
Need a generic design here to make sure the component can work with both Apache
MapReduce and Amazon MapReduce.
ScaleListener Hadoop Auto-Scaling component
y IScaleListener ± Interface of this component, we can use different
implementations based on the environment our cloud is built on.
o Just a place holder. The implementation is depends on how the
scalability information is collect.
y ScaleListener ± Dynamically check the scaling of MR component.
o Option 1: get CPU/Mem utilization data from HP Cloud controller
o Option 2: get the utilization from Hadoop Vaidya script.
o Option 3: use tool like CSSH/Puppet to directly get utilization info
form all server instances in the Hadoop cluster.
y AWS scale toolkit Wrapper - Wrap the Amazon's AutoScaling toolkit
(deprecated, we won¶t use Amazon MapReduce, just leave it here for future
reuse)
HiveReport BI in Cloud sample
Hive is a sub project of Hadoop. By using Hive, you can build a data warehouse directly on top of
the Hadoop distributed file system. It provides the tool to query data from the HDFS directly,
without t ransforming or integration.
Since Hive enable the HiveQL (similar with normal SQL) and JDBC liked operations,
we can just use the normal reporting tools (eg. JasperReport), analyze the database
and generate summary or analysis reports of all distributed data.
Steps:
1. Design what kind of report we want to generate
2. Create Hive tables to store required data
3. Use JasperReport to generate a report, use JDBC connection to get data fromHive data warehouse.
4. Create HiveServlet to generate the report
Notes:
8/7/2019 MapReduce Component Design
http://slidepdf.com/reader/full/mapreduce-component-design 5/5
- The performance of Hive is a bit poor, the similar feature by using normal MR is
almost 3 times faster than Hive, but the Hive team is improving the performance.
I think it¶s pretty good to try it.
- We can even put all data in the HDFS, instead of using standalone database to
store all of those data. Then we can have a pure distributed database.