Real Time Analytics using Cloudera Impala in Manufacturing use case

13
Final Project Real Time Analytics using Cloudera Impala in Manufacturing use case Rapheephan Thongkham-uan (Nancy) CSCI E-185 Big Data Analytics @Rapheephan Thongkham-Uan Friday, May 10, 13

description

CSCI E-185 Big Data Analytics -- Final project, Fall 2013

Transcript of Real Time Analytics using Cloudera Impala in Manufacturing use case

Page 1: Real Time Analytics using Cloudera Impala in Manufacturing use case

Final Project

Real Time Analyticsusing Cloudera Impala in Manufacturing use case

Rapheephan Thongkham-uan (Nancy)CSCI E-185 Big Data Analytics

@Rapheephan Thongkham-Uan

Friday, May 10, 13

Page 2: Real Time Analytics using Cloudera Impala in Manufacturing use case

To make Big Data makes MoneyIn manufacturing, ...

• We want to improve the supply chain management by tracking the defective parts, finding the bottlenecks, etc.

• We are doing the analysis on the big amount of data using traditional tools that takes too much time.

• People in the factory are familiar to SQL query.

• The faster we analyze the big data,

- faster defects/bottlenecks detection

- near real-time problem solving, decision-making

- less time and money spending on the defects

That’s why we need Cloudera Impala

@Rapheephan Thongkham-Uan

Friday, May 10, 13

Page 4: Real Time Analytics using Cloudera Impala in Manufacturing use case

After finishing cloudera manager installation

@Rapheephan Thongkham-Uan

Friday, May 10, 13

Page 5: Real Time Analytics using Cloudera Impala in Manufacturing use case

We will use Hue Web UI to query Impala

From the Services menu bar, click

HUE1and choose Hue

Web UI

@Rapheephan Thongkham-Uan

Friday, May 10, 13

Page 6: Real Time Analytics using Cloudera Impala in Manufacturing use case

Create table in HiveCreate Hive table with user impala then load the data from local into the table

$ sudo -E -u impala hive -e “CREATE TABLE khsample (id INT, sdate STRING, seq INT, product STRING, ope STRING, resource_grp STRING, resource STRING, inflow FLOAT, proclot FLOAT, wip FLOAT, ope_rate FLOAT) ROW FORMAT DELIMITED FILEDS TERMINATED BY ‘,’;”

$ sudo -E -u impala hive -e “LOAD DATA LOCAL INPATH ‘KH_RESULT.csv’ INTO TABLE khsample;”

@Rapheephan Thongkham-Uan

Friday, May 10, 13

Page 7: Real Time Analytics using Cloudera Impala in Manufacturing use case

Sample table in Hue Web UI

We can view the table we just created in Hive shell on Hue Web UI

*the input data is included japanese characters which cannot be read.

@Rapheephan Thongkham-Uan

Friday, May 10, 13

Page 8: Real Time Analytics using Cloudera Impala in Manufacturing use case

Create table in HiveBefore querying Impala on Hue Web UI, we have to refresh the Impala first. In the Impala-shell, input the following command

$ impala-shell

[impala-server:21000] > refresh;

@Rapheephan Thongkham-Uan

Friday, May 10, 13

Page 9: Real Time Analytics using Cloudera Impala in Manufacturing use case

Query in Impala

In Hue Web UI, click Impala icon the query editor page will be shown.

input the query and execute

@Rapheephan Thongkham-Uan

Friday, May 10, 13

Page 10: Real Time Analytics using Cloudera Impala in Manufacturing use case

Bottlenecks query

- To find the groups of machines which are the bottlenecks, we can calculate from WIP by day. The group of machines which WIP value is higher than the day before can be predicted as bottleneck.

- The simulation dates were from 12/13 to 12/22. I will get the summation of WIP values from the sampling dates (12/14, 12/16, 12/18, 12/20, 12/22).

- We have to do 5 sub-queries in FROM statement.

@Rapheephan Thongkham-Uan

Friday, May 10, 13

Page 11: Real Time Analytics using Cloudera Impala in Manufacturing use case

Bottlenecks query (2)

SELECT A.resource_grp,

A.awip as wip22, --12/22 wip

B.bwip as wip20, --12/20 wip

C.cwip as wip18, --12/18 wip

D.dwip as wip16, --12/16 wip

D.dwip as wip14 --12/14 wip

FROM (SELECT resource_grp, sum(wip) as awip

FROM khsample

WHERE id = 118 and sdate =’”2012/12/22”’) A join

(SELECT resource_grp, sum(wip) as bwip

FROM khsample

WHERE id = 118 and sdate =’”2012/12/20”’) B join

(SELECT resource_grp, sum(wip) as cwip

FROM khsample

WHERE id = 118 and sdate =’”2012/12/18”’) C join

@Rapheephan Thongkham-Uan

(SELECT resource_grp, sum(wip) as dwip

FROM khsample

WHERE id = 118 and sdate =’”2012/12/16”’) D join

(SELECT resource_grp, sum(wip) as ewip

FROM khsample

WHERE id = 118 and sdate =’”2012/12/14”’) E

WHERE A.resource_grp = B.resource_grp

and A.resource_grp = C.resource_grp

and A.resource_grp = D.resource_grp

and A.resource_grp = E.resource_grp

and A.awip >= B.bwip and B.bwip >= C.cwip

and C.cwip >= D.dwip and D.dwip >= E.ewip

ORDER BY A.awip DESC

LIMIT 20;

Friday, May 10, 13

Page 12: Real Time Analytics using Cloudera Impala in Manufacturing use case

Comparing the result of Impala with Oracle SQL

@Rapheephan Thongkham-Uan

Friday, May 10, 13

Page 13: Real Time Analytics using Cloudera Impala in Manufacturing use case

Results

• join 5 sub-queries in Oracle SQL took 50s.

• join 5 sub-queries in Impala took 6.67s.

• Impala can query 7x faster with the same results.

• In the real use, we could configure Impala to work with HBase, also change Hive Metastore to OracleDB.

@Rapheephan Thongkham-Uan

Friday, May 10, 13