HCatalog & Templeton

31
HCatalog & Templeton Youngwoo Kim ([email protected], kt.com) Daegeun Kim ([email protected] ) 데이터분석플랫폼 KTCloudware (NexR) Wednesday, July 18, 12

description

HCatalog & Templeton 소개 및 예제

Transcript of HCatalog & Templeton

Page 1: HCatalog & Templeton

HCatalog & TempletonYoungwoo Kim ([email protected], kt.com)

Daegeun Kim ([email protected])데이터분석플랫폼 KTCloudware (NexR)

Wednesday, July 18, 12

Page 2: HCatalog & Templeton

HCatalog

Wednesday, July 18, 12

Page 3: HCatalog & Templeton

Hadoop Ecosystem(Many data processing tools)

MapReduce Hive Pig

Filesystem

Metastore

InputFormat / OutputFormat / ...

SerDeLoadFuncStoreFunc

RDBMS

SerDe

Wednesday, July 18, 12

Page 4: HCatalog & Templeton

Problems

• Hive 외에는 메타스토어의 부재

• 한 클러스터에서 다양한 도구를 사용하는 경우 연동이 쉽지 않다.

• 매번 커뮤니케이션 비용이 발생

• 어디에? 어떻게? 뭘?

• M/R, Pig 사용자는 기억해야할 많은 정보

• 스키마, 데이터 경로 또는 포맷 변경은 M/R, Pig 에 많은 영향

Wednesday, July 18, 12

Page 5: HCatalog & Templeton

HCatalog

• Apache Incubator

• Hive metastore 기반

• M/R, Pig 사용자에게 읽고 쓸 수 있는 프로그래밍 인터페이스 제공

• MapReduce 작업이 필요없는 모든 DDL 명령 제공 (CLI Commands)

• import/export, CREATE TABLE AS SELECT 등 제외

• Data exploration 기능 제공

• SHOW TABLES, DESCRIBE 제공

• http://incubator.apache.org/hcatalog/docs/r0.4.0/cli.html

• Hortonworks, Yahoo, Twitter, ... 등 개발

Wednesday, July 18, 12

Page 6: HCatalog & Templeton

Table abstraction

• 메타데이터

• 데이터 위치, 스키마, 압축, 파티션, 포맷 등

• HCatalog를 이용하여 데이터를 추상화

• 한 곳에서 메타데이터가 관리되며 그 만큼 역할 또한 중요

• 컬럼 타입으로 primitives, map, list, struct 지원

Wednesday, July 18, 12

Page 7: HCatalog & Templeton

HCatalog

MapReduce Hive Pig

Filesystem

HCatLoaderHCatStorer

RDBMS

HCatInputFormatHCatOutputFormat

InputFormatOutputFormat

Metastore SerDeSerDe

Wednesday, July 18, 12

Page 8: HCatalog & Templeton

Data types : Pig

HCatalog = Hive Pig

primitives(int, long, float, double, string)

int, long, float, double, chararray

map(contains key and value pairs)

map

list(contains a list elements of same data type)

bag

struct(contains elements of different data types)

tuple

Wednesday, July 18, 12

Page 9: HCatalog & Templeton

Examples

Wednesday, July 18, 12

Page 10: HCatalog & Templeton

DDL

$HCAT_HOME/bin/hcat -e “drop table if exists rawevents;create external table rawevents ( url string, user string)partitioned by (ds string)“

$HIVE_HOME/bin/hive -e “LOAD DATA LOCAL INPATH ‘...’ OVERWRITE INTO TABLE raweventsPARTITION (ds=‘20120530`)“

Wednesday, July 18, 12

Page 11: HCatalog & Templeton

Pig

raw = LOAD '/data/rawevents/20120530' AS (url, user);

botless = FILTER raw BY myudfs.NotABot(user);

grpd = GROUP botless by (url, user);

cntd = FOREACH grpd GENERATE flatten(url, user), COUNT(botless);

STORE cntd INTO '/data/counted/20120530';

http://www.slideshare.net/hortonworks/h-cat-berlinbuzzwords2012 : Page. 8

Wednesday, July 18, 12

Page 12: HCatalog & Templeton

Pig + HCatalog

Pigraw = LOAD '/data/rawevents/20120530' AS (url, user);

Pig + HCatalograw = LOAD 'rawevents' using org.apache.hcatalog.pig.HCatLoader();

http://www.slideshare.net/hortonworks/h-cat-berlinbuzzwords2012 : Page. 8

PigSTORE cntd INTO '/data/counted/20120530';

Pig + HCatalogSTORE cntd INTO 'counted' using org.apache.hcatalog.pig.HCatStorer();

LOAD '/data/rawevents/20120530'

Pig + HCatalog (Partition Filter)raw_0530 = FILTER raw BY ds = '20120530';

Wednesday, July 18, 12

Page 13: HCatalog & Templeton

MapReduce

• HCatInputFormat과 HCatOutputFormat 클래스를 활용

• Value 클래스는 기본적으로 HCatRecord를 활용

• Key는 사용하지 않음

• OutputValueClass는 HCatRecord로 설정

• 언제나 그렇듯 Reducer는 필수가 아닌 선택

• 파티션 제어 가능

• 스키마로 쉽게 제어 가능

Wednesday, July 18, 12

Page 14: HCatalog & Templeton

MapReduce - Job Job job = new Job(getConf()); job.setJarByClass(HCatMRTest.class); job.setJobName("HCatMRTest");

job.setOutputKeyClass(WritableComparable.class); job.setOutputValueClass(HCatRecord.class);

job.setMapperClass(HCatMRTest.Map.class); job.setInputFormatClass(HCatInputFormat.class); job.setOutputFormatClass(HCatOutputFormat.class);

job.setNumReduceTasks(0);

Wednesday, July 18, 12

Page 15: HCatalog & Templeton

MapReduce - DB, TBL, Partition java.util.Map<String, String> partition = ... partition.put("ds", "20120530");

in = InputJobInfo.create("DB", "rawevents", "ds='20120530'"); out = OutputJobInfo.create("DB", "counted", partition);

HCatInputFormat.setInput(job, in); HCatOutputFormat.setOutput(job, out);

HCatSchema s = HCatOutputFormat.getTableSchema(job); HCatOutputFormat.setSchema(job, s);

Wednesday, July 18, 12

Page 16: HCatalog & Templeton

MapReduce - HCatRecord

• 레코드 단위로 사용되는 클래스

• boolean, byte, short, integer, long, float, double, string, list, struct, map

• tinyint : HCatRecord.getByte

• smallint : HCatRecord.getShort

• Index 또는 컬럼명으로 접근가능

• 컬럼명으로 접근할 때는 HCatSchema 정보 필요

• 파티션 컬럼이 들어갈 수 있도록 공간 확보

Wednesday, July 18, 12

Page 17: HCatalog & Templeton

MapReduce - HCatRecord

테이블 스키마 정보 획득 방법

HCatSchema in = HCatInputFormat.getTableSchema(context)HCatSchema out = HCatOutputFormat.getTableSchema(context)

HCatRecord record = new HCatRecord(3);record.set(“url”, out, value.get(“url”, in));

context.write(null, record);

해당 스키마 정보는 job.xml에 기록(encoded) * mapreduce.lib.hcat.job.info * mapreduce.lib.hcatoutput.info

Wednesday, July 18, 12

Page 18: HCatalog & Templeton

Conclusions

• Pig 및 MR만을 사용하더라도 메타데이터 관리가 편해진다

• 다양한 도구를 활용할 때 효과를 발휘

• 빠른 컨트리뷰션이 이루어지고 있어 추후에 더 기대

Wednesday, July 18, 12

Page 19: HCatalog & Templeton

Templeton

Wednesday, July 18, 12

Page 20: HCatalog & Templeton

Wednesday, July 18, 12

Page 21: HCatalog & Templeton

The Templeton project is named after the a character in the award-winning children's novel Charlotte's Web, by E. B. White. The novel's protagonist is a pig named Wilber. Templeton is a rat who helps Wilber by running errands and making deliveries.

Wednesday, July 18, 12

Page 22: HCatalog & Templeton

Templeton

• HCatalog 연동

• Thrift

• Java API (HCATALOG-419)

• REST API

• Web services interface for HCatalog access and Pig, Hive and MR Job excution

• http://github.com/hortonworks/templeton

• HCATALOG-182

• a.k.a ‘webhcat’

Wednesday, July 18, 12

Page 23: HCatalog & Templeton

Getting started

• Install◦ Requirements■ Hadoop 0.20.205 or Hadoop 1.x■ Zookeeper■ HCatalog■ Hadoop Distributed Cache■ To use the Hive, Pig, or hadoop/streaming

resources• Configuration◦ templeton-site.xml

• Security◦ Default security (without additional authentication)◦ Authentication via Kerberos

Wednesday, July 18, 12

Page 24: HCatalog & Templeton

Templeton Resources

:version Returns a list of supported response types.

status Returns the Templeton server status.

version Returns the a list of supported versions and the

current version.

Wednesday, July 18, 12

Page 25: HCatalog & Templeton

Templeton Resources (2)

ddl Performs an HCatalog DDL command.

ddl/database List HCatalog databases.

ddl/database/:db (GET) Describe an HCatalog database.

ddl/database/:db (PUT) Create an HCatalog database.

ddl/database/:db (DELETE) Delete (drop) an HCatalog database.

ddl/database/:db/table List the tables in an HCatalog database.

ddl/database/:db/table/:table (GET) Describe an HCatalog table.

ddl/database/:db/table/:table (POST) Rename an HCatalog table.

ddl/database/:db/table/:table/partion List all partitions in an HCatalog table.

ddl/database/:db/table/:table/partion/:partition (GET) Describe a single partition in an HCatalog table.............

ddl/database/:db/table/:table/partion/:partition (PUT)

Wednesday, July 18, 12

Page 26: HCatalog & Templeton

Templeton Resources (3)

mapreduce/streaming Creates and queues Hadoop streaming MapReduce jobs.

mapreduce/jar Creates and queues standard Hadoop MapReduce jobs.

pig Creates and queues Pig jobs.

hive Runs Hive queries and commands.

queue Returns a list of all jobids registered for the user.

queue/:jobid (GET) Returns the status of a job given its ID.

queue/:jobid (DELETE) Kill a job given its ID.

Wednesday, July 18, 12

Page 27: HCatalog & Templeton

Examples

$ curl -s 'http://tb080:50111/templeton/v1/status'

{"status":"ok","version":"v1"}

$ curl -s -d user.name=nexr -d 'exec=show tables;' 'http://tb080:50111/templeton/v1/ddl'

{ "stdout": "emp\nname\nname_a29\n", "stderr": "WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. ......//[jar:file:\/home\/nexr\/nexr_platforms\/hadoop\/hadoop-1.0.3\/lib\/slf4j-log4j12-1.4.3.jar!\/org\/slf4j\/impl\/StaticLoggerBinder.class]\nSLF4J: See http:\/\/www.slf4j.org\/codes.html#multiple_bindings for an explanation.\nOK\nTime taken: 0.491 seconds\n", "exitcode": 0}

Wednesday, July 18, 12

Page 28: HCatalog & Templeton

Examples

$ curl -s 'http://tb080:50111/templeton/v1/ddl/database/default/table/emp?user.name=nexr'

{ "statement": "use default; desc emp; ", "error": "...", "exec": { "stdout": "{\"columns\":[{\"name\":\"empno\",\"type\":\"int\"},{\"name\":\"name\",\"type\":\"string\"},{\"name\":\"deptno\",\"type\":\"int\"}]}\t \t \n", "stderr": "WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. ...... explanation.\nOK\nTime taken: 0.324 seconds\nOK\nTime taken: 0.398 seconds\n", "exitcode": 0 }}

Wednesday, July 18, 12

Page 29: HCatalog & Templeton

Examples$ curl -s -X PUT -HContent-type:application/json -d '{ "comment": "Test table", "columns": [ { "name": "id", "type": "bigint" }, { "name": "price", "type": "float", "comment": "The unit price" } ], "partitionedBy": [ { "name": "country", "type": "string" } ], "format": { "storedAs": "rcfile" } }' \'http://tb080:50111/templeton/v1/ddl/database/default/table/test_table?user.name=nexr'

hive> show tables;OKemptest_tableTime taken: 0.477 secondshive> describe extended test_table;OKid bigint price float The unit pricecountry string Detailed Table Information Table(tableName:test_table, dbName:default, owner:nexr, createTime:1342578059, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:id, type:bigint, comment:null), FieldSchema(name:price, type:float, comment:The unit price), FieldSchema(name:country, type:string,

Wednesday, July 18, 12

Page 30: HCatalog & Templeton

Future of Templeton

• webhcat• Java API based on REST API• Integrate or replace existing web interfaces, e.g.,

WebHDFS

Wednesday, July 18, 12