HCatalog & Templeton
-
Upload
daegeun-kim -
Category
Technology
-
view
2.297 -
download
1
description
Transcript of HCatalog & Templeton
HCatalog & TempletonYoungwoo Kim ([email protected], kt.com)
Daegeun Kim ([email protected])데이터분석플랫폼 KTCloudware (NexR)
Wednesday, July 18, 12
HCatalog
Wednesday, July 18, 12
Hadoop Ecosystem(Many data processing tools)
MapReduce Hive Pig
Filesystem
Metastore
InputFormat / OutputFormat / ...
SerDeLoadFuncStoreFunc
RDBMS
SerDe
Wednesday, July 18, 12
Problems
• Hive 외에는 메타스토어의 부재
• 한 클러스터에서 다양한 도구를 사용하는 경우 연동이 쉽지 않다.
• 매번 커뮤니케이션 비용이 발생
• 어디에? 어떻게? 뭘?
• M/R, Pig 사용자는 기억해야할 많은 정보
• 스키마, 데이터 경로 또는 포맷 변경은 M/R, Pig 에 많은 영향
Wednesday, July 18, 12
HCatalog
• Apache Incubator
• Hive metastore 기반
• M/R, Pig 사용자에게 읽고 쓸 수 있는 프로그래밍 인터페이스 제공
• MapReduce 작업이 필요없는 모든 DDL 명령 제공 (CLI Commands)
• import/export, CREATE TABLE AS SELECT 등 제외
• Data exploration 기능 제공
• SHOW TABLES, DESCRIBE 제공
• http://incubator.apache.org/hcatalog/docs/r0.4.0/cli.html
• Hortonworks, Yahoo, Twitter, ... 등 개발
Wednesday, July 18, 12
Table abstraction
• 메타데이터
• 데이터 위치, 스키마, 압축, 파티션, 포맷 등
• HCatalog를 이용하여 데이터를 추상화
• 한 곳에서 메타데이터가 관리되며 그 만큼 역할 또한 중요
• 컬럼 타입으로 primitives, map, list, struct 지원
Wednesday, July 18, 12
HCatalog
MapReduce Hive Pig
Filesystem
HCatLoaderHCatStorer
RDBMS
HCatInputFormatHCatOutputFormat
InputFormatOutputFormat
Metastore SerDeSerDe
Wednesday, July 18, 12
Data types : Pig
HCatalog = Hive Pig
primitives(int, long, float, double, string)
int, long, float, double, chararray
map(contains key and value pairs)
map
list(contains a list elements of same data type)
bag
struct(contains elements of different data types)
tuple
Wednesday, July 18, 12
Examples
Wednesday, July 18, 12
DDL
$HCAT_HOME/bin/hcat -e “drop table if exists rawevents;create external table rawevents ( url string, user string)partitioned by (ds string)“
$HIVE_HOME/bin/hive -e “LOAD DATA LOCAL INPATH ‘...’ OVERWRITE INTO TABLE raweventsPARTITION (ds=‘20120530`)“
Wednesday, July 18, 12
Pig
raw = LOAD '/data/rawevents/20120530' AS (url, user);
botless = FILTER raw BY myudfs.NotABot(user);
grpd = GROUP botless by (url, user);
cntd = FOREACH grpd GENERATE flatten(url, user), COUNT(botless);
STORE cntd INTO '/data/counted/20120530';
http://www.slideshare.net/hortonworks/h-cat-berlinbuzzwords2012 : Page. 8
Wednesday, July 18, 12
Pig + HCatalog
Pigraw = LOAD '/data/rawevents/20120530' AS (url, user);
Pig + HCatalograw = LOAD 'rawevents' using org.apache.hcatalog.pig.HCatLoader();
http://www.slideshare.net/hortonworks/h-cat-berlinbuzzwords2012 : Page. 8
PigSTORE cntd INTO '/data/counted/20120530';
Pig + HCatalogSTORE cntd INTO 'counted' using org.apache.hcatalog.pig.HCatStorer();
LOAD '/data/rawevents/20120530'
Pig + HCatalog (Partition Filter)raw_0530 = FILTER raw BY ds = '20120530';
Wednesday, July 18, 12
MapReduce
• HCatInputFormat과 HCatOutputFormat 클래스를 활용
• Value 클래스는 기본적으로 HCatRecord를 활용
• Key는 사용하지 않음
• OutputValueClass는 HCatRecord로 설정
• 언제나 그렇듯 Reducer는 필수가 아닌 선택
• 파티션 제어 가능
• 스키마로 쉽게 제어 가능
Wednesday, July 18, 12
MapReduce - Job Job job = new Job(getConf()); job.setJarByClass(HCatMRTest.class); job.setJobName("HCatMRTest");
job.setOutputKeyClass(WritableComparable.class); job.setOutputValueClass(HCatRecord.class);
job.setMapperClass(HCatMRTest.Map.class); job.setInputFormatClass(HCatInputFormat.class); job.setOutputFormatClass(HCatOutputFormat.class);
job.setNumReduceTasks(0);
Wednesday, July 18, 12
MapReduce - DB, TBL, Partition java.util.Map<String, String> partition = ... partition.put("ds", "20120530");
in = InputJobInfo.create("DB", "rawevents", "ds='20120530'"); out = OutputJobInfo.create("DB", "counted", partition);
HCatInputFormat.setInput(job, in); HCatOutputFormat.setOutput(job, out);
HCatSchema s = HCatOutputFormat.getTableSchema(job); HCatOutputFormat.setSchema(job, s);
Wednesday, July 18, 12
MapReduce - HCatRecord
• 레코드 단위로 사용되는 클래스
• boolean, byte, short, integer, long, float, double, string, list, struct, map
• tinyint : HCatRecord.getByte
• smallint : HCatRecord.getShort
• Index 또는 컬럼명으로 접근가능
• 컬럼명으로 접근할 때는 HCatSchema 정보 필요
• 파티션 컬럼이 들어갈 수 있도록 공간 확보
Wednesday, July 18, 12
MapReduce - HCatRecord
테이블 스키마 정보 획득 방법
HCatSchema in = HCatInputFormat.getTableSchema(context)HCatSchema out = HCatOutputFormat.getTableSchema(context)
HCatRecord record = new HCatRecord(3);record.set(“url”, out, value.get(“url”, in));
context.write(null, record);
해당 스키마 정보는 job.xml에 기록(encoded) * mapreduce.lib.hcat.job.info * mapreduce.lib.hcatoutput.info
Wednesday, July 18, 12
Conclusions
• Pig 및 MR만을 사용하더라도 메타데이터 관리가 편해진다
• 다양한 도구를 활용할 때 효과를 발휘
• 빠른 컨트리뷰션이 이루어지고 있어 추후에 더 기대
Wednesday, July 18, 12
Templeton
Wednesday, July 18, 12
Wednesday, July 18, 12
The Templeton project is named after the a character in the award-winning children's novel Charlotte's Web, by E. B. White. The novel's protagonist is a pig named Wilber. Templeton is a rat who helps Wilber by running errands and making deliveries.
Wednesday, July 18, 12
Templeton
• HCatalog 연동
• Thrift
• Java API (HCATALOG-419)
• REST API
• Web services interface for HCatalog access and Pig, Hive and MR Job excution
• http://github.com/hortonworks/templeton
• HCATALOG-182
• a.k.a ‘webhcat’
Wednesday, July 18, 12
Getting started
• Install◦ Requirements■ Hadoop 0.20.205 or Hadoop 1.x■ Zookeeper■ HCatalog■ Hadoop Distributed Cache■ To use the Hive, Pig, or hadoop/streaming
resources• Configuration◦ templeton-site.xml
• Security◦ Default security (without additional authentication)◦ Authentication via Kerberos
Wednesday, July 18, 12
Templeton Resources
:version Returns a list of supported response types.
status Returns the Templeton server status.
version Returns the a list of supported versions and the
current version.
Wednesday, July 18, 12
Templeton Resources (2)
ddl Performs an HCatalog DDL command.
ddl/database List HCatalog databases.
ddl/database/:db (GET) Describe an HCatalog database.
ddl/database/:db (PUT) Create an HCatalog database.
ddl/database/:db (DELETE) Delete (drop) an HCatalog database.
ddl/database/:db/table List the tables in an HCatalog database.
ddl/database/:db/table/:table (GET) Describe an HCatalog table.
ddl/database/:db/table/:table (POST) Rename an HCatalog table.
ddl/database/:db/table/:table/partion List all partitions in an HCatalog table.
ddl/database/:db/table/:table/partion/:partition (GET) Describe a single partition in an HCatalog table.............
ddl/database/:db/table/:table/partion/:partition (PUT)
Wednesday, July 18, 12
Templeton Resources (3)
mapreduce/streaming Creates and queues Hadoop streaming MapReduce jobs.
mapreduce/jar Creates and queues standard Hadoop MapReduce jobs.
pig Creates and queues Pig jobs.
hive Runs Hive queries and commands.
queue Returns a list of all jobids registered for the user.
queue/:jobid (GET) Returns the status of a job given its ID.
queue/:jobid (DELETE) Kill a job given its ID.
Wednesday, July 18, 12
Examples
$ curl -s 'http://tb080:50111/templeton/v1/status'
{"status":"ok","version":"v1"}
$ curl -s -d user.name=nexr -d 'exec=show tables;' 'http://tb080:50111/templeton/v1/ddl'
{ "stdout": "emp\nname\nname_a29\n", "stderr": "WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. ......//[jar:file:\/home\/nexr\/nexr_platforms\/hadoop\/hadoop-1.0.3\/lib\/slf4j-log4j12-1.4.3.jar!\/org\/slf4j\/impl\/StaticLoggerBinder.class]\nSLF4J: See http:\/\/www.slf4j.org\/codes.html#multiple_bindings for an explanation.\nOK\nTime taken: 0.491 seconds\n", "exitcode": 0}
Wednesday, July 18, 12
Examples
$ curl -s 'http://tb080:50111/templeton/v1/ddl/database/default/table/emp?user.name=nexr'
{ "statement": "use default; desc emp; ", "error": "...", "exec": { "stdout": "{\"columns\":[{\"name\":\"empno\",\"type\":\"int\"},{\"name\":\"name\",\"type\":\"string\"},{\"name\":\"deptno\",\"type\":\"int\"}]}\t \t \n", "stderr": "WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. ...... explanation.\nOK\nTime taken: 0.324 seconds\nOK\nTime taken: 0.398 seconds\n", "exitcode": 0 }}
Wednesday, July 18, 12
Examples$ curl -s -X PUT -HContent-type:application/json -d '{ "comment": "Test table", "columns": [ { "name": "id", "type": "bigint" }, { "name": "price", "type": "float", "comment": "The unit price" } ], "partitionedBy": [ { "name": "country", "type": "string" } ], "format": { "storedAs": "rcfile" } }' \'http://tb080:50111/templeton/v1/ddl/database/default/table/test_table?user.name=nexr'
hive> show tables;OKemptest_tableTime taken: 0.477 secondshive> describe extended test_table;OKid bigint price float The unit pricecountry string Detailed Table Information Table(tableName:test_table, dbName:default, owner:nexr, createTime:1342578059, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:id, type:bigint, comment:null), FieldSchema(name:price, type:float, comment:The unit price), FieldSchema(name:country, type:string,
Wednesday, July 18, 12
Future of Templeton
• webhcat• Java API based on REST API• Integrate or replace existing web interfaces, e.g.,
WebHDFS
Wednesday, July 18, 12
References
• Apache HCatalog (Incubating), http://incubator.apache.org/hcatalog/
• HCatalog, http://www.slideshare.net/ydn/jan-2012-hug-hcatalog
• Future of HCatalog, http://www.slideshare.net/hortonworks/future-of-hcatalog-hadoop-summit-2012
• Introduction to HCatalog, http://geekdani.wordpress.com/2012/07/11/introduction-to-hcatalog/
• HCatalog 설치와 HCatalog를 이용한 Hive & Pig 스키마 연동, http://mixellaneous.tistory.com/1123
Wednesday, July 18, 12