Download - Web Services Hadoop Summit 2012

Web Services in Hadoop

Nicholas Sze and Alan F. Gates @szetszwo, @alanfgates

REST-ful API Front-door for Hadoop

• Opens the door to languages other than Java • Thin clients via web services vs. fat-clients in gateway • Insulation from interface changes release to release

HDFS HBase

HCatalog

External Store

MapReduce Pig Hive

HCatalog web interfaces

© 2012 Hortonworks

Not Covered in this Talk

© 2012 Hortonworks

•  HttpFS (fka Hoop) – same API as WebHDFS but proxied •  Stargate – REST API for HBase

HDFS Clients

• DFSClient: the native client – High performance (using RPC) – Java blinding

• libhdfs: a C++ client interface – Using JNI => large overhead – Also Java blinding (require Hadoop installing)

Architecting the Future of Big Data

HFTP

• Designed for cross-version copying (DistCp) – High performance (using HTTP) – Read-only – The HTTP API is proprietary – Clients must use HftpFileSystem (hftp://)

• WebHDFS is a rewrite of HFTP


Design Goals

• Support a public HTTP API

• Support Read and Write

• High Performance

• Cross-version

• Security


WebHDFS features

• HTTP REST API – Defines a public API – Permits non-Java client implementation – Support common tools like curl/wget

• Wire Compatibility – The REST API will be maintained for wire compatibility – WebHDFS clients can talk to different Hadoop versions.


WebHDFS features (2)

• A Complete HDFS Interface – Support all user operations

– reading files – writing to files – mkdir, chmod, chown, mv, rm, …

• High Performance – Using HTTP redirection to provide data locality – File read/write are redirected to the corresponding datanodes


WebHDFS features (3)

• Secure Authentication – Same as Hadoop authentication: Kerberos (SPNEGO) and Hadoop delegation tokens – Support proxy users

• A HDFS Built-in Component – WebHDFS is a first class built-in component of HDFS. – Run inside Namenodes and Datanodes

• Apache Open Source – Available in Apache Hadoop 1.0 and above.


WebHDFS URI & URL

• FileSystem scheme: webhdfs://

• FileSystem URI: webhdfs://<HOST>:<HTTP_PORT>/<PATH>

• HTTP URL: http://<HOST>:<HTTP_PORT>/webhdfs/v1/<PATH>?op=..

– Path prefix: /webhdfs/v1 – Query: ?op=..


URI/URL Examples

• Suppose we have the following file hdfs://namenode:8020/user/szetszwo/w.txt

• WebHDFS FileSystem URI webhdfs://namenode:50070/user/szetszwo/w.txt

• WebHDFS HTTP URL http://namenode:50070/webhdfs/v1/user/szetszwo/w.txt?op=..

• WebHDFS HTTP URL to open the file http://namenode:50070/webhdfs/v1/user/szetszwo/w.txt?op=OPEN


Example: curl

• Use curl to open a file $curl -i -L "http://namenode:50070/webhdfs/v1/user/szetszwo/w.txt?op=OPEN"

HTTP/1.1 307 TEMPORARY_REDIRECT Content-Type: application/octet-stream

Location: http://192.168.5.2:50075/webhdfs/v1/user/szetszwo/w.txt?op=OPEN&offset=0

Content-Length: 0

Server: Jetty(6.1.26)


Example: curl (2)

HTTP/1.1 200 OK Content-Type: application/octet-stream

Content-Length: 21

Server: Jetty(6.1.26)

Hello, WebHDFS user!


Example: wget

• Use wget to open the same file $wget "http://namenode:50070/webhdfs/v1/user/szetszwo/w.txt?op=OPEN" –O w.txt

Resolving ...

Connecting to ... connected.

HTTP request sent, awaiting response... 307 TEMPORARY_REDIRECT Location: http://192.168.5.2:50075/webhdfs/v1/user/szetszwo/w.txt?op=OPEN&offset=0 [following]


Example: wget (2)

--2012-06-13 01:42:10-- http://192.168.5.2:50075/webhdfs/v1/user/szetszwo/w.txt?op=OPEN&offset=0

Connecting to 192.168.5.2:50075... connected.

HTTP request sent, awaiting response... 200 OK Length: 21 [application/octet-stream]

Saving to: `w.txt'

100%[=================>] 21 --.-K/s in 0s

2012-06-13 01:42:10 (3.34 MB/s) - `w.txt' saved [21/21]


Example: Firefox


HCatalog REST API

© 2012 Hortonworks

•  REST endpoints: databases, tables, partitions, columns, table properties •  PUT to create/update, GET to list or describe, DELETE to drop •  Uses JSON to describe metadata objects •  Versioned, because we assume we will have to update it:

http://hadoop.acme.com/templeton/v1/… •  Runs in a Jetty server •  Supports security

–  Authentication done via kerberos using SPNEGO •  Included in HDP, runs on Thrift metastore server machine •  Not yet checked in, but you can find the code on Apache’s JIRA

HCATALOG-182

HCatalog REST API

© Hortonworks 2012

Hadoop/HCatalog

Get a list of all tables in the default database:

GET http://…/v1/ddl/database/default/table

{ "tables": ["counted","processed",], "database": "default" }

Indicate user with URL parameter: http://…/v1/ddl/database/default/table?user.name=gates

Actions authorized as indicated user

HCatalog REST API

© Hortonworks 2012

Hadoop/HCatalog

Create new table “rawevents”

PUT {"columns": [{ "name": "url", "type": "string" }, { "name": "user", "type": "string"}], "partitionedBy": [{ "name": "ds", "type": "string" }]} http://…/v1/ddl/database/default/table/rawevents

{ "table": "rawevents", "database": "default” }

HCatalog REST API

© Hortonworks 2012

Hadoop/HCatalog

Describe table “rawevents”

GET http://…/v1/ddl/database/default/table/rawevents

{ "columns": [{"name": "url","type": "string"}, {"name": "user","type": "string"}], "database": "default", "table": "rawevents" }

Job Management

© 2012 Hortonworks

•  Includes APIs to submit and monitor jobs •  Any files needed for the job first uploaded to HDFS via WebHDFS

– Pig and Hive scripts –  Jars, Python scripts, or Ruby scripts for UDFs – Pig macros

•  Results from job stored to HDFS, can be retrieved via WebHDFS •  User responsible for cleaning up output in HDFS •  Job state information stored in ZooKeeper or HDFS

Job Submission

© 2012 Hortonworks

•  Can submit MapReduce, Pig, and Hive jobs •  POST parameters include

–  script to run or HDFS file containing script/jar to run –  username to execute the job as –  optionally an HDFS directory to write results to (defaults to user’s home directory) –  optionally a URL to invoke GET on when job is done

Hadoop/HCatalog

POST http://hadoop.acme.com/templeton/v1/pig

{"id": "job_201111111311_0012",…}

Find all Your Jobs

© 2012 Hortonworks

•  GET on queue returns all jobs belonging to the submitting user •  Pig, Hive, and MapReduce jobs will be returned

Hadoop/HCatalog

GET http://…/templeton/v1/queue?user.name=gates

{"job_201111111311_0008", "job_201111111311_0012"}

Get Status of a Job

© 2012 Hortonworks

•  Doing a GET on jobid gets you information about a particular job •  Can be used to poll to see if job is finished •  Used after job is finished to get job information •  Doing a DELETE on jobid kills the job

Hadoop/HCatalog

GET http://…/templeton/v1/queue/job_201111111311_0012

{…, "percentComplete": "100% complete", "exitValue": 0,… "completed": "done" }

Future

© 2012 Hortonworks

•  Job management –  Job management APIs don’t belong in HCatalog – Only there by historical accident – Need to move them out to MapReduce framework

•  Authentication needs more options than kerberos •  Integration with Oozie •  Need a directory service

– Users should not need to connect to different servers for HDFS, HBase, HCatalog, Oozie, and job submission