Web Services in Hadoop
Page 1
Nicholas Sze and Alan F. Gates @szetszwo, @alanfgates
REST-ful API Front-door for Hadoop
• Opens the door to languages other than Java • Thin clients via web services vs. fat-clients in gateway • Insulation from interface changes release to release
Page 2
HDFS HBase
HCatalog
External Store
MapReduce Pig Hive
HCatalog web interfaces
© 2012 Hortonworks
Not Covered in this Talk
Page 3 © 2012 Hortonworks
• HttpFS (fka Hoop) – same API as WebHDFS but proxied • Stargate – REST API for HBase
HDFS Clients
• DFSClient: the native client – High performance (using RPC) – Java blinding
• libhdfs: a C++ client interface – Using JNI => large overhead – Also Java blinding (require Hadoop installing)
Page 4 Architecting the Future of Big Data
HFTP
• Designed for cross-version copying (DistCp) – High performance (using HTTP) – Read-only – The HTTP API is proprietary – Clients must use HftpFileSystem (hftp://)
• WebHDFS is a rewrite of HFTP
Page 5 Architecting the Future of Big Data
Design Goals
• Support a public HTTP API
• Support Read and Write
• High Performance
• Cross-version
• Security
Page 6 Architecting the Future of Big Data
WebHDFS features
• HTTP REST API – Defines a public API – Permits non-Java client implementation – Support common tools like curl/wget
• Wire Compatibility – The REST API will be maintained for wire compatibility – WebHDFS clients can talk to different Hadoop versions.
Page 7 Architecting the Future of Big Data
WebHDFS features (2)
• A Complete HDFS Interface – Support all user operations
– reading files – writing to files – mkdir, chmod, chown, mv, rm, …
• High Performance – Using HTTP redirection to provide data locality – File read/write are redirected to the corresponding datanodes
Page 8 Architecting the Future of Big Data
WebHDFS features (3)
• Secure Authentication – Same as Hadoop authentication: Kerberos (SPNEGO) and Hadoop delegation tokens – Support proxy users
• A HDFS Built-in Component – WebHDFS is a first class built-in component of HDFS. – Run inside Namenodes and Datanodes
• Apache Open Source – Available in Apache Hadoop 1.0 and above.
Page 9 Architecting the Future of Big Data
WebHDFS URI & URL
• FileSystem scheme: webhdfs://
• FileSystem URI: webhdfs://<HOST>:<HTTP_PORT>/<PATH>
• HTTP URL: http://<HOST>:<HTTP_PORT>/webhdfs/v1/<PATH>?op=..
– Path prefix: /webhdfs/v1 – Query: ?op=..
Page 10 Architecting the Future of Big Data
URI/URL Examples
• Suppose we have the following file hdfs://namenode:8020/user/szetszwo/w.txt
• WebHDFS FileSystem URI webhdfs://namenode:50070/user/szetszwo/w.txt
• WebHDFS HTTP URL http://namenode:50070/webhdfs/v1/user/szetszwo/w.txt?op=..
• WebHDFS HTTP URL to open the file http://namenode:50070/webhdfs/v1/user/szetszwo/w.txt?op=OPEN
Page 11 Architecting the Future of Big Data
Example: curl
• Use curl to open a file $curl -i -L "http://namenode:50070/webhdfs/v1/user/szetszwo/w.txt?op=OPEN"
HTTP/1.1 307 TEMPORARY_REDIRECT Content-Type: application/octet-stream
Location: http://192.168.5.2:50075/webhdfs/v1/user/szetszwo/w.txt?op=OPEN&offset=0
Content-Length: 0
Server: Jetty(6.1.26)
Page 12 Architecting the Future of Big Data
Example: curl (2)
HTTP/1.1 200 OK Content-Type: application/octet-stream
Content-Length: 21
Server: Jetty(6.1.26)
Hello, WebHDFS user!
Page 13 Architecting the Future of Big Data
Example: wget
• Use wget to open the same file $wget "http://namenode:50070/webhdfs/v1/user/szetszwo/w.txt?op=OPEN" –O w.txt
Resolving ...
Connecting to ... connected.
HTTP request sent, awaiting response... 307 TEMPORARY_REDIRECT Location: http://192.168.5.2:50075/webhdfs/v1/user/szetszwo/w.txt?op=OPEN&offset=0 [following]
Page 14 Architecting the Future of Big Data
Example: wget (2)
--2012-06-13 01:42:10-- http://192.168.5.2:50075/webhdfs/v1/user/szetszwo/w.txt?op=OPEN&offset=0
Connecting to 192.168.5.2:50075... connected.
HTTP request sent, awaiting response... 200 OK Length: 21 [application/octet-stream]
Saving to: `w.txt'
100%[=================>] 21 --.-K/s in 0s
2012-06-13 01:42:10 (3.34 MB/s) - `w.txt' saved [21/21]
Page 15 Architecting the Future of Big Data
Example: Firefox
Page 16 Architecting the Future of Big Data
HCatalog REST API
Page 17 © 2012 Hortonworks
• REST endpoints: databases, tables, partitions, columns, table properties • PUT to create/update, GET to list or describe, DELETE to drop • Uses JSON to describe metadata objects • Versioned, because we assume we will have to update it:
http://hadoop.acme.com/templeton/v1/… • Runs in a Jetty server • Supports security
– Authentication done via kerberos using SPNEGO • Included in HDP, runs on Thrift metastore server machine • Not yet checked in, but you can find the code on Apache’s JIRA
HCATALOG-182
HCatalog REST API
Page 18 © Hortonworks 2012
Hadoop/HCatalog
Get a list of all tables in the default database:
GET http://…/v1/ddl/database/default/table
{ "tables": ["counted","processed",], "database": "default" }
Indicate user with URL parameter: http://…/v1/ddl/database/default/table?user.name=gates
Actions authorized as indicated user
HCatalog REST API
Page 19 © Hortonworks 2012
Hadoop/HCatalog
Create new table “rawevents”
PUT {"columns": [{ "name": "url", "type": "string" }, { "name": "user", "type": "string"}], "partitionedBy": [{ "name": "ds", "type": "string" }]} http://…/v1/ddl/database/default/table/rawevents
{ "table": "rawevents", "database": "default” }
HCatalog REST API
Page 20 © Hortonworks 2012
Hadoop/HCatalog
Describe table “rawevents”
GET http://…/v1/ddl/database/default/table/rawevents
{ "columns": [{"name": "url","type": "string"}, {"name": "user","type": "string"}], "database": "default", "table": "rawevents" }
Job Management
Page 21 © 2012 Hortonworks
• Includes APIs to submit and monitor jobs • Any files needed for the job first uploaded to HDFS via WebHDFS
– Pig and Hive scripts – Jars, Python scripts, or Ruby scripts for UDFs – Pig macros
• Results from job stored to HDFS, can be retrieved via WebHDFS • User responsible for cleaning up output in HDFS • Job state information stored in ZooKeeper or HDFS
Job Submission
Page 22 © 2012 Hortonworks
• Can submit MapReduce, Pig, and Hive jobs • POST parameters include
– script to run or HDFS file containing script/jar to run – username to execute the job as – optionally an HDFS directory to write results to (defaults to user’s home directory) – optionally a URL to invoke GET on when job is done
Hadoop/HCatalog
POST http://hadoop.acme.com/templeton/v1/pig
{"id": "job_201111111311_0012",…}
Find all Your Jobs
Page 23 © 2012 Hortonworks
• GET on queue returns all jobs belonging to the submitting user • Pig, Hive, and MapReduce jobs will be returned
Hadoop/HCatalog
GET http://…/templeton/v1/queue?user.name=gates
{"job_201111111311_0008", "job_201111111311_0012"}
Get Status of a Job
Page 24 © 2012 Hortonworks
• Doing a GET on jobid gets you information about a particular job • Can be used to poll to see if job is finished • Used after job is finished to get job information • Doing a DELETE on jobid kills the job
Hadoop/HCatalog
GET http://…/templeton/v1/queue/job_201111111311_0012
{…, "percentComplete": "100% complete", "exitValue": 0,… "completed": "done" }
Future
Page 25 © 2012 Hortonworks
• Job management – Job management APIs don’t belong in HCatalog – Only there by historical accident – Need to move them out to MapReduce framework
• Authentication needs more options than kerberos • Integration with Oozie • Need a directory service
– Users should not need to connect to different servers for HDFS, HBase, HCatalog, Oozie, and job submission