1 Sqoop 2 Introduction Mengwei Ding, Software Engineer Intern at Cloudera.

13
1 Sqoop 2 Introduction Mengwei Ding, Software Engineer Intern at Cloudera

Transcript of 1 Sqoop 2 Introduction Mengwei Ding, Software Engineer Intern at Cloudera.

1

Sqoop 2 IntroductionMengwei Ding, Software Engineer Intern at Cloudera

2

What is Sqoop

• Apache Top-Level Project• SQl and hadOOP• Transfer a large bulk of data

• From relational data warehouses: Teradata, MySQL, PostgreSQL, Oracle, Netezza

• To Hadoop ecosystem: HDFS, Hive, HBase, Avio• Vice versa

• Sqoop 1(1.4.3) and Sqoop 2(1.99.2)

3

Sqoop 1

4

Sqoop 1 Challenges

• Command line tool, configured with line arguments(60+!)

• Connector-driven:o Responsible for metadata lookups and data transfero JDBC vocabulary-enforced (--connect)o Implicit connector selection

• Non-uniform, duplicated functionality

• Client accesses hadoop configurations and databases directly

• Security Concerns:o Client needs to know credentials to databases

• Type mapping is not clearly defined

5

Sqoop 2 - Design Goals

• Same goal: transfer data around

• Ease of Useo Sqoop as a Serviceo Domain Specific Interactions without too many args

• Ease of Extensiono No low-level Hadoop knowledge neededo Uniform functionality of connectors, no functional

overlap between connectors

• Security and Separation of Concernso Role based access and use

6

Sqoop 2 - Design Goals

7

Sqoop 2 - Connection vs Job Metadata

• There are two distinct sets of optionso Connection (distinct per database)o Job (distinct per table)

8

Sqoop 2 - Connection vs Job Metadata

• Another distinct two sets of argumentso Connector specifico Shared across all connectors

9

Sqoop 2 - Security

• Support for secure access to external system via role-based access to connection objectso Administrators create/edit/delete connectionso Operators use connections

• Connection encompass credentialso Connection created once, then reused latero Created by Admin, used by operator to safeguard

credential access from end user

10

Sqoop 2 - Resource Management

• Connections allow specification of resource policyo Administrator can limit the total number of physical

connections open at one timeo Connections can be disabled

11

Sqoop 2 - Current Status

• Primary focus of Sqoop community

• Second cut: 1.99.2o bits and docs: http://sqoop.apache.org

12

Demo Time

13

Thank You!