Apache sqoop

28
APACHE SQOOP by Ashish Tiwari

Transcript of Apache sqoop

Page 1: Apache sqoop

APACHE SQOOP by Ashish Tiwari

Page 2: Apache sqoop

What is Apache Sqoop?● Sqoop is a hadoop component build on top of

hdfs.

● It is an open source tool that allow user to extract data from a structured data store into hadoop for further processing.

● Structured Data Store ⇔ RDBMS○ MySQL○ Oracle○ Sql Server○ PostgreSQL○ etc

Page 3: Apache sqoop

3 key points of apache sqoop

1. Either SQOOP <<import or export>> will work only with HDFS.

2. To interact with any RDBMS using SQOOP, the target RDBMS should be java compatible.

3. JDBC Jar location is important?$SQOOP_HOME/lib

Page 4: Apache sqoop

Apache SQOOP Commands -> Basic Commands (import & export)

Sqoop is a collection of related tools. To use Sqoop, you specify the tool you want to use and the arguments that control the tool.

Syntax: $ sqoop tool-name [tool-arguments]

Example: $sqoop version OR

You can display help for a specific tool by entering: sqoop help (tool-name); for example, sqoop help import.You can also add the --help argument to any command: sqoop import --help.

Page 5: Apache sqoop

SQOOP common general arguments:Syntax: $ sqoop <<tool-name>> --(generic-args)

Page 6: Apache sqoop

How to list all databases in RDBMS:

Syntax: $ sqoop list-databases --connect “<<jdbc - url>>”

Example: $ sqoop list-databases –connect “jdbc:mysql://localhost” \--username “root” –password “<<MySQL password>>”

How to list all tables present in Database:

Syntax: $ sqoop list-tables --connect “<<jdbc - url>>/<<DB name>>”

Example: $ sqoop list-tables –connect “jdbc:mysql://localhost/sample” \--username “root” -P

SQOOP list-databases & list-tables :

Note: --password & -P is for password only difference is –password is insecure other user can read the password & -P read password from console & it will be secure.

Page 7: Apache sqoop

SQOOP eval :• The eval tool allows users to quickly run simple SQL queries against a database; results are printed

to the console. • This allows users to preview their import queries to ensure they import the data they expect.

Syntax: $ sqoop eval (generic-args) (eval-args)

//eval-args <=> --query “<<SQL - Query>>” or -e “<<SQL - Query>>” | execute statement in SQLHow to retrieve all column from tables of RDBMS using apache sqoop

$ sqoop eval \--connect “jdbc:mysql://localhost/<<DB Name>>” \--username “<<Username>>” –P \--query “select * from <<table-name>> where <<column-name>> =‘’”;

Page 8: Apache sqoop

SQOOP IMPORT:• The import tool imports an individual table from an RDBMS to HDFS. • Each row from a table is represented as a separate record in HDFS. • Records can be stored as text files (one record per line), or in binary representation as

Avro or SequenceFiles.

Syntax: $ sqoop import (generic-args) (import-args)$ sqoop-import (generic-args) (import-args)

Page 9: Apache sqoop

Example: $sqoop import --connect “jdbc:mysql://localhost:3306/<<DB-Name>>” \

--username <<username>> -P \--table <<table – name >> -m <<mention no. of mappers>> \--target-dir <<hdfs - dir path>> \--fields-terminated-by <<delimiter>>

--table <table-name> Table to read

--target-dir <dir> HDFS destination dir

-m,--num-mappers <n> Use n map tasks to import in parallel

--fields-terminated-by <char> Sets the field separator character

Page 10: Apache sqoop

2 Ways of Importing Data into HDFS using Apache SQOOP

1.Selecting the Data to Import

2. Free-form Query Imports

• Sqoop typically imports data in a table-centric fashion.

• Use the --table argument to select the table to import.

• The facility of using free-form query in the current version of Sqoop is limited to simple queries where there are no ambiguous projections and no OR conditions in the WHERE clause.

• Use of complex queries such as queries that have sub-queries or joins leading to ambiguous projections can lead to unexpected results

Page 11: Apache sqoop

$ sqoop import \--connect jdbc:mysql://localhost/classicmodels \--username root -P \--table employees -m 2 \--columns "employeeNumber,lastName,firstName,email,jobTitle" \--where "jobTitle='Sales Rep'" \--fields-terminated-by "|" \--as-textfile \--target-dir /sqoop/nye/classicmodels/empl-select

Selecting the Data to Import Example

Free-form Query Imports Example

$ sqoop import \--connect jdbc:mysql://localhost/classicmodels \--username root -P \--query "select * from employees where jobTitle='Sales Rep' AND \$CONDITIONS" -m 3 \--split-by employees.employeeNumber \--target-dir /sqoop/nye/classicmodels/empl-freeform2 \--as-avrodatafile

Page 12: Apache sqoop

NOTE:

--split-by :

It is used to distribute the values from table across the mappers uniformly i.e. say u have 100 unique records(primary key) and if there are 4 mappers, --split-by (primary key column) will help to distribute you data-set evenly among the mappers.

$CONDITIONS:

It is used by Sqoop process, it will replace with a unique condition expression internally to get the data-set.

If you run a parallel import, the map tasks will execute your query with different values substituted in for $CONDITIONS. e.g., one mapper may execute "select bla from foo WHERE (id >=0 AND id < 10000)", and the next mapper may execute "select bla from foo WHERE (id >= 10000 AND id < 20000)" and so on.

Page 13: Apache sqoop

Apache SQOOP Import File Format• You can import data in one of two file formats: delimited text or Sequence Files.

• Delimited text is the default import format. You can also specify it explicitly by using the --as-textfile argument.

• Sequence Files are a binary format that store individual records in custom record-specific data types.

• Avro data files are a compact, efficient binary format that provides interoperability with applications written in other programming languages.

--as-avrodatafile Imports data to Avro Data Files--as-sequencefile Imports data to SequenceFiles

--as-textfile Imports data as plain text (default)

Page 14: Apache sqoop

Sqoop tool name => { sqoop-import-all-tables }:• The import-all-tables tool imports a set of tables from an RDBMS to HDFS. Data from each

table is stored in a separate directory in HDFS.

• For the import-all-tables tool to be useful, the following conditions must be met:

a. Each table must have a single-column primary key.b. You must intend to import all columns of each table.c. You must not intend to use non-default splitting column, nor impose any conditions via a WHERE clause.

Syntax: $ sqoop import-all-tables (generic-args) (import-args)

Note: The --table, --split-by, --columns, and --where arguments are invalid for sqoop-import-all-tables.

Page 15: Apache sqoop

Apache Sqoop Import all tables example:$ sqoop import-all-tables \--connect jdbc:mysql://localhost/classicmodels \--username root -P \--fields-terminated-by "|" \--as-textfile \--warehouse-dir /sqoop/nye/classicmodels/warehouse

Note: Suppose there’s 100 tables we are having in database & we don’t want to import tabl10, tabl50, tabl80, so how can we do this without having to import the tables one by one ?

We can accomplish this via using import-all-tables tools of apache sqoop by specifying the --exclude-tablesOption with it as follows:

$ sqoop import-all-tables –connect jdbc:mysql://localhost/classicmodels \--username root –P \--exclude-tables employees, department

Page 16: Apache sqoop

Apache SQOOP Jobs:• Sqoop job creates and saves the import and export commands.

• The job tool allows you to create and work with saved jobs.

• Saved jobs remember the parameters used to specify a job, so they can be re-executed by invoking the job by its handle.

Syntax: $ sqoop job (generic-args) (job-args) [-- [subtool-name] (subtool-args)]

Sqoop Job Management Options:

Argument Description

--create <job-id> Define a new saved job with the specified job-id (name). A second Sqoop command-line, separated by a -- should be specified; this defines the saved job.

--delete <job-id> Delete a saved job.--exec <job-id> Given a job defined with --create, run the saved job.--show <job-id> Show the parameters for a saved job.--list List all saved jobs

Page 17: Apache sqoop

Apache Sqoop Job create:• Creating saved jobs is done with the --create action. • This operation requires a -- followed by a tool name and its arguments. • The tool and its arguments will form the basis of the saved job.

Example: $ sqoop job --create listsDB -- list-databases \--connect "jdbc:mysql://localhost" \--username root –P

sqoop job --create listsTbl -- list-tables \--connect "jdbc:mysql://localhost/classicmodels" \--username root -P

Page 18: Apache sqoop

Verify Job (--list)‘--list’ argument is used to verify the saved jobs. The following command is used to verify the list of saved Sqoop jobs.

Syntax: $ sqoop job --list

Inspect Job (--show)‘--show’ argument is used to inspect or verify particular jobs and their details. The following command and sample output is used to verify a job called listsDB.

Syntax: $ sqoop job –show <<job-name>>

Page 19: Apache sqoop

Execute Job (--exec)‘--exec’ option is used to execute a saved job. The following command is used to execute a saved job called listsDB.

Syntax: $ sqoop job --exec <<Job-Name>>

Delete Job (--delete)‘--delete’ option is used to delete a saved job. The following command is used to delete a saved job called listsDB.

Syntax: $ sqoop job --delete <<Job-Name>>

Page 20: Apache sqoop

Apache SQOOP Incremental Imports:Sqoop provides an incremental import mode which can be used to retrieve only rows newer than some previously-imported set of rows.

The following arguments control incremental imports:

Incremental import arguments:

Argument Description--check-column (col) Specifies the column to be examined when determining which rows to import.

--incremental (mode) Specifies how Sqoop determines which rows are new. Legal values for mode include append and lastmodified.

--last-value (value) Specifies the maximum value of the check column from the previous import.

Note: Sqoop supports two types of incremental imports: append and lastmodified.

Page 21: Apache sqoop

Some Important points on append:• the --incremental argument to specify the type of incremental import to perform.

• If specify append mode when importing a table where new rows are continually being added with increasing row id values.

• You specify the column containing the row’s id with --check-column.

• Sqoop imports rows where the check column has a value greater than the one specified with --last-value.Example:$ sqoop import –connect “jdbc:mysql://localhost/classicmodels” \--username root –P \--table employees \--target-dir --target-dir /sqoop/nye/classicmodels/empl-select \--incremental append \--check-column employees.employeeNumber \--last-value <<last-value>>

Problems:

1. If –check-column value is not in sequence it will throw an error.2. Also, problem is to remember the –last-value of import which was performed.

Page 22: Apache sqoop

Some Important points on lastmodified:• Incremental imports mode can be used to retrieve only rows newer than some previously-

imported set of rows.

• lastmodified,” works on time-stamped data.

• Use this when rows of the source table may be updated

• And each such update will set the value of a last-modified column to the current timestamp.

• Rows where the check column holds a timestamp more recent than the timestamp specified with –last-value are imported.

• Sqoop checks for changes in data between the last value timestamp (Lower bound value) and Current timestamp (Upper bound value) and imports the modified or newly added rows.

Page 23: Apache sqoop
Page 24: Apache sqoop

APACHE SQOOP INCREMENTAL IMPORTS APPROACH USING JOBSExample:

1. Create:

$ sqoop job –create weeklyImport_Employees \-- import \--connect “jdbc:mysql://localhost/<<DB-Name>>” \--table <<table-name>>\--incremental append \--check-column <<col-name>> \--last-value 0 \--target-dir <<hdfs-location>>

2. Execute

$ sqoop job –exec weeklyImport_Employees

• If an incremental import is run from a saved job, this value will be retained in the saved job.

• Subsequent runs of sqoop job --exec <<Job-Name>>will continue to import only newer rows than those previously imported.

Page 25: Apache sqoop

Apache SQOOP Export:• The export tool exports a set of files from HDFS back to an RDBMS.

• The target table must already exist in the database.

• The input files are read and parsed into a set of records according to the user-specified delimiters.

• The default operation is to transform these into a set of INSERT statements that inject the records into the database.

• In "update mode," Sqoop will generate UPDATE statements that replace existing records in the database.

Syntax: $ sqoop export (generic-args) (export-args) $ sqoop-export (generic-args) (export-args)

Note:

• The --table and --export-dir arguments are required. • These specify the table to populate in the database, and the directory in HDFS that contains the source

data.

Page 26: Apache sqoop

Argument Description--export-dir <dir> HDFS source path for the export-m,--num-mappers <n> Use n map tasks to export in parallel

--table <table-name> Table to populate--update-key <col-name>

Anchor column to use for updates. Use a comma separated list of columns if there are more than one column.

--update-mode <mode>

Specify how updates are performed when new rows are found with non-matching keys in database. Legal values for mode include updateonly (default) and allowinsert.

Example:$ sqoop export \--connect jdbc:mysql://localhost/db \--username root -P\--table employee \ --export-dir <<hdfs-location>>

Page 27: Apache sqoop

Exports may fail for a number of reasons:1. Loss of connectivity from the Hadoop cluster to the database (either due to hardware

fault, or server software crashes).

2. Attempting to INSERT a row which violates a consistency constraint (for example, inserting a duplicate primary key value).

3. Attempting to parse an incomplete or malformed record from the HDFS source data.

4. Attempting to parse records using incorrect delimiters.

5. Capacity issues (such as insufficient RAM or disk space).

Page 28: Apache sqoop

Next: Sqoop JOB’s => Sqoop with Hive => Sqoop with HBASE => Sqoop Interview questions