Bridging the gap of Relational to Hadoop using Sqoop @ Expedia
-
Upload
dataworks-summithadoop-summit -
Category
Technology
-
view
494 -
download
0
Transcript of Bridging the gap of Relational to Hadoop using Sqoop @ Expedia
Bridging the gap of Relational to Hadoop using Sqoop@Expedia(Enhancing Sqoop for Synchronization)
Shashank Tandon, Expedia
Kopal Niranjan, Expedia
| Expedia Inc. Proprietary & Confidential2
Agenda• Problem statement
• Why- Sqoop
• Expedia Enhancements for Sqoop.
• New Tool : Hive Merge
• Data Synchronization
• Demo
| Expedia Inc. Proprietary & Confidential3
Data Synchronization
| Expedia Inc. Proprietary & Confidential4
Problem Statement• Import huge amount of data available on RDBMS to Hive
table
• Support multiple partitions on Hive while importing.
• Regular updates happening on RDBMS.–Merge the new/updated data to hive tables.–Merge the data in parallel.
| Expedia Inc. Proprietary & Confidential5
Community Solution - Sqoop• Sqoop is an open source tool designed to efficiently
transfer bulk data between Hadoop and structured data stores such as relational databases.
• Support various relational databases like Teradata, SQL Server, Oracle,Mysql,DB2 etc.
| Expedia Inc. Proprietary & Confidential6
Enhanced Sqoop Features • Enhanced Sqoop Features for community business needs.
- Hive Merge - Merges the incremental data migrated to hdfs into your
existing hive tables.- Supports merge based on composite keys- Merges older partitions as well as add new partitions.
| Expedia Inc. Proprietary & Confidential7
Enhanced Sqoop Features - Hive Dynamic Partition
- Hive Dynamic Partition with Partition Format
- Hive External Table
- Compression like Snappy
| Expedia Inc. Proprietary & Confidential8
Hcatalog for Hive- Hcatalog is a java wrapper on top of Hive metastore.
- Sqoop supports all the latest Hive features using Hcatalog.
| Expedia Inc. Proprietary & Confidential9
External tables with HCatalog
| Expedia Inc. Proprietary & Confidential10
Sqoop Import to Hive Managed Table
• Sqoop connects to mysql database test
• Import table MYTABLE in a hive managed table test_part1
• The hive managed table is located in /apps/hive/warehouse
| Expedia Inc. Proprietary & Confidential11
| Expedia Inc. Proprietary & Confidential12
New Enhancement :Import to Hive External Table
• The above command creates a hive table in the user managedDirectory /user/root/test_part2
| Expedia Inc. Proprietary & Confidential13
| Expedia Inc. Proprietary & Confidential14
Dynamic Partitioning with HCatalog
| Expedia Inc. Proprietary & Confidential15
Sqoop Import to Hive Static Partition• Can pass only 1 static partition as sqoop argument
| Expedia Inc. Proprietary & Confidential16
Sqoop Import to Hive Static Partition• Check Hive Partition
| Expedia Inc. Proprietary & Confidential17
Sqoop Import to Hive Static Partition on Date column
• Can pass only 1 static partition as sqoop argument with date value specified manually.
| Expedia Inc. Proprietary & Confidential18
Questions
How to Import Data if there are more than 200 partitions ? Should I manually run these jobs again and again ?
How to Import Data if the date format is month or day or year?Is there any way that I can pass the format ?
| Expedia Inc. Proprietary & Confidential19
New Enhancement : Import to Hive Dynamic Partition
• A new argument is passed –hcatalog-dynamic-partition-keys in sqoop.
• It works along with current static partition key.
• If both are passed then it will give more preference to static partition key.
| Expedia Inc. Proprietary & Confidential20
| Expedia Inc. Proprietary & Confidential21
New Enhancement : Import to Hive Dynamic Partition with Date Format
• A new argument is passed –hcatalog-dynamic-partition-key-format with argument –hcatalog-dynamic-partition-keys.
• Check the Hive Partitions after the Sqoop Import.
• The partitions created will be in the user-specified format.
| Expedia Inc. Proprietary & Confidential22
| Expedia Inc. Proprietary & Confidential23
Password encrypted in Sqoop Metastore• Password will now be saved in Sqoop metastore in
encrypted manner.
• The logic is same as done in file encryption where generic passkey and algorithm is passed in command line.
| Expedia Inc. Proprietary & Confidential24
Issues with Sqoop Merge Tool• Designed to merge two directories on HDFS. Will need
modification to support merging of Hive tables.
• The output directory must be specified while performing the merge.
• Supports merge based on a single column.
• To merge many partitions, each will require separate sequential Sqoop jobs.
| Expedia Inc. Proprietary & Confidential25
Merge Incremental data using Sqoop and Hive External Table
• Import records from base table to a HDFS directory.
• Import updates using incremental imports to another HDFS directory.
• Create a hive external table for both the directories.
• Create a view that combines record sets from both the Base (base_table) and Change (incremental_table) tables.
| Expedia Inc. Proprietary & Confidential26
Merge Incremental data using Sqoop and Hive External Table
• The view now contains the most up-to-date set of records.
• Generate a table from the view created in above step.
• Replace the base table with the entries from the above generated table.
| Expedia Inc. Proprietary & Confidential27
New Tool: Hive Merge• Import original base table into Hive
| Expedia Inc. Proprietary & Confidential28
New Tool : Hive merge • Import incremental data into Hive
| Expedia Inc. Proprietary & Confidential29
• Finally merge data using tool hive-merge.
New Tool : Hive merge
| Expedia Inc. Proprietary & Confidential30
Acquiring locks during Hive Merge• In order to allow only single Hive merge happen on same
table, tool acquire lock in the start and release lock once it finishes.
| Expedia Inc. Proprietary & Confidential31
Performance metrics : Hive Merge tool
| Expedia Inc. Proprietary & Confidential32
Other Key Enhancements• Save encrypted password in Sqoop Metastore
• Teradata varchar/char support
• Teradata current timestamp support
• Sqoop Job runs for Incremental Import
• Snappy compression support in Hcatalog
| Expedia Inc. Proprietary & Confidential33
Apache Sqoop Jiras These are the few jiras for which the patch has been provided by us:
• SQOOP-2332: Dynamic Partition in Sqoop HCatalog- if Hive table does not exists & add support for Partition Date Format
• SQOOP-2335 :Support for Hive External Table in Sqoop – Hcatalog
| Expedia Inc. Proprietary & Confidential34
• SQOOP-2585: Merging hive tables using sqoop
• SQOOP-2596:Precision of varchar/char column cannot be retrieved from teradata database during sqoop import
• SQOOP-2801: Secure RDBMS password in Sqoop Metastore in a encrypted form.
• SQOOP-2331: Snappy Compression Support in Sqoop-Hcatalog
35
Demo
| Expedia Inc. Proprietary & Confidential36
Questions
| Expedia Inc. Proprietary & Confidential37
Hive Merge Internal ArchitectureStep 1: Identify partitions to update. Skip this step for non-partitioned tables.
| Expedia Inc. Proprietary & Confidential38
Hive Merge Internal ArchitectureStep 2: Merge the new partitions with the old partitions(only for partitioned tables).
| Expedia Inc. Proprietary & Confidential39
Hive Merge Internal ArchitectureStep 3: Delete older versions.