EPAM. Hadoop MR streaming in Hive

Post on 19-Jun-2015

103 views 0 download

Tags:

description

Streaming offers an alternative way to transform data. During a streaming job, the Hadoop Streaming API opens an I/O pipe to an external process. This presentation will tell how use Streaming feature in Hive to reduce code complexity with real story example.

Transcript of EPAM. Hadoop MR streaming in Hive

Yauheni Yushyn, EPAM Systems – September 2014

Hadoop MR Streaming in HiveUse-case with Hive and Python from real life

2

Agenda

• Intro

• Pros and Cons

• Hive reference

• Use case from Real Life

• Possible solutions

• Hive Streaming: Architecture

• Hive Streaming: Realization

• Hive Streaming: Source code

• Hive Streaming: Debug

• Hive Streaming: Pitfalls

• Hive Streaming: Benchmarks

3

CONCEPTSHadoop MR Streaming in Hive

SECTION

Intro

Streaming offers an alternative way to transform data. During a streaming job, the Hadoop Streaming API opens an I/O pipe to an external process

Unix like interface:• Streaming API opens an I/O pipe to an external process• Process reads data from the standard input and writes the results out through the standard

output

By default,INPUT for user script:• columns transformed to STRING• delimited by TAB• NULL values converted to the literal string \N (differentiate NULL values from empty strings)

OUTPTUT of user script:• treated as TAB-separated STRING columns•  \N will be re-interpreted as a NULL• resulting STRING column will be cast to the data type specified in the table declaration

These defaults can be overridden with ROW FORMAT

• Simplicity for developer, dealing with stdin/stdout

• Schema-less model, treat values as needed

• Non-Java interface

• Overhead for Serialization/Deserialization between processes

• Disallowed when "SQL standard based authorization" is configured (Hive 0.13.0 and later releases) 

Pros and Cons

• MAP()

• REDUCE()

• TRANSFORM()

Hive provides several clauses to use streaming: MAP(), REDUCE(), and TRANSFORM().

Note:

MAP() does not actually force streaming during the map phase nor does reduce force streaming to happen in the reduce phase. For this reason, the functionally equivalent yet more generic TRANSFORM() clause is suggested to avoid misleading the reader of the query.

Hive reference

7

USE CASEHadoop MR Streaming in Hive

SECTION

Requirements:

There’re 14 flags in source table in Hive, which controls output values for 4 new fields in target table

Solutions:

• Hive "case … when" clause

• User Defined Function (UDF)

• Custom MR Job

• Hive Streaming

Use case from Real Life

Use case from Real Life: Requirements

10

• There’re more than 1,500 lines of code to map flags with new fields (statement repeats for every new output field)

• Complexity for debugging

• Fast execution• SQL-like syntax• All logic in one place (hql script)

Hive "case … when" clause

• You are single consumer of UDF (for this particular case, custom logic for single DataMart)

• Java-code

• Fast execution• Pass only needed flags into UDF (in contrast with

Hive Streaming)• In the final point: SQL-like syntax, All logic in one

place• Java-code

UDF

• Slower execution (time for SerDe)• Deal with all fields, not only flags (in contrast

with UDF)

• Reducing complexity of code using script language

• Small size of code• Fast developing• Wide stack of programming languages

Hive Streaming

13

REALIZATIONHadoop MR Streaming in Hive

SECTION

Hive Streaming: Architecture

Python snippets:

• Create matrix (e.g., list of tuples) with flags and related values of fields

• Loop through INPUT

• Split INPUT by TAB

• Split data fields and flags

• Compare with matrix and get max possible matching

• Spill out data with new fields as TAB separated text

Hive Streaming: Realization

#!/usr/bin/env python"""Mapper for Hive Streaming, using Python iterators and generators.Spill out new fields in accordance with input flags."""

import sysimport logging

def read_input(file): """Read data from STDIN using python generator""" #yield "IAH\tCUN\tIAH-CUN\t01\t\t14\tUSD\t520.99\t4\t19\t\N\t\N\t0\t\N\tDID\t2\tDID\tDID\tDID\tCHEAPTICKETS\t520.99\tORBITZ\t520.99\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\tTRIPADVISOR - US\t01\t01\t0\t0\t0\t0\t0\t1\t0\t0\t0\t1\t0\t0\t0\t0\t0\t0\t0\t0\t0\t2014-01-01\tEpam.COM" #yield "IAH\tCUN\tIAH-CUN\t01\t\t14\tUSD\t520.99\t4\t19\t\N\t\N\t0\t\N\tDID\t2\tDID\tDID\tDID\tCHEAPTICKETS\t520.99\tORBITZ\t520.99\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\tTRIPADVISOR - US\t\N1\t\N1\t\N\t\N\t\N\t\N\t\N\t1\t\N\t\N\t\N\t1\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t2014-01-01\tEpam.COM" for line in file: yield line.strip()

def compare_flags(source, target): """Compare flags from source and target lists. Src/trg should have the same size""" size = len(source) out = list()

# Go through elemets, add 0 to OUT list if src/trg elements equals for i in xrange(size): if target[i] != '-': if target[i] == source[i]: out.append(0) else: #logging.debug("Position: %i. Values of src/trg not equals, skip: %s,%s" % (i, source[i], target[i])) return None #out.append(1) else: out.append('-')

return out

def main(separator='\t'): column_list = ["ORIGIN","DESTINATION","OND","CARRIER","LOS","BKG_WINDOW","LOCAL_CURRENCY","LOWEST_PRICE","PAGE","POSITION","XP_RANK","XP_PRICE","XP_COMPETED","XP_PRICE_DIFF","BML","NUMBER_SELLERS","XP_IS_HERO","ECPC_LOSS","PRICE_LOSS","OTA_1","OTA_1_PRICE","OTA_2","OTA_2_PRICE","OTA_3","OTA_3_PRICE","OTA_4","OTA_4_PRICE","OTA_5","OTA_5_PRICE","OTA_6","OTA_6_PRICE","OTA_7","OTA_7_PRICE","OTA_8","OTA_8_PRICE","OTA_9","OTA_9_PRICE","OTA_10","OTA_10_PRICE","OTA_11","OTA_11_PRICE","OTA_12","OTA_12_PRICE","OTA_13","OTA_13_PRICE","OTA_14","OTA_14_PRICE","OTA_15","OTA_15_PRICE","PARTNER_NAME","RCXR","DCXR","SPLIT_TICKET","DEPARTURE_DURATION","RETURN_DURATION","DEPARTURE_STOPS","RETURN_STOPS"] flag_list = ["exp_listed_on_route_flag","exp_listed_on_carr_flag","exp_lst_on_itin_flag","carr_is_seller_flag","more_than_1_seller_flag","split_ticket_flag","exp_in_hero_flag","ota_in_hero_flag","meta_in_hero_flag","carr_in_hero_flag","cheapest_prc_is_unique_flag","exp_prc_match_carr_flag","exp_prc_match_cheapest_flag","cheapest_ota_meta_prc_match_carr_flag"] partition_list = ["SHOP_DATE", "PARTNER_POS"]

logging.debug("Star specifying vocabulary matrix") target = [ (["Inventory","Epam not showing route","Epam Lost","Unknown"],["0","-","0","-","-","-","-","-","-","-","-","-","-","-"]) ,(["Inventory","Epam not showing carrier","Epam Lost","Unknown"],["1","0","0","0","-","0","-","-","-","-","-","-","-","-"]) ,(["Inventory","Epam not showing carrier","Epam Lost","Restricted carrier for Epam"],["1","0","0","1","1","0","-","-","-","-","-","-","-","-"]) ,(["Inventory","Epam not showing carrier","Epam Lost","Restricted carrier on Meta"],["1","0","0","1","0","0","-","-","-","-","-","-","-","-"]) ,(["Inventory","Epam not showing carrier","Epam Lost","Unknown"],["1","0","0","0","-","-","-","-","-","-","-","-","-","-"]) ,(["Inventory","Epam not showing itinerary","Epam Lost","Unknown"],["1","1","0","-","1","0","-","-","-","-","-","-","-","-"]) ,(["Inventory","Unique Inventory","Epam Lost","Split Ticket"],["1","0","0","0","0","1","-","-","-","-","-","-","-","-"]) ,(["Inventory","Unique Inventory","Split Ticket","Epam Won"],["1","1","1","0","-","1","1","0","0","0","-","-","-","-"]) ,(["Inventory","Unique Inventory","Epam Lost","Split Ticket"],["1","1","0","0","-","1","-","1","0","0","-","-","-","-"]) ,(["Inventory","Unique Inventory","Epam Lost","Split Ticket"],["1","1","0","0","-","1","-","0","1","0","-","-","-","-"]) ,(["Inventory","Unique Inventory","Unknown","Epam Won"],["1","1","1","0","0","0","1","0","0","0","-","-","-","-"]) ,(["Inventory","Unique Inventory","Epam Lost","Suspected Carrier Restricted Content"],["1","1","0","1","0","0","0","0","0","1","-","-","-","-"]) ,(["Inventory","Unique Inventory","Epam Lost","Unknown"],["1","1","0","0","0","0","-","-","-","-","-","-","-","-"]) ,(["Price","Carrier more expensive","Undercutting carrier","Epam Won"],["1","1","1","1","-","0","1","0","0","0","-","0","-","-"]) ,(["Price","Carrier more expensive","Epam Lost","Undercutting carrier"],["1","1","1","1","1","0","0","0","1","0","-","-","0","0"]) ,(["Price","Carrier more expensive","Epam Lost","Undercutting carrier"],["1","1","1","1","1","0","0","1","0","0","-","-","0","0"]) ,(["Price","Carrier cheapest","Epam Lost","Unknown"],["1","1","1","1","1","0","0","-","-","1","-","0","0","0"]) ,(["Price","Carrier cheapest","Epam Lost","Carrier controlled pricing"],["1","1","1","1","-","0","0","0","0","1","1","0","-","-"]) ,(["Price","Fees or charges","Epam Lost","Split Ticket"],["1","1","1","0","-","1","0","0","1","0","-","-","0","-"]) ,(["Price","Fees or charges","Epam Lost","Split Ticket"],["1","1","1","0","-","1","0","1","0","0","-","-","0","-"]) ,(["Price","Fees or charges","Split Ticket","Epam Won"],["1","1","1","0","-","1","1","0","0","0","-","-","-","-"]) ,(["Price","Fees or charges","Epam Lost","Unknown"],["1","1","1","0","-","0","0","0","1","0","-","-","0","-"]) ,(["Price","Fees or charges","Epam Lost","Unknown"],["1","1","1","0","-","0","0","1","0","0","-","-","0","-"]) ,(["Price","Fees or charges","Unknown","Epam Won"],["1","1","1","0","-","0","1","0","0","0","1","-","-","-"]) ,(["Price","Fees or charges","Epam Lost","Fees or charges"],["1","1","1","1","1","0","0","1","0","0","0","-","0","1"]) ,(["Price","Fees or charges","Epam Lost","Fees or charges"],["1","1","1","1","1","0","0","0","1","0","0","-","0","1"]) ,(["Price","Fees or charges","Epam Lost","Fees or charges"],["1","1","1","1","1","0","0","0","0","1","0","0","-","1"]) ,(["Rank","Rank","ECPC","Epam Won"],["1","1","1","-","1","-","1","0","0","0","0","-","1","-"]) ,(["Rank","Rank","Epam Lost","ECPC"],["1","1","1","-","1","-","0","1","0","0","0","-","1","-"]) ,(["Rank","Rank","Epam Lost","ECPC"],["1","1","1","-","1","-","0","0","1","0","0","-","1","-"]) ,(["Rank","Rank","Epam Lost","ECPC"],["1","1","1","-","1","-","0","0","0","1","0","-","1","-"]) ]

# Input comes from STDIN

data = read_input(sys.stdin)

header_list = column_list + flag_list + partition_list logging.debug("Header for input data: %s" % header_list)

logging.debug("Start reading from STDIN") # Loop through STDIN for words in data: #for words in sys.stdin: logging.debug("-----------") current_flags = list()

#words = words.strip() words = words.split('\t')

logging.debug("Input values from external process (STDIN): %s" % words) logging.debug("Input length: %s" % len(words))

if (len(header_list) != len(words)): logging.error("Length of IN data (%i) not equal Header length (%i)! Exit" % (len(words), len(header_list))) sys.exit(1)

data_set = dict(zip(header_list, words))

logging.debug("Parsing of STDIN: %s" % data_set)

# Get flags for flag in flag_list: current_flags.append(data_set[flag])

logging.debug("Find flags: %s" % current_flags)

# Get list with result of comparison src/trg compared_list = list() logging.debug("Comparing flags with vocabulary...") for k,v in target: #logging.debug("key, value: %s,%s" % (k, v)) temp_out = compare_flags(current_flags,v) if not temp_out: continue

logging.debug("Match is found: %s" % temp_out) compared_list.append((k, temp_out)) temp_out = list()

logging.debug("Comparing flags with vocabulary finished. List of matches: %s" % (compared_list))

# Find max occurrence of src in trg (find max-occurrence of zeros) max_zeros = 0 out_fields = list() max_flag_from_trg =list() for k, v in compared_list: #logging.debug("key, value: %s,%s" % (k, v)) count_zero = v.count(0) if count_zero > max_zeros: out_fields = k max_flag_from_trg = v

if (not out_fields) or (not max_flag_from_trg): logging.warning("Can't find values in vocabulary. Set values for DEFAULT") logging.warning("Fields: %s" % out_fields) logging.warning("Flags: %s" % max_flag_from_trg) out_fields = ["DEFAULT" for x in xrange(len(target[0][0]))] else: logging.debug("Output fields found") logging.debug("Fields: %s" % out_fields) logging.debug("Flags: %s" % max_flag_from_trg)

# Output fields with flags in STDOUT field_data = [data_set[x] for x in column_list] partition_date = [data_set[x] for x in partition_list] out_row = separator.join(field_data) + separator + separator.join(out_fields) + separator + separator.join(partition_date) logging.debug("Output string: %s" % out_row) print out_row #print "%s%s%s%s%s" % (separator.join(field_data), separator, separator.join(out_fields), separator, separator.join(partition_date))

if __name__ == "__main__": logging.basicConfig(level=logging.DEBUG, stream=sys.stderr, #format='%(filename)s[LINE:%(lineno)d]# %(levelname)-8s [%(asctime)s] %(message)s' format='[%(asctime)s][%(filename)s][%(levelname)s] %(message)s' ) main()

Hive Streaming: Source code

echo -e “val11\tval12\t…val1N\nval21\tval22\t…val2N"| ./script_name.py

Example:

Put 2 lines (TSV) in stdinecho -e "IAH\tCUN\tIAH-CUN\t01\t\t14\tUSD\t520.99\t4\t19\t\N\t\N\t0\t\N\tDID\t2\tDID\tDID\tDID\tCHEAPTICKETS\t520.99\tORBITZ\t520.99\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\tTRIPADVISOR - US\t01\t01\t0\t0\t0\t0\t0\t1\t0\t0\t0\t1\t0\t0\t0\t0\t0\t0\t0\t0\t0\t2014-01-01\tEPAM.COM\nIAH\tCUN\tIAH-CUN\t01\t\t14\tUSD\t520.99\t4\t19\t\N\t\N\t0\t\N\tDID\t2\tDID\tDID\tDID\tCHEAPTICKETS\t520.99\tORBITZ\t520.99\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\tTRIPADVISOR - US\t01\t01\t0\t0\t0\t0\t0\t1\t0\t0\t0\t1\t0\t0\t0\t0\t0\t0\t0\t0\t0\t2014-01-01\tEPAM.COM“ | ./script_name.py

Get 2 lines with new fields (without flags) in stdoutIAH CUN IAH-CUN 01 14 USD 520.99 4 19 \N \N 0 \N DID 2 DID DID DID CHEAPTICKETS 520.99 ORBITZ 520.99 \N \N \N \N \N \N \N \N \N \N \N \N \N \N \N \N \N \N \N \N \N \N \N \N \N \N TRIPADVISOR - US 01 01 0 0 0 0 0 Inventory Epam not showing carrier Epam Lost Unknown 2014-01-01 EPAM.COM

IAH CUN IAH-CUN 01 14 USD 520.99 4 19 \N \N 0 \N DID 2 DID DID DID CHEAPTICKETS 520.99 ORBITZ 520.99 \N \N \N \N \N \N \N \N \N \N \N \N \N \N \N \N \N \N \N \N \N \N \N \N \N \N TRIPADVISOR - US 01 01 0 0 0 0 0 Inventory Epam not showing carrier Epam Lost Unknown 2014-01-01 EPAM.COM

Hive Streaming: Debug

• Add script in Distributed Cash before running query with Hive Streaming• Use last columns in select statement for Dynamic Partitioning• Use more robust separator (default, TAB) to prevent inconsistency of data

Note: always use iterator/generator (python methodology) functions instead of explicit reading from stdin! It saves system resources and executes script much faster (more over than 10 times)

Example:

Hive Streaming: Pitfalls

def read_input(file): for line in file: # split the line into words yield line.strip()

data = read_input(sys.stdin)for words in data:…

for words in sys.stdin:…

19

BENCHMARKSHadoop MR Streaming in Hive

SECTION

Hive Streaming: BenchmarksHive "case … when" clause

Source: MANAGED, Non-partitioned, 2M rows

Target: MANAGED, Non-Partitioned

Time spent: 2m39s

Hive Streaming: Benchmarks

Hive Streaming

Source: MANAGED, Non-partitioned, 2M rows

Target: MANAGED, Non-Partitioned

Time spent: 4m53s

Note: no compression for output, so “Number of bytes written extremely larger

Hive Streaming: Benchmarks

Hive "case … when" clause

Source: MANAGED, Non-partitioned, 2M rows

Target: MANAGED, Partitioned by 2 columns

Time spent: 2m44s

Hive Streaming: Benchmarks

Hive Streaming

Source: MANAGED, Non-partitioned, 2M rows

Target: MANAGED, Partitioned by 2 columns

Time spent: 5m12s

Join us at https://www.linkedin.com/groups/Belarus-Hadoop-User-Group-BHUG-8104884

yauheni_yushyn@epam.com

skype: ushin.evgenij

Thanks!