Processing Java UDFs in a C++ Environment - user.tu-berlin.de · Wildfire 3 • HTAP prototype...

25
Viktor Rosenfeld, Rene Mueller, Pınar Tözün, Fatma Özcan IBM Research – Almaden SoCC '17, Santa Clara – September 27, 2017 Processing Java UDFs in a C++ Environment

Transcript of Processing Java UDFs in a C++ Environment - user.tu-berlin.de · Wildfire 3 • HTAP prototype...

  • Viktor Rosenfeld, Rene Mueller, Pınar Tözün, Fatma Özcan IBM Research – Almaden

    SoCC '17, Santa Clara – September 27, 2017

    Processing Java UDFs in a C++ Environment

  • Programming languages split

    2

    liberal use of UDFs

    JVM-based data analytics ecosystem

    Software components that do not use the JVM

    “native code”

  • Wildfire

    3

    • HTAP prototype from IBM Research [CIDR2017] • C++ columnar engine • Spark as user-facing front end • data analytics through SparkSQL queries • SparkSQL queries can contain Scala UDFs

    Goal: Execute Scala UDFs inside SparkSQL queries on the Wildfire C++ engine.

  • Java Native Interface

    4

    native codeJVM

    Java Native Interface (JNI)

  • JNI overheads

    5

    • Simple Java function called in a loop • 1 billion iterations • JNI overhead: 2 orders of magnitude • Splitting the loop (strided execution)

    hides the overhead

    int udf(int i) { return i; }

    ● ● ● ● ● ●

    1

    10

    100

    100 101 102 103 104 105 106 107 108 109

    Stride size [logscale]

    Seco

    nds

    [logs

    cale

    ]

    Loop in Java

    Loop in C / Function called via JNI

    Split loop / 1 JNI call per stride

    • JNI calls have significant overhead • Execute tuple-based UDF in a strided fashion

  • SparkSQL UDFs

    6

    var offset = 10 sqlContext.udf.register("add_offset", (i: Int) => i + offset) sqlContext.sql("SELECT add_offset(i) FROM table").show()

    Usage in Spark

    Java class representationpublic final class SparkProgramm$$anonfun$run$1 extends scala.runtime.AbstractFunction1$mcII$sp implements scala.Serializable { public SparkProgram$$anonfun$run$1(scala.runtime.IntRef); public final int apply(int); ... }

    var offset = 10 sqlContext.udf.register("add_offset", (i: Int) => i + offset) sqlContext.sql("SELECT add_offset(i) FROM table").show()

    public final class SparkProgramm$$anonfun$run$1 extends scala.runtime.AbstractFunction1$mcII$sp implements scala.Serializable { public SparkProgram$$anonfun$run$1(scala.runtime.IntRef); public final int apply(int); ... }

    var offset = 10 sqlContext.udf.register("add_offset", (i: Int) => i + offset) sqlContext.sql("SELECT add_offset(i) FROM table").show()

    public final class SparkProgramm$$anonfun$run$1 extends scala.runtime.AbstractFunction1$mcII$sp implements scala.Serializable { public SparkProgram$$anonfun$run$1(scala.runtime.IntRef); public final int apply(int); ... }

    SparkSQL UDFs represented both as Java class and instance.

  • Spark

    UDF

    Bytecode

    Instance

    • Instantiate UDF in embedded JVM • Instantiate UDF in embedded JVM • JIT-compile strided execution wrapper • Instantiate UDF in embedded JVM • JIT-compile strided execution wrapper • Wrap engine buffers as Java direct ByteBuffers

    Wildfire engine nodeEmbedded JVM

    Bytecode copy

    Instance copy

    Embedded JVM

    7

    Strided execution wrapper

    Java direct ByteBuffers

    Buffers

    Output

    Inputs

  • Embedded JVM performance

    8

    • Comparable performance to execution in Spark • Strided execution is key to fast performance

    Wal

    l−cl

    ock

    time

    [rela

    tive

    to S

    park−o

    nly]

    1 0.65

    8.91

    0123456789

    10(bandwidth−bound)Range query

    10.89

    1.07

    0

    1

    (compute−bound)Vincenty formulae

    1 0.83

    3.77

    0

    1

    2

    3

    4

    (object creation)Word length

    Spark−only Strided execution Unstrided execution

    Wal

    l−cl

    ock

    time

    [rela

    tive

    to S

    park−o

    nly]

    1 0.65

    8.91

    0123456789

    10(bandwidth−bound)Range query

    10.89

    1.07

    0

    1

    (compute−bound)Vincenty formulae

    1 0.83

    3.77

    0

    1

    2

    3

    4

    (object creation)Word length

    Spark−only Strided execution Unstrided execution

    Wal

    l−cl

    ock

    time

    [rela

    tive

    to S

    park−o

    nly]

    1 0.65

    8.91

    0123456789

    10(bandwidth−bound)Range query

    10.89

    1.07

    0

    1

    (compute−bound)Vincenty formulae

    1 0.83

    3.77

    0

    1

    2

    3

    4

    (object creation)Word length

    Spark−only Strided execution Unstrided execution

    Wal

    l−cl

    ock

    time

    [rela

    tive

    to S

    park−o

    nly]

    1 0.65

    8.91

    0123456789

    10(bandwidth−bound)Range query

    10.89

    1.07

    0

    1

    (compute−bound)Vincenty formulae

    1 0.83

    3.77

    0

    1

    2

    3

    4

    (object creation)Word length

    Spark−only Strided execution Unstrided execution

  • Can we do better?

    9

  • Wildfire

    Object codeLLVM MCJIT

    JIT compilation to machine code

    • Compile bytecode to LLVM IR

    •10

    Spark

    Bytecode

    Instance

    Wrapper LLVM IR

    Bytecode LLVM IR

    BugVM compiler

    Instance pointer

    BugVM JVM

    • Compile bytecode to LLVM IR • Generate wrapper LLVM IR • Compile bytecode to LLVM IR • Generate wrapper LLVM IR • Link both to object code

    • Compile bytecode to LLVM IR • Generate wrapper LLVM IR • Link both to object code

    • Deserialize instance • Compile bytecode to LLVM IR • Generate wrapper LLVM IR • Link both to object code

    • Deserialize instance • Dynamically load and

    execute object code

  • JIT compilation performance

    11

    Wal

    l−cl

    ock

    time

    [rela

    tive

    to S

    park−o

    nly]

    0.65

    0.51

    0.75

    0

    0.5

    1

    (bandwidth−bound)Range query

    0.89

    0.51 0.51

    0

    0.5

    1

    (compute−bound)Vincenty formulae

    0.83

    0.39

    1.39

    0

    0.5

    1

    1.5

    (object creation)Word length

    Embedded JVM Hand−written JIT−compiled Spark−only

    Compilation to machine code is beneficial if UDF is computationally heavy and does not create objects.

    Wal

    l−cl

    ock

    time

    [rela

    tive

    to S

    park−o

    nly]

    0.65

    0.51

    0.75

    0

    0.5

    1

    (bandwidth−bound)Range query

    0.89

    0.51 0.51

    0

    0.5

    1

    (compute−bound)Vincenty formulae

    0.83

    0.39

    1.39

    0

    0.5

    1

    1.5

    (object creation)Word length

    Embedded JVM Hand−written JIT−compiled Spark−only

    Wal

    l−cl

    ock

    time

    [rela

    tive

    to S

    park−o

    nly]

    0.65

    0.51

    0.75

    0

    0.5

    1

    (bandwidth−bound)Range query

    0.89

    0.51 0.51

    0

    0.5

    1

    (compute−bound)Vincenty formulae

    0.83

    0.39

    1.39

    0

    0.5

    1

    1.5

    (object creation)Word length

    Embedded JVM Hand−written JIT−compiled Spark−only

    Wal

    l−cl

    ock

    time

    [rela

    tive

    to S

    park−o

    nly]

    0.65

    0.51

    0.75

    0

    0.5

    1

    (bandwidth−bound)Range query

    0.89

    0.51 0.51

    0

    0.5

    1

    (compute−bound)Vincenty formulae

    0.83

    0.39

    1.39

    0

    0.5

    1

    1.5

    (object creation)Word length

    Embedded JVM Hand−written JIT−compiled Spark−only

  • Word length UDF optimizations

    for i ← 1 to size of input do javaString ← CreateJavaString(inputi) outputi ← WordLengthUdf(inputi) CheckForJavaException() ReleaseJavaObject(javaString) end

    12

    These optimizations violate Java language guarantees (e.g., immutability of Strings)!

    We can apply them because we know that 
the UDF does not leak or retain the String object.

    for i ← 1 to size of input do javaString ← CreateJavaString(inputi) outputi ← WordLengthUdf(javaString) CheckForJavaException() ReleaseJavaObject(javaString) end

    for i ← 1 to size of input do javaString ← CreateJavaString(inputi) outputi ← WordLengthUdf(javaString) CheckForJavaException() ReleaseJavaObject(javaString) end

    UDF wrapper Optimizations

    1. Reuse javaString object 2. Eliminate data copies

    when creating javaString 3. Remove memory fence 4. Relax exception handling

    1. Reuse javaString object 2. Eliminate data copies

    when creating javaString 3. Remove memory fence

  • Word length UDF performance

    13

    1.391.12

    0.55 0.5 0.42 0.39

    0.0

    0.5

    1.0

    1.5

    JIT−compiled ReuseStringobject

    Wrapenginebuffer

    Relaxexceptionhandling

    Removememory

    fence

    Hand−written

    Wal

    l−cl

    ock

    time

    [rela

    tive

    to S

    park−o

    nly]

    Spark−only

    These optimizations violate Java language guarantees!

  • Summary

    • Transparent strided execution of tuple-based SparkSQL UDFs inside a C++ engine

    • Comparable performance to execution in Spark • Java direct ByteBuffers simplify passing data between

    C++ code and the embedded JVM • Compilation to machine code is beneficial if UDF is

    computationally heavy and does not create objects

    14

  • Backup

    15

  • Comparison with SQL

    16

    Range query with UDF predicate -- def filter(a: Int, low: Int, high: Int) -- = low < a && a < high SELECT sum(a) FROM R WHERE filter(a, min, max)

    Range query with SQL predicate SELECT sum(a) FROM R WHERE min < a AND a < max

    0.65

    0.51

    0

    1

    UDFpredicate

    SQLpredicate

    Wal

    l−cl

    ock

    time

    [rela

    tive

    to S

    park−o

    nly]

  • Embedded JVM performance

    17

    • 2.5B integers (10 GB) • Stride size 4096

    • 64M points (1 GB) • Stride size 512

    • 80M words (1 GB) • Stride size 1024 • Word length follows

    poisson distribution• Intel Xeon X7560, 2.27 GHz, 512 GB RAM • Oracle JDK 112 • Data stored as uncompressed Parquet files with RLE/PLAIN encoding

    Wal

    l−cl

    ock

    time

    [rela

    tive

    to S

    park−o

    nly]

    1 0.65

    8.91

    0123456789

    10(bandwidth−bound)Range query

    10.89

    1.07

    0

    1

    (compute−bound)Vincenty formulae

    1 0.83

    3.77

    0

    1

    2

    3

    4

    (object creation)Word length

    Spark−only Strided execution Unstrided execution

  • State of the art

    • Impala: supports C++ UDFs (fast) and unmodified Hive UDFs (tuple-based, slow, discouraged)

    • Vertica: User-defined Transform Functions that process multiple rows via an iterator

    • Tupleware: UDFs in LLVM-based languages

    18

  • Wildfire system architecture

    19

    Analytics

    tolerate slightly stale data

    Analytics

    require most recent data

    Transactions

    high volume

    Wildfireengine

    Spark executor

    Spark executor

    Wildfireengine

    Wildfireengine

    Wildfireengine

    Spark executor

    Spark executor

    Spark executor

    Shared file system

    Spark applications

    Wildfireengine

  • Java direct ByteBuffers

    • Wraps memory that is not managed by the JVM • Access in Java through typed getter and setter methods • JVM will make “best effort” to avoid unnecessary data copies • Primitive types: equivalent to typed array • String objects: data must be copied into temporary array

    20

  • Strided execution wrapper

    21

    public class StridedExecutionWrapper { public static void wrapUdf( UdfClass udfInstance, int numRows, ByteBuffer output, ByteBuffer auxiliaryOutput, ByteBuffer[] inputs, ByteBuffer[] auxiliaryInputs) { for (int i = 0; i < numRows; ++i) { output.putX(udfInstance.apply( inputs[0].getX(), … ,inputs[N].getX())); } }

  • Supported types

    Supported • Primitive types (no extra copies) • Strings (copy into temporary array)

    Not supported in prototype • User-defined types require object creation (straightforward

    if type can be decomposed into primitives) • Complex nested types cannot be used with scalar UDFs

    22

  • Security considerations

    • UDFs must not crash database • typically executed in separate process • JVM offers similar separation • errors conditions are signalled as exceptions • UDFs can further be restricted through Java

    SecurityManagers

    23

  • Thread scaling

    24

    ●●

    1

    2

    4

    8

    16

    1 2 4 8 16 32 64Number of threads [logscale]

    Spee

    dup

    [logs

    cale

    ] ● SQL predicateUDF predicate

  • Auxiliary buffer

    ValueLength

    CustomString Object

    Length

    Pointer

    set

    wrap

    1 Object allocation

    BugVM JVMEngine

    Modified Java String implementation

    25

    Auxiliary buffer

    ValueLength

    String Object

    Length

    Char Array

    set

    copy

    2 Object allocations

    BugVM JVMEngine

    Default implementation

    Auxiliary buffer

    ValueLength

    ReusedString Object

    Length

    Char Array

    set

    copy

    1 Object allocation

    BugVM JVMEngine

    Reuse string buffer

    Wrap engine buffer