Software Engineer, Workday Complex Plans Asif Shahid ...

Optimizing the Catalyst Optimizer for Complex Plans

Jianneng LiSoftware Engineer, Workday

Asif ShahidSoftware Engineer, Workday

This presentation may contain forward-looking statements for which there are risks, uncertainties, and assumptions. If the risks materialize or assumptions prove incorrect, Workday’s business results and directions could differ materially from results implied by the forward-looking statements. Forward-looking statements include any statements regarding strategies or plans for future operations; any statements concerning new features, enhancements or upgrades to our existing applications or plans for future applications; and any statements of belief. Further information on risks that could affect Workday’s results is included in our filings with the Securities and Exchange Commission which are available on the Workday investor relations webpage: www.workday.com/company/investor_relations.php

Workday assumes no obligation for and does not intend to update any forward-looking statements. Any unreleased services, features, functionality or enhancements referenced in any Workday document, roadmap, blog, our website, press release or public statement that are not currently available are subject to change at Workday’s discretion and may not be delivered as planned or at all.

Customers who purchase Workday, Inc. services should make their purchase decisions upon services, features, and functions that are currently available.

Safe Harbor Statement

Agenda

▪ Workday Prism Analytics▪ Complex Plans▪ Handling Complex Plans

▪ Common Subexpression Elimination (CSE)

▪ Large Case Expressions▪ Constraint Propagation

▪ Closing Thoughts

Workday Prism Analytics

Example Spark physical plan of our pipeline shown in Spark UI

• Customers use our self-service product to build data transformation pipelines, which are compiled to DataFrames and executed by Spark

• Finance and HR use cases• Use cases often involve complex

data pipelines

Spark in Workday Prism Analytics

For more details, see session from SAIS 2019 - Lessons Learned using Apache Spark for Self-Service Data Prep in SaaS World

Complex Plans

What are Complex Plans

• Thousands of operators• Many (self) joins, unions, and large expressions

• Takes Catalyst hours to compile and optimize• Difficult to understand or inspect visually

Example: Data Validation

id name group_id

Part 1: filter rows based on criteria

SELECT *

FROM dataset

WHERE id > 1 AND name != "b"

id name group_id

Part 2: if a row is filtered, other rows in the same group_id are also filtered

SELECT *

(SELECT *

FROM dataset

WHERE id > 1 AND name != "b") l

LEFT ANTI JOIN

(SELECT group_id

FROM dataset

WHERE NOT (id > 1 AND name != "b")

GROUP BY group_id) r

ON l.group_id = r.group_id

SELECT *

FROM dataset

UNION ALL

SELECT l.id, l.name, l.group_id

(SELECT *

FROM dataset

INNER JOIN

(SELECT group_id

FROM dataset

id name group_id

Part 3: compute invalid rows too

SELECT *

(SELECT *

FROM dataset

LEFT ANTI JOIN

(SELECT group_id

FROM dataset

id name group_id

SELECT *

FROM dataset

UNION ALL

SELECT l.id, l.name, l.group_id

(SELECT *

FROM dataset

INNER JOIN

(SELECT group_id

FROM dataset

Part 4: show unique error message for each filter criterion

SELECT *

(SELECT *

FROM dataset

LEFT ANTI JOIN

(SELECT group_id

FROM dataset

MORE SELF UNIONS

About Complex Plans

• Complexity increases gradually over time• Could ask customers to optimize, but much better if

performance is good without optimizations

Handling Complex Plans

Common Subexpression Elimination (CSE)

• Identify shared subplans, and cache them• E.g. self joins, self unions, reused scans

• Performed while creating DataFrames• Heuristic• Algorithmic

Union(

Parse(“Dataset A”),

Parse(“Dataset B”)),

Parse(“Dataset B”))

Union(

Cache(ID=1,

Parse(“Dataset A”)),

Cache(ID=1, ∅),

Union(

Cache(ID=1,

Cache(ID=1, ∅),

Union(

Cache(ID=1,

Cache(ID=2,

Cache(ID=1, ∅),

Parse(“Dataset B”))),

Cache(ID=2, ∅)

Union(

CSE Benchmark

Without CSE With CSENumber of operators in

optimized plan 10K 150

Time to compile and optimize plan 10 minutes 30 seconds

4 Data Validations in one data pipeline

Spark 2.4, local mode, 4GB memory

Logging Complex Plans (10s of MBs in Size)

• Stream plans to log without generating them upfront• Send only truncated plans to log aggregation service

Large Case Expressions

WHEN f(a, b) = 1 then 1

WHEN f(a, b) = 2 then 2

WHEN f(a, b) = 1000 then 1000

ELSE -1

Large Case Expression

1000s of branches

Problems with Large Case Expressions

• Function f is evaluated once for each branch• Inlined into nested Projects (CollapseProject rule)• OOM during code generation (SPARK-29561)

Relationa, b

Project(CASEWHEN f(a, b) = 1 then 10WHEN f(a, b) = 2 then 20...END) as c

Projectc, c as c1

Relationa, b

Project(CASEWHEN k = 1 then 10WHEN k = 2 then 20...END) as c,(CASEWHEN k = 1 then 10WHEN k = 2 then 20...END) as c1,

Handling Large Case Expressions in Catalyst

• Identify large expressions and not collapse them• Identify and extract f as an alias

• Only if f is used more than once• Disable whole stage codegen if too many branches

Relationa, b

Project(CASEWHEN f(a, b) = 1 then 10WHEN f(a, b) = 2 then 20...END) as c

Projectc, c as c1

Relationa, b

Project(CASEWHEN k = 1 then 10WHEN k = 2 then 20...END) as c

Projectf(a, b) as k

Projectc, c as c1

Large Case Expression Benchmark

SELECT CASE

WHEN cf1 + cf2 = -1 then 1

END as cf3

FROM (SELECT cf1, cf1 AS cf2

FROM (SELECT CASE

WHEN f(a) = 1 AND g(b) = 1 THEN 1

END as cf1

FROM dataset))

SELECT CASE

END as cf3

FROM (SELECT CASE

END as cf1

FROM dataset))

SELECT CASE

END as cf3

FROM (SELECT CASE

END as cf1

FROM dataset))

SELECT CASE

END as cf3

FROM (SELECT CASE

END as cf1

FROM dataset))

Constraint Propagation

What are Constraints

• Filters on column values• Can be used to

• Generate new filters (eg. IsNotNull)• Prune redundant filters• Push down new filters on the "other" side of a join

Example: Generate New Filter

Relationa, b

Filtera > 10

Constraints:a is not nulla > 10

Filtera > 10IsNotNull(a)

Relationa, b

Example: Prune Redundant Filter

Relationa, b

Filtera > 10

Filtera > 10IsNotNull(a)

Relationa, b

Filtera1 > 10

Projecta, b, a as a1

Example: New Filter on “Other” Side of Join

Relationa, b

Filtera > 5, a!=null

Projecta, a as a1, b

Join a1 == xb == y

Relationx, y

Filterx > 5x != null

Relationa, b

Filtera > 5, a!=null

Projecta, a as a1, b

Join a1 == xb == y

Relationx, y

Current Constraint Propagation Algorithm

• Traverses tree from bottom to top• On Filter node, create additional IsNotNull constraints• On Project node with alias, create all possible

combinations of constraints

Relationa, b

Filtera > 10

Constraints:a is not nulla > 10a1 is not nulla1 > 10EqualsNullSafe(a, a1)

Current Constraint Propagation Algorithm

• To prune filter• Check if the filter already exists in constraints

• To add a new filter to right hand side of join• Check if any constraint exists on join key

• Consider only those constraints dependent on a single join key

• Given a filter function F(a, b), if• count of attribute a and its aliases is m• count of attribute b and its aliases is n

Current Algorithm Takes High Memory

Constraints for alias combinations

IsNotNull constraints

EqualsNullSafe constraints

• Then• total intermediate constraints created for 1 such filter expression

≈ m * n + m + n + C(m, 2) + C(n, 2)

Recall: Fix for Large Case Expressions

• We created new aliases!• New aliases cause OOM in Catalyst, due to

• Large number of aliases• Large number of operators in plan

Optimized Constraint Propagation (SPARK-33152)

• Traverses tree from bottom to top• On Filter node, create additional null constraints• On Project node, create Lists where

• Each List maintains original attribute and its aliases and constraint is stored in terms of original attribute

Relationa, b

Filtera > 10

Constraints:a is not nulla > 10Aliases:[a, a1]

• To prune filter• Rewrite expression in terms of original attribute• a1 > 10 becomes a > 10• Check if canonical version already exists in constraints

Relationa, b

Filtera > 10

Constraints:a is not nulla > 10Aliases:[a, a1]

Filtera1 > 10

Relationa, b

Filtera > 10

• To add a new filter to right hand side of join• Rewrite expression in terms of original attributes• Check if any constraint exists on join key

Relationa, b

Filtera + b > 5, a!=null, b!=null

Projecta, a as a1, b, b as b1

Constraints:a is not nullb is not nulla + b > 5Aliases:[a, a1][b, b1]

Join a1 == xb1 == y

Relationx, y

Filterx + y > 5x != nully != null

Relationa, b

Filtera + b > 5, a!=null, b!=null

Projecta, a as a1, b, b as b1

Join a1 == xb1 == y

Relationx, y

Constraint Propagation Algorithms Comparison

Current Algorithm Improved Algorithm

Number of constraints Combinatorial, dependent on the number of aliases

Independent of the number of aliases

Memory usage High Low

Filter pushdown for join Single reference filters Single reference and compound filters

Creation of IsNotNull constraints Can miss IsNotNull constraints Detects more IsNotNull constraints

Constraint Propagation BenchmarkSELECT cf

FROM (SELECT cf

FROM (SELECT CASE

WHEN abs(c01) < 1 THEN 1

END AS cf

FROM (SELECT sum(a + a) AS c01,

sum(a + b) AS c02

FROM dataset

GROUP BY a))

WHERE cf > 0)

INNER JOIN letters ON a = cf

FROM (SELECT cf

FROM (SELECT CASE

END AS cf

sum(a + b) AS c02

FROM dataset

GROUP BY a))

WHERE cf > 0)

FROM (SELECT cf

FROM (SELECT CASE

END AS cf

sum(a + b) AS c02

FROM dataset

GROUP BY a))

WHERE cf > 0)

FROM (SELECT cf

FROM (SELECT CASE

END AS cf

sum(a + b) AS c02

FROM dataset

GROUP BY a))

WHERE cf > 0)

FROM (SELECT cf

FROM (SELECT CASE

END AS cf

sum(a + b) AS c02

FROM dataset

GROUP BY a))

WHERE cf > 0)

Effect on Customer Pipeline

• Financial use case for large insurance company

• Uses nested case statements to validate and categorize data

Closing Thoughts

Tuning Tips

• Take advantage of CSE• Reduce the number of operators• Limit the number of aliases• Follow SPARK-33152 to receive updates on the

improved constraint propagation algorithm

Future Work

• Improve logic for Catalyst rules• PushDownPredicates• CollapseProject

• Implement rules engine in Spark• Algorithms for converting to lookup table• Rete Algorithm

Thank You

Feedback

Your feedback is important to us.

Don’t forget to rate and review the sessions.

Software Engineer, Workday Complex Plans Asif Shahid ...

Documents

Transcript of Software Engineer, Workday Complex Plans Asif Shahid ...

Workday Training & Accounting Overview - Amazon S3€¦ · 30.09.2016 · Workday Training & Accounting Overview ... Group 1 Workday - Payroll Module Workday ... you will be able

Workday Education Training Catalog - Workday Community

Passport Workday

Workday Update

Workday Presentation

Turbocharging your Workday testing...Nathan Stearns Director, HR Technology Zynga. We’re Kainos. Kainos & Workday We’re all-in on Workday –Kainos is the leading provider of Workday

Workday Update - Welcome | Workday@Yale Workday... · Workday Update . August 26, 2014 - 2 - ... Payroll . and. Financial Management solution . ... X Learning Systems (TMS, etc.)

Workday Mobile - Workday Centralworkday-irsc.weebly.com/uploads/4/6/8/6/46867157/workday_mobile_tutorial.pdf · Enjoy Workday on the Go! • Workday Mobile goes wherever you go and

HR Liaison Network & Workday Updates - Human Resources€¦ · – Workday 32 Preview presentation slide deck available in . Workday Help . in the Use Workday section (Second chance

Banglalink Amar Tune - docshare01.docshare.tipsdocshare01.docshare.tips/files/1898/18986504.pdf · Asif Kichu Vul Kichu Sriti Asif Kokhono Bhalo Bashoni Asif Kokhono Kokhono ... Asif

Project BREWd2f5upgbvkx8pz.cloudfront.net/sites/default/files... · Workday Training Workday Performance Management Workday Talent Acquisition Workday Benefits Workday Compensation

BANNER TO WORKDAY GENERAL LEDGER (GL) … to Workday GL.pdfBANNER TO WORKDAY GENERAL LEDGER ... BANNER TO WORKDAY GENERAL LEDGER ... 1907 Construction in Progress 17500 Construction

MOHAMMAD ASIF 3MOHD IMRAN AND ASIF HUSAIN

Application Management for Workday - Accenture · Application Management for Workday Rise and Shine: Get more from your Workday solutions. ... Suite for Workday • Accenture Payroll

Workday 25 Update Workday Solutions Group September 14, 2015 1.

Workday Kick off Presentation - brandeis.edu · Workday Project Kickoff Oct 26, 2017 BRANDEIS INBOUND TO WORKDAY . Inbound To Workday Agenda ... Milestone: End to End & User Acceptance

Workday Basics & FAQtees.tamu.edu/media/580821/workday-basics-faq-engr.pdf · Workday Basics & FAQ. Engineering Human Resource Contact Information • Engineering Workday Help Email:

workday payroll training | workday payroll certification | payroll in the cloud | workday payroll course

WORKDAY TECHNOLOGY Stan Swete CTO - Workday. Company History WORKDAY CONFIDENTIAL 20072008 2009 Workday Founded First customer 2005 – 06 2010.

Human Capital Management Solutions for Workday release management ... it comes to SaaS, cloud and Workday ... Read about Human capital management solutions for workday. Keywords: workday