Rapid Development of Data Generators Using Meta Generators in PDGF

19
Rapid Development of Data Generators Using Meta Generators in PDGF Tilmann Rabl, Meikel Poess, Manuel Danisch, Hans-Arno Jacobsen DBTest 2013, June 24, New York City MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG

description

This is a presentation that was held at the Sixth International Workshop on Testing Database Systems, collocated with ACM SIGMOD 2013, June 24, New York, USA. Full paper and additional information available at: http://msrg.org/papers/dbtest13-rabl Abstract: Generating data sets for the performance testing of database systems on a particular hardware configuration and application domain is a very time consuming and tedious process. It is time consuming, because of the large amount of data that needs to be generated and tedious, because new data generators might need to be developed or existing once adjusted. The difficulty in generating this data is amplified by constant advances in hardware and software that allow the testing of ever larger and more complicated systems. In this paper, we present an approach for rapidly developing customized data generators. Our approach, which is based on the Parallel Data Generator Framework (PDGF), deploys a new concept of so called meta generators. Meta generators extend the concept of column-based generators in PDGF. Deploying meta generators in PDGF significantly reduces the development effort of customized data generators, it facilitates their debugging and eases their maintenance.

Transcript of Rapid Development of Data Generators Using Meta Generators in PDGF

Page 1: Rapid Development of Data Generators Using Meta Generators in PDGF

Rapid Development of Data Generators Using Meta Generators in PDGFTilmann Rabl, Meikel Poess, Manuel Danisch, Hans-Arno Jacobsen

DBTest 2013, June 24, New York City

MIDDLEWARE SYSTEMSRESEARCH GROUP

MSRG.ORG

Page 2: Rapid Development of Data Generators Using Meta Generators in PDGF

DBMS Benchmarking is Increasingly Complex• Data Volumes are sky rocketing

Enterprise data warehouses double every three years

Many enterprise data warehouses are in petabyte size

• Systems are becoming increasingly complex Large number of processor cores

Single systems (SMP) with high number of cores (80 on commodity hardware, 2048 on specialized hardware)

Multi node systems (sky is the limit) Large memory

Dell released a TPC-H benchmark with 15 TB of main memory on 64 systems

• How to challenge these systems?

Page 3: Rapid Development of Data Generators Using Meta Generators in PDGF

Benchmarks are increasingly complex

• More tables, columns

• More relationships, dependencies, data types, …

• How to build these benchmarks?

• Parallel Data Generation Framework to the rescue!

TPC-A TPC-C TPC_E TPC-DS0

100

200

300

400

500

4 9 33 2410

92

188

430

#Tables#Columns

Page 4: Rapid Development of Data Generators Using Meta Generators in PDGF

Parallel Data Generation Framework• Generic data generation framework

• Relational model Schema specified in configuration file Post-processing stage for alternative representations

• Repeatable computation Based on XORSHIFT random number generators Hierarchical seeding strategy

Page 5: Rapid Development of Data Generators Using Meta Generators in PDGF

Repeatable Data Generation• Data generation based on random numbers

• More specifically parallel random number generation

• Generation of numbers within range (e.g., age)

• What if we want NULL values?

• Repeat that logic in every generator?

Page 6: Rapid Development of Data Generators Using Meta Generators in PDGF

PDGF Architecture

• Controller Initialization

• Meta Scheduler Inter node scheduling

• Scheduler Inter thread scheduling

• Worker Blockwise data generation

• Update Black Box Co-ordination of data updates

• Seeding System Random sequence adaption

• Generators Value generation

• Output system Data formating

• To generate data for a schema the user defines: Schema XML file

Defines relational schema

Generation XML file Defines output format (CSV, XML, merging tables)

Page 7: Rapid Development of Data Generators Using Meta Generators in PDGF

Configuring PDGF• Schema configuration

Data model

• Relational model Tables, fields

• Properties Table size, characters, …

• Generators Base generators Meta generators

• Update definition Insert, update, delete Generated as change data capture

<table name="SUPPLIER"> <size>${S}</size> <field name="S_SUPPKEY" size="" type="NUMERIC“ primary="true" unique="true"> <gen_IdGenerator /> </field> <field name="S_NAME" size="25" type="VARCHAR"> <gen_PrePostfixGenerator> <gen_PaddingGenerator> <gen_OtherFieldValueGenerator> <reference field="S_SUPPKEY" /> </gen_OtherFieldValueGenerator > <character>0</character> <padToLeft>true</padToLeft> <size>9</size> </gen_PaddingGenerator > <prefix>Supplier </prefix> </gen_PrePostfixGenerator> </field>[..]

Page 8: Rapid Development of Data Generators Using Meta Generators in PDGF

Base Generators in PDGF• DictList generator

Random line from file

• Long generator Random long in interval

• Others StaticValue Double Date String Text …

<table name="users"> <size>10000</size> <fields> <field name="name"> <type>java.sql.types.VARCHAR</type> <size>100</size> <gen_DictList> <file>dicts/names.dict</file> </gen_DictList> </field> <field name="age"> <type>java.sql.types.NUMERIC</type> <gen_LongGenerator> <min>0</min> <max>120</max> </gen_LongGenerator> </field> </fields></table>

Page 9: Rapid Development of Data Generators Using Meta Generators in PDGF

Null Generator• Add NULL logic to every generator?

Could easily be implemented in higher class Adds to the configuration file Reduces performance (every time)

• Higher order generator NullGenerator Only used if added to the schema Can be added to any generator

<field name="age"> <type>java.sql.types.NUMERIC</type> <gen_NullGenerator> <probability>0.05</probability> <gen_LongGenerator> <min>0</min> <max>120</max> </gen_LongGenerator> </gen_NullGenerator></field>

Page 10: Rapid Development of Data Generators Using Meta Generators in PDGF

Meta Generators• Control flow and post-processing generators

Null generator controls flow

• Post-processing FormattedNumberGenerator PaddingGenerator UpperLowerCaseGenerator PrePostfixGenerator FormulaGenerator

• Flow control ProbabilityGenerator SequentialGenerator IfGenerator SwitchGenerator ReferenceGenerator

Page 11: Rapid Development of Data Generators Using Meta Generators in PDGF

Post-Processing Example• Phone number for users

10s of representations PhoneNumberGenerator was too inflexible

• Formatted long number Long numbers between 10010001 and 9999999999 Number formatting (%d%d%d) %d%d%d-%d%d%d%d

<field name="phonenumber"> <type>java.sql.types.VARCHAR</type> <size>30</size> <generator name="FormattedNumberGenerator"> <generator name="LongGenerator"> <min>10010001</min> <max>9999999999</max> </generator> <format>(%d%d%d) %d%d%d-%d%d%d%d</format> </generator></field>

Page 12: Rapid Development of Data Generators Using Meta Generators in PDGF

Flow Control Example• More elaborate name field

Name male or female 50% chance

All upper case Padded to 100 characters

• Sequential generator Probability generator

DictList generator

UpperLowerCase generator Padding generator

<field name="name"> <type>java.sql.types.VARCHAR</type> <size>100</size> <generator name="SequentialGenerator"> <generator name="ProbabilityGenerator"> <probability value="0.5"> <generator name="DictList"> <file>dicts/female.dict</file> </generator> </probability> <probability value="0.5"> <generator name="DictList"> <file>dicts/male.dict</file> </generator> </probability> </generator> <generator name="UpperLowerCaseGenerator"> <mode>uppercase</mode> </generator> <generator name="PaddingGenerator"> <character> </character> <padToLeft>true</padToLeft> </generator> </generator></field>

Page 13: Rapid Development of Data Generators Using Meta Generators in PDGF

Core Performance

• Test environment: single core laptop, no I/O• Base time for framework ~ 55 ns (Base Time)

Seeding, method invocation, setting a value

• Computation time for generator 50+ ns (Gen Time)• Cache update if referenced ~ 50 ns (Cache Update)• Cache lookup if intra row reference ~ 50 ns (Cache Lookup)• Sub-generator invocation ~ 50 ns

Static Value (no Cache)

Null Generator (100% NULL)

Null Generator (0% NULL)

0

50

100

150

200

250

Base Time GeneratorBase Time Sub Sub Generator

Page 14: Rapid Development of Data Generators Using Meta Generators in PDGF

Performance Basic Generators

• Basic generators without formatting 120ns – 510ns

DictList LongGeneratorDoubleGeneratorDateGenerator RandomString0

100

200

300

400

500

600

Page 15: Rapid Development of Data Generators Using Meta Generators in PDGF

Performance Formatted Values

• Basic Generators with formatting Usually > 1000ns

DictList SimpleFormat Number Generator

DateGenerator (formatted)

DoubleGenerator (4 places)

0

200

400

600

800

1000

1200

1400

1600

1800

2000

Page 16: Rapid Development of Data Generators Using Meta Generators in PDGF

Performance Meta Generators

• Meta generator overhead: Base overhead ~ 50 ns Generator overhead starts from 50 ns Sub generator invocation ~ 50ns

• Often negligible due to lazy formatting

Null G

ener

ator

(100

% N

ull)

Null G

ener

ator

(0%

Nul

l)Pr

ePos

tFix

Sequ

entia

l

(exe

c 2)

Sequ

entia

l

(con

cat 2

)

Sequ

entia

l

(2 fo

rmat

ed

+ lo

ng)

0

200

400

600

800

1000

1200

1400

1600

Page 17: Rapid Development of Data Generators Using Meta Generators in PDGF

Use Cases• TPC-H / SSB

8 tables, 61 columns (first non-trivial example) Without meta-FVGs: 26 custom FVGs 2h editing: 10 custom FVGs 1 day reimplementation: 0 custom FVGs, i.e. no coding SSB variations

skews on dimension attributes, fact measures, references

• TPC-DI (in process) 20 tables, 200 columns 19 custom FVGs (mainly for performance in corner

cases) 56x NullGenerator 32x ProbabilityGenerator 3000 lines of config (XML import for multiple files).

Page 18: Rapid Development of Data Generators Using Meta Generators in PDGF

Conclusion & Future Work• Meta generators

Improve usability and expressiveness Speed up schema definition Remove necessity for coding Enlarged configuration files

• Used in TPC benchmark(s)

• Performance overhead is small, often negligible

• Future work GUI and SQL export SQL import and data extraction

Page 19: Rapid Development of Data Generators Using Meta Generators in PDGF

Thanks

• Questions?

• Contact: [email protected]

• Download and try PDGF:

• http://www.paralleldatageneration.org

• Some big data info in our BigBench presentation Tuesday, 4pm, Industry 3