CloverETL Cluster - Big Data Parallel Processing Explained
-
Upload
cloveretl -
Category
Technology
-
view
4.342 -
download
4
description
Transcript of CloverETL Cluster - Big Data Parallel Processing Explained
BIGHANDLING LARGE DATA
The CloverETL Cluster Architecture Explained
Wednesday, August 14, 13
The Reality:You have a really big pile to deal with.
One traditional digger might not be enough.
Really
Big D
ata
Wednesday, August 14, 13
You could get a really big, expensive digger...
Really
Big D
ata
Wednesday, August 14, 13
…or several smaller ones and get the job done faster & cheaper.
Really
Big D
ata
Wednesday, August 14, 13
But what if the one big one suffers a mechanical failure?
Really
Big D
ata
Wednesday, August 14, 13
With small diggers, failure of one does not affect the rest.
Really
Big D
ata
Wednesday, August 14, 13
Which one do you choose ?
vs
Wednesday, August 14, 13
CloverETL Cluster resiliency features
Optimizing for robustness...
Wednesday, August 14, 13
Fault resiliency – HW & SW
automatic fail-over
Before After
Node 2 Node 1 Node 2Node 1
Wednesday, August 14, 13
automatic load balancing
Load Balancing
New ta
sk
Before After
Node 2
Node 1 Node 1
Node 2
Wednesday, August 14, 13
CloverETL Cluster - BIG DATA features
Optimizing for speed...
Wednesday, August 14, 13
Traditionally, data transformations were run on a single, big serverwith multiple CPUs and plenty of RAM.
And it was expensive.
Wednesday, August 14, 13
Then the CloverETL team developed the concept of a data
transformation cluster.
The CloverETL Cluster was born
It creates a powerful data transformation beast from a set of low-cost commodity hardware machines.
Wednesday, August 14, 13
Now, one data transformation can be set to run in parallel on all available nodes of the CloverETL Cluster.
Wednesday, August 14, 13
Each cluster node executing the transformation is automatically fed with a
different portion of the input data.
Part 1
Part 2
Part 3
Wednesday, August 14, 13
Part 1
Part 2
Part 3
Now
Before
=
=
Working in parallel, they finish the job faster, with less resources needed individually.
Wednesday, August 14, 13
That sounds nice and simple.But how is it really done?
Wednesday, August 14, 13
CloverETL allows certain transformation components to be assigned to multiple cluster nodes.
runs1x
runs1x
runs3x
Allocated to
Allocated toAllocated to
Allocated toNode 1
Node 2
Node 3
CloverETL Cluster
Such components then run in multiple instances.
We call this Allocation.
Alloca
ted to
Wednesday, August 14, 13
Special components allow incoming data to be split and sent in parallel flows to multiple nodes where the processing flow continues.
Node 1
Node 2
Node 3
Serial data Partitioned data
Node 1
1st instance
2nd instance
3rd instance
Wednesday, August 14, 13
Other components gather data from parallel flows back into a single, serial one.
Node 1
Node 2
Node 3
Serial dataPartitioned data
Node 1
1st instance
2nd instance
3rd instance
Wednesday, August 14, 13
The original transformation is automatically “rewritten” into several smaller ones, which are executed by cluster nodes in parallel.
Which nodes will be used is determined by Allocation.
Node 1
Node 2 Node 3
2nd instance
3rd instance
Serial data Serial dataPartitioned data
1st instance
Node 3
Wednesday, August 14, 13
Let’s take a look at an example.
Wednesday, August 14, 13
In this example, we’ll read data about company addresses. There are 10,499,849 records in total.
We also calculate statistics of the number of companies residing in each US state.
We get a total of 51 records – one record per US state.
serial processing
Wednesday, August 14, 13
Here, we’re processing the same input data, but in parallel now.
We get a total of 51 records again.
Split Gather
work in3 parallel streams
Each parallel stream gets a portion of the
input data
Partial results
Wednesday, August 14, 13
Go parallel in 1 minute.
☟
drag&drop drag&
drop
serial
parallel
Wednesday, August 14, 13
What’s the Trick?
Split the input data into parallel streams.
Do the heavy lifting on smaller data portions in parallel.
Bring the individual pieces of results together at the end.
☞
☜DONE
Wednesday, August 14, 13
Let’s continue.
More on allocation and partitioned sandboxes
Wednesday, August 14, 13
A Sandbox
We assume you are familiar with the CloverETL Server’s concept of a SANDBOX.
SANDBOX is a logical name for a file directory structure managed by the Server. It allows individual projects on the Server to be separated into logical units. Each CloverETL data transformation can access multiple sandboxes either locally or remotely.
Let’s look at a special type of sandbox – partitioned
Wednesday, August 14, 13
The sandbox presents “originals” – combined data.
Part 2
Part 1 Partitionedsandbox“SboxP”
Part 3
Node 1
Node 2
Node 3
SboxP
In a partitioned Sandbox, the input file is split into subfiles, each residing on a different node of the Cluster in a similarly structured folder.
Wednesday, August 14, 13
Partitioned Sandboxes
A partitioned sandbox is a logical abstraction on top of similarly structured folders on different Cluster nodes.
The Sandbox’s logicalstructure with a unified view of folders & files
The Sandbox’s physicalstructure with listed locations/nodes of
files’ portions
Wednesday, August 14, 13
Partitioned SandboxPartitioned sandbox defines how
data is partitioned across nodes of the CloverETL
Cluster
Allocation
Allocation defines how a transformation’s run is distributed
across nodes of the CloverETL Cluster
☜☞The allocation can be set to derive from the sandbox layout.
Data processing happens where data resides.
We tell the cluster to run our transformation components on nodes that also contain portions of
data we want to process.
☟
Wednesday, August 14, 13
Allocation Determined By a Partitioned Sandbox:
4 partitions ⇒ 4 parallel
transformations.
There’s no gathering at the end - partitioned results are stored directly to the partitioned sandbox. Allocation for the
aggregator is derived from sandbox being used.
Wednesday, August 14, 13
Allocation Determined By an Explicit Number:
8 parallel transformations.
Partitioning at the beginning and gathering at the end is necessary as we need to cross the
serial⇿parallel boundary twice.
Wednesday, August 14, 13
A Data Skew
This is called a data skew.
Data is not uniformly distributed across partitions. This indicates that chosen partitioning key is not the best for the maximum performance.
However, the chosen key allows us to perform only single pass aggregation (no semi-results) - thus it’s a good tradeoff.
The busiest worker will have to process 2.5 million rows whereas the least busy, only 0.67 million – that is, approximately 3.5x less.
Wednesday, August 14, 13
Parallel Pitfalls
When processing data in parallel, a few things should be considered.
Aggregating, Sorting, Joining…
Working in parallel means producing “parallel”/semi results.
First, we produce 4 aggregated semi-results. Then we aggregate the semi-results to get the final result.
➔semi-result1➔semi-result 2
➔semi-result3➔semi-result4
record stream1record stream2
record stream3record stream4
These partial results have to be further processed to get final result.
➔final resultsemi-result1,2,3,4 ➔
The good news: When increasing or changing the number of parallel streams, we don’t have to
change the transformation.
Wednesday, August 14, 13
Parallel Pitfalls
Full transformation – parallel aggregation & post-processing semi results
sum()here
count()here
Why ?
Example: A parallel counting of occurrences of companies per state using count().
In step 1, we produce partial results. Because records are partitioned in a round-robin, data for one state may appear in multiple parallel streams.
For example, we might get data for NY as 4 partial results in 4 different streams.
In step 2, we merge all the partial results from the 4 parallel streams into a sequence and then aggregate again to get the final numbers.
At this step the aggregation function is sum() – we sum the partial counts.
Step 1
Aggregating, Sorting, Joining…
Step 2
Wednesday, August 14, 13
Parallel Pitfalls
Parallel sorting
mergehere
sorthere
Why ?
Sorting in parallel ➔ records are sorted in individual parallel streams, but not across all streams.
Bringing parallel sorted streams together into serial stream ➔ records have to be merged according to the same key as used in parallel sorting ➔ to produce overall sorted serial result.
1 2
Aggregating, Sorting, Joining…
Wednesday, August 14, 13
Parallel Pitfalls
Why ?
Joining in parallel➔master&slave(s) records must be partitioned by the same key/field. The same key must be used for joining records.
!In another case, there is a danger that records from master & slave with the same key will not join as they end up in different parallel streams. Joiner joins only within one stream and not across streams.
!
Aggregating, Sorting, Joining…
Parallel joining
Wednesday, August 14, 13
Parallel Pitfalls
Example
Result(all master records joined)
Parallel joining - 3 parallel streams - partitioning by state
[AL AK AZ AR CA CO CT DC DE FL][GA HI ID IL IN IA KS KY LA ME MD MA MI MN MS MO MT NE NV NH NJ NM NY NC ND][OH OK OR PA RI SC SD TN TX UT VT VA WA WV WI WY]
[AK AZ DE][IL MD NY][OR PA VA]
[AK AZ DE]
[IL MD NY][OR PA VA]
1⥤
2⥤3⥤
1⥤
2⥤3⥤
1⥤
2⥤3⥤
Aggregating, Sorting, Joining…
stream
stream
stream
stream
stream
stream
Wednesday, August 14, 13
Parallel Pitfalls
Result(some master records joined)
Parallel joining - 3 parallel streams - partitioning round robin
[AL AR CT FL GA HI IA LA MA MS NE NJ NC OH PA SD UT WA WY][AK CA DC IL KS ME MI MO NV NM ND OK RI TN VT WV][AZ CO DE ID IN KY MD MN MT NH NY OR SC TX VA WI]
[][][DE NY]
[AK IL OR][AZ MD VA][DE NY PA]
1⥤
3⥤2⥤
1⥤
2⥤3⥤
1⥤
2⥤3⥤
Aggregating, Sorting, Joining…
Example
stream
stream
stream
stream
stream
stream
Wednesday, August 14, 13
Bringing it all together…
Going parallel is easy!Try it out for yourself.
☞ BIG DATA problems are handled through Cluster’s scalability
☞ Existing transformations can be easily converted to parallel
☞ There’s no magic – users have full control over what’s happening
☞ CloverETL Cluster has built in fault resiliency and load balancing
Wednesday, August 14, 13
If you have any questions, check out:
www.cloveretl.comforum.cloveretl.comblog.cloveretl.com
Wednesday, August 14, 13