Saturday, December 5, 2009

Keyless VS Keyed Partitioning Algorithms

Keyless VS Keyed Partitioning Algorithms:
Keyless: Rows are distributed independently of data values
Round Robin
Entire
Same

Keyed: Rows are distributed based on values in the specified key.

Hash: Partition based on key
Example - Key is state, All "CA" rows go into the same partion;
"MA" rows go in the same partition. Two rows of the same stage never go into different partitions.

Modulus: Partition based on modulus of key divided by the no. of partitions. key is a numeric type.
Example - key is OrderNumber. Rows with the same order numner will all go into the same partition.

DB2: Matches DB2 EEE partitioning.
-------------------------------------------------------------------------
Partitioning/Collecting Algorithms:
Partitioning algorithms include:
Round Robin
Hash: Determin partition based on key value
( Require key specification )
Entire : Send all rowsdown all partitions
Same : Preserve the same partitioning
Auto : Let DataStage choose the algorithm

Collecting Algorithms:
Round robin
Sort Merge
Read in by key
Presumes data in sorted by the key in each partition
Builds a single sorted stream based on the ky

Ordered
Read all records from first partition, then second ... so on.
----------------------------------------------------------------
Partitioners and collectors have no stage nor icons of their own.
They live live on input links of stages running in parallel/sequentially, respectively.

Link markings indicate their presence.

S---------------->S (no Marking)
S----(fan out)-->P (partitioner)
P----(fan in)--->S (collector)
P----(box)------>P (no reshuffling: partitioner using "SAME" method)
P----(bow tie)-->P (reshuffling: partitioner using another method )
==================================================================
Round Robin and Random Partitioning
-----------------------------------

# Keyless partitioning methods
# Rows are evenly distributed across partitions
- Good for initial import of data if no other partitioning is needed.
- Useful for redistibuting data
# Fairly low overhead

# Round Robin assigns rows to partitions like dealing cards
- Row/Partition assignment will be the same for a given
$APT_CONFIG_FILE

# Random has slightly higher overhead, but assigns rows in a non-deterministic
fashion between job runs.

===================================================================
ENTIRE Partitioning:
-------------------------------------
# Each partition gets a complete copy of the data
- Useful for distributing lookup and reference data
Many have performance impact in MPP/clustered environments.
- On SMP platforms, Lookup stage (only) uses shared memory instead of duplicating ENTIRE reference data.
On MPP platforms, each server uses shared memory for a single local copy for a single local copy.
# ENTIRE is the default partitioning for Lookup reference links with "Auto" partitioning
- On SMP platforms, it is a good practice to set this explicitly on the NormalLookup reference link(s)

====================================================================
HASH Partitioning: ( Keyed Partition )
----------------------------------------------

# Keyed partitioning method
# Rows are distributed according to the values in key columns
- Guarantees that rows with same key values go into the same parttion.
- Needed to prevent matching rows from "hiding" in other partitions
# E.g: Join, Merge, Remove Duplicates,...
- Partition distribution is relatively equal if the data across the source key
column(s) is evenly distributed.

For certain stages (Remove Duplicates, Join . Merge) to work correctly in parallel, the user must override the default(Auto) to Hash or a variant: Range, Modules.

Here the numbers are no longer row IDs, but values of key column.
# Hash guarantees that all the rows with key value 3 end up in the same partition.
# Hash does not guarantee "continuity": here, 3s arebunched with 0s, not with
neighboring value 2.
This is an expensive version of Hash, "Range", that guarantees continuity
# Hash does not guarantee load balance.
Make sure key column(s) takes on enough values to distribute data across
available partitions
("gender" would be a poor choice of key..)
=================================================================
Modulus Partitioning:
----------------------------------
# Keyed partitioning method
# Rows are distributed according to the values in one integer key column
- Uses modulus
partition = MOD (key_value/ #partitions)
# Faster than HASH
# Guarantees that rows with identical key values go in the same partition.
# Partition size is relatively equal if the data within the key column is evenly distributed.

=========================================================================
Auto Partitioning;
--------------------------------------------------
# DataStage inserts partition components as necessary to ensure correct results
- Before any stage with "Auto" partitioning
- Generally chooses ROUND-ROBIN or SAME
- Inserts HASH on stages that require matched key values
(e.g: Join, Merge, Remove Duplicates)
- Inserts ENTIRE on Normal (not Sparse) Lookup reference links.
NOT always appropriate for MPP/clusters
# Since Datastage has limited awareness of your data and business rules, explicitly specify HASH partitioning when needed
- DataStage ha no visibility into Transformer logic
- Hash is required before Sort and Aggregator stages
- DataStage sometimes inserts un-needed partitioning
# Check the log.

Auto generally chooses Round Robin when going from sequential to parallel. It generally chooses Same when going from parallel to parallel.
Since DataStage has limited awareness of your data and business rules, explicityly specify HASH partitioning when needed, that is, when processing requires groups of related records.

No comments: