Tuesday, December 8, 2009

Combined Data using DataStage stages

Combining Data:
===============

Ways to combine data:

# Horizontally:
- Multiple input links
- One output link made of columns from different input links.
- Joins
- Lookup
- Merge

# Vertically:
- One input link, one output link combining groups of related records into a single record.
- Aggregator
- Remove Duplicates

# Funneling: Multiple input streams funneled into a single output stream.
- Funnel stage.
---------------------------------------------------------------------------

Lookup, Merge, Join Stages:
==========================

# These stages combine two or more input links
- Data is combined by designated "key" column(s)

# These stages differ mainly in;
- Memory usage
- Treatement of rows with unmatched key values
- Input requirements (sorted, de-duplicated)


Lookup Features:
-----------------

# One Stream input link (source link)
# Multiple reference links
# One output link
# Lookup failure options
- Continue, Drop, Fail, Reject
# Reject link
- Only available with Reject lookup failure option.
# Can return multiple matching rows
# Hash file in built in memory from the lookup files
- Indexed by key
- Should be small enough to fit into physical memory.

Lookup Types:
-----------------
# Equality match
- Match exactly values in the lookup key coulmn of the reference link to selected values in the source row.
- Return row or rows (if multiple matches are to be returned) that match
# Caseless match:
- Like an equality match except that it's caseless
E.g., "abc" matches 'ABC'
# Range on the reference link
- Two columns on the reference link difine the range
- A match occurs when a selected value in the source row is within the range.
# Range on the source link
- Two columns on the source link define the range
- A match occurs when a selected value in the reference link is within the range.

For an equality or caseless match lookup, one or more columns in the reference link are selected as keys. Column from the source link are matched to the key columns using drag and drop. To specify an equality match, select the equal sign (=) from the key Type box of the reference link pannel. To specify a caseless match, select the Caseless from the key Type box of the reference link panel.

Output columns are specified in the top, right panel. Columns from the source and reference link are dragged to the front of these columns to specify the values to be mapped to the output columns.

Lookup Failure Actions:
-----------------------
If the lookup fails to find a matching key column, one of several actions can be taken;

# Fail (Defailt)
- Stage reports an error and the job fails immediately
# Drop
- Input row is dropped
# Continue
- Input row is transferred to the output. Reference link columns are filled with null or default values.
# Reject
- Input row sent to a reject link
- This requires that a reject link has been created for the stage.
=======================================================
Join Stage:
-----------------
# Four types of joins
- Inner
- Left outer
- Right outer
- Full outer

# Input links must be sorted
- Left link and a right link
- Supports additional "intermediate" links

# Light-weight
- Little memory required, because of the sort requirement

# Join key column or columns
- Names for each input link must match.
---------------------------------------------
Inner Join:
# Transfers rows from both data sets whose key columns have matching values.

Left Outer Join:

# Transfers all values from the left link and transfers values from the right link only where key columns match.

Right Outer Join:

# Transfers all values from the right link and transfers values from the left link only where key columns match.
=============================================================
Merge Stage:
# Similar to Join Stage
# Input links must be sorted
- Master link and one or more secondery links
- Master must be duplicate-free
# Light-Weight
- Little memory required, because of the sort requirement.
# Unmatched master rows can be kept or dropped
# Unmatched secondary links can be captured in a reject link.
# Unmatched updates are captured by adding additional reject links (one for each update link)
==================================================================
Funnel Stage:
# Combines data from multiple input links to a single output link.
# All sources must have identical metadata
# Three modes

- Continuous
# Records are combined in no particular order
- First available record.

- Sort Funnel
# Combines the input records in the order defined by a key.
# Produces sorted output if the input links are all sorted by the same key.

- Sequence: # Output all records from the first input link, then all from the second input link and so on.

No comments: