Friday, January 15, 2010

How to become a Master in Modify Stage

Modify Stage Purpose
With a Modify stage you can perform five classes of function, but one major limitation is that Modify stage functions are limited to a single argument - straight away arithmetic and concatenation are ruled out, for example.
The five classes of function are:

•null handling
•date/time manipulation
•date/time interval calculation (limited)
•string trimming and substring
•data type conversion
Th Robinson had another great comment pointing out it is better than the Transformer for null handling:

I used to routinely get bitten by the transformer stage rejecting rows in which some fields were participating in derivations or stage variables and the dang things had NULLS in them.
2. Solving the NULLS problem with IF IsNull() for every single field being evaluated in some way can get overly complex and very messy.
Instead I put a Modify stage before the Transformer, call the stage MhandleNull and handle_null() for all fields being evaluated in the transformer. This simplifies the already cumbersome IF THEN ELSE syntax of the transformer and/or the stage variables.

Modify versus Transformer

I've put together a table that compares the Modify Stage to the Transformer Stage to show the limitations - the Modify Stage is a highly specialist stage while Transformer is an all-rounder:

There are a lot of NOs in the Modify column! It is a specialized stage, it is very fast at converting metadata and unlike the Transformer stage it can use job parameters for just about anything. The last No is very important.

There are many wrong paths to Modify Stage enlightenment and very few beacons to light the way, take a torch

So I am learning how to use the Modify Stage through a test harness - I've built some independent DataStage jobs that can be used to test out Modify Stage syntax without having to recompile or change the job. The hardest part of debugging Modify Stage code is that you don't know which specification had the error and it is time consuming to keep changing the code, compiling and retesting. The test harness lets you isolate and test specifications one at a time.

Here are the three jobs.

Using Job Parameters

A Job Parameter is to DataStage what a Swiss Army Knife is to the camper

The examples shown below use job parameters to test different Modify Stage scenarios, there are three ways to set the value of job parameters:

- In the job properties before a compile: this is where you set the default values that come up when the job is run.

- In the Director using Set Job Parameters: this is a convenient place to change the default values created at design time as DataStage will remember and use these defaults between job runs.

- In the Director when a job is run: the set job properties window pops up when you run a job letting you change values, but these values are reset to default the next time you run the job so it's not a good place to put complex Modify test code while you are debugging.

Courtesy: Vincent McBurney
-----------------------------------------------------------------------------------
Learning the Modify Stage

The Modify stage is a metadata panel beater. It does just one thing: converts columns from one data type to another. It is quite restrictive, you cannot nest functions or use constants as values. Almost all the functions in the Modify stage are also available in the all rounder Transformer stage.
The Transformer is an automatic, the Modify Stage is a manual

Transformers are easy to use, which is one of the reasons why DataStage has been successful as it is the most commonly used stage. When adding derivation / mapping / transformation code into the text entry windows there is a right mouse click menu for finding a list of transformation functions:

It all started a couple of AscentialWorld conferences ago when Ascential DataStage experts started telling people to avoid the parallel transformer if they wanted fast performance in DataStage jobs. If a parallel job is a racing car the transformer stage was a roof rack and luggage.

Historically parallel jobs did not even have a transformer stage.

Back when the parallel engine was a product called Orchestrate there was a Transform function that worked much the same way as other parallel commands. It required a lot of manual coding. When Ascential turned Orchestrate into parallel jobs they took the transformer stage from Server Edition and sat it over the top of the parallel transform function. This is when the trouble began. The transformer stage and the Transform function didn't quite fit together, so each parallel transformer needs a C++ compilation, a kind of automated custom stage build, even in the latest versions. The message is that made it slower then other parallel stages.

The Allrounder Stage
Most of the functions of a transformer can be performed by other stages.
Transformer constraints = filter stage
Transformer metadata conversion = Modify stage
Transformer copy of fields and multiple outputs = Copy stage
Transformer creation of new fields = Column Generator stage
Transformer counters = Surrogate Key Stage


As you can see each of the parallel stages are highly specialised while the transformer is an allrounder. The specialised stages are faster because they do not carry as much overhead.

Ease of Use versus Performance
Let's start with the Modify stage. The only people out there who enjoy using this stage are also experts at crytic crosswords or were part of the team that cracked the enigma code in world war two or are members of opus dei and are looking for the DataStage equivelent of self flagulation. With no friendly right mouse click menus, no red text highlighting syntax errors, not warnings or helpful messages on compile there is simply no easy way to ensure you have correct Modify stage code without compiling and waiting for a (hopefully) helpful error message. Trial and error is the order of the day.

The Filter stage is not much better.

The Column Generator stage is just bizarre and involves navigating your way into the column properties like a trip through the pipe system on Prison Break.

The Transformer is hands down the best stage to write any type of transformation code in the entire DataStage product given the right mouse click menu, the red text marking for syntax errors and the grid layout.

Robustness
The transformer stage is one of the few stages that has a robust reject link for rows that fail transformation. The Modify and Filter stages lack this link so if a piece of data defies logic the job will abort rather then drop a row down a reject link.

3 comments:

chintu said...

Hi Surendra,
Nice information.
I would like to have one example on this stage

Anonymous said...

Hi surendra can u give some real time examples

Anonymous said...

Hi surendra can u give some real time examples