Friday, December 4, 2009

Performance Tuning Datastage Jobs

Performance Tuning - Basics
BasicsParallelism Parallelism in DataStage Jobs should be optimized rather than maximized. The degree of parallelism of a DataStage Job is determined by the number of nodes that is defined in the Configuration File, for example, four-node, eight –node etc. A configuration file with a larger number of nodes will generate a larger number of processes and will in turn add to the processing overheads as compared to a configuration file with a smaller number of nodes. Therefore, while choosing the configuration file one must weigh the benefits of increased parallelism against the losses in processing efficiency (increased processing overheads and slow start up time).Ideally , if the amount of data to be processed is small , configuration files with less number of nodes should be used while if data volume is more , configuration files with larger number of nodes should be used.

Partioning :
Proper partitioning of data is another aspect of DataStage Job design, which significantly improves overall job performance. Partitioning should be set in such a way so as to have balanced data flow i.e. nearly equal partitioning of data should occur and data skew should be minimized.

Memory :
In DataStage Jobs where high volume of data is processed, virtual memory settings for the job should be optimised. Jobs often abort in cases where a single lookup has multiple reference links. This happens due to low temp memory space. In such jobs $APT_BUFFER_MAXIMUM_MEMORY, $APT_MONITOR_SIZE and $APT_MONITOR_TIME should be set to sufficiently large values.

Performance Analysis of Various stages in DataStag

Sequential File Stage -
The sequential file Stage is a file Stage. It is the most common I/O Stage used in a DataStage Job. It is used to read data from or write data to one or more flat Files. It can have only one input link or one Output link .It can also have one reject link. While handling huge volumes of data, this Stage can itself become one of the major bottlenecks as reading and writing from this Stage is slow.Sequential files should be used in following conditionsWhen we are reading a flat file (fixed width or delimited) from UNIX environment which is FTPed from some external systemsWhen some UNIX operations has to be done on the file Don’t use sequential file for intermediate storage between jobs. It causes performance overhead, as it needs to do data conversion before writing and reading from a UNIX file.In order to have faster reading from the Stage the number of readers per node can be increased (default value is one).

Data Set Stage :
The Data Set is a file Stage, which allows reading data from or writing data to a dataset. This Stage can have a single input link or single Output link. It does not support a reject link. It can be configured to operate in sequential mode or parallel mode. DataStage parallel extender jobs use Dataset to store data being operated on in a persistent form.Datasets are operating system files which by convention has the suffix .dsDatasets are much faster compared to sequential files.Data is spread across multiple nodes and is referred by a control file.Datasets are not UNIX files and no UNIX operation can be performed on them.Usage of Dataset results in a good performance in a set of linked jobs.They help in achieving end-to-end parallelism by writing data in partitioned form and maintaining the sort order.

Lookup Stage –
A Look up Stage is an Active Stage. It is used to perform a lookup on any parallel job Stage that can output data. The lookup Stage can have a reference link, single input link, single output link and single reject link.Look up Stage is faster when the data volume is less.It can have multiple reference links (if it is a sparse lookup it can have only one reference link)The optional reject link carries source records that do not have a corresponding input lookup tables.Lookup Stage and type of lookup should be chosen depending on the functionality and volume of data.Sparse lookup type should be chosen only if primary input data volume is small.If the reference data volume is more, usage of Lookup Stage should be avoided as all reference data is pulled in to local memory

Join Stage :
Join Stage performs a join operation on two or more datasets input to the join Stage and produces one output dataset. It can have multiple input links and one Output link.There can be 3 types of join operations Inner Join, Left/Right outer Join, Full outer join. Join should be used when the data volume is high. It is a good alternative to the lookup stage and should be used when handling huge volumes of data.Join uses the paging method for the data matching.

Merge Stage :
The Merge Stage is an active Stage. It can have multiple input links, a single output link, and it supports as many reject links as input links. The Merge Stage takes sorted input. It combines a sorted master data set with one or more sorted update data sets. The columns from the records in the master and update data sets are merged so that the output record contains all the columns from the master record plus any additional columns from each update record. A master record and an update record are merged only if both of them have the same values for the merge key column(s) that you specify. Merge key columns are one or more columns that exist in both the master and update records. Merge keys can be more than one column. For a Merge Stage to work properly master dataset and update dataset should contain unique records. Merge Stage is generally used to combine datasets or files.

Sort Stage :
The Sort Stage is an active Stage. The Sort Stage is used to sort input dataset either in Ascending or Descending order. The Sort Stage offers a variety of options of retaining first or last records when removing duplicate records, Stable sorting, can specify the algorithm used for sorting to improve performance, etc. Even though data can be sorted on a link, Sort Stage is used when the data to be sorted is huge.When we sort data on link ( sort / unique option) once the data size is beyond the fixed memory limit , I/O to disk takes place, which incurs an overhead. Therefore, if the volume of data is large explicit sort stage should be used instead of sort on link.Sort Stage gives an option on increasing the buffer memory used for sorting this would mean lower I/O and better performance.

Transformer Stage :
The Transformer Stage is an active Stage, which can have a single input link and multiple output links. It is a very robust Stage with lot of inbuilt functionality. Transformer Stage always generates C-code, which is then compiled to a parallel component. So the overheads for using a transformer Stage are high. Therefore, in any job, it is imperative that the use of a transformer is kept to a minimum and instead other Stages are used, such as:Copy Stage can be used for mapping input links with multiple output links without any transformations. Filter Stage can be used for filtering out data based on certain criteria. Switch Stage can be used to map single input link with multiple output links based on the value of a selector field. It is also advisable to reduce the number of transformers in a Job by combining the logic into a single transformer rather than having multiple transformers .

Funnel Stage –
Funnel Stage is used to combine multiple inputs into a single output stream. But presence of a Funnel Stage reduces the performance of a job. It would increase the time taken by job by 30% (observations). When a Funnel Stage is to be used in a large job it is better to isolate itself to one job. Write the output to Datasets and funnel them in new job. Funnel Stage should be run in “continuous” mode, without hindrance.

Overall Job Design :
While designing DataStage Jobs care should be taken that a single job is not overloaded with Stages. Each extra Stage put in a Job corresponds to lesser number of resources available for every Stage, which directly affects the Jobs Performance. If possible, big jobs having large number of Stages should be logically split into smaller units. Also if a particular Stage has been identified to be taking lot of time in a job, like a transformer Stage having complex functionality with a lot of Stage variables and transformations, then the design of jobs could be done in such a way that this Stage is put in a separate job all together (more resources for the transformer Stage!!!). Also while designing jobs, care must be taken that unnecessary column propagation is not done. Columns, which are not needed in the job flow, should not be propagated from one Stage to another and from one job to the next. As far as possible, RCP (Runtime Column Propagation) should be disabled in the jobs. Sorting in a job should be taken care try to minimise number sorts in a job. Design a job in such a way as to combine operations around same sort keys, if possible maintain same hash keys. Most often neglected option is “don’t sort if previously sorted” in sort Stage, set this option to “true”. This improves the Sort Stage performance a great deal. In Transformer Stage “Preserve Sort Order” can be used to maintain sort order of the data and reduce sorting in the job.In a transformer minimum of Stage variables should be used. More the no of Stage variable lower is the performance. An overloaded transformer can choke the data flow and lead to bad performance or even failure of job at some point. In order to minimise the load on transformer we can Avoid some unnecessary function calls. For example to convert a varchar field with date value can be type cast into Date type by simple formatting the input value. We need not use StringToDate function, which is used to convert a String to Date type.Implicit conversion of data types.
Reduce the number of Stage variables used. It was observed in our previous project by removing 5 Stage variables and 6 function calls, runtime for the job was reduced from 2 hours to 1 hour 10 min (approximately) with 100 million records input.Try to balance load on transformers by sharing the transformations across existing transformers. This would ensure smooth flow of data.If you require type casting, renaming of columns or addition of new columns, use Copy or Modify Stages to achieve thisWhenever you have to use Lookups on large tables, look at the options such as unloading the lookup tables to datasets and using, user defined join SQL to reduce the look up volume with the help of temp tables, etc.The Copy stage should be used instead of a Transformer for simple operations including:o Job Design placeholder between stages o Renaming Columnso Dropping Columnso Implicit (default) Type Conversions The “upsert” works well if the data is sorted on the primary key column of the table which is being loaded. Or Determine , if the record already exists or not to have “Insert” and “Update” separately.It is sometimes possible to re-arrange the order of business logic within a job flow to leverage the same sort order, partitioning, and groupings. Don’t read from a Sequential File using SAME partitioning. Unless more than one source file is specified, this scenario will read the entire file into a single partition, making the entire downstream flow run sequentially

5 comments:

Unknown said...

Hi Surendra,

Firstly I would like to thank you personally for the content and the coverge you have put in your blog.

I have never responded to any blog in my life.

Such a good coverage should be visible in the first page when we google out.

But after going through your blog , I really got impressed and thanking you.

It would be great if you can share some more experiences in Job architecturing and designing.

for exmple I have 1000 records, and for some reason only 200 records got loaded in target.
Now how to load remining 800 records ensuring that repeated records are not loaded.



Thnks,
Vishwa

Unknown said...

Hi surendra,
Thanks a lot for the data your covering in your blog .Its really helping me alot ,as a beginer it's really useful for me to knowing the subject deeply..

Thanks,
N.Rameshwar.

Unknown said...

Hi Surendar,
Iam very thankful to you very personally for putting the valuable content in you blog.
It would be great if you can share some more information regarding jobs designing, change capture stages..........

Thnks,
Nagendra

Madhu.VC said...

Dear Suri,

Thanks for the stuff about the stages .

I haven't read like this when I google out in my first attempt.

karimullashakhadari Shaik said...

Hello Surendra,

I am heartly thanking you whatever information you are giving in the blog.It was helpful people who are new to Datastage.One more thing your explanation levels are amazing.It was understandable for any level of people.


Thanks,
Karimullashakhadari shaik