Tuesday, December 29, 2009

DataStage PX 7.5

DataStage has the following features to aid the design and processing
required to build a data warehouse:
• Uses graphical design tools. With simple point-and-click techniques
you can draw a scheme to represent your processing
requirements.
• Extracts data from any number or type of database.
• Handles all the meta data definitions required to define your data
warehouse. You can view and modify the table definitions at any
point during the design of your application.
• Aggregates data. You can modify SQL SELECT statements used to extract data.
• Transforms data. DataStage has a set of predefined transforms and
functions you can use to convert your data. You can easily extend
the functionality by defining your own transforms to use.
• Loads the data warehouse.
DataStage has four client components which are installed on any PC
running Windows 2000 or Windows NT 4.0 with Service Pack 4 or later:
DataStage Designer. A design interface used to create DataStage
applications (known as jobs). Each job specifies the data sources,
the transforms required, and the destination of the data. Jobs are
compiled to create executables that are scheduled by the Director
and run by the Server (mainframe jobs are transferred and run on
the mainframe).
• DataStage Director. A user interface used to validate, schedule,
run, and monitor DataStage server jobs.
• DataStage Manager. A user interface used to view and edit the
contents of the Repository.
• DataStage Administrator. A user interface used to perform administration
tasks such as setting up DataStage users, creating and moving projects, and setting up purging criteria.

There are three server components:
• Repository. A central store that contains all the information
required to build a data mart or data warehouse.
• DataStage Server. Runs executable jobs that extract, transform,
and load data into a data warehouse.
• DataStage Package Installer. A user interface used to install packaged DataStage jobs and plug-ins.

Datastage Compile Command Line - dscc
Here is the command line to compile jobs.
Note that this is NOT a server command but a client command.

(On Windows, you will find it under C:\Program Files\Ascential\DataStageXX where XX is your Datastage version)

# Usage

dscc [@argfile] [/?] /h [/o] [/u ] [/p ]
[/j ] [/r ] [/bo ] [/f] [/ouc] [/rd ] [/rt ]
[/jt ] [/mful ] [/mfcgb ]

@argfile Read arguments from a file.
/? Show usage.
/h Name of Host to attach to
/o Omit logon details
/u User Name
/p Password
project Project To Attach to
/j Job to compile, * = All Jobs, cat\* = category
/r Routine to compile, * = All Routines, cat\* = category
/bo Build Op to compile, * = All Build Ops, cat\* = category
/f Use the Force Compile
/ouc Only Compile Uncompiled Objects
/rd Report Directory and Name
/rt Type of report to produce : X = xml, T = txt - default
/jt Job Type To Compile
-1 (Default) = All Types
0 = Server
1 = Mainframe
2 = Sequence
3 = Parallel

/mful Mainframe Job Upload Profile
/mfcgb Mainframe Code Gen base directory location

#Example - Compile 1 job

DSCC /H 100.100.5.5 /U user /P password myProject /j jobName /f /ouc > compile.log

#Example - Compile a category

DSCC /H 100.100.5.5 /U user /P password myProject /j Project\Category1\* /f /ouc > compile.log
==================================================================================
Datastage Import Command Line - dscmdimport

Here is the command line to import jobs.
Note that this is NOT a server command but a client command.

(On Windows, you will find it under C:\Program Files\Ascential\DataStageXX where XX is your Datastage version)

# Usage

dscmdimport [/H][/U][/P][/O][/NUA] project|/ALL|/ASK filepaths [/V]

/H Host name.
/U User name.
/P Password.
/O Omit flag.
/NUA Suppress usage analysis.
project Specifies the project name.
/ALL Import the specified dsx files into all projects on the server.
/ASK Asks for project name.
filepaths Specifies the import file names.
/V Verbose. Default to false.

Note that both project name and filepath must be specifed, and that all items will be overwritten if they already exist.

#Example

DSCMDIMPORT /H=100.100.5.5 /U=user /P=password myProject DSExport.dsx /V > import.log
=====================================================================================
Datastage Export Command Line - dscmdexport

Here is the command line to export jobs.
Note that this is NOT a server command but a client command.

(On Windows, you will find it under C:\Program Files\Ascential\DataStageXX where XX is your Datastage version)

# Usage

dscmdexport [/H][/U][/P][/O] project filepath [/V]
/H Host name.
/U User name.
/P Password.
/O Omit flag.
project Specifies the project name.
filepath Specifies the export file name.
/V Verbose. Default to false.


Note that project name must be specifed.

#Example

DSCMDEXPORT /H=100.100.5.5 /U=user /P=password myProject DSExport.dsx /V > export.log
=====================================================================================
Datastage Design - Making Nice Looking Jobs
A good datastage is allways a good looking one.

But what makes a good looking job ?
# The process reading it from the left to the right
Input on the left
Output on the right
Lookup sources on the top
Rejects on the bottom

# Let space between stages, not to short but not too long

# Links must allways be either horizontal or vertical !!

Here are few examples :




=====================================================================================
Datastage Partitioning Methods

# Variant of "hash" method
# Faster
# Offers a good "load balancing"
# You can only choose 1 column to specify the hash key
# Key column type must be integer
# Partitioning algorithm is : partition = key value x (mod # partition)

Example :

With 4 nodes, key value "5" will be in partition 1 :
1 : 5 = 1 (mod 4)

Notes :

#1 Modulus is a variant of "hash" method ; use it whenever it is possible (1 key, integer)
-------------------------------------------------------------------------------------
Datastage Partitioning - Choose your partitioning method

Datastage PX offers 7 different partitioning methods :


Notes :

#1 "Same" is the default partitioning method used between 2 "parallel" stages
#2 "Round robin" is the default partitioning method used between a "sequential" stage and a "parallel" stage

Datastage Partitioning Method - Same

# Records keep the same distribution
# No re-partitioning

Example :


Notes :

#1 "Same" is the default partitioning method used between 2 "parallel" stages
#2 Perfect method to keep data sorted in each node after a sorting operation

Datastage Partitioning Method - Round Robin
# Records are alternatively sent on the nodes

Example :


Notes :

#1 "Round robin" is the default partitioning method used between a "sequential" stage and a "parallel" stage
#2 Guarantee the "load balancing" - all nodes have the same number of records

Datastage Partitioning Method - Entire
# Records are duplicated on all nodes

Example :

Notes :

#1 Useful for lookup tables and parameters files
#2 Greatly increase data size

You should use this partitioning method with the following stages :


Datastage Partitioning Method - Hash

# Records are distributed according to the values of one/multiple key(s) (selected by user)
# Records with the same key are sent to the same node

Example :

Notes :

#1 The same key will be sent on the same partition (all keys "3" is on the same node)
#2 The continuity is not guaranteed (the key "2" and "3" are not on the same node)
#3 "load balancing" is not guaranteed - all nodes won't have the same number of records
#4 This method is necessary for many stages

You should use this partitioning method with the following stage :

=================================================================================
Datastage Sorting and Partitioning Optimization - NOSORTINSERTION and NOPARTINSERTION

The default congiguration settings of Datastage enables sorting and partitioning operations whenever a stage needs it.

For example, during a RemoveDuplicate Stage, Datastage automatically uses the defined keys for sorting and partitioning data.

That default settings is stored under two variables.
Check it on Datastage Administrator :
Projects -> Properties -> General -> Environment ... -> Parallel Menu

APT_NO_PART_INSERTION : Turn off automatic insertion of partitioners

#False : Auto partitioning is On
#True : No auto partitioning


APT_NO_SORT_INSERTION : Turn off automatic insertion of sorting

#False : Auto sort is On
#True : No auto sort

I greatly recommend to set these 2 variables to True.

Why ?
The auto sorting & partitioning options are great if you allways choose the auto partitioning option in all your stage, meaning that you allways define properly the data keys too.

But a good developper will allways choose the best partitioning option in the list (Hash, Entire, Same ...).
If you keep the auto partitioning & sorting options, Datastage will part & sort twice :
First, the data is sorted and partitioned automatically, then the data is sorted and partitioned according to the partitioning & sorting options set in the stage.
This will result in a great loss of time during your processes.

The auto sorting & partitioning options can also be desactivated by using a runtime option for Parallel Jobs (via Datastage Administrator)


=====================================================================================
Datastage Functions

3 functions are used to manipulate decimal values :
DecimalToDecimal, DecimalToString, StringToDecimal.

DecimalToDecimal(%decimal% [,"%rtype%"])
Return : Decimal Value

DecimalToString(%decimal% [,"suppress_zero" | "fix_zero"])
Return : Varchar

StringToDecimal(%string% [,"%rtype%"])
Return : Decimal Value

Note that the scale that you specify in the stage defines rounding operations.



Examples DecimalToDecimal :
DecimalToDecimal(0000000001234.567890) = 00000000000001234.57 (lengh,scale = 19,2)
DecimalToDecimal(0000000001234.567890,"floor") = 00000000000001234.56 (lengh,scale = 19,2)
DecimalToDecimal(0000000001234.567890,"ceil") = 00000000000001234.57 (lengh,scale = 19,2)
DecimalToDecimal(0000000001234.567890,"round_inf") = 00000000000001234.57 (lengh,scale = 19,2)
DecimalToDecimal(0000000001234.567890,"trunc_zero") = 00000000000001234.56 (lengh,scale = 19,2)


Examples DecimalToString :
DecimalToString(0000000001234.567890) = " 0000000001234.567890"
DecimalToString(0000000001234.567890,"suppress_zero") = " 1234.56789"
DecimalToString(0000000001234.567890,"fix_zero") = " 0000000001234.567890"

As you can see, there is a leading character in the string representation of decimal values.
The character is a space or a negative sign.


Examples StringToDecimal :
StringToDecimal("0001234.56789") = 0000000001234.567890 (lvalue=19,6)
==================================================================================
Datastage Oracle Stage - Extract your Data faster
During my interventions, I often met BI Manager or Architects who are attached to the famous acronym E.T.L.: "Extraction is Extraction, Transformation is Transformation, Loading is Loading ..."
It means that you extract data as much as possible and the transformation stage will do the "cleaning" job ...

Well, for the purpose of optimizing the processing jobs, this statement is wrong.

Let's take a simple example to illustrate this.

You have an extraction job with an Oracle Stage, a copy stage and a Dataset Stage.


If you follow the recommendations of your architect, in the Oracle Stage, you select all columns of the table.


Then in the copy stage, you map data you only need.


Advantages:
#1: This is a very fast method, you just have to load a schema in the Ora Stage
#2: This method works whatever you need during the transformation phase ; it's very simple to map additional columns in the copy stage

Drawbacks:
# This is not an optimized method because you get data you may not need.
Even if you drop columns in the copy stage, all columns are extracted from the Oracle Stage !
It can be a real waste of time in a process system.

Solution :

Select the data you need in the Ora Stage

Advantages:
#1: Less Oracle Ressources used (less Data selected)
#2: Less Datastage Ressources used (memory & processors)
#3: Less network bandwidth used (less Data sent)
#4: Use of columns indexes :

select * from TABLE --> TABLE ACCESS FULL

select key, col_1, col_2 from TABLE --> TABLE ACCESS BY INDEX on KEY

Drawbacks:
#1: Adding additional columns can be laborious
#2: You don't follow the recommendations of your architect !

No comments: