Thursday, December 31, 2009

DataStage PX Training.



Administrator Module
Manager Module
Designer Module
Director Module
Parallelism Concepts

Adminstrator Module
Upon module completion, you will be able to:
Create and remove projects
Set project-level properties
Set environment variable default values and add new variables and values, if necessary


Logging into a DataStage server using the Administrator requires the host name of the server—the fully qualified name if necessary—or the server’s IP address, and an operating system username and password. For UNIX servers, users logging in as root or as a root-equivalent account, or as dsadm will have full administrative rights. For Windows servers, users logging in who are members of the Local Administrators (standalone server) or Domain Administrators (domain controller or servers in an Active Directory Forest) groups will have full administrative rights.

This page lists the DataStage projects, and shows the pathname of the selected project in the Project pathname field. The Projects page has the following buttons:

• Add… adds new DataStage projects. This button is enabled only if you have administrator status.
• Delete deletes projects. This button is enabled only if you have administrator status.
• Properties views or sets the properties of the selected project.
• NLS… lets you change project maps and locales (if the NLS option was installed during the server installation).
• Command issues DataStage Engine commands directly from the selected project.



Provided that you have the proper permissions, you can add as many projects to the DataStage server as necessary. In normal projects any DataStage developer can create, delete, or modify any object within the project once it has been created. During the creation process, however, you can specify that a new project is protected. This is a special category of project and, normally, nothing can be added, deleted or changed in the project. Users can view objects in the project, and perform tasks that affect the way a job runs rather than the jobs design; specifically they can:
• Run jobs
• Set job properties
• Set job parameter default values.

A newly created protected project is populated by importing developed jobs and components; only a Production Manager user can perform the import, no other types of user can import into a special project.

Tip: The default directory path in which to create projects is located under the root directory of the DataStage server installation. For example, if the server was installed to /appl/Ascential/DataStage the projects would be installed to /appl/Ascential/DataStage/Projects/{project name}. If a separate UNIX file system or Windows partition is available and has enough free space (refer to the DataStage Install and Upgrade Guide for project sizing values), change the path to reflect the proper location on the server before creating the new project. As you will see later on in the course, too many projects in the same file system or partition can lead to problems with free space—which can ultimately lead to the engine malfunctioning.


Provided you have the proper permissions, you can delete any project from the DataStage server. Before you delete any project, you should be sure that you have a proper backup of the project. This can be done from the DataStage Manager and is discussed in the Manager Module.


From the General tab, you can set some basic properties for the project:
Enable job administration in Director—enabling this feature will allow users the ability to Cleanup Resources and Clear Status File from within the Job menu of DataStage Director. The features let DataStage users release the resources of a job that has aborted or hung, and so return the job to a state in which it can be rerun when the cause of the problem has been fixed. Users should proceed with caution when cleaning up resources, since the utility will allow users to view and end job processes and view and release the associated locks of those processes. Also, remember that users can only end processes or locks that they own—some processes and locks will require administrative authority to clear them.
Enable Runtime Column Propagation for Parallel Jobs—if you enable this feature, stages in parallel jobs can handle undefined columns that they encounter when the job is run, and propagate these columns through to the rest of the stages in the job. This check box enables the feature; to actually use it you must explicitly select or deselect the option on each stage. If it is enabled, all stages in parallel jobs will have RCP enabled as a default setting. This setting has no effect on jobs created on the server canvas.
Auto-purge of job log—this setting will automatically purge job log entries for jobs based on the auto-purge action setting. For example, if you specify to auto purge up to the previous 3 job runs, entries for the previous 3 job runs are kept as new job runs are completed. Keep in mind that the auto purge setting applies to newly created jobs, not existing jobs. If you want to enable the auto-purge setting for a previously created job, this can be accomplished from the DataStage Director.


You can set project-wide defaults for general environment variables or ones specific to parallel jobs from this page. You can also specify new variables. All of these are then available to be used in jobs. In each of the categories except User Defined, only the default value can be modified. In the User Defined category, users can create new environment variables and assign default values.



You can trace activity on the server to help diagnose project problems. The default is for server tracing to be disabled. When you enable it, information about server activity is recorded for any clients that subsequently attach to the project. This information is written to trace files, and users with in-depth knowledge of the system software can use it to help identify the cause of a client problem. If tracing is enabled, users receive a warning message whenever they invoke a DataStage client.

You can also view or delete current trace files from within this window by selecting the trace file and clicking on view or delete.



This tab applies to Windows NT/2000 servers only. DataStage uses the Windows NT Schedule service to schedule jobs. This means that by default the job runs under the user name of the Schedule service, which defaults to NT system authority. You may find that the NT system authority does not have enough rights to run the job. To overcome this, you can specify a user account and password to run DataStage jobs in a project under that user’s authority. Click Test to test that the user name and password can be used successfully. This involves scheduling and running a command on the server, so the test may take some time to complete.



Hashed file stage caching: When a Hashed File stage writes records to a hashed file, there is an option for the write to be cached rather than written to the hashed file immediately. Similarly, when a Hashed File stage is reading a hashed file there is an option to pre-load the file to memory, which makes subsequent access much faster and is typically used when the file is providing a reference link to a Transformer stage.

Row buffering: The use of row buffering can greatly enhance performance in server jobs. Select the Enable row buffer check box to enable this feature for the whole project. There are two types of mutually exclusive row buffering:
• In process. You can improve the performance of most DataStage jobs by turning in-process row buffering on and recompiling the
job. This allows connected active stages to pass data via buffers rather than row by row.
• Inter process. Use this if you are running server jobs on a multi-CPU server (SMP). This enables the job to run using a separate process for each active stage, which will run simultaneously on a separate processor.

When you have enabled row buffering, you can specify the following:
• Buffer size. Specifies the size of the buffer used by in-process or inter-process row buffering. Defaults to 128 Kb.
• Timeout. Only applies when inter-process row buffering is used. Specifies the time one process will wait to communicate with another via the buffer before timing out. Defaults to 10 seconds.

Job Parameters

* Defined in job Properties window
* Makes the job more flexible
* Parameters can be:
- Used in directory and file names
- Used to specify property values
- Used in constraints and derivations
* Parameter values are determined at run time
* When used for directory and files names and property
Vlues, they are surrounded with pound signs(#)
- E.g., #NumRows#
* Job parameters can reference DataStage environment variables
- Prefaced by $, e.g., $APT_CONFIG_FILE
-----------------------------------------------------------------
Running Jobs from Command Line

* dsjob -run -param numrows=10 dx0863 GenDataJob
- Runs a jo b
- Use -run to run the job
- Use -param to specify parameters
- In this example, dx0863 is the name of the project
- In this example, GenDataJob is the name of the job
* dsjob -logsum dx0863 GenDataJob
- Displays a job's messages in the log
------------------------------------------------------------------
Runtime Column Propagation ( RCP )

* When RCP is trurned on:
- Columns of data can flow through a stage without being explicitly defined in the stage
- Target columns in a stage need not have any columns explicitly mapped to them
* No column mapping enforcement at design time.
- Input columns are mapped to unmapped columns by name.
* How implicit columns get into a job
- Read a file using a schema in a sequential file stage
- Read a database table using "Select * "
- Explicitly define as a n output column in a stage earlier in the flow

Benefits of RCP
- Job flexibility
* Job can process input with different layouts
- Ability to create reusable components in shared containers
* Component logic an apply to a single name column
* All other columns flow through untouched.

Enabling Runtime Column Propagation ( RCP )

# Project level
- DataStage Administrator Parallel tab.
# Job level
- Job properties General tab.
# Stage level
- Link Output Column tab.
# Setting at a lower level override settings at a higher level.
- E.g., disable at the project level, but enable for a given job.
- E.g., enable at the job level, but disable a given stage.

No comments: