Thursday, December 31, 2009

Datastage EE configuration file

The Datastage EE configuration file is a master control file (a textfile which sits on the server side) for Enterprise Edition jobs which describes the parallel system resources and architecture. The configuration file provides hardware configuration for supporting such architectures as SMP (Single machine with multiple CPU , shared memory and disk), Grid , Cluster or MPP (multiple CPU, mulitple nodes and dedicated memory per node).

The configuration file defines all processing and storage resources and can be edited with any text editor or within Datastage Manager.
The main outcome from having the configuration file is to separate software and hardware configuration from job design. It allows changing hardware and software resources without changing a job design. Datastage EE jobs can point to different configuration files by using job parameters, which means that a job can utilize different hardware architectures without being recompiled.

The Datastage EE configuration file is specified at runtime by a $APT_CONFIG_FILE variable.

Configuration file structure
Datastage EE configuration file defines number of nodes, assigns resources to each node and provides advanced resource optimizations and configuration.

The configuration file structure and key instructions:
node - a node is a logical processing unit. Each node in a configuration file is distinguished by a virtual name and defines a number and speed of CPUs, memory availability, page and swap space, network connectivity details, etc.
fastname defines node's hostname or IP address
pool - defines resource allocation. Pools can overlap accross nodes or can be independent.
resource (resources) names of disk directories accessible to each node.
The resource keyword is followed by the type of resource that a given resource is restricted to, for instance resource disk, resource scratchdisk, resource sort, resource bigdata

Sample configuration files

Configuration file for a simple SMP
A basic configuration file for a single machine, two node server (2-CPU) is shown below. The file defines 2 nodes (dev1 and dev2) on a single etltools-dev server (IP address might be provided as well instead of a hostname) with 3 disk resources (d1 , d2 for the data and temp as scratch space).

The configuration file is shown below:

{
node "dev1"
{
fastname "etltools-dev"
pool ""
resource disk "/data/etltools-tutorial/d1" { }
resource disk "/data/etltools-tutorial/d2" { }
resource scratchdisk "/data/etltools-tutorial/temp" { }
}

node "dev2"
{
fastname "etltools-dev"
pool ""
resource disk "/data/etltools-tutorial/d1" { }
resource scratchdisk "/data/etltools-tutorial/temp" { }
}
}


Configuration file for a cluster / MPP / grid
The sample configuration file for a cluster or a grid computing on 4 machines is shown below.
The configuration defines 4 nodes (etltools-prod[1-4]), node pools (n[1-4]) and s[1-4), resource pools bigdata and sort and a temporary space.

{
node "prod1"
{
fastname "etltools-prod1"
pool "" "n1" "s1""tutorial2" "sort"
resource disk "/data/prod1/d1" {}
resource disk "/data/prod1/d2" {"bigdata"}
resource scratchdisk "/etltools-tutorial/temp" {"sort"}
}

node "prod2"
{
fastname "etltools-prod2"
pool "" "n2" "s2""tutorial1"
resource disk "/data/prod2/d1" {}
resource disk "/data/prod2/d2" {"bigdata"}
resource scratchdisk "/etltools-tutorial/temp" {}
}

node "prod3"
{
fastname "etltools-prod3"
pool "" "n3" "s3""tutorial1"
resource disk "/data/prod3/d1" {}
resource scratchdisk "/etltools-tutorial/temp" {}
}

node "prod4"
{
fastname "etltools-prod4"
pool "n4" "s4""tutorial1"
resource disk "/data/prod4/d1" {}
resource scratchdisk "/etltools-tutorial/temp" {}
}
}


Validate configuration file
The easiest way to validate the configuration file is to export APT_CONFIG_FILE variable pointing to the newly created configuration file and then issue the following command: orchadmin check

No comments: