HCL Workload Automation, Version 9.4

Defining and managing mission-critical jobs

About this task

Job schedulers can use the HCL Workload Automation command line or the Dynamic Workload Console to flag jobs as mission-critical and specify their deadlines. A critical job and all its predecessors make up what is called a critical network. At planning time, HCL Workload Automation calculates the start time of the critical job and of each of its predecessors starting from the critical job deadline and estimated duration. While the plan runs, this information is dynamically kept up-to-date based on how the plan is progressing. If a predecessor, or the critical job itself, is becoming late, HCL Workload Automation automatically prioritizes its submission and promotes it to get more system resources and thus meet its deadline.

Within a critical network, HCL Workload Automation dynamically identifies the path of predecessors that is potentially most at risk; this is called the critical path. HCL Workload Automation calculates the level of risk that each critical job has of missing its deadline; a high risk indicates that the estimated end of the critical job is after its deadline while a potential risk indicates that some predecessors of the critical job have a warning condition, for example are late or in error.

The Dynamic Workload Console provides specialized views for tracking the progress of critical jobs and their predecessors. Job schedulers and operators can access the views from the Dashboard or by creating Monitor Critical Jobs tasks.

The initial view lists all critical jobs for the engine, showing the status: normal, potential risk, or high risk. From this view, an operator can navigate to see:

The hot list of jobs that put the critical deadline at risk.
The critical path.
Details of all critical predecessors.
Details of completed critical predecessors.
Job logs of jobs that have already run.

Using the views, operators can monitor the progress of the critical network , find out about current and potential problems, release dependencies, and rerun jobs.

For example:

To flag a critical job and follow it up, the Job scheduler opens the Workload Designer on the Dynamic Workload Console, marks the specific job as critical, and sets the deadline for 5 a.m.
When JnextPlan is run, the critical start dates for this job, and all the jobs that are identified as its predecessors, are calculated.
To track a specific critical job, the operator proceeds as follows:
1. The operator checks the dashboards and sees that there are critical jobs scheduled on one of the engines.
2. He clicks the link to get a list of the critical jobs.
  The specific job shows a Potential Risk status.
3. He selects the job and clicks Hot List to see the predecessor job or jobs that are putting the critical job at risk.
  One of the predecessor jobs is listed as being in error.
4. He selects the job and clicks Job log.
  The log shows that the job failed because of incorrect credentials for a related database.
5. After discovering that the database password was changed that day, he changes the job definition in the symphony file and reruns the job.
6. When he comes back to the dashboard, he notices that there are no longer any jobs in potential risk. Also, the critical jobs list that was opened when clicking on the potential risk link no longer shows the critical job after the job is rerun.
7. The job is now running after being automatically promoted, getting higher priority for submission and system resources.
8. No further problems need fixing and the critical job finally completes at 4.45 a.m.