HCL Workload Automation, Version 9.4

Job scheduling events

After performing the configuration steps described in the Configuring the Tivoli Enterprise Console adapter, use the events gathered from the HCL Workload Automation log file using the Tivoli Enterprise Console logfile adapter to perform event management and correlation using the Tivoli Enterprise Console in your scheduling environment.

This section describes the events that are generated by using to the information stored in the log file specified in the BmEvents.conf configuration file stored on the system where you installed the Tivoli Enterprise Console logfile adapter.

An important aspect to be considered when configuring the integration with the Tivoli Enterprise Console using event adapters is whether to monitor only the master domain manager or every HCL Workload Automation agent.

If you integrate only the master domain manager, all the events coming from the entire scheduling environment are reported because the log file on a master domain manager logs the information from the entire scheduling network. On the Tivoli Enterprise Console event server and TEC event console all events will therefore look as if they come from the master domain manager, regardless of which HCL Workload Automation agent they originate from. The workstation name, job name, and job stream name are still reported to Tivoli Enterprise Console, but as a part of the message inside the event.

If, instead, you install a Tivoli Enterprise Console logfile adapter on every HCL Workload Automation agent, this results in a duplication of events coming from the master domain manager, and from each agent. Creating and using a Tivoli Enterprise Console that detects these duplicated events, based on job_name, job_cpu, schedule_name, and schedule_cpu, and keeps just the event coming from the log file on the HCL Workload Automation agent, helps you to handle this problem. The same consideration also applies if you decide to integrate the backup master domain manager, if defined, because the log file on a backup master domain manager logs the information from the entire scheduling network. For information on creating new rules for the Tivoli Enterprise Console refer to the IBM Tivoli Enterprise Console Rule Builder's Guide. For information on how to define a backup master domain manager refer to HCL Workload Automation: Planning and Installation Guide.

Figure 1 describes how an event is generated. It shows the Tivoli Enterprise Console logfile adapter installed on the master domain manager. This is to ensure that all the information about the job scheduling execution across the entire scheduling environment is available inside the log file on that workstation. You can decide, however, to install the Tivoli Enterprise Console logfile adapter on another workstation in your scheduling environment, depending on your environment and business needs.
Figure 1. Event generation flow
This graphic illustrates the flow of events
The logic that is used to generate job scheduling events is the following:
  • The information logged during the job scheduling process has an event number for each type of logged activity or problem.
  • Each item of information marked with an event number that appears in the EVENT field of the BmEvents.conf file is written into the log file specified in the FILE field of the BmEvents.conf file.
  • The Tivoli Enterprise Console logfile adapter reads this information inside the log file, formats it using the structure stored in the FMT file (maestro.fmt for UNIX, maestro_nt.fmt for Windows) and forwards it to the TEC event server, using the TEC gateway defined on the managed node of the Tivoli® environment.
  • On the TEC event server, the structure of the formatted information is checked using the information stored in the BAROC files and, if correct, is accepted. Otherwise a parsing failure is prompted.
  • Once the event is accepted by the TEC event server, a check on possible predefined correlation rules or automatic responses for that event number is made using the information stored in the RLS files.
  • If defined, the correlation rules and/or automatic responses are triggered and the event is sent to the TEC event console to be displayed on the defined Event Console.

For some error conditions on event informing that the alarm condition is ended is also stored in the log file and passed to the TEC event server via the Tivoli Enterprise Console logfile adapter. This kind of event is called a clearing event. It ends on the TEC event console any related problem events.

The following table describes the events and rules provided by HCL Workload Automation.

The text of the message that is assigned by the FMT file to the event is shown in bold. The text message is the one that is sent by the Tivoli Enterprise Console logfile adapter to TEC event server and then to the TEC event console. The percent sign (%s) in the messages indicates a variable. The name of each variable follows the message between brackets.

Table 1. HCL Workload Automation events
Event Characteristic Description
"TWS process %s has been reset on host %s" (program_name, host_name) Event Class: TWS_Process_Reset
Event Severity: HARMLESS
Event Description: HCL Workload Automation daemon process reset.
"TWS process %s is gone on host %s" (program_name, host_name) Event Class: TWS_Process_Gone
Event Severity: CRITICAL
Event Description: HCL Workload Automation process gone.
"TWS process %s has abended on host %s" (program_name, host_name) Event Class: TWS_Process_Abend
Event Severity: CRITICAL
Event Description: HCL Workload Automation process abends.
"Job %s.%s failed, no recovery specified" (schedule_name, job_name) Event Class: TWS_Job_Abend
Event Severity: CRITICAL
Automated Action (UNIX only): Send job stdlist to the TWS_user.
Event Description: Job failed, no recovery specified.
Correlation Activity: If this job has abended more than once within a 24 hour time window, send a TWS_Job_Repeated_Failure event.
"Job %s.%s failed, recovery job will be run then schedule %s will be stopped" (schedule_name, job_name, schedule_name) Event Class: TWS_Job_Abend
Event Severity: CRITICAL
Automated Action (UNIX only): Send job stdlist to the TWS_ user.
Event Description: Job failed, recovery job runs, and schedule stops.
Correlation Activity: If this job has abended more than once within a 24 hour time window, send a TWS_Job_Repeated_Failure event.
"Job %s.%s failed, this job will be rerun" (schedule_name, job_name) Event Class: TWS_Job_Abend
Event Severity: CRITICAL
Automated Action (UNIX only): Send job stdlist to the TWS_user.
Event Description: Job failed, the job is rerun.
Correlation Activity: If this job has abended more than once within a 24 hour time window, send a TWS_Job_Repeated_Failure event.
"Job %s.%s failed, this job will be rerun after the recovery job" (schedule_name, job_name) Event Class: TWS_Job_Abend
Event Severity: CRITICAL
Automated Action (UNIX only): Send job stdlist to the TWS_user.
Event Description: Job failed, recovery job is run, and the job is run again.
Correlation Activity: If this job has abended more than once within a 24 hour time window, send a TWS_Job_Repeated_Failure event.
"Job %s.%s failed, continuing with schedule %s" (schedule_name, job_name, schedule_name) Event Class: TWS_Job_Abend
Event Severity: CRITICAL
Automated Action (UNIX only): Send job stdlist to user TWS_user.
Event Description: Job failed, the schedule proceeds.
Correlation Activity: If this job has abended more than once within a 24 hour time window, send a TWS_Job_Repeated_Failure event.
"Job %s.%s failed, running recovery job then continuing with schedule %s" (schedule_name, job_name, schedule_name) Event Class: TWS_Job_Abend
Event Severity: CRITICAL
Automated Action (UNIX only): Send job stdlist to the TWS_user.
Event Description: Job failed, recovery job runs, schedule proceeds.
Correlation Activity: If this job has abended more than once within a 24 hour time window, send a TWS_Job_Repeated_Failure event.
"Failure while rerunning failed job %s.%s" (schedule_name, job_name) Event Class: TWS_Job_Abend
Event Severity: CRITICAL
Automated Action (UNIX only): Send job stdlist to the TWS_user.
Event Description: Rerun of abended job abends.
Correlation Activity: If this job has abended more than once within a 24 hour time window, send a TWS_Job_Repeated_Failure event.
"Failure while recovering job %s.%s" (schedule_name, job_name) Event Class: TWS_Job_Abend
Event Severity: CRITICAL
Automated Action (UNIX only): Send job stdlist to the TWS_user.
Event Description: Recovery job abends.
Correlation Activity: If this job has abended more than once within a 24 hour time window, send a TWS_Job_Repeated_Failure event.
"Multiple failures of Job %s#%s in 24 hour period" (schedule_name, job_name) Event Class: TWS_Job_Repeated_Failure
Event Severity: CRITICAL
Event Description: Same job fails more than once in 24 hours.
"Job %s.%s did not start" (schedule_name, job_name) Event Class: TWS_Job_Failed
Event Severity: CRITICAL
Event Description: Job failed to start.
"Job %s.%s has started on CPU %s" (schedule_name, job_name, cpu_name) Event Class: TWS_Job_Launched
Event Severity: HARMLESS
Event Description: Job started.
Correlation Activity: Clearing Event - Close open job prompt events related to this job.
"Job %s.%s has successfully completed on CPU %s" (schedule_name, job_name, cpu_name) Event Class: TWS_Job_Done
Event Severity: HARMLESS
Event Description: Job completed successfully.
Correlation Activity: Clearing Event - Close open job started events for this job and auto-acknowledge this event.
"Job %s.%s suspended on CPU %s" (schedule_name, job_name, cpu_name) Event Class: TWS_Job_Suspended
Event Severity: WARNING
Event Description: Job suspended, the until time expired (default option suppress).
"Job %s.%s is late on CPU %s" (scheduler_name, job_cpu) Event Class: TWS_Job_Late
Event Severity: WARNING
Event Description: Job late, the deadline time expired before the job completed.
"Job %s.%s:until (continue) expired on CPU %s", schedule_name, job_name, job_cpu Event Class: TWS_Job_Until_Cont
Event Severity: WARNING
Event Description: Job until time expired (option continue).
"Job %s.%s:until (cancel) expired on CPU %s", schedule_name, job_name, job_cpu Event Class: TWS_Job_Until_Canc
Event Severity: WARNING
Event Description: Job until time expired (option cancel).
(TWS Prompt Message) Event Class: TWS_Job_Recovery_Prompt
Event Severity: WARNING
Event Description: Job recovery prompt issued.
"Schedule %s suspended", (schedule_name) Event Class: TWS_Schedule_Susp
Event Severity: WARNING
Event Description: Schedule suspended, the until time expired (default option suppress).
"Schedule %s is late", (schedule_name) Event Class: TWS_Schedule_Late
Event Severity: WARNING
Event Description: Schedule late, the deadline time expired before the schedule completion.
"Schedule %s until (continue) expired", (schedule_name) Event Class: TWS_Schedule_Until_Cont
Event Severity: WARNING
Event Description: Schedule until time expired (option continue).
"Schedule %s until (cancel) expired", (schedule_name) Event Class: TWS_Schedule_Until_Canc
Event Severity: WARNING
Event Description: Schedule until time expired (option cancel).
"Schedule %s has failed" (schedule_name) Event Class: TWS_Schedule_Abend
Event Severity: CRITICAL
Event Description: Schedule abends.
Correlation Activity: If event is not acknowledged within 15 minutes, send mail to TWS_user (UNIX only).
"Schedule %s is stuck" (schedule_name) Event Class: TWS_Schedule_Stuck
Event Severity: CRITICAL
Event Description: Schedule stuck.
Correlation Activity: If event is not acknowledged within 15 minutes, send mail to TWS_user (UNIX only).
"Schedule %s has started" (schedule_name) Event Class: TWS_Schedule_Started
Event Severity: HARMLESS
Event Description: Schedule started.
Correlation Activity: Clearing Event - Close all related pending schedule, or schedule abend events related to this schedule.
"Schedule %s has completed" (schedule_name) Event Class: TWS_Schedule_Done
Event Severity: HARMLESS
Event Description: Schedule completed successfully.
Correlation Activity: Clearing Event - Close all related schedule started events and auto-acknowledge this event.
(Global Prompt Message) Event Class: TWS_Global_Prompt
Event Severity: WARNING
Event Description: Global prompt issued.
(Schedule Prompt's Message) Event Class: TWS_Schedule_Prompt
Event Severity: WARNING
Event Description: Schedule prompt issued.
(Job Recovery Prompt's Message) Event Class: TWS_Job_Prompt
Event Severity: WARNING
Event Description: Job recovery prompt issued.
"Comm link from %s to %s unlinked for unknown reason" (hostname, to_cpu) Event Class: TWS_Link_Dropped
Event Severity: WARNING
Event Description: HCL Workload Automation link to CPU dropped for unknown reason.
"Comm link from %s to %s unlinked via unlink command" (hostname, to_cpu) Event Class: TWS_Link_Dropped
Event Severity: HARMLESS
Event Description: HCL Workload Automation link to CPU dropped by unlink command.
"Comm link from %s to %s dropped due to error" (hostname, to_cpu) Event Class: TWS_Link_Dropped
Event Severity: CRITICAL
Event Description: HCL Workload Automation link to CPU dropped due to error.
"Comm link from %s to %s established" (hostname, to_cpu) Event Class: TWS_Link_Established
Event Severity: HARMLESS
Event Description: HCL Workload Automation CPU link to CPU established.
Correlation Activity: Close related TWS_Link_Dropped or TWS_Link_Failed events and auto-acknowledge this event.
"Comm link from %s to %s down for unknown reason" (hostname, to_cpu) Event Class: TWS_Link_Failed
Event Severity: CRITICAL
Event Description: HCL Workload Automation link to CPU failed for unknown reason.
"Comm link from %s to %s down due to unlink" (hostname, to_cpu) Event Class: TWS_Link_Failed
Event Severity: HARMLESS
Event Description: HCL Workload Automation link to CPU failed due to unlink.
"Comm link from %s to %s down due to error" (hostname, to_cpu) Event Class: TWS_Link_Failed
Event Severity: CRITICAL
Event Description: HCL Workload Automation CPU link to CPU failed due to error.
"Active manager % for domain %" (cpu_name, domain_name) Event Class: TWS_Domain_Manager_Switch
Event Severity: HARMLESS
Event Description: HCL Workload Automation domain manager switch has occurred.
Long duration for Job %s.%s on CPU %s. (schedule_name, job_name, job_cpu) Event Class: TWS_Job_Launched
Event Severity: WARNING
Event Description: If after a time equal to estimated duration, the job is still in exec status, a new message is generated.
Job %s.%s on CPU %s, could miss its deadline. (schedule_name, job_name, job_cpu) Event Class: TWS_Job_Ready, TWS_Job_Hold
Event Severity: WARNING
Event Description: If the job has a deadline and the sum of job estimated start time and estimated duration is greater than the deadline time, a new message is generated.
Start delay of Job %s.%s on CPU %s. (schedule_name, job_name, job_cpu) Event Class: TWS_Job_Ready
Event Severity: WARNING
Event Description: If the job is still in ready status, after n minutes a new message is generated. The default value for n is 10.

Default criteria that control the correlation of events and the automatic responses can be changed by editing the file maestro_plus.rls (in UNIX environments) or maestront_plus.rls (in Windows environments) file. These RLS files are created during the installation of HCL Workload Automation and compiled with the BAROC file containing the event classes for the HCL Workload Automation events on the TEC event server when the Setup Event Server for TWS task is run. Before modifying either of these two files, make a backup copy of the original file and test the modified copy in your sample test environment.

For example, in the last event described in the table you can change the n value, the number of seconds the job has to be in ready state to trigger a new message, by modifying the rule job_ready_open set for the TWS_Job_Ready event class.
rule: job_ready_open : ( 
     		description: 'Start a timer rule for ready',

			event: _event of_class 'TWS_Job_Ready'
   					where [

								 status: outside ['CLOSED'],
								 schedule_name: _schedule_name,
								 job_cpu: _job_cpu,
								 job_name: _job_name
						],
			reception_action: (
						set_timer(_event,600,'ready event')
			)
		).
For example, by changing the value from 600 to 1200 in the set_timer predicates of the reception_action action, and then by recompiling and reloading the Rule Base you change from 600 to 1200 the number of seconds the job has to be in ready state to trigger a new message.

Refer to Tivoli Enterprise Console® User's Guide and Tivoli Enterprise Console Rule Builder's Guide for details about rules commands.