Apache Spark jobs
Apache Spark jobs define, schedule, monitor, and control the execution of Apache Spark processes.
Prerequisites
Apache Spark job definition
A description of the job properties and valid values are detailed in the context-sensitive help in the Dynamic Workload Console by clicking the question mark (?) icon in the top-right corner of the properties pane.For more information about creating jobs using the various supported product interfaces, see Defining a job.
Attribute | Description and value | Required |
---|---|---|
Connection attributes | ||
Url | The Apache Spark server Url. It must have the following format: http://<SPARK_SERVER>:8080/json (dashboard address). | If not specified in the job definition, it must be supplied in the plug-in properties file. |
REST Url | The Apache Spark server Url to execute REST API calls. It must have the following format: http://<SPARK_SERVER>:6066 where 6066 is the default port for REST API calls. | If not specified in the job definition, it must be supplied in the plug-in properties file. |
Resource Name | The full path to the .jar, .py, or .R file that contains the application code. | ✓ |
Resource Type | The type of resource specified in the Resource Name field. |
|
Main Class | The entry point for your application. For example, org.apache.spark.examples.SparkPi. | ✓ |
Arguments | The arguments passed to the main method of your main class, if any. If more than one argument is present, use commas to separate the different arguments. | |
Application Name | The name of the application. | ✓ |
JAR | The full path to a bundled jar including your application and all dependencies. The URL must be globally visible inside your cluster, for instance, an hdfs path or a file path that is present on all nodes. | ✓ |
Deploy Mode | The deploy mode of Apache Spark driver program:
|
|
Spark Master | The master URL for the cluster. For example, spark://23.195.26.187:7077. | ✓ |
Driver Cores | Number of cores to use for the driver process, only in cluster mode. | |
Driver Memory | Amount of memory in gigabytes to use for the driver process. | |
Executor Cores | The number of cores to use on each executor. It is ignored when Apache Spark runs in standalone mode: in this case, it gets the value of Driver Cores since the executor is launched within a driver jvm process. | |
Executor Memory | Amount of memory in gigabytes to use per executor process. It is ignored when spark runs in standalone mode: in this case, it gets the value of Driver Memory since the executor is launched within a driver jvm process. | ✓ |
Variable List | The list of variables with related values that you want to specify. Click the plus (+) sign to add one or more variables to the variable list. Click (-) sign to remove one or more variables from the variable list. You can search a variable in the list by specifying the variable name in the filter box. |
Scheduling and stopping a job in HCL Workload Automation
You schedule HCL Workload Automation Apache Spark jobs by defining them in job streams. Add the job to a job stream with all the necessary scheduling arguments and submit the job stream.
You can submit jobs by using the Dynamic Workload Console, Application Lab or the conman command line. See Scheduling and submitting jobs and job streams for information about how to schedule and submit jobs and job streams using the various interfaces.
After submission, when the job is running and is reported in EXEC status in HCL Workload Automation, you can stop it if necessary, by using the kill command. This action stops also the program execution on the Apache Spark server.
Monitoring a job
If the HCL Workload Automation agent stops when you submit the Apache Spark job, or while the job is running, the job restarts automatically as soon as the agent restarts.
For information about how to monitor jobs using the different product interfaces available, see Monitoring HCL Workload Automation jobs.
ApacheSparkJobExecutor.properties
The properties file is automatically generated either when you perform a "Test Connection" from the Dynamic Workload Console in the job definition panels, or when you submit the job to run the first time. Once the file has been created, you can customize it. This is especially useful when you need to schedule several jobs of the same type. You can specify the values in the properties file and avoid having to provide information such as credentials and other information, for each job. You can override the values in the properties files by defining different values at job definition time.
url= http://<SPARK_SERVER>:8080/json
sparkurl= http://<SPARK_SERVER>:6066
drivercores=1
drivermemory=1
executorcores=1
executormemory=1
timeout=36000
The url and sparkurl properties must be specified either in this file or when creating the Apache Spark job definition in the Dynamic
Workload Console. For more information, see the Dynamic Workload Console online help. The timeout property represents the time, in seconds, that HCL Workload Automation waits for a reply from Apache Spark server. When the timeout expires with no reply, the job terminates with abend status. The timeout property can be specified only in the properties file.
For a description of each property, see the corresponding job attribute description in Table 1.
Job properties
While the job is running, you can track the status of the job and analyze the properties of the job. In particular, in the Extra Information section, if the job contains variables, you can verify the value passed to the variable from the remote system. Some job streams use the variable passing feature, for example, the value of a variable specified in job 1, contained in job stream A, is required by job 2 in order to run in the same job stream.
conman sj <job_name>;props
where
<job_name> is the Apache Spark job name. The properties are listed in the Extra Information section of the output command.
For information about passing job properties, see Passing job properties from one job to another in the same job stream instance.
<?xml version="1.0" encoding="UTF-8"?>
<jsdl:jobDefinition xmlns:jsdl="http://www.ibm.com/xmlns/prod/scheduling/1.0/jsdl"
xmlns:jsdlapachespark="http://www.ibm.com/xmlns/prod/scheduling/1.0/jsdlapachespark" name="APACHESPARK">
<jsdl:application name="apachespark">
<jsdlapachespark:apachespark>
<jsdlapachespark:ApacheSparkParameters>
<jsdlapachespark:Connection>
<jsdlapachespark:connectionInfo>
<jsdlapachespark:url>{url}</jsdlapachespark:url>
<jsdlapachespark:sparkurl>{sparkurl}</jsdlapachespark:sparkurl>
</jsdlapachespark:connectionInfo>
</jsdlapachespark:Connection>
<jsdlapachespark:Action>
<jsdlapachespark:ResourceProperties>
<jsdlapachespark:resourcename>{resourcename}</jsdlapachespark:resourcename>
<jsdlapachespark:resourcetype>{resourcetype}</jsdlapachespark:resourcetype>
<jsdlapachespark:mainclass>{mainclass}</jsdlapachespark:mainclass>
<jsdlapachespark:arguments>{arguments}</jsdlapachespark:arguments>
</jsdlapachespark:ResourceProperties>
<jsdlapachespark:SparkProperties>
<jsdlapachespark:appname>{appname}</jsdlapachespark:appname>
<jsdlapachespark:jars>{jars}</jsdlapachespark:jars>
<jsdlapachespark:deploymode>{deploymode}</jsdlapachespark:deploymode>
<jsdlapachespark:sparkmaster>{sparkmaster}</jsdlapachespark:sparkmaster>
<jsdlapachespark:drivercores>{drivercores}</jsdlapachespark:drivercores>
<jsdlapachespark:drivermemory>{drivermemory}</jsdlapachespark:drivermemory>
<jsdlapachespark:executorcores>{executorcores}</jsdlapachespark:executorcores>
<jsdlapachespark:executormemory>{executormemory}
</jsdlapachespark:executormemory>
</jsdlapachespark:SparkProperties>
<jsdlapachespark:EnvVariables>
<jsdlapachespark:variablelistValues pairsList="jsdlapachespark:variablelistValue">
</jsdlapachespark:variablelistValues>
</jsdlapachespark:EnvVariables>
</jsdlapachespark:Action>
</jsdlapachespark:ApacheSparkParameters>
</jsdlapachespark:apachespark>
</jsdl:application>
</jsdl:jobDefinition>
Job log content
For information about how to display the job log from the various supported interfaces, see Analyzing the job log.
For example, you can see the job log content by running conman sj <job_name>;stdlist, where <job_name> is the Apache Spark job name.
See also
From the Dynamic Workload Console you can perform the same task as described in
For more information about how to create and edit scheduling objects, see