WMSX
Basic usage
This is a program for mass job management on the Grid.
There are two program you will be using: Provider and Requestor.
Provider
Provider is the background process which does the actual work. There can be only one provider running on a machine per user. Start with:
wmsx-provider.sh myworkdir [-v]
where:
-
myworkdir
- Location of the working directory. This is where all the logging and output goes. You may want to set this to a location inside your home directory.
-
-v
- Specifies that you want debug output on the console. Leave this out once you feel comfortable.
After the start of the provider, directories
log
,
debug
and
out
is prepared under the working directory. The logging informations of the WMSX actions are written into a file under the
log
directory. The files
jobids.all
,
jobids.running
,
jobids.failed
and
jobids.done
under the working directory contain the unique job identifiers of the sent Grid jobs.
Note: On certain platforms, the file and directory names, given as command line arguments to
wmsx-provider.sh
or
wmsx-requestor.sh
, are only properly recognized when specified as absolute paths.
Requestor
Requestor is an application to submit commands to the Provider system for execution. Start with:
wmsx-requestor.sh -option
A complete list of options is available with
-h
. Some options are:
-
-f
- Check if the provider is running.
-
-k
- Kill provider.
-
-backend yourbackend
- With this, you may specify the job submission backend. Here,
yourbackend
may be any of glitewms
, edg
, worker
, or local
, fake
, gat
. Default backend is glitewms
. The backend is the system which is actually taking care about your job submission. The backend concept shall be explained later.
-
-rememberafs
- Will ask for your AFS password and renew AFS tokens until provider is running. It only works if you do not quit the interactive terminal (when you quit interactive session, AFS automatically deletes your tokens).
-
-forgetafs
- Forget the AFS password.
-
-vo yourvo
- Set the VO for all future job submissions to
yourvo
.
-
-remembergrid
- Will ask for your Grid password and renew the Grid tokens until there are no more managed jobs. If your grid key files (
userkey.pem
and usercert.pem
, typically located under ~/.globus
) are located on AFS space, you should use -rememberafs
, and should not quit the interactive terminal. If this is not feasible, copy the .globus
directory to some local disk space, and point the environmental variables X509_USER_CER
, X509_USER_KEY
and X509_CERT_DIR
to the absolute path of usercert.pem
, userkey.pem
files and certificates
directory, respectively, and also copy your ~/.glite
directory to some local disk space, and point the environmental variable GLITE_USER_HOME
to the absolute path of that directory.
The possible job submission procedures are described in the next sections.
Submission of single jobs with traditional JDL files
Single jobs may be submitted and managed via the framework, by using traditional JDL files. A sample usage is:
wmsx-requestor.sh -j example.jdl -r resultDir [-o StdOutFile]
Where the options are:
-
-j
- Name of the JDL file.
-
-r
- When the job is done, results are retrieved and stored in the given directory.
-
-o
- If the JDL file has
JobType
set to "Interactive"
, then StdOut / StdError will be retrieved while the Job is running and stored in the given filename.
Automated mass submission of jobs via ArgList files
By using ArgList files, one can submit many independent jobs and handle their outputs in a simple and efficient way.
The ArgList file can contain lines of the following format:
COMMAND parameters
where
COMMAND
refers to a job name, and
parameters
are the command line arguments of the executable of your job.
To submit the jobs, use:
-
-a args.file
- Submit many jobs with the ArgList file, named
args.file
.
-
-name runname
- Give the name
runname
to this execution run (optional). It is used as a subdirectory under the working directory of the provider to distinguish between different ArgList submissions.
-
-n nmaxjobs
- Set the maximal number of concurrently running jobs to
nmaxjobs
(optional).
The word
COMMAND
refers to a WMSX JDL file (not a traditional JDL file), named
COMMAND.wjdl
, which describes your job. If the
COMMAND.wjdl
is not present, default job specifications are assumed, which shall be discussed in the followings.
The outputs of the jobs are written into generated directories under the
out
directory of the specified working directory of the provider.
WMSX JDL files
For each
COMMAND
in the ArgList file, there may be a WMSX JDL file
COMMAND.wjdl
, to customize the properties of your job.
The structure of a WMSX JDL file is similar to traditional JDL files, however the supported variables are only:
-
JobType
- If this variable is set to
"Interactive"
, the jobs will be run as interactively in the sense that the StdOut / StdError is retrieved on-the-fly, so you are able to see what your job is currently doing. If not set, the job is not interactive by default.
-
Archive
- Name of the program archive file. This is the name of the tarball, containing your program. If not specified, defaults to
"COMMAND.tar.gz"
. (Must be of tar-gz format!)
-
ProgramDir
- Name of the root directory inside the program archive file. The files and directories of your program archive are assumed to be under this directory, and their paths are assumed to be given relative to this directory. Setting this to
"."
means that the content of your tarball is not wrapped in a directory. If not specified, defaults to "COMMAND"
.
-
Executable
- Name of the executable to run inside the
ProgramDir
. If not specified, defaults to "COMMAND"
.
-
OutputDirectory
- The name of the directory under
ProgramDir
, where the output of your program is written. This is the directory, which is going to be retrieved by the framework as output. If not specified, defaults to "out"
.
-
Software
- List of software that must be present (executable) on the target machine. The special key
"AFS"
requires and checks AFS presence. E.g.: Software = {"AFS", "g++"};
. If not set, defaults to empty.
-
Requirements
- Extra queue requirements, like in a traditional JDL file. If not set, defaults to empty.
If the WMSX JDL file is not present, the above default values are assumed (i.e. the tarball has to have the name
COMMAND.tar.gz
etc.).
In the followings, things are more easily explained if the notion of
AbsCOMMAND
is introduced: this is simply the full path to the file
COMMAND.wjdl
file (or if not present, to the the
COMMAND.tar.gz
file), without the ".wjdl" extension (or without the ".tar.gz" extension).
Pre-execution and post-execution scripts
In most times it is useful to have pre-execution and post-execution scripts. These may be used for e.g. preparing the input data files, or archiving output data files etc. If present, these have to be called
COMMAND_preexec
and
COMMAND_postexec
. They must be executable. They will be run directly before submission and after job output is retrieved, respectively.
COMMAND_preexec
, if present, is automatically called by the framework with the
AbsCOMMAND
as first argument, the name of the job output directory as second argument (automatically generated by the framework), and with all the given arguments from ArgList as following arguments.
COMMAND_postexec
, if present, is automatically called by the framework with the
AbsCOMMAND
as first argument, the name of the job output directory as second argument (automatically generated by the framework), and with all the given arguments from ArgList as following arguments.
The retrieved outputs of your job will always be in the tarball
out.tar.gz
under the generated job output directory.
If
COMMAND_preexec
returns with 0 nothing further happens: the Grid job is submitted. If returns with 1, the actual Grid job shall not be launched. This feature can be used for a job submission decision.
If
COMMAND_postexec
returns with 0 nothing further happens. If returns with 1, the script
COMMAND_chain
is called, if present, which can be used to launch further jobs.
Job chaining
The running time of Grid jobs is limited. This is unavoidable for efficient controling of resources. The time limit depends on the given queue, but a typical value is three days. However, one often faces such computing problems, when the total running time of the jobs cannot be estimated a priori, or it is estimated to be very long. For such cases, the job chaining is the solution: the program has to be split up into shorter subsequent pieces with limited running time. The program has to be written in such a way, that its running time is limited internally (e.g. to one day), and when this time limit is exceeded, its last state is dumped as output. The next copy of the program has to be started with this last state as input, thus, by such chain of limited lifetime jobs, one can imitate arbitrary long lifetime jobs. The script
COMMAND_chain
is the tool to lauch further jobs, when needed.
If the
COMMAND_postexec
script returns 1, the script
COMMAND_chain
is invoked (must be executable). In this case, if present, it is automatically called by the framework with the
AbsCOMMAND
as first argument, the name of the job output directory as second argument, and with all the given arguments from ArgList as following arguments.
The output of
COMMAND_chain
is interpreted by the framework as ArgList lines, just as if they were lines from the initial ArgList file. Therefore, it can be used to lauch further jobs by a finished job, depending on the decision of the
COMMAND_postexec
script. This is called the job chaining. (The
COMMAND_chain
may have multiple lines as output. Each line is interpreted like a line from the ArgList file, so multiple jobs may also be lauched: the job chain may also fork.)
The backend concept
The backend is the system which is actually taking care about your job submission. This section needs expansion...
Graphical User Interface
A simple graphical interface is written to help job flow monitoring. Start with:
wmsx-gui.sh
--
AndrasLaszlo - 10 Jul 2009