WMSX
Basic usage
There are two program you will be using: Provider and Requestor.
Provider
Provider is the background process which does the actual work.
There can be only one provider running on a machine per user. Start with:
wmsx-provider.sh -v /tmp/myworkdir
where:
- -v
- Specifies that you want debug output on the console.
Leave this out once you feel comfortable.
- myworkdir
- Location of the working directory.
This is where all the output goes. You may want to set this to a
location inside your home directory.
Requestor
Requestor is an application to submit commands to the Provider system
for execution. Start with:
wmsx-requestor.sh -option
A complete list of options is available with -h. Some options are:
-
-f
- Check if the provider is running.
-
-k
- Kill provider.
-
-rememberafs
- Will ask for your AFS password and renew AFS tokens
until you use.
-
-forgetafs
- Forget the password.
-
-remembergrid
- Will ask for your Grid password and renew the
Grid tokens until there are no more managed jobs.
-
-vo yourvo
- Set the VO for all future job submissions to
yourvo
.
The possible job submission procedures ars described in the next sections.
Traditional Submission of Single Jobs with Traditional JDL Files
Single jobs may be submitted and managed via the framework, by using
traditional JDL files. A sample usage is:
wmsx-requestor.sh -j example.jdl -o /tmp/StdOut -r /tmp/resultDir
Where the options are:
-
-j
- Name of the JDL file.
-
-o
- If the JDL file has
JobType
set to "Interactive"
,
then
StdOut /
StdError will be retrieved
while the Job is running
and stored in the given filename.
-
-r
- When the job is done, results are retrieved and stored in the
given directory.
Automated Mass Submission of Jobs via ArgList Files
By using
ArgList files, one can submit many independent jobs and handle
their outputs in a simple and efficient way.
The
ArgList file can contain lines of the following format:
COMMAND parameters
where
COMMAND
refers to a job name, and
parameters
are the command
line arguments of the executable of your job.
To submit the jobs, use:
-
-a args.file
- Submit many jobs with the ArgList file
args.file
.
-
-name runname
- Give the name
runname
to this execution run (optional).
It is used as a subdirectory under the working directory to distinguish
between different
ArgList submissions.
-
-n nmaxjobs
- Set the maximal number of concurrently running jobs
to
nmaxjobs
(optional).
The word
COMMAND
refers to a WMSX JDL file (not a traditional JDL file),
named
COMMAND.jdl
, which describes your job. If the
COMMAND.jdl
is not present, default job specifications are assumed.
WMSX JDL files.
For each
COMMAND
in the
ArgList file, there may be a JDL file
COMMAND.jdl
, to customize the properties of your job.
The structure of a WMSX JDL file is similar to traditional JDL files,
however the supported variables are only:
-
Archive
- Name of the program archive file. This is the name of the
tarball, containing your program. If not specified, defaults to
"COMMAND.tar.gz"
. (Must be of tar-gz format!)
-
ProgramDir
- Name of the root directory inside the program archive
file. The files and directories of your program archive are assumed to be
under this directory. Setting it to
"."
means that they are not wrapped
in a directory. If not specified, defaults to
"COMMAND"
.
-
Executable
- Name of the executable to run inside the
ProgramDir
.
If not specified, Defaults to
"COMMAND"
.
-
OutputDirectory
- The name of the directory under
ProgramDir
, where
the output of your program is written. This is the directory, Which is going
to be retrieved by the framework as output. If not specified, defaults to
"out"
.
-
JobType
- If this variable is set to
"Interactive"
, the jobs will
be run as interactively in the sense that the
StdOut /
StdError is retrieved
on-the-fly, so you are able to see what your job is currently doing.
If not set, the job is not interactive by default.
-
Software
- List of software that must be present (executable) on the
target machine. The special key
"AFS"
checks for AFS presence. E.g.:
Software = {"AFS", "g++"};
. If not set, defaults to empty.
-
Requirements
- Extra queue requirements, like in a traditional JDL file.
If not set, defaults to empty.
If the JDL file is not present, the above default values are assumed
(i.e. the tarball has to have the name
COMMAND.tar.gz
etc.).
In the followings, things are more easily explained if the notion of
AbsCOMMAND
is introduced:
this is simply the full path to the file
COMMAND.jdl
file (or if not
present, to the the
COMMAND.tar.gz
file), without the ".jdl" extension
(or without the ".tar.gz" extension, if JDL file is not present).
In most times it is useful to have pre-execution and post-execution
scripts. These may be used for e.g. preparing the input data files,
or archiving output data files etc. If present, these have to be called
COMMAND_preexec
and
COMMAND_postexec
. They must be executable.
They will be run directly before submission and after job output is retrieved,
respectively.
PreExec , if present, is automatically called by the framework
with the
AbsCOMMAND
as first argument,
and with all the given arguments from
ArgList as following arguments.
PostExec , if present, is automatically called by the framework
with the
AbsCOMMAND
as first argument, the name of the job output
directory as second argument (automatically generated by the framwork),
and with all the given arguments from
ArgList as following arguments.
The retrieved outputs of your job will always be in the tarball
OutputDirectory.tar.gz
under the job output directory, where
OutputDirectory
is what you specified in the JDL file (
out
by default).
If
PostExec returns with 0 nothing further happens.
If
PostExec returns with 1, the user script
COMMAND_chain
is called,
which can be used to launch further jobs.
Job Chaining
The running time of Grid jobs is limited. This is unavoidable for efficient
control on resources. The time limit depends on the given
queue, but a typical value is three days. However, one often faces such
computing problems, when the total running time of the jobs cannot be
estimated a priori, or it is estimated to be very long.
For such cases, the job chaining
is the solution: the program has to be split up into shorter pieces with
limited running time. The program has to be written in such a way, that
its running time is limited internally (e.g. to one day), and when
this time limit is exceeded, its last state should be dumped as output. The
next copy of the program has to be started with this last state as input,
thus, by such chain of limited lifetime jobs, one can imitate arbitrary long
lifetime jobs. The script
COMMAND_chain
is the tool to lauch further
jobs, when needed.
If the
COMMAND_postexec
script returns 1, the script
COMMAND_chain
is invoked (must be executable). In this case,
if present, it is automatically called by the framework
with the
AbsCOMMAND
as first argument, the name
of the job output directory as second argument, and with all the given
arguments from
ArgList as following arguments.
The output of
COMMAND_chain
is interpreted by the framework as
ArgList
lines, just as if they were lines from the initial
ArgList file.
Therefore, it can be used to lauch
further jobs by a finished job, depending on the decision of the
COMMAND_postexec
script. This is called the job chaining.
(The
COMMAND_chain
may have multiple lines as output. Each line is
interpreted like a line from the
ArgList file, so multiple jobs may
also be lauched: the job chain may also fork.)