NEW USERS' LSF BATCH JOBS POLICY AT CNAF


- A RECIPE TO RUN BATCH JOBS WITH SATISFACTION -



Introduction

Starting from half of December 2005 the LSF batch job submission has been given strict rules in order to avoid any improper load to the nfs server which can reflect not only in a less efficient job (namely longer time of execution) but also in a less satisfying usage of the front-end machine by the whole community of the BaBar users.

Moreover these rules are allowing the NFS scalability tests performed in order to progressively increase the maximum number of jobs per queue and per user; higher limits mean that users. jobs production will be done fastly.

In the new farm configuration, the scratch area and the AWG areas (with the exclusion of the area AWG/Xrootd/work/users dedicated to collections of reskimmed data) will be NFS-mounted by any batch worker-node in a READ-ONLY mode (with the exclusion of worker-nodes serving the babar_build queue). This is possible by exploiting the feature that LSF uses a protocol transfer different from NFS.

Since reskimming activity may somehow stress the NFS server it is recommended to avoid running more than one hundred batch jobs at the same time.
To let us study the effect of reskimming activity on the NFS server please let us know when you have submitted (or when you are running) reskimming jobs.
Thank you.

Recipe Topics:


Job output

With "job output" I primarily mean here the '.root' or '.hbook' files produced by the batch job.
Tipically these wide-size files are written in the scratch area or in the area of a user within their AWG area.
Since these areas are now no more writable via NFS, the user should:

1) make LSF to write the job output, during job processing, on local worker-node (in the directory "/data" of the worker-node disk),

2) make LSF to transfer the job output, at the end of the job, to the chosen destination (either scratch area or AWG area).

This can be correspondingly obtained in the following 2-step way:

1) setting properly somewhere in one of his tcl files the path where the job output should be written; for istance:

set histFileName /data/$env(NAMEHBK).${BetaMiniTuple}

where $env(NAMEHBK).${BetaMiniTuple} represents the standard way to settle the name of the job output.

2) giving to the 'bsub' command line the "-f" option in order to tell LSF where the job output should be copied (with the operator "<") at the end of the job; for istance, if the final destination is the scratch area:

bsub [...]
-f "$BFROOT/work/u/user/analysis-30/$NAMEHBK.hbook < /data/$NAMEHBK.hbook"
[...]

In the past users were tipically used to simply put in one of their tcl files the following line:
set histFileName $env(BFROOT)/work/u/user/$env(NAMEHBK).${BetaMiniTuple}
and they were not using the "-f" option for the 'bsub' command line.
This is now wrong and no more possible: any batch job configured in this way will crash (except reskimmming jobs).


Logfiles

With "logfile" I mean here:
- the standard output file (stdout) obtained by the 'bsub' option "-o";
- both standard output and standard error (stderr) files obtained correspondingly by 'bsub' options "-o" and "-e".

The 2 logfiles by default are written, at the end of the job, via NFS to the Current Working Directory (CWD) that is the directory from which the user executes the 'bsub' command (tipically the "workdir").

If you want to get these files in a subdirectory (i.e. called "logfiles") of the "workdir" you should write:

bsub [...]
-o /home/BABAR/user/XXX/workdir/logfiles/stdout-filename
-e /home/BABAR/user/XXX/workdir/logfiles/stderr-filename
[...]

In the same way you may want to specify any other subdirectory of the user's "/home".

Tipical productions are huge and logfiles may need to use a relevant fraction of the 1.5GB-wide "/home". It's very likely that a user may want instead to store (temporarely) the logfiles in the scratch area or, better, in their space within the AWG area.
To obtain this you need:

1) to make LSF to write the logfiles locally on the workernode (in the directory "/data"), and...
2) to make LSF redirect them, at the end of the job, to the chosen destination:

bsub [...]
-o /data/stdout-filename -e /data/stderr-filename
-f "$BFROOT/work/AWG/charm/charmuser/stdout-filename < /data/stdout-filename"
-f "$BFROOT/work/AWG/charm/charmuser/stderr-filename < /data/stderr-filename"

[..]


Tcl files

Tipically '.tcl' files reside somewhere in a subdirectory of the "workdir"; however this is not a rule.
If a user prefers to have them in his own AWG area then:

1) he needs to set:

setenv BDBINPUTFILE /data/filename.tcl

somewhere in the submission script before the line with the 'bsub' command;

2) one of the 'bsub' command's options must be:

-f "$BFROOT/work/AWG/charm/charmuser/filename.tcl > /data/filename.tcl"

Note that the '.tcl' file is copied on the workernode by using the operator "<".

There may be slightly more elegant solutions but the previous one works and is rather easy.


Important Note

Users do not need to care about deleting their jobs' output and inputs. CNAF LSF admins have implemented an automatic cancellation of files, on the "/data" partition of each workernode, that are older than 1 week.

Nevertheless it is recommended that each user chooses/tags the names of the following files with something peculiar/specific that will avoid any confusion with other's users job files:

- logfile and errlogfile (if not directed to your "/home")
- output file ('.root' or '.hbook' file)
- tcl file (if not present into your "/home")

More skilled users may want to use, for the last purpose of avoiding any possible confusion, a slightly more elegant solution which involves the use of the '$LSB_JOBID' variable.



This is a preliminary version of the recipe. For any comment/suggestion/correction please write to: alexis.pompili@cnaf.infn.it

Last update: 23-Jan-2006