-
Notifications
You must be signed in to change notification settings - Fork 14
Special Algorithms and Functionalities
The pilot executes several special algorithms for different tasks, described in the sections below.
The checksum is calculated after a stage-in or before stage-out followed by verification against a known number. The Adler32 checksum algorithm is implemented in the pilot since it is normally not available as a command on the worker nodes. The algorithm is standard, with the addition that the pilot makes sure that the returned string is always eight characters long, i.e. it fills the leading part with zeros (e.g. '3d' -> '0000003d').
The current CPU consumption time (system+user time) for a given process is calculated on the fly by looping over all of its child processes. After all child processes have been identified, the corresponding /prod/pid/stat files are parsed and the utime, stime, cutime, cstime are calculated by dividing the relevant fields from the stat files by the os.sysconf(os.sysconf_names['SC_CLK_TCK']) value. The CPU consumption time for each sub process is the sum of these values, and the wanted CPU consumption time for the given process is the sum of the sub process CPU consumption times.
See also the Timing measurements section.
A job is considered to be looping if it has not updated any files in the work directory within the specified time. The pilot uses an internal time limit of 2h for both user analysis and production jobs. The mechanism can be turned off by using the noLoopingCheck task parameter (forwarded to the Pilot as loopingCheck=False). The internal limit can be changed in pilot/util/default.cfg.
To find the last touched files, the following command is executed once per 15 minutes (also configurable in the Pilot config file):
find <workdir> -mmin -<limit>
where the limit is divided by 60 to convert to minutes.
A troublesome job can be debugged live by turning on the special debug mode in the prodtask-dev page. The instruction is delivered to the pilot via the job update backchannel (i.e. in the return dictionary after an updateJob call). In this case, the pilot changes the frequency of updateJob calls to one per five minutes, and adds the tail of the latest found non-binary file in the working directory. The uploaded tail is then made visible in the corresponding PanDA monitor job page.
The pilot monitors the payload for possible memory leakage, if it has access to memory values returned by an external memory monitor tool. ATLAS currently uses the prmon tool. The pilot fits the PSS+SWAP values versus time. The slope gives a measure of the leakage rate. Tails are removed and the Chi2 is also calculated.
In case the StageOutClient fails to stage-out a file, the pilot has the ability to retry
at a secondary RSE. The mechanism is implemented in the pilot/control/data.py module and
function _do_stageout().
This function orchestrates the stage-out process, including a fallback mechanism known as "alternative stage-out." This allows the pilot to attempt transferring output files to a secondary storage element (RSE) if the primary one fails.
The alternative stage-out logic is as follows:
-
Initial Transfer Attempt: The function first calls
client.transfer()to attempt uploading all files to their primary destination RSEs. Crucially, if alternative stage-out is enabled for the job (job.allow_altstageout()is True), thetransfermethod is called withraise_exception=False. This ensures that if a transfer fails, the function does not immediately exit but continues execution, allowing for a retry. -
Failure Detection: After the initial attempt, the code checks for any files that were not successfully transferred by creating a
remain_fileslist. -
Conditions for Retry: An alternative stage-out is attempted only if all of the following conditions are met: a. The job is configured to allow alternative stage-out (
altstageoutis True). b. There are files remaining that failed the first transfer attempt (remain_filesis not empty). c. Every file that failed has an alternative destination defined (has_altstorageis True). The alternative destination is stored in theddmendpoint_altattribute of the file spec. -
Executing the Alternative Transfer: If the conditions are met, the function iterates through the list of failed files. For each file, it swaps the primary destination (
entry.ddmendpoint) with the alternative one (entry.ddmendpoint_alt). -
Second Transfer Attempt: The
client.transfer()method is called a second time. TheStageOutClientwill now attempt to transfer the previously failed files to their newly assigned alternative destinations. Files that were successfully transferred in the first attempt are ignored.
- Introduction
- Pilot Architecture
- Project Structure
- Pilot Workflows
- Event service
- Metadata
- Signal Handling
- Error Codes
- Containers
- Special Algorithms
- Timing Measurements
- Data Transfers
- Copy Tools
- Direct Access
- Fallback Mechanism in Unified PanDA Queues
- Pilot release procedure