Home
What is Virgo ?
People
CBwaves
GPU_inspiral
Computing activities
Analysis activities
Theory
Galery
Publications
Events
Useful links
Getting started
TDK/BSc/MSc/PhD
Archive
Contact
Other GW groups

Running analysis pipelins on the EGEE Grid

Running analysis pipelins on the EGEE Grid

Objectives

The aim of this note is to find a simple, lightweight solution for running LIGO/Virgo analysis pipelines on the European Grid.

For the impatients...

A straightforward solution is to launch a pool of pilot jobs to the EGEE Grid. Inside this pilot job pool, one starts up and maintains a condor metacluster. From a condor submit machine pointed to this metacluster one can submit the pipelines exactly in the same way as it is now done on dedicated cluster...

Introduction

Current status of pipeline running at Virgo, LIGO

Currently hierarchical execution pipelines are expressed as Condor specific Condor DAGs, and as batch system independent workflow description DAXs (DAG XML). These are executed on dedicated condor clusters like Atlas, Nemo, etc. Currently these clusters are spearated but in principle they can be attached together using the flocking feature of condor or using Condor-c.

The EGEE Grid

The EGEE Grid is a set of heterogenus computer cluster using various batch systems (Torque, SGE, LSF, Condor, BQS) each of which is front-ended with a Grid Computing Element (CE) the task of which is to implement a uniform communication protocol among the various computer centers. Jobs are submitted from a Grid User Interface machine (UI), scheduled by the Workload Management System (WMS) and after received by one of the CE it is executed on a Worker Node (WN) belongs to the CE's local cluster.

EGEE is interoparable with OSG and NorduGrid and with many others.

It is not straightforward to execute Condor DAGs or DAX on such a diverse computing environment without imposing extra difficulties. However, there exist several solution to do so, and choiche depends on the nature of the specific problem.

Approach A

The first approach is to use the Globus GRAM service. This service is running on each CE. A Condor-G service can communicate to Globus GRAM that understands DAX workflows, so a pipeline having DAX description available can be submitted.
  • Advantages
    • Relatively simple to implement
  • Drawbacks
    • Globus GRAM will not be supported by the next generation of CEs, by the Cream CEs.
    • The DAX jobs will be executed only on the cluster which belongs to the CE which could cause slow execution at some cases.
    • One always have to ensure that the two workflow i.e (the DAGs on dedicated clusters) and the DAX (on the Grid) is always exactly the same. It means extra work.
    • Since one intermediate step is involved debugging could be more difficult.

Approach B

Another approach to use the Pegasus scheduler for the execution of DAG workflows. Pegasus can interpret Condor DAGs and it can send jobs to local clusters or to the Grid.
  • Advantages
    • No need to generate a second workflow, DAGs can be used
    • Existing working examples
  • Disadvantages
    • An extra service is needed to run, i.e. the Pegasus
    • Since one intermediate step is involved debugging could be more difficult.

Note, that I don't have much experience with Pegasus, this part have to be examined more carefully.

Approach C

In this approach one submit a set of pilot jobs to the Grid. (A pilot job is in fact empty job which after landing on a Worker Node pulls the payload job in from a central place and executes it. After executing the payload job it can terminate or execute another payload job. For a description of pilot jobs see for example this.) Each of these submitted pilot job starts up a lightweight condor client which is pointed to a condor central manager. In this way these pilot jobs are forming a condor metacluster.From this point on the user can submit the DAGs to this condor metacluster exactly in the same way as (s)he is doing it on dedicated clusters.

This is called condor metacluster since in this case it doesn't matter on what kind of batch system the pilot job is executed, since the pilot jobs starting up a condor cluster, the diversity of the EGEE Grid will be hidden to the user.

  • Advantages
    • Pilot job framework is already in use and proved to be extremely useful.
    • Faster execution speed since the communication overhead is minimal
    • No need to maintain a second workflow
    • In principle no change for the users.
    • With the metacluster we can ensure that job are executed on the same WN does not have to copy over the software/data several times.

  • Disadvantages
    • Running a condor client on the Worker Node causes a slight performance decrease, which should be negligible (to be tested)

In each of the above approach (A,B,C) one has to handle a set of issues, which I now address only for Approach C, but the problems are very similar in A and B.

The Condor metacluster

There are various way of building up the condor metacluster.

  1. The central manager is running on the Grid User Interface machine.
  2. The central manager is run by a pilot job. (In this case the condor metcluster is restriced to one site due to firewall issues.)
  3. The condor metacluster connected to an existing condor pool (like ATLAS) via condor glidein.

Availability of the software

The Ligo/Virgo software has to be available in on the Worker Nodes (WNs) where the job is executed. There are 3 way to achive this:
  1. The software is pre-installed on each site and is visible for the WN. (This requires a lot of communication between Virgo and site administrators and hard to ensure that each installatio is really exactly the same.)
  2. The software is available through some wide-area network file system, like AFS. (The load generated on the AFS server have to be examined.)
  3. The job fetches the software. (The size of the pre-compiled version of the full software is around 1GB, if transfer time small compared to execution time, then it's OK.)
  4. Condor takes care of this.

Availability of the data

The data to analyze should be available on the Worker Node. Again, the possibilities are:
  1. The data is replicated on each site. (This is almost impossible)
  2. The data is availbe through some wide-area network file system, like AFS. (The load generated on the AFS server have to be examined.)
  3. The job fetches the data. (The size of the pre-compiled version of the full software is around 1GB, if transfer time small compared to execution time, then it's OK.)
  4. Condor takes care of this.

Execution plan and status

The realisations differ in simplicity and efficiency. One should start experiencing with the simplest solution and later on use the most efficient one.

  • The simplest solution is C/I/d on a Grid site where the data is already available (ex. Lyon).
  • The second simplest soultion is in general: C/I/d/4 or C/I/b/4
  • The most usable one probably is C/III/d/4

Issuses

  • The time of data transfer and data processing has to be compared. Probably datafind job should not be sent to the Grid, they should be run locally.
  • Making interactive debugging possible. This is not possible even on the current dedicated clusters, but with tunneled out ssh sessions this is doable.
  • The Issues are same when submitting job to OSG through Pegasus, so one should examine how the above problems are adressed there.
  • One month of data is quite big (300 GB) so, it is possible that not all kind of searches are worth to run on the Grid.
  • One has to make distinction when the data is locally available or when it is transferred from somewhere may datafind utilitites (findlscdata) may have to be modified/extended.


The RMKI Budapest Virgo Group

-- GergelyDebreczeni - 09 Apr 2009

Edit | Attach | Watch | Print version | History: r3 < r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r3 - 2009-09-03 - GergelyDebreczeni
 
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright &© by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback