Chapter 11
N-tuple and Event Collection facilities
11.1 Overview
In this chapter we describe facilities available in Gaudi to create and retrieve n-tuples. We discuss how Event Collections, which can be considered an extension of n-tuples, can be used to make preselections of event data. Finally, we explore some possible tools for the interactive analysis of n-tuples.
11.2 N-tuples and the N-tuple Service
User data - so called n-tuples - are very similar to event data. Of course, the scope may be different: a row of an n-tuple may correspond to a track, an event or complete runs. Nevertheless, user data must be accessible by interactive tools such as PAW or ROOT.
Gaudi n-tuples allow to freely format structures. Later, during the running phase of the program, data are accumulated and written to disk.
The transient image of an n-tuple is stored in a Gaudi data store which is connected to the n-tuple service. Its purpose is to store user created objects that have a lifetime of more than a single event.
As with the other data stores, all access to data is via a service interface. In this case it is via the INTupleSvc interface which extends the IDataProviderSvc interface. In addition the interface to the n-tuple service provides methods for creating n-tuples, saving the current row of an n-tuple or retrieving n-tuples from a file. The n-tuples are derived from DataObject in order to be storable, and are stored in the same type of tree structure as the event data. This inheritance allows to load and locate n-tuples on the store with the same smart pointer mechanism as is available for event data items (c.f. Chapter 7).
11.2.1 Access to the N-tuple Service from an Algorithm.
The Algorithm base class defines a member functionwhich returns a pointer to the INTupleSvc interface
.
INTupleSvc* ntupleSvc()
The n-tuple service provides methods for the creation and manipulation of n-tuples and the location of n-tuples within the persistent store.
The top level directory of the n-tuple transient data store is called "/NTUPLES". The next directory layer is connected to the different output streams: e.g. "/NTUPLES/FILE1", where FILE1 is the logical name of the requested output file for a given stream. There can be several output streams connected to the service. In case of persistency using HBOOK, "FILE1" corresponds to the top level RZ directory of the file (...the name given to HROPEN). From then on the tree structure is reflected with normal RZ directories (caveat: HBOOK only accepts directory names with less than 8 characters! It is recommended to keep directory names to less than 8 characters even when using another technology (e.g. ROOT) for persistency, to make the code independent of the persistency choice.).
11.2.2 Using the N-tuple Service.
When defining an n-tuple the following steps must be performed:
· The n-tuple tags must be defined.
· The n-tuple must be booked and the tags must be declared to the n-tuple.
· The n-tuple entries have to be filled.
· The filled row of the n-tuple must be committed.
· Persistent aspects are steered by the job options.
In the following an attempt is made to explain the different steps. Please note that when using HBOOK for persistency, the n-tuple number must be unique and, in particular, that it must be different from any histogram number. This is a limitation imposed by HBOOK. It is recommended to keep this number unique even when using another technology (e.g. ROOT) for persistency, to make the code independent of the persistency choice.
11.2.2.1 Defining N-tuple tags
When creating an n-tuple it is necessary to first define the tags to be filled in the n-tuple. Typically the tags belong to the filling algorithm and hence should be provided in the Algorithm's header file. Currently the following data types are supported: bool, long, float and double. double types (Fortran REAL*8) need special attention if using HBOOK for persistency: the n-tuple structure must be defined in a way that aligns double types to 8 byte boundaries, otherwise HBOOK will complain. In addition PAW cannot understand double types. Listing 11.1 illustrates how to define n-tuple tags:
11.2.2.2 Booking and Declaring Tags to the N-tuple
When booking the n-tuple, the previously defined tags must be declared to the n-tuple. Before booking, the proper output stream (file) must be accessed. The target directory is defined automatically.
Tags which are not declared to the n-tuple are invalid and will cause an access violation at run-time.
11.2.2.3 Filling the N-tuple
Tags are usable just like normal data items, where
· Items<TYPE> are the equivalent of numbers: bool, long, float.
· Array<TYPE> are equivalent to 1 dimensional arrays: bool[size], long[size], float[size]
· Matrix<TYPE> are equivalent to an array of arrays or matrix: bool[dim1][dim2].
There is no implicit bounds checking possible without a rather big overhead at run-time. Hence it is up to the user to ensure the arrays do not overflow.
When all entries are filled, the row must be committed, i.e. the record of the n-tuple must be written.
11.2.2.4 Reading N-tuples
Although n-tuples intended for interactive analysis, they can also be read by a regular program. An example of reading back such an n-tuple is given in Listing 11.4. Notice line 8, where an example is given of preselecting rows of the n-tuple according to given criteria. This option is only possible if supported by the underlying database used to make the n-tuple persistent. Currently it is possible to preselect rows from n-tuples written in ROOT format and from relational databases using ODBC1. Note that the syntax of the query is also affected by the underlying technology: while an ODBC database will accept any SQL query, the ROOT implentation understands only the "And" and "Or" SQL operators - but it does understand the full C++ syntax (an example is given in section 11.3.2).
11.2.3 N-tuple Persistency
11.2.3.1 Input and Output File Specification
Conversion services exist to convert n-tuple objects into a form suitable for persistent storage in a number of storage technologies. In order to use this facility it is necessary to add the following line in the job options file:
NTupleSvc.Output = {"FILE1 DATAFILE='tuples.hbook' TYP='HBOOK' OPT='NEW'",
"FILE2 ...",
...
"FILEN ..."};
where <tuples.hbook> should be replaced by the name of the file to which you wish to write the n-tuple. FILE1 is the logical name of the output file - it could be any other string. A similar option NTupleSvc.Input exists for n-tuple input.
The following is a complete overview of all possible options:
· DATAFILE='<file-specs>'
Specifies the datafile (file name) of the output stream.
· TYP='<typ-spec>'
Specifies the type of the output stream. Currently supported types are:
· HBOOK: Write in HBOOK RZ format.
· ROOT: Write as a ROOT tree.
· MS Access: Write as a Microsoft Access database2.
There is also weak support for the following database types1:
· SQL Server
· MySQL
· Oracle ODBC
These database technologies are supported through their ODBC interface. They were tested privately on local installations. However all these types need special setup to grant access to the database.
Except for the HBOOK data format, you need to specify the use of the technology specific persistency package (i.e. GaudiRootDb) in your CMT requirements file and to load explicitly in the job options the DLLs containing the generic (GaudiDb) and technology specific (GaudiRootDb) implementations of the database access drivers:
ApplicationMgr.DLLs += { "GaudiDb", "GaudiRootDb" };
· OPT='<opt-spec>'
· NEW, CREATE, WRITE: Create a new data file. Not all implementations allow to over-write existing files.
· OLD, READ: Access an existing file for read purposes
· UPDATE: Open an existing file and add records. It is not possible to update already existing records.
· SVC='<service-spec>' (optional)
Connect this file directly to an existing conversion service. This option however needs special care. It should only be used to replace default services.
· AUTHENTICATION='<authentication-specs>' (optional)
For protected datafiles (e.g. Microsoft Access) it can happen that the file is password protected. In this case the authentication string allows to connect to these databases. The connection string in this case is the string that must be passed to ODBC, for example: AUTH='SERVER=server_host;UID=user_name;PWD=my_password;'
· All other options are passed without any interpretation directly to the conversion service responsible to handle the specified output file.
For all options at most three leading characters are significant: DATAFILE=<...>, DATABASE=<...> or simply DATA=<...> would lead to the same result.
The handling of row wise n-tuples does not differ. However, only individual items (class NTuple::Item) can be filled, no arrays and no matrices. Since the persistent representation of row wise n-tuples in HBOOK is done by floats only, the first row of each row wise n-tuple contains the type information - when looking at a row wise n-tuple with PAW make sure to start at the second event!
11.3 Event Collections
Event collections or, to be more precise, event tag collections, are used to minimize data access by performing preselections based on small amounts of data. Event tag data contain flexible event classification information according to the physics needs. This information could either be stored as flags indicating that the particular event has passed some preselection criteria, or as a small set of parameters which describe basic attributes of the event. Fast access is required for this type of event data.
Event tag collections can exist in several versions:
· Collections recorded during event processing stages from the online, reconstruction, reprocessing etc.
· Event collections defined by analysis groups with pre-computed items of special interest to a given group.
· Private user defined event collections.
Starting from this definition an event tag collection can be interpreted as an n-tuple which allows to access the data used to create the n-tuple. Using this approach any n-tuple which allows access to the data is an event collection.
Event collections allow pre-selections of event data. These pre-selections depend on the underlying storage technology.
First stage pre-selections based on scalar components of the event collection. First stage preselection is not necessarily executed on your computer but on a database server e.g. when using ORACLE. Only the accessed columns are read from the event collection. If the criteria are fulfilled, the n-tuple data are returned to the user process. Preselection criteria are set through a job options, as described in section 11.3.2.
The second stage pre-selection is triggered for all items which passed the first stage pre-selection criteria. For this pre-selection, which is performed on the client computer, all data in the n-tuple can be used. The further preselection is implemented in a user defined function object (functor) as described in section 11.3.2. Gaudi algorithms are called only when this pre-selector also accepts the event, and normal event processing can start.
11.3.1 Writing Event Collections
Event collections are written to the data file using a Gaudi sequencer. A sequencer calls a series of algorithms, as discussed in section 5.2. The execution of these algorithms may terminate at any point of the series (and the event not selected for the collection) if one of the algorithms in the sequence fails to pass a filter.
11.3.1.1 Defining the Address Tag
The event data is accessed using a special n-tuple tag of the type
It is defined in the algorithm's header file in addition to any other ordinary n-tuple tags, as described in section 11.2.2.1. When booking the n-tuple, the address tag must be declared like any other tag, as shown in Listing 11.1. It is recommended to use the name "Address" for this tag.
The usage of this tag is identical to any other tag except that it only accepts variables of type IOpaqueAddress - the information necessary to retrieve the event data.
11.3.1.2 Filling the Event Collection
At fill time the address of the event must be supplied to the Address item. Otherwise the n-tuple may be written, but the information to retrieve the corresponding event data later will be lost. Listing 11.2 also demonstrates the setting of a filter to steer whether the event is written out to the event collection.
11.3.1.3 Writing out the Event Collection
The event collection is written out by an EvtCollectionStream, which is the last member of the event collection Sequencer. Listing 11.3 (which is taken from the job options of EvtCollection example), shows how to set up such a sequence consisting of a user written Selector algorithm (which could for example contain the code in Listing 11.2), and of the EvtCollectionStream.
11.3.2 Reading Events using Event Collections
Reading event collections as the input for further event processing in Gaudi is transparent. The main change is the specification of the input data to the event selector:
These tags need some explanation:
· COLLECTION
Specifies the sub-path of the n-tuple used to write the collection. If the n-tuple which was written was called e.g. "/NTUPLES/FILE1/Collection", the value of this tag must be "Collection".
· ADDRESS (optional)
Specifies the name of the n-tuple tag which was used to store the opaque address to be used to retrieve the event data later. This is an optional tag, the default value is "Address". Please use this default value when writing, conventions are useful!
· SEL (optional):
Specifies the selection string used for the first stage pre-selection. The syntax depends on the database implementation; it can be:
· SQL like, if the event collection was written using ODBC.
Example: (NTrack>200 AND Energy>200)
· C++ like, if the event collection was written using ROOT.
Example: (NTrack>200 && Energy>200).
Note that event collections written with ROOT also accept the SQL operators 'AND' instead of '&&' as well as 'OR' instead of '||'. Other SQL operators are not supported.
· FUN (optional)
Specifies the name of a function object used for the second-stage preselection. An example of a such a function object is shown in Listing 11.5. Note that the factory declaration on line 16 is mandatory in order to allow Gaudi to instantiate the function object.
· The DATAFILE and TYP tags, as well as additional optional tags, have the same meaning and syntax as for n-tuples, as described in section 11.2.3.1.
11.4 Interactive Analysis using N-tuples
n-tuples are of special interest to the end-user, because they can be accessed using commonly known tools such as PAW, ROOT or Java Analysis Studio (JAS). In the past it was not a particular strength of the software used in HEP to plug into many possible persistent data representations. Except for JAS, only proprietary data formats are understood. For this reason the choice of the output format of the data depends on the preferred analysis tool/viewer. In the following an overview is given over the possible data formats.
In the examples below the output of the GaudiExample/NTuple.write program was used.
11.4.1 HBOOK
This data format is used by PAW. PAW can understand this and only this data format. Files of this type can be converted to the ROOT format using the h2root data conversion program. The use of PAW in the long term is deprecated.
11.4.2 ROOT
This data format is used by the interactive ROOT program. Using the helper library TBlob located in the package GaudiRootDb it is possible to interactively analyse the n-tuples written in ROOT format. However, access is only possible to scalar items (int, float, ...) not to arrays.
Analysis is possible through directly plotting variables:
root [1] gSystem->Load("D:/mycmt/GaudiRootDb/v3/Win32Debug/TBlob");
root [2] TFile* f = new TFile("tuple.root");
root [3] TTree* t = (TTree*)f->Get("<local>_MC_ROW_WISE_2");
root [4] t->Draw("pz");or using a ROOT macro interpreted by ROOT's C/C++ interpreter (see for example the code fragment interactive.C shown in Listing 11.6):
root [0] gSystem->Load("D:/mycmt/GaudiRootDb/v3/Win32Debug/TBlob");
root [1] .L ./v8/NTuples/interactive.C
root [2] interactive("./v8/NTuples/tuple.root");More detailed explanations can be found in the ROOT tutorials (http://root.cern.ch).
11.4.3 ODBC3
Open DataBase Connectivity (ODBC) developed by Microsoft allows to access a very wide range of relational databases using the same callable interface. A Gaudi interface to store and retrieve data from ODBC tables was developed and offers the entire range of MS Office applications to access these data. The small Visual Basic program in Listing 11.7 shows how to fill an Excel spreadsheet using n-tuple data from an Access database. Apparently access to ODBC compliant databases using ROOT is also possible, but this was not tested.
11.5 Known Problems
Nothing is perfect and there are always things to be sorted out....
· When building the GaudiRootDb package on Linux using CMT you must first set up the ROOT environment, by sourcing the setup.csh file
1 The ODBC implementation exists in the LHCb extensions to Gaudi. It is not distributed with Gaudi v6
2 The implementation for MS Access and other ODBC compliant databases is available in the LHCb extensions to Gaudi. It is not distributed with Gaudi v6
3 The ODBC implementation exists in the LHCb extensions to Gaudi. It is not distributed with Gaudi v6
Quadralay Corporation http://www.webworks.com Voice: (512) 719-3399 Fax: (512) 719-3606 sales@webworks.com |