|
|
XML is the `Extensible Markup Language' . XML is not a single, predefined markup language: it's a metalanguage -- a language for describing other languages -- which lets you design your own markup. (A predefined markup language like HTML defines a way to describe information in one specific class of documents only: XML lets you define your own customized markup languages for limitless different classes of document).
In LHCb, we use XML in various domains: Detector Description, Data Dictionaries and Event Display. The XML parser we use is Xerces-C.
XML is an extendible, meaning that a user can define his own markup language with his own tags. The tags define the meaning of the data they represent or contain. They act as data describers. For example the following shows the XML file which describes an e-mail.
<?xml version=`1.0'
encoding=`UTF-8'?> <!-- This is an example of XML --> <Email> <TimeStamp time="11:38:43" date="22/11/1999" /> <Sender>sender@cern.ch</Sender> <Recipient>recipient@cern.ch</Recipient> <Subject>Lunch...</Subject> <Body> Could we meet at 14:00? <Signature>Sender's signature</Signature> </Body> </Email> |
At first look this markup language looks like screwed-up HTML. This is because both HTML and XML have their roots in SGML, but they are used for different purposes. From the example above it's not clear how to present the data described there nor how to visualize them. What is clear however is the meaning of the data items encoded in XML. Thus one can easily recognize the data items and guess what they mean. On the other hand it is relatively easy to instruct a computer program what to do with the given data item according to the XML markup elements it is encapsulated in. Let us analyse the example shown above.
XML declaration must be at the beginning of each XML document. It is the first line in the example. It says that this file is an XML file conforming to the XML standard version 1.0 and is encoded in UTF-8 encoding. The encoding is very important because XML has been designed to describe data using the Unicode standard for text encoding. This means that all XML documents are treated as 16 bit Unicode characters instead of usual ASCII. So, even if you write your XML files using 7 or 8 bit ASCII, all the XML applications will work with it as with 16 bit Unicode XML data. The encoding information is important, for example when an XML document is transferred over the Internet to some other country where a different encoding is used. If the receiving application can guess the XML encoding from the received file, it can apply transcoding methods to convert the file into proper local encoding, thus preserving readability of the data.
XML comments look like comments in SGML or HTML. They start with <!-- and end with -->. Comments in XML cannot be nested.
XML elements are the markup components describing data in XML. In the example we had the following XML elements: Email, TimeStamp, Sender, Recipient, Subject, Body, Signature. The very basic and mandatory rule of XML is that all XML element tags must nest properly and there must be only one root XML element at the top level of each XML document, which contains all the others. Proper nesting means that each XML element has its opening and closing tag and the closing tag must appear before the parent element's closing tag, as shown in the following listing. Following these rules one can always produce well-formed XML documents.
<?xml version='1.0'
encoding='UTF-8' standalone='yes'?> <!-- Root tag is top level root element of XML file --> <Root> <!-- Elements which are empty --> <EmptyElement /> <EmptyWithAttributes attr1="first" attr2='second' /> <!-- Elements having content model --> <ProperNesting> <Something>In</Something> </ProperNesting> <WRONGNESTING> <BADTHING>huhu </WRONGNESTING> </BADTHING> </Root> |
XML elements can have attributes and a content. Attributes usually describe the properties of the given element. The value of the attribute follows the assignment operator and is enclosed inside double or single quotes. In the content can appear text data or other properly nested elements. The text data and nested elements can be mixed inside the content.
A well formed XML document is any XML document which follows the rules described in the previous section. However this is not sufficient in some cases. Inside an XML document you can have any XML tag you can imagine. This is not harmful to XML itself but makes it impossible to process such a document with a computer program and makes it very hard to maintain the program to keep it synchronised with all the new tags users invent. For a well formed XML document is not possible to check or validate that the document contains only the tags which make sense or have valid content. For that purpose there exists a notation called Document Type Definition (DTD) which comes from SGML world. This notation permits the definition of a grammar (a valid set of tags, their contents and attributes) which allows then to perform the validation of XML documents, i.e. whether the document contains the XML tags which belong to the set of tags defined by its associated DTD. In addition, the validating application can check whether the tags contain only allowed tags or text data, and whether the tags use only attributes defined for them by the DTD. The validation process can can also perform normalization of attributes, i.e. assign default values to attributes that the user has omitted because they are described as optional or with a fixed value in the DTD.
Important note: the default behaviour of the validating application, recommended by the XML standard, is to stop parsing of an XML document in case of an error. This is because the XML files describe data and an error in XML means corrupted data.