A Quick Introduction to XML

This document provides a quick introduction to some of the terms and concepts used in the analysis of the XML documents in the Tutorial section of the CellML website. The terms are taken from the original XML specification published in February 1998 by the World Wide Web Consortium.

The following online resources provide more thorough documentation on XML:

http://www.w3.org/XML/ — the W3C's XML page.
http://www.ucc.ie/xml/ — the official XML FAQ.
http://www.xml.com/axml/testaxml.htm — the annotated XML specification.
http://www.oasis-open.org/cover/xml.html — the XML Cover pages.

The following list of terms is by no means exhaustive, and the definitions are in some cases incomplete:

XML

XML stands for eXtensible Markup Language, and it is a standard for structured text documents developed by the World Wide Web Consortium (W3C). The W3C represents about 500 paying member companies and is responsible for many of the standards relating to the internet, including HTML. XML can be used to structure text in such a way that it is readable by both humans and machines, and it presents a simple format for the exchange of information across the internet between computers. As such, electronic commerce is the principal application area for XML.

XML is a simplification (or subset) of the Standard Generalized Markup Language (SGML) which was developed in the 1970s for the large-scale storage of structured text documents.

XML document

An XML document contains a prolog and a body. The prolog consists of an XML declaration, possibly followed by a document type declaration. The body is made up of a single root element, possibly with some comments and/or processing instructions. An XML document is typically a computer file whose contents meet the requirements laid out in the XML specification. However, XML documents may also be generated "on the fly" by a computer responding to a request from another computer. For instance, an XML document may be dynamically compiled from information contained in a database.)

XML Declaration

The first few characters of an XML document must make up an XML declaration. The declaration is used by the processing software to work out how to deal with the subsequent XML content. A typical XML declaration is shown below. The encoding of a document is particularly important, as XML processors will default to UTF-8 when reading an 8-bit-per-character document. This will cause characters to be rendered incorrectly if the document uses Latin encoding (iso-8859-1). XML processing applications are required to handle 16-bit-per-character documents in the Unicode encoding, which makes XML a truly international format, able to handle most modern languages.

<?xml version="1.0" encoding="iso-8859-1"?>

Document Type Declaration

A document author can use an optional document type declaration after the XML declaration to indicate what the root element of the XML document will be and possibly to point to a document type definition. A typical document type declaration for a CellML document is shown below. Note that the document type declaration facility defined in the XML specification provides a lot more functionality than what is discussed or shown here.

<!DOCTYPE model SYSTEM "http://www.cellml.org/cellml/cellml_1_1.dtd">

Start / End Tag

The simplest way of encoding the meaning of a piece of text in XML is to enclose it inside start and end tags. A start tag consists of the tag-name in between less-than and greater-than signs, and the matching end tag has a slash preceding the tag-name, as shown below. A well-formed XML document has an end-tag that matches every start-tag.

<my_tag> the text data </my_tag>

Element

The combination of start-tag, data and end-tag is known as an element. The data may be plain text (as in the example above), further elements (sub-elements), or a combination of text and sub-elements. A document is usually made up of a tree of elements with a single root element as shown below.

<root_element> <sub_element_1> data for sub-element 1 </sub_element_1> <sub_element_2> data for sub-element 2 </sub_element_2></root_element>

Attribute

Another way of putting data into an XML document is by adding attributes to start tags. The value of the attribute is usually intended to be data relevant to the content of the current element. Whitespace is used to separate attributes from the tag-name and each other. Each attribute has a name followed by an equals sign and the value of the attribute. The value of the attribute is enclosed in single or double quotes. In the example below, <my_tag> has two attributes: att_1 and att_2.

<my_tag att_1="1" att_2="2"> the text data </my_tag>

Empty Element

If an element has no content, the end-tag can be left out. In this case, a slash is added to the end of the start-tag to indicate that this is an empty element. Element content is anything that the XML specification allows to appear between a start-tag and an end-tag, such as text, sub-elements, comments and processing instructions. An empty element may still have attributes, as shown below.

<my_empty_element att_1="1" att_2="2" />

Document Type Definition

The Uniform Resource Identifier (URI) in a document type declaration can point to a document known as a document type definition (DTD). The format for a DTD is defined in the XML Specification and is not the same as for an XML document. A DTD may contain a set of rules that specify how the different tags in an XML document can be used together and the attributes that may belong to each tag. Most XML processors provide checking of XML documents against a DTD, allowing applications to quickly and painlessly check that the structure of an XML document is roughly correct.

DTDs do not allow the specification of constraints on element and attribute content like “the value of the att_1 attribute must be a number”. This kind of validation can be handled by using XML Schema, the successor to DTDs which defines an XML-based file format.

Comment

A document author can place comments in XML documents to add annotations intended for other humans reading the document. The contents of a comment are not regarded as part of the document's data. A comment is started with a less-than sign, exclamation mark, and two hyphens, and is ended with two-hyphens and a greater-than sign, as shown below. Comments may not be placed inside start- or end-tags.

<my_tag> content </my_tag>

XML Namespace

Namespaces in XML is a companion specification to the main XML specification. It provides a facility for associating the elements and/or attributes in all or part of a document with a particular schema, as indicated by a URI. The key aspect of the URI is that it is unique. The value of the URI need not have anything to do with the XML document that uses it, although typically it would be a good location for the XML Schema or DTD that defines the rules for the document type. The URI may be mapped to a prefix which may then be used in front of tag and attribute names, separated by a colon. If not mapped to a prefix, the URI sets the default schema for the current element and all of its children.

A namespace declaration looks like an attribute on a start tag, but may be identified by the keyword xmlns. In the following example, the default namespace is set to the CellML namespace, and the MathML namespace is declared and mapped to the mathml prefix, which is then used on a <math> element. Note that the <model> element and any children elements with no default namespace declaration or namespace prefix (such as the <component> element) will be in the CellML namespace.

<model xmlns="http://www.cellml.org/cellml/1.1#" xmlns:mathml="http://www.w3.org/1998/Math/MathML"> <component> ... </component> <mathml:math> ... math goes here ... </mathml:math></model>