What is XML?¶

XML (eXtensible Markup Language) is a flexible, text-based format for storing and transporting structured data. It was defined by the World Wide Web Consortium (W3C) and first published as a W3C Recommendation on February 10, 1998.

For a comprehensive overview, see the Wikipedia article on XML.

Anatomy of an XML Document¶

A complete XML document contains several types of nodes. Here is an example with every major component labeled:

<?xml version="1.0" encoding="UTF-8"?>                             <!-- (1) -->
<!DOCTYPE library SYSTEM "library.dtd">                            <!-- (2) -->
<library name="Central">                                           <!-- (3) -->
    <!-- This is a comment -->                                     <!-- (4) -->
    <book id="1" category="fiction">                               <!-- (5) -->
        <title>The Great Gatsby</title>                            <!-- (6) -->
        <author>F. Scott Fitzgerald</author>                       <!-- (7) -->
        <notes><![CDATA[Said to be <inspired> by real events]]></notes>  <!-- (8) -->
        <?custom-processor run-at="save"?>                         <!-- (9) -->
    </book>
    <book id="2"/>                                                 <!-- (10) -->
</library>

Let’s go through each numbered component:

XML Declaration — <?xml version="1.0" encoding="UTF-8"?>

The optional first line that identifies the document as XML and specifies the version and character encoding. If present, it must be the very first thing in the document.
DOCTYPE Declaration — <!DOCTYPE library SYSTEM "library.dtd">

References a Document Type Definition (DTD) that defines the allowed structure and elements. An XML document that conforms to its DTD is called valid.
Root Element — <library name="Central">

Every XML document has exactly one root (top-level) element that contains all other elements. Elements can have attributes (name="Central") which are key-value pairs.
Comments — 

Human-readable annotations that parsers can optionally preserve or skip. Comments are never part of the data model for applications.
Element with Attributes — <book id="1" category="fiction">

Elements are the building blocks of XML. Each element has a tag name (book) and zero or more attributes (id, category).
Text Node (PCDATA) — The Great Gatsby

Parsed Character Data — the actual text content inside an element. “PCDATA” means the text is parsed, so entity references like & are expanded.
Child Element — <author> inside <book>

Elements nest inside parent elements, forming a tree structure.
CDATA Section — <![CDATA[...]]>

Character Data sections allow you to include text that would otherwise be treated as markup. Inside CDATA, characters like <, >, and & lose their special meaning and are treated literally.
Processing Instruction (PI) — <?custom-processor run-at="save"?>

Directives for applications processing the document. They provide application-specific information that is not part of the XML data model.
Empty (Self-Closing) Element — <book id="2"/>

An element with no content can be written as a self-closing tag with a trailing /> instead of a separate closing tag.

Elements¶

Elements are containers that hold data. Every element consists of:

A start tag — <tagname>
Zero or more attributes
Content — child elements, text, comments, etc.
An end tag — </tagname>

<book id="1">          <!-- start tag with attribute -->
    <title>Gatsby</title>  <!-- child element -->
</book>                <!-- end tag -->

Empty elements can use the shorthand syntax:

<br/>                  <!-- self-closing -->
<img src="photo.jpg"/>

Attributes¶

Attributes are name-value pairs attached to elements:

<book id="1" category="fiction" lang="en">

Attribute names must be unique within an element
Attribute values must always be quoted (single or double quotes)
Order of attributes does not matter

Entities — special character references that can appear in attribute values and text:

Entity	Renders As
`&`	`&`
`<`	`<`
`>`	`>`
`"`	`"`
`'`	`'`

Example: <tag attr="x & y"/> renders as x & y.

Text Nodes: PCDATA vs CDATA¶

PCDATA (Parsed Character Data)¶

The default mode for text content. The parser interprets special characters:

<message>5 &lt; 10 &amp; 10 &gt; 3</message>

The parser sees: 5 < 10 & 10 > 3

CDATA (Character Data)¶

A section where the parser treats everything as literal text. No entity processing, no markup recognition:

<code><![CDATA[
    if (x < 10 && y > 3) {
        return x & y;
    }
]]></code>

The parser sees the text exactly as written — no escaping needed. CDATA sections begin with <![CDATA[ and end with ]]>.

When to use CDATA:

Embedding code snippets (HTML, JavaScript, SQL)
Including text that contains many < or & characters
Avoiding the tedious process of escaping special characters

When NOT to use CDATA:

When you need entity references to be expanded
Inside attribute values (CDATA sections are only valid in element content)

Comments¶

Comments are annotations ignored by applications:

<!-- This is a comment -->
<!--
    Multi-line comments
    are also supported
-->

Comments cannot appear inside attribute values
Comments cannot be nested ( --> is invalid)
The string -- cannot appear inside a comment

Processing Instructions (PIs)¶

Processing instructions provide application-specific data:

<?xml-stylesheet type="text/xsl" href="style.xsl"?>
<?custom-app mode="debug"?>

The target (e.g. xml-stylesheet) identifies the application
The data is passed as-is to that application
PIs starting with xml (case-insensitive) are reserved for W3C use

See the W3C specification on PIs.

Namespaces¶

XML namespaces prevent name collisions when combining documents from different vocabularies:

<root xmlns:book="http://example.com/books"
      xmlns:store="http://example.com/store">
    <book:title>XML Guide</book:title>
    <store:title>Store Name Here</store:title>
</root>

xmlns:prefix="URI" declares a namespace
Elements and attributes can be qualified with a prefix
The default namespace (no prefix) applies to unprefixed elements: xmlns="http://example.com/default"

See W3C Namespaces in XML.

The XML Tree Model¶

An XML document is represented as a DOM tree (Document Object Model):

Document
└── Element: library
    ├── Comment: "This is a comment"
    ├── Element: book  (id="1")
    │   ├── Element: title
    │   │   └── Text: "The Great Gatsby"
    │   └── Element: author
    │       └── Text: "F. Scott Fitzgerald"
    └── Element: book  (id="2")

Every node in the tree has a type:

Node Type	Description
Document	The root of the tree (the entire document)
Element	A tag like `<book>` or `<title>`
Text	Character data inside an element
Comment	`<!-- ... -->` content
CDATA	Content inside a `<![CDATA[...]]>` section
PI	Processing instruction data
Declaration	The `<?xml ...?>` declaration
DOCTYPE	The `<!DOCTYPE ...>` declaration

In pygixml, you access the type via the node.type property.

XPath — Querying XML¶

XPath (XML Path Language) is a query language for selecting nodes from an XML document. It was developed by the W3C and reached version 1.0 in November 1999.

XPath lets you navigate the tree using path expressions:

Expression	Meaning
`/library`	Root element `library`
`/library/book`	All `book` children of `library`
`//book`	All `book` elements anywhere in the document
`book[@id='1']`	`book` elements with `id="1"`
`book[year > 1950]/title`	Titles of books published after 1950
`book[1]/author`	Author of the first book
`count(//book)`	Total number of books
`sum(book/price) div count(book)`	Average book price

pygixml supports the full XPath 1.0 specification. See XPath Support for a detailed guide.

Well-Formed vs Valid XML¶

Well-formed XML satisfies the basic syntactic rules:

Every start tag has a matching end tag (or is self-closing)
Elements are properly nested (no overlapping)
Attribute values are quoted
Exactly one root element
Entity references are properly formed

Valid XML is well-formed and conforms to a DTD or XML Schema that defines its allowed structure:

<?xml version="1.0"?>
<!DOCTYPE note [
  <!ELEMENT note (to,from,heading,body)>
  <!ELEMENT to     (#PCDATA)>
  <!ELEMENT from   (#PCDATA)>
  <!ELEMENT heading (#PCDATA)>
  <!ELEMENT body    (#PCDATA)>
]>
<note>
    <to>Tove</to>
    <from>Jani</from>
    <heading>Reminder</heading>
    <body>Don't forget me this weekend!</body>
</note>

Note

pygixml checks for well-formedness only. It does not validate against DTDs or XML Schemas.

Real-World Applications¶

XML is used in virtually every industry. Key examples:

Web services — SOAP, RSS, Atom
Office files — .docx, .xlsx (Office Open XML)
Vector graphics — SVG
Build systems — Maven (pom.xml), MSBuild (.csproj)
Configuration — Android (AndroidManifest.xml), Spring, Apache
Scientific — MathML, SBML
Documentation — DocBook, DITA

Alternatives to XML¶

JSON — lighter syntax, dominant in web APIs (RFC 8259)
YAML — human-readable, used for configuration (yaml.org)
Protocol Buffers — binary serialization by Google (protobuf)

XML’s strengths remain: schema validation, namespaces, XPath/XQuery, and mature tooling across all platforms.

What is XML?¶

Anatomy of an XML Document¶

Elements¶

Attributes¶

Text Nodes: PCDATA vs CDATA¶

PCDATA (Parsed Character Data)¶

CDATA (Character Data)¶

Comments¶

Processing Instructions (PIs)¶

Namespaces¶

The XML Tree Model¶

XPath — Querying XML¶

Well-Formed vs Valid XML¶

Real-World Applications¶

Alternatives to XML¶

See Also¶