What is XML?¶
XML (eXtensible Markup Language) is a flexible, text-based format for storing and transporting structured data. It was defined by the World Wide Web Consortium (W3C) and first published as a W3C Recommendation on February 10, 1998.
For a comprehensive overview, see the Wikipedia article on XML.
Anatomy of an XML Document¶
A complete XML document contains several types of nodes. Here is an example with every major component labeled:
<?xml version="1.0" encoding="UTF-8"?> <!-- (1) -->
<!DOCTYPE library SYSTEM "library.dtd"> <!-- (2) -->
<library name="Central"> <!-- (3) -->
<!-- This is a comment --> <!-- (4) -->
<book id="1" category="fiction"> <!-- (5) -->
<title>The Great Gatsby</title> <!-- (6) -->
<author>F. Scott Fitzgerald</author> <!-- (7) -->
<notes><![CDATA[Said to be <inspired> by real events]]></notes> <!-- (8) -->
<?custom-processor run-at="save"?> <!-- (9) -->
</book>
<book id="2"/> <!-- (10) -->
</library>
Let’s go through each numbered component:
XML Declaration —
<?xml version="1.0" encoding="UTF-8"?>The optional first line that identifies the document as XML and specifies the version and character encoding. If present, it must be the very first thing in the document.
DOCTYPE Declaration —
<!DOCTYPE library SYSTEM "library.dtd">References a Document Type Definition (DTD) that defines the allowed structure and elements. An XML document that conforms to its DTD is called valid.
Root Element —
<library name="Central">Every XML document has exactly one root (top-level) element that contains all other elements. Elements can have attributes (
name="Central") which are key-value pairs.Comments —
<!-- This is a comment -->Human-readable annotations that parsers can optionally preserve or skip. Comments are never part of the data model for applications.
Element with Attributes —
<book id="1" category="fiction">Elements are the building blocks of XML. Each element has a tag name (
book) and zero or more attributes (id,category).Text Node (PCDATA) —
The Great GatsbyParsed Character Data — the actual text content inside an element. “PCDATA” means the text is parsed, so entity references like
&are expanded.Child Element —
<author>inside<book>Elements nest inside parent elements, forming a tree structure.
CDATA Section —
<![CDATA[...]]>Character Data sections allow you to include text that would otherwise be treated as markup. Inside CDATA, characters like
<,>, and&lose their special meaning and are treated literally.Processing Instruction (PI) —
<?custom-processor run-at="save"?>Directives for applications processing the document. They provide application-specific information that is not part of the XML data model.
Empty (Self-Closing) Element —
<book id="2"/>
An element with no content can be written as a self-closing tag with a trailing
/>instead of a separate closing tag.
Elements¶
Elements are containers that hold data. Every element consists of:
A start tag —
<tagname>Zero or more attributes
Content — child elements, text, comments, etc.
An end tag —
</tagname>
<book id="1"> <!-- start tag with attribute -->
<title>Gatsby</title> <!-- child element -->
</book> <!-- end tag -->
Empty elements can use the shorthand syntax:
<br/> <!-- self-closing -->
<img src="photo.jpg"/>
Attributes¶
Attributes are name-value pairs attached to elements:
<book id="1" category="fiction" lang="en">
Attribute names must be unique within an element
Attribute values must always be quoted (single or double quotes)
Order of attributes does not matter
Entities — special character references that can appear in attribute values and text:
Entity |
Renders As |
|---|---|
|
|
|
|
|
|
|
|
|
|
Example: <tag attr="x & y"/> renders as x & y.
Text Nodes: PCDATA vs CDATA¶
PCDATA (Parsed Character Data)¶
The default mode for text content. The parser interprets special characters:
<message>5 < 10 & 10 > 3</message>
The parser sees: 5 < 10 & 10 > 3
CDATA (Character Data)¶
A section where the parser treats everything as literal text. No entity processing, no markup recognition:
<code><![CDATA[
if (x < 10 && y > 3) {
return x & y;
}
]]></code>
The parser sees the text exactly as written — no escaping needed.
CDATA sections begin with <![CDATA[ and end with ]]>.
When to use CDATA:
Embedding code snippets (HTML, JavaScript, SQL)
Including text that contains many
<or&charactersAvoiding the tedious process of escaping special characters
When NOT to use CDATA:
When you need entity references to be expanded
Inside attribute values (CDATA sections are only valid in element content)
Processing Instructions (PIs)¶
Processing instructions provide application-specific data:
<?xml-stylesheet type="text/xsl" href="style.xsl"?>
<?custom-app mode="debug"?>
The target (e.g.
xml-stylesheet) identifies the applicationThe data is passed as-is to that application
PIs starting with
xml(case-insensitive) are reserved for W3C use
See the W3C specification on PIs.
Namespaces¶
XML namespaces prevent name collisions when combining documents from different vocabularies:
<root xmlns:book="http://example.com/books"
xmlns:store="http://example.com/store">
<book:title>XML Guide</book:title>
<store:title>Store Name Here</store:title>
</root>
xmlns:prefix="URI"declares a namespaceElements and attributes can be qualified with a prefix
The default namespace (no prefix) applies to unprefixed elements:
xmlns="http://example.com/default"
The XML Tree Model¶
An XML document is represented as a DOM tree (Document Object Model):
Document
└── Element: library
├── Comment: "This is a comment"
├── Element: book (id="1")
│ ├── Element: title
│ │ └── Text: "The Great Gatsby"
│ └── Element: author
│ └── Text: "F. Scott Fitzgerald"
└── Element: book (id="2")
Every node in the tree has a type:
Node Type |
Description |
|---|---|
Document |
The root of the tree (the entire document) |
Element |
A tag like |
Text |
Character data inside an element |
Comment |
|
CDATA |
Content inside a |
PI |
Processing instruction data |
Declaration |
The |
DOCTYPE |
The |
In pygixml, you access the type via the node.type property.
XPath — Querying XML¶
XPath (XML Path Language) is a query language for selecting nodes from an XML document. It was developed by the W3C and reached version 1.0 in November 1999.
XPath lets you navigate the tree using path expressions:
Expression |
Meaning |
|---|---|
|
Root element |
|
All |
|
All |
|
|
|
Titles of books published after 1950 |
|
Author of the first book |
|
Total number of books |
|
Average book price |
pygixml supports the full XPath 1.0 specification. See XPath Support for a detailed guide.
Well-Formed vs Valid XML¶
Well-formed XML satisfies the basic syntactic rules:
Every start tag has a matching end tag (or is self-closing)
Elements are properly nested (no overlapping)
Attribute values are quoted
Exactly one root element
Entity references are properly formed
Valid XML is well-formed and conforms to a DTD or XML Schema that defines its allowed structure:
<?xml version="1.0"?>
<!DOCTYPE note [
<!ELEMENT note (to,from,heading,body)>
<!ELEMENT to (#PCDATA)>
<!ELEMENT from (#PCDATA)>
<!ELEMENT heading (#PCDATA)>
<!ELEMENT body (#PCDATA)>
]>
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
Note
pygixml checks for well-formedness only. It does not validate against DTDs or XML Schemas.
Real-World Applications¶
XML is used in virtually every industry. Key examples:
Alternatives to XML¶
JSON — lighter syntax, dominant in web APIs (RFC 8259)
YAML — human-readable, used for configuration (yaml.org)
Protocol Buffers — binary serialization by Google (protobuf)
XML’s strengths remain: schema validation, namespaces, XPath/XQuery, and mature tooling across all platforms.
Comments¶
Comments are annotations ignored by applications:
Comments cannot appear inside attribute values
Comments cannot be nested (
<!-- <!-- nested --> -->is invalid)The string
--cannot appear inside a comment