Advanced Tips

Performance tuning, parse flags, and advanced usage patterns for pygixml.

Parse Flags

All parse functions accept a ParseFlags enum to control exactly how pugixml processes the input. By default pygixml uses ParseFlags.DEFAULT which enables all standard XML processing. You can trade strictness for speed when you know your input is clean.

Quick Example

import pygixml

# Fastest possible parse — skip everything optional
doc = pygixml.parse_string(xml, pygixml.ParseFlags.MINIMAL)

# Combine specific flags with bitwise OR
flags = pygixml.ParseFlags.COMMENTS | pygixml.ParseFlags.CDATA
doc = pygixml.parse_string(xml, flags)

The same flags apply to parse_string(), parse_file(), load_string(), and load_file().

Available Flags

Flag

What it enables

ParseFlags.MINIMAL

No optional processing — fastest parse. Skips escapes, EOL normalization, and all whitespace handling.

ParseFlags.COMMENTS

Parse <!--comment--> nodes. Without this flag, comments are silently skipped.

ParseFlags.CDATA

Parse <![CDATA[...]]> sections. Without this flag, CDATA content is treated as regular PCDATA.

ParseFlags.PI

Parse processing instructions (<?target data?>).

ParseFlags.ESCAPES

Process entity references (&amp;, &lt;, &quot;, etc.). Disabling this leaves &amp; as literal text.

ParseFlags.EOL

Normalize line endings (\r\n, \r) to \n.

ParseFlags.WS_PCDATA

Convert whitespace characters in PCDATA to spaces.

ParseFlags.WS_PCDATA_SINGLE

Collapse consecutive whitespace in PCDATA to a single space.

ParseFlags.WCONV_ATTRIBUTE

Convert attribute whitespace (tabs, newlines) to spaces.

ParseFlags.WNORM_ATTRIBUTE

Normalize attribute whitespace (trim leading/trailing, collapse consecutive).

ParseFlags.DECLARATION

Parse the <?xml version="1.0" ...?> declaration node.

ParseFlags.DOCTYPE

Parse the <!DOCTYPE ...> node.

ParseFlags.TRIM_PCDATA

Trim leading and trailing PCDATA whitespace.

ParseFlags.FRAGMENT

Parse XML fragments that lack a root element. Useful for processing partial documents.

ParseFlags.EMBED_PCDATA

Parse embedded PCDATA as markup. Handles cases where escaped XML appears inside text content.

ParseFlags.MERGE_PCDATA

Merge adjacent PCDATA nodes into a single node.

ParseFlags.DEFAULT

All standard processing enabled. This is the default when no flag is specified.

ParseFlags.FULL

Same as DEFAULT — full XML compliance.

When to Use MINIMAL

ParseFlags.MINIMAL is the fastest parse mode. It skips:

  • Escape processing (&amp; stays as &amp;)

  • EOL normalization

  • Attribute whitespace conversion/normalization

  • PCDATA whitespace handling

Use it when:

  • You control the XML source and know it has no escapes

  • You only need element structure, not text formatting

  • You’re processing large documents in a hot path

On real-world XML with lots of escaped content, MINIMAL can be up to ~16% faster than DEFAULT.

Working with Text: value, child_value(), and text()

pygixml automatically shadows pugixml’s internal text-node structure. In pugixml, elements do not hold text directly; they contain child text nodes. pygixml handles this complexity for you, offering multiple ways to get and set text depending on your needs.

Setting Text Content

You can set text on an element, and pygixml will automatically create or replace the underlying text node:

item = root.append_child("item")
item.value = "Hello World"  # Automatically creates/replaces a text child

You can also modify the text node directly. Both approaches work seamlessly because pygixml shadows the underlying structure:

# Setting via element (creates/replaces child)
item.value = "Hello"

# Setting via the text node directly (equivalent result)
item.first_child().value = "World"

Reading Text Content

Depending on how much of the tree you need to read, choose the right accessor:

value — direct access

Returns the raw value of the node. For text nodes, this is the content. For element nodes, this returns the value of the first text/CDATA child (or None if no text exists). This is a convenient shortcut that shadows the underlying child-access.

item.value  # Returns "Hello" (from first text child)
child_value() — targeted single child

Returns the text of the first child element, or the child with a specific tag. It does not recurse. Best for simple key-value XML where elements hold a single text node.

doc.root.child_value()          # "Hello"
doc.root.child_value("title")   # "Python 101"
text() — full recursive extraction

Walks the entire subtree, collecting all text and CDATA nodes, and joins them. Use this when you need to extract all text from mixed content.

doc.root.text()             # "Hello\nworld!\nThis is\nrich\ntext."
doc.root.text(join=" ")     # "Hello world! This is rich text."

### Summary Table

Method

Scope

Best For

element.value

First text child (or None)

Quick read/write of simple element text

element.first_child().value

The text node directly

Direct manipulation of the text node

child_value()

First child element’s text

Key-value XML structures

text()

Entire subtree (recursive)

Mixed content, documents, rich text

XPathQuery for Repeated Queries

When running the same XPath query multiple times, use XPathQuery to compile once and evaluate many times:

# ✅ Good: compile once, evaluate many times
query = pygixml.XPathQuery("book[@category='fiction']")
for i in range(1000):
    results = query.evaluate_node_set(root)

# ❌ Bad: re-compile every iteration
for i in range(1000):
    results = root.select_nodes("book[@category='fiction']")

Be Specific in XPath Expressions

Avoid the descendant-axis search (//) when you know the structure:

# ✅ Good: specific path
books = root.select_nodes("library/book")

# ❌ Bad: scans entire document
books = root.select_nodes("//book")

Use Attributes for Filtering

Attribute comparisons are faster than text-node comparisons:

# ✅ Good: fast attribute comparison
books = root.select_nodes("book[@id='123']")

# ❌ Bad: slower text comparison
books = root.select_nodes("book[id='123']")

Limit Result Sets

Limit results in the query rather than slicing in Python:

# ✅ Good: limit in the query
first_10 = root.select_nodes("book[position() <= 10]")

# ❌ Bad: fetch all then slice
all_books = root.select_nodes("book")
first_10 = all_books[:10]

Document Reuse

Reuse an XMLDocument with reset() to avoid repeated allocations when processing many files:

doc = pygixml.XMLDocument()

for filename in file_list:
    doc.reset()              # Clear existing content
    doc.load_file(filename)
    # ... process ...

Processing Large Files

For documents with thousands of elements, use XPath to select only what you need rather than walking the entire tree in Python:

# ✅ Fast: let XPath filter in C++
fiction = root.select_nodes("book[@category='fiction']")

# ❌ Slower: filter in Python with per-node calls
book = root.first_child()
while book:
    cat = book.attribute("category")
    if cat and cat.value == "fiction":
        fiction.append(book)
    book = book.next_sibling

Every .child(), .attribute(), and .text() call crosses the Python↔Cython boundary. Minimizing these calls in tight loops has the biggest impact on traversal speed.

Node Identity and Fast Lookup

Each XMLNode exposes a mem_id — a unique numeric identifier derived from the node’s internal address. Unlike pugixml, which works exclusively with C++ object references, pygixml makes this identifier available as a plain Python integer.

Because mem_id is hashable, it is ideal for use as a dictionary key — a common pattern when building indexes, caches, or associating extra data with specific nodes:

# Cache node metadata by mem_id
cache = {}
for node in doc:
    cache[node.mem_id] = {
        "xpath": node.xpath,
        "depth": node.xpath.count("/"),
    }

There are two ways to look up a node by its identifier:

find_mem_id() — safe, O(n)

Walks the tree from the current node, comparing identifiers. Returns None if the node is not found.

node_id = item.mem_id
found = root.find_mem_id(node_id)   # safe, but O(n)
from_mem_id_unsafe() — instant, O(1)

Reconstructs an XMLNode directly from the identifier. No tree traversal — the lookup is instantaneous.

node_id = item.mem_id
node = pygixml.XMLNode.from_mem_id_unsafe(node_id)  # O(1)

⚠️ Warning: If the identifier is stale (the document was freed or the node was deleted), calling methods on the returned object may cause a segmentation fault. Use this only when you are certain the identifier still belongs to a live node.

Which to choose? For most code, find_mem_id is the right choice — it’s safe and fast enough for typical use. from_mem_id_unsafe is reserved for performance-critical hot paths where you’ve profiled and confirmed that the O(n) tree walk is a bottleneck.