Advanced Tips¶

Performance tuning, parse flags, and advanced usage patterns for pygixml.

Parse Flags¶

All parse functions accept a ParseFlags enum to control exactly how pugixml processes the input. By default pygixml uses ParseFlags.DEFAULT which enables all standard XML processing. You can trade strictness for speed when you know your input is clean.

Quick Example¶

import pygixml

# Fastest possible parse — skip everything optional
doc = pygixml.parse_string(xml, pygixml.ParseFlags.MINIMAL)

# Combine specific flags with bitwise OR
flags = pygixml.ParseFlags.COMMENTS | pygixml.ParseFlags.CDATA
doc = pygixml.parse_string(xml, flags)

The same flags apply to parse_string(), parse_file(), load_string(), and load_file().

Available Flags¶

Flag	What it enables
`ParseFlags.MINIMAL`	No optional processing — fastest parse. Skips escapes, EOL normalization, and all whitespace handling.
`ParseFlags.COMMENTS`	Parse `<!--comment-->` nodes. Without this flag, comments are silently skipped.
`ParseFlags.CDATA`	Parse `<![CDATA[...]]>` sections. Without this flag, CDATA content is treated as regular PCDATA.
`ParseFlags.PI`	Parse processing instructions (`<?target data?>`).
`ParseFlags.ESCAPES`	Process entity references (`&`, `<`, `"`, etc.). Disabling this leaves `&` as literal text.
`ParseFlags.EOL`	Normalize line endings (`\r\n`, `\r`) to `\n`.
`ParseFlags.WS_PCDATA`	Convert whitespace characters in PCDATA to spaces.
`ParseFlags.WS_PCDATA_SINGLE`	Collapse consecutive whitespace in PCDATA to a single space.
`ParseFlags.WCONV_ATTRIBUTE`	Convert attribute whitespace (tabs, newlines) to spaces.
`ParseFlags.WNORM_ATTRIBUTE`	Normalize attribute whitespace (trim leading/trailing, collapse consecutive).
`ParseFlags.DECLARATION`	Parse the `<?xml version="1.0" ...?>` declaration node.
`ParseFlags.DOCTYPE`	Parse the `<!DOCTYPE ...>` node.
`ParseFlags.TRIM_PCDATA`	Trim leading and trailing PCDATA whitespace.
`ParseFlags.FRAGMENT`	Parse XML fragments that lack a root element. Useful for processing partial documents.
`ParseFlags.EMBED_PCDATA`	Parse embedded PCDATA as markup. Handles cases where escaped XML appears inside text content.
`ParseFlags.MERGE_PCDATA`	Merge adjacent PCDATA nodes into a single node.
`ParseFlags.DEFAULT`	All standard processing enabled. This is the default when no flag is specified.
`ParseFlags.FULL`	Same as `DEFAULT` — full XML compliance.

When to Use MINIMAL¶

ParseFlags.MINIMAL is the fastest parse mode. It skips:

Escape processing (& stays as &)
EOL normalization
Attribute whitespace conversion/normalization
PCDATA whitespace handling

Use it when:

You control the XML source and know it has no escapes
You only need element structure, not text formatting
You’re processing large documents in a hot path

On real-world XML with lots of escaped content, MINIMAL can be up to ~16% faster than DEFAULT.

Working with Text: `value`, `child_value()`, and `text()`¶

pygixml automatically shadows pugixml’s internal text-node structure. In pugixml, elements do not hold text directly; they contain child text nodes. pygixml handles this complexity for you, offering multiple ways to get and set text depending on your needs.

Setting Text Content¶

You can set text on an element, and pygixml will automatically create or replace the underlying text node:

item = root.append_child("item")
item.value = "Hello World"  # Automatically creates/replaces a text child

You can also modify the text node directly. Both approaches work seamlessly because pygixml shadows the underlying structure:

# Setting via element (creates/replaces child)
item.value = "Hello"

# Setting via the text node directly (equivalent result)
item.first_child().value = "World"

Reading Text Content¶

Depending on how much of the tree you need to read, choose the right accessor:

value — direct access

Returns the raw value of the node. For text nodes, this is the content. For element nodes, this returns the value of the first text/CDATA child (or None if no text exists). This is a convenient shortcut that shadows the underlying child-access.

item.value  # Returns "Hello" (from first text child)

child_value() — targeted single child

Returns the text of the first child element, or the child with a specific tag. It does not recurse. Best for simple key-value XML where elements hold a single text node.

doc.root.child_value()          # "Hello"
doc.root.child_value("title")   # "Python 101"

text() — full recursive extraction

Walks the entire subtree, collecting all text and CDATA nodes, and joins them. Use this when you need to extract all text from mixed content.

doc.root.text()             # "Hello\nworld!\nThis is\nrich\ntext."
doc.root.text(join=" ")     # "Hello world! This is rich text."

### Summary Table

Method	Scope	Best For
`element.value`	First text child (or `None`)	Quick read/write of simple element text
`element.first_child().value`	The text node directly	Direct manipulation of the text node
`child_value()`	First child element’s text	Key-value XML structures
`text()`	Entire subtree (recursive)	Mixed content, documents, rich text

XPathQuery for Repeated Queries¶

When running the same XPath query multiple times, use XPathQuery to compile once and evaluate many times:

# ✅ Good: compile once, evaluate many times
query = pygixml.XPathQuery("book[@category='fiction']")
for i in range(1000):
    results = query.evaluate_node_set(root)

# ❌ Bad: re-compile every iteration
for i in range(1000):
    results = root.select_nodes("book[@category='fiction']")

Be Specific in XPath Expressions¶

Avoid the descendant-axis search (//) when you know the structure:

# ✅ Good: specific path
books = root.select_nodes("library/book")

# ❌ Bad: scans entire document
books = root.select_nodes("//book")

Use Attributes for Filtering¶

Attribute comparisons are faster than text-node comparisons:

# ✅ Good: fast attribute comparison
books = root.select_nodes("book[@id='123']")

# ❌ Bad: slower text comparison
books = root.select_nodes("book[id='123']")

Limit Result Sets¶

Limit results in the query rather than slicing in Python:

# ✅ Good: limit in the query
first_10 = root.select_nodes("book[position() <= 10]")

# ❌ Bad: fetch all then slice
all_books = root.select_nodes("book")
first_10 = all_books[:10]

Document Reuse¶

Reuse an XMLDocument with reset() to avoid repeated allocations when processing many files:

doc = pygixml.XMLDocument()

for filename in file_list:
    doc.reset()              # Clear existing content
    doc.load_file(filename)
    # ... process ...

Processing Large Files¶

For documents with thousands of elements, use XPath to select only what you need rather than walking the entire tree in Python:

# ✅ Fast: let XPath filter in C++
fiction = root.select_nodes("book[@category='fiction']")

# ❌ Slower: filter in Python with per-node calls
book = root.first_child()
while book:
    cat = book.attribute("category")
    if cat and cat.value == "fiction":
        fiction.append(book)
    book = book.next_sibling

Every .child(), .attribute(), and .text() call crosses the Python↔Cython boundary. Minimizing these calls in tight loops has the biggest impact on traversal speed.

Node Identity and Fast Lookup¶

Each XMLNode exposes a mem_id — a unique numeric identifier derived from the node’s internal address. Unlike pugixml, which works exclusively with C++ object references, pygixml makes this identifier available as a plain Python integer.

Because mem_id is hashable, it is ideal for use as a dictionary key — a common pattern when building indexes, caches, or associating extra data with specific nodes:

# Cache node metadata by mem_id
cache = {}
for node in doc:
    cache[node.mem_id] = {
        "xpath": node.xpath,
        "depth": node.xpath.count("/"),
    }

There are two ways to look up a node by its identifier:

find_mem_id() — safe, O(n)

Walks the tree from the current node, comparing identifiers. Returns None if the node is not found.

node_id = item.mem_id
found = root.find_mem_id(node_id)   # safe, but O(n)

from_mem_id_unsafe() — instant, O(1)

Reconstructs an XMLNode directly from the identifier. No tree traversal — the lookup is instantaneous.

node_id = item.mem_id
node = pygixml.XMLNode.from_mem_id_unsafe(node_id)  # O(1)

⚠️ Warning: If the identifier is stale (the document was freed or the node was deleted), calling methods on the returned object may cause a segmentation fault. Use this only when you are certain the identifier still belongs to a live node.

Which to choose? For most code, find_mem_id is the right choice — it’s safe and fast enough for typical use. from_mem_id_unsafe is reserved for performance-critical hot paths where you’ve profiled and confirmed that the O(n) tree walk is a bottleneck.