Advanced Tips¶
Performance tuning, parse flags, and advanced usage patterns for pygixml.
Parse Flags¶
All parse functions accept a ParseFlags enum to control
exactly how pugixml processes the input. By default pygixml uses
ParseFlags.DEFAULT which enables all standard XML processing. You can
trade strictness for speed when you know your input is clean.
Quick Example¶
import pygixml
# Fastest possible parse — skip everything optional
doc = pygixml.parse_string(xml, pygixml.ParseFlags.MINIMAL)
# Combine specific flags with bitwise OR
flags = pygixml.ParseFlags.COMMENTS | pygixml.ParseFlags.CDATA
doc = pygixml.parse_string(xml, flags)
The same flags apply to parse_string(),
parse_file(), load_string(),
and load_file().
Available Flags¶
Flag |
What it enables |
|---|---|
|
No optional processing — fastest parse. Skips escapes, EOL normalization, and all whitespace handling. |
|
Parse |
|
Parse |
|
Parse processing instructions ( |
|
Process entity references ( |
|
Normalize line endings ( |
|
Convert whitespace characters in PCDATA to spaces. |
|
Collapse consecutive whitespace in PCDATA to a single space. |
|
Convert attribute whitespace (tabs, newlines) to spaces. |
|
Normalize attribute whitespace (trim leading/trailing, collapse consecutive). |
|
Parse the |
|
Parse the |
|
Trim leading and trailing PCDATA whitespace. |
|
Parse XML fragments that lack a root element. Useful for processing partial documents. |
|
Parse embedded PCDATA as markup. Handles cases where escaped XML appears inside text content. |
|
Merge adjacent PCDATA nodes into a single node. |
|
All standard processing enabled. This is the default when no flag is specified. |
|
Same as |
When to Use MINIMAL¶
ParseFlags.MINIMAL is the fastest parse mode. It skips:
Escape processing (
&stays as&)EOL normalization
Attribute whitespace conversion/normalization
PCDATA whitespace handling
Use it when:
You control the XML source and know it has no escapes
You only need element structure, not text formatting
You’re processing large documents in a hot path
On real-world XML with lots of escaped content, MINIMAL can be up to ~16% faster than DEFAULT.
Working with Text: value, child_value(), and text()¶
pygixml automatically shadows pugixml’s internal text-node structure. In pugixml, elements do not hold text directly; they contain child text nodes. pygixml handles this complexity for you, offering multiple ways to get and set text depending on your needs.
Setting Text Content¶
You can set text on an element, and pygixml will automatically create or replace the underlying text node:
item = root.append_child("item")
item.value = "Hello World" # Automatically creates/replaces a text child
You can also modify the text node directly. Both approaches work seamlessly because pygixml shadows the underlying structure:
# Setting via element (creates/replaces child)
item.value = "Hello"
# Setting via the text node directly (equivalent result)
item.first_child().value = "World"
Reading Text Content¶
Depending on how much of the tree you need to read, choose the right accessor:
value— direct accessReturns the raw value of the node. For text nodes, this is the content. For element nodes, this returns the value of the first text/CDATA child (or
Noneif no text exists). This is a convenient shortcut that shadows the underlying child-access.item.value # Returns "Hello" (from first text child)
child_value()— targeted single childReturns the text of the first child element, or the child with a specific tag. It does not recurse. Best for simple key-value XML where elements hold a single text node.
doc.root.child_value() # "Hello" doc.root.child_value("title") # "Python 101"
text()— full recursive extractionWalks the entire subtree, collecting all text and CDATA nodes, and joins them. Use this when you need to extract all text from mixed content.
doc.root.text() # "Hello\nworld!\nThis is\nrich\ntext." doc.root.text(join=" ") # "Hello world! This is rich text."
### Summary Table
Method |
Scope |
Best For |
|---|---|---|
|
First text child (or |
Quick read/write of simple element text |
|
The text node directly |
Direct manipulation of the text node |
|
First child element’s text |
Key-value XML structures |
|
Entire subtree (recursive) |
Mixed content, documents, rich text |
XPathQuery for Repeated Queries¶
When running the same XPath query multiple times, use
XPathQuery to compile once and evaluate many times:
# ✅ Good: compile once, evaluate many times
query = pygixml.XPathQuery("book[@category='fiction']")
for i in range(1000):
results = query.evaluate_node_set(root)
# ❌ Bad: re-compile every iteration
for i in range(1000):
results = root.select_nodes("book[@category='fiction']")
Be Specific in XPath Expressions¶
Avoid the descendant-axis search (//) when you know the structure:
# ✅ Good: specific path
books = root.select_nodes("library/book")
# ❌ Bad: scans entire document
books = root.select_nodes("//book")
Use Attributes for Filtering¶
Attribute comparisons are faster than text-node comparisons:
# ✅ Good: fast attribute comparison
books = root.select_nodes("book[@id='123']")
# ❌ Bad: slower text comparison
books = root.select_nodes("book[id='123']")
Limit Result Sets¶
Limit results in the query rather than slicing in Python:
# ✅ Good: limit in the query
first_10 = root.select_nodes("book[position() <= 10]")
# ❌ Bad: fetch all then slice
all_books = root.select_nodes("book")
first_10 = all_books[:10]
Document Reuse¶
Reuse an XMLDocument with reset() to avoid
repeated allocations when processing many files:
doc = pygixml.XMLDocument()
for filename in file_list:
doc.reset() # Clear existing content
doc.load_file(filename)
# ... process ...
Processing Large Files¶
For documents with thousands of elements, use XPath to select only what you need rather than walking the entire tree in Python:
# ✅ Fast: let XPath filter in C++
fiction = root.select_nodes("book[@category='fiction']")
# ❌ Slower: filter in Python with per-node calls
book = root.first_child()
while book:
cat = book.attribute("category")
if cat and cat.value == "fiction":
fiction.append(book)
book = book.next_sibling
Every .child(), .attribute(), and .text() call crosses the
Python↔Cython boundary. Minimizing these calls in tight loops has the
biggest impact on traversal speed.
Node Identity and Fast Lookup¶
Each XMLNode exposes a mem_id —
a unique numeric identifier derived from the node’s internal address.
Unlike pugixml, which works exclusively with C++ object references, pygixml
makes this identifier available as a plain Python integer.
Because mem_id is hashable, it is ideal for use as a dictionary key
— a common pattern when building indexes, caches, or associating extra data
with specific nodes:
# Cache node metadata by mem_id
cache = {}
for node in doc:
cache[node.mem_id] = {
"xpath": node.xpath,
"depth": node.xpath.count("/"),
}
There are two ways to look up a node by its identifier:
find_mem_id()— safe, O(n)Walks the tree from the current node, comparing identifiers. Returns
Noneif the node is not found.node_id = item.mem_id found = root.find_mem_id(node_id) # safe, but O(n)
from_mem_id_unsafe()— instant, O(1)Reconstructs an
XMLNodedirectly from the identifier. No tree traversal — the lookup is instantaneous.node_id = item.mem_id node = pygixml.XMLNode.from_mem_id_unsafe(node_id) # O(1)
⚠️ Warning: If the identifier is stale (the document was freed or the node was deleted), calling methods on the returned object may cause a segmentation fault. Use this only when you are certain the identifier still belongs to a live node.
Which to choose? For most code, find_mem_id is the right choice —
it’s safe and fast enough for typical use. from_mem_id_unsafe is
reserved for performance-critical hot paths where you’ve profiled and
confirmed that the O(n) tree walk is a bottleneck.