Performance¶

pygixml is designed for high-performance XML processing, leveraging the power of pugixml’s C++ implementation through Cython.

Benchmarks¶

All numbers below come from the included benchmark suite (benchmarks/full_benchmark.py) comparing pygixml, lxml, and xml.etree.ElementTree on the same machine.

Parsing Performance¶

XML Parsing Performance (warmed-up, 50 iterations)¶
Size	pygixml (best of default/minimal)	lxml	ElementTree
100	0.000008 s	0.000081 s	0.000105 s
500	0.000096 s	0.000442 s	0.000643 s
1 000	0.000152 s	0.000764 s	0.001282 s
2 500	0.000440 s	0.001944 s	0.003395 s
5 000	0.000899 s	0.004096 s	0.008256 s
10 000	0.001880 s	0.009338 s	0.016710 s

Measured with ParseFlags.MINIMAL (pygixml.parse_string(xml, pygixml.ParseFlags.MINIMAL)). Skips escape processing, EOL normalization, and attribute whitespace conversion for maximum throughput. Use the default (ParseFlags.DEFAULT) when you need full XML compliance.

Speedup vs ElementTree¶

Parsing Speedup (how many times faster than ElementTree)¶
Size	pygixml
100	13.6×
500	6.7×
1 000	8.5×
2 500	7.7×
5 000	9.2×
10 000	8.9×

pygixml consistently outperforms lxml by ~2× and ElementTree by 7–14× depending on document size. Each row shows the faster of ParseFlags.DEFAULT and ParseFlags.MINIMAL.

Traversal Performance¶

Traversal is measured as walking each top-level child, reading two sub-elements and extracting their text content.

Traversal (seconds)¶
Size	pygixml	lxml	ElementTree
100	0.000026 s	0.000207 s	0.000009 s
500	0.000108 s	0.001002 s	0.000042 s
1 000	0.000213 s	0.002014 s	0.000085 s
5 000	0.001063 s	0.010307 s	0.000421 s
10 000	0.002168 s	0.020971 s	0.000859 s

pygixml traversal is ~10× faster than lxml but slower than ElementTree in absolute terms. This is because every .child() and .child_value() call crosses the Python↔Cython boundary. Best practice: use XPath for bulk selection (which stays in C++) rather than walking nodes manually.

Memory Usage¶

Peak memory during parsing, measured via tracemalloc:

Peak Memory (MB)¶
Size	pygixml	lxml	ElementTree
1 000	0.13 MB	0.13 MB	1.01 MB
5 000	0.67 MB	0.67 MB	4.84 MB
10 000	1.34 MB	1.34 MB	9.68 MB

pygixml and lxml have nearly identical memory footprints (both backed by C/C++ parsers), while ElementTree uses ~7× more memory due to creating full Python objects for every node and attribute.

Package Size¶

Installed Package Size¶
Package	Size
pygixml	0.43 MB
lxml	5.48 MB

pygixml is 12.7× smaller than lxml in installed size according to pip-size package.

Performance Tips¶

Use XPathQuery for Repeated Queries¶

# ✅ Good: compile once, evaluate many times
query = pygixml.XPathQuery("book[@category='fiction']")
for _ in range(1000):
    results = query.evaluate_node_set(root)

# ❌ Bad: re-compile every iteration
for _ in range(1000):
    results = root.select_nodes("book[@category='fiction']")

Be Specific in XPath Expressions¶

# ✅ Good: specific path
books = root.select_nodes("library/book")

# ❌ Bad: descendant-axis search
books = root.select_nodes("//book")

Use Attributes for Filtering¶

# ✅ Good: fast attribute comparison
books = root.select_nodes("book[@id='123']")

# ❌ Bad: slower text-node comparison
books = root.select_nodes("book[id='123']")

Limit Result Sets¶

# ✅ Good: limit in the query
first_10 = root.select_nodes("book[position() <= 10]")

# ❌ Bad: fetch all then slice
all_books = root.select_nodes("book")
first_10 = all_books[:10]

Memory Management¶

Automatic Cleanup¶

pygixml automatically manages memory through C++ destructors:

# Memory is automatically freed when objects go out of scope
def process_large_xml():
    doc = pygixml.parse_file("large_file.xml")
    # ... process XML ...
    # Memory automatically freed when function returns

Document Reset¶

# Reuse document to avoid reallocation
doc = pygixml.XMLDocument()

for filename in large_file_list:
    doc.reset()          # Clear existing content
    doc.load_file(filename)
    # ... process ...

Optimization Checklist¶

[ ] Use XPathQuery for repeated queries
[ ] Prefer attribute filtering over text filtering
[ ] Be specific in XPath expressions (avoid //)
[ ] Limit result sets with positional predicates
[ ] Reuse XMLDocument objects with reset()
[ ] Use XPath for bulk selection, iterate results in Python
[ ] Avoid unnecessary string conversions

Running Benchmarks¶

Reproduce the numbers on your own machine:

# Full suite: parsing, memory, package size across 6 XML sizes
python benchmarks/full_benchmark.py

# Legacy parsing-only benchmark
python benchmarks/benchmark_parsing.py

The full suite tests 100 – 10 000 element documents over 5 iterations, measures peak memory at 1 000 / 5 000 / 10 000 elements, and reports installed package sizes.