Performance¶
pygixml is designed for high-performance XML processing, leveraging the power of pugixml’s C++ implementation through Cython.
Benchmarks¶
All numbers below come from the included benchmark suite
(benchmarks/full_benchmark.py) comparing pygixml, lxml, and
xml.etree.ElementTree on the same machine.
Parsing Performance¶
Size |
pygixml (best of default/minimal) |
lxml |
ElementTree |
|---|---|---|---|
100 |
0.000008 s |
0.000081 s |
0.000105 s |
500 |
0.000096 s |
0.000442 s |
0.000643 s |
1 000 |
0.000152 s |
0.000764 s |
0.001282 s |
2 500 |
0.000440 s |
0.001944 s |
0.003395 s |
5 000 |
0.000899 s |
0.004096 s |
0.008256 s |
10 000 |
0.001880 s |
0.009338 s |
0.016710 s |
Measured with ParseFlags.MINIMAL (pygixml.parse_string(xml, pygixml.ParseFlags.MINIMAL)).
Skips escape processing, EOL normalization, and attribute whitespace conversion
for maximum throughput. Use the default (ParseFlags.DEFAULT) when you need
full XML compliance.
Speedup vs ElementTree¶
Size |
pygixml |
|---|---|
100 |
13.6× |
500 |
6.7× |
1 000 |
8.5× |
2 500 |
7.7× |
5 000 |
9.2× |
10 000 |
8.9× |
pygixml consistently outperforms lxml by ~2× and ElementTree by 7–14×
depending on document size. Each row shows the faster of
ParseFlags.DEFAULT and ParseFlags.MINIMAL.
Traversal Performance¶
Traversal is measured as walking each top-level child, reading two sub-elements and extracting their text content.
Size |
pygixml |
lxml |
ElementTree |
|---|---|---|---|
100 |
0.000026 s |
0.000207 s |
0.000009 s |
500 |
0.000108 s |
0.001002 s |
0.000042 s |
1 000 |
0.000213 s |
0.002014 s |
0.000085 s |
5 000 |
0.001063 s |
0.010307 s |
0.000421 s |
10 000 |
0.002168 s |
0.020971 s |
0.000859 s |
pygixml traversal is ~10× faster than lxml but slower than ElementTree in
absolute terms. This is because every .child() and .child_value()
call crosses the Python↔Cython boundary. Best practice: use XPath for
bulk selection (which stays in C++) rather than walking nodes manually.
Memory Usage¶
Peak memory during parsing, measured via tracemalloc:
Size |
pygixml |
lxml |
ElementTree |
|---|---|---|---|
1 000 |
0.13 MB |
0.13 MB |
1.01 MB |
5 000 |
0.67 MB |
0.67 MB |
4.84 MB |
10 000 |
1.34 MB |
1.34 MB |
9.68 MB |
pygixml and lxml have nearly identical memory footprints (both backed by C/C++ parsers), while ElementTree uses ~7× more memory due to creating full Python objects for every node and attribute.
Package Size¶
Package |
Size |
|---|---|
pygixml |
0.43 MB |
lxml |
5.48 MB |
pygixml is 12.7× smaller than lxml in installed size according to pip-size package.
Performance Tips¶
Use XPathQuery for Repeated Queries¶
# ✅ Good: compile once, evaluate many times
query = pygixml.XPathQuery("book[@category='fiction']")
for _ in range(1000):
results = query.evaluate_node_set(root)
# ❌ Bad: re-compile every iteration
for _ in range(1000):
results = root.select_nodes("book[@category='fiction']")
Be Specific in XPath Expressions¶
# ✅ Good: specific path
books = root.select_nodes("library/book")
# ❌ Bad: descendant-axis search
books = root.select_nodes("//book")
Use Attributes for Filtering¶
# ✅ Good: fast attribute comparison
books = root.select_nodes("book[@id='123']")
# ❌ Bad: slower text-node comparison
books = root.select_nodes("book[id='123']")
Limit Result Sets¶
# ✅ Good: limit in the query
first_10 = root.select_nodes("book[position() <= 10]")
# ❌ Bad: fetch all then slice
all_books = root.select_nodes("book")
first_10 = all_books[:10]
Memory Management¶
Automatic Cleanup¶
pygixml automatically manages memory through C++ destructors:
# Memory is automatically freed when objects go out of scope
def process_large_xml():
doc = pygixml.parse_file("large_file.xml")
# ... process XML ...
# Memory automatically freed when function returns
Document Reset¶
# Reuse document to avoid reallocation
doc = pygixml.XMLDocument()
for filename in large_file_list:
doc.reset() # Clear existing content
doc.load_file(filename)
# ... process ...
Optimization Checklist¶
[ ] Use
XPathQueryfor repeated queries[ ] Prefer attribute filtering over text filtering
[ ] Be specific in XPath expressions (avoid
//)[ ] Limit result sets with positional predicates
[ ] Reuse
XMLDocumentobjects withreset()[ ] Use XPath for bulk selection, iterate results in Python
[ ] Avoid unnecessary string conversions
Running Benchmarks¶
Reproduce the numbers on your own machine:
# Full suite: parsing, memory, package size across 6 XML sizes
python benchmarks/full_benchmark.py
# Legacy parsing-only benchmark
python benchmarks/benchmark_parsing.py
The full suite tests 100 – 10 000 element documents over 5 iterations, measures peak memory at 1 000 / 5 000 / 10 000 elements, and reports installed package sizes.