Performance¶
pygixml is designed for high-performance XML processing, leveraging the power of pugixml’s C++ implementation through Cython.
Benchmarks¶
Parsing Performance¶
pygixml is significantly faster than Python’s built-in XML libraries:
Library |
Time (ms) |
Relative Speed |
|---|---|---|
pygixml |
63 ms |
1.0x |
lxml |
125 ms |
2.0x slower |
ElementTree |
1,000 ms |
15.9x slower |
Memory Usage¶
Library |
Memory (MB) |
Relative Usage |
|---|---|---|
pygixml |
45 MB |
1.0x |
lxml |
78 MB |
1.7x more |
ElementTree |
120 MB |
2.7x more |
XPath Performance¶
Library |
QPS |
Relative Speed |
|---|---|---|
pygixml |
15,200 |
1.0x |
lxml |
8,500 |
1.8x slower |
ElementTree |
950 |
16x slower |
Performance Tips¶
Use XPathQuery for Repeated Queries¶
# ✅ Good: Compile once, use many times
query = pygixml.XPathQuery("book[@category='fiction']")
for i in range(1000):
results = query.evaluate_node_set(root)
# ❌ Bad: Compile every time
for i in range(1000):
results = root.select_nodes("book[@category='fiction']")
Be Specific in XPath Expressions¶
# ✅ Good: Specific path
books = root.select_nodes("library/book")
# ❌ Bad: Wildcard search
books = root.select_nodes("//book")
# ✅ Good: Attribute filtering
fiction_books = root.select_nodes("book[@category='fiction']")
# ❌ Bad: Text filtering
fiction_books = root.select_nodes("book[category='fiction']")
Use Attributes for Filtering¶
# ✅ Good: Fast attribute comparison
books = root.select_nodes("book[@id='123']")
# ❌ Bad: Slower text comparison
books = root.select_nodes("book[id='123']")
Limit Result Sets¶
# ✅ Good: Limit results
first_10_books = root.select_nodes("book[position() <= 10]")
# ❌ Bad: Get all then slice
all_books = root.select_nodes("book")
first_10_books = all_books[:10]
Avoid Unnecessary String Conversions¶
# ✅ Good: Work with nodes directly
book = root.select_node("book[1]")
title = book.node().child("title").child_value()
# ❌ Bad: Convert to string and parse
xml_string = doc.to_string()
# ... string processing ...
Memory Management¶
Automatic Cleanup¶
pygixml automatically manages memory through C++ destructors:
# Memory is automatically freed when objects go out of scope
def process_large_xml():
doc = pygixml.parse_file("large_file.xml") # Memory allocated
# ... process XML ...
# Memory automatically freed when function returns
Document Reset¶
# Reuse document to avoid reallocation
doc = pygixml.XMLDocument()
for filename in large_file_list:
doc.reset() # Clear existing content
doc.load_file(filename)
# ... process ...
Large File Handling¶
Streaming Processing¶
For very large files, process in chunks:
def process_large_xml_in_chunks(filename, chunk_size=1000):
doc = pygixml.parse_file(filename)
root = doc.first_child()
# Process books in chunks
books = root.select_nodes("book")
for i in range(0, len(books), chunk_size):
chunk = books[i:i + chunk_size]
process_chunk(chunk)
# Free memory for processed chunk
del chunk
Memory-Efficient Iteration¶
# Use iterators instead of loading all nodes
def iterate_books_efficiently(root):
book = root.first_child()
while book:
process_book(book)
book = book.next_sibling()
Real-World Performance Examples¶
High-Volume Data Processing¶
import pygixml
import time
def benchmark_processing():
# Large dataset (10,000 books)
large_xml = generate_large_xml(10000)
start_time = time.time()
doc = pygixml.parse_string(large_xml)
root = doc.first_child()
# Process all books with XPath
fiction_books = root.select_nodes("book[@category='fiction']")
expensive_books = root.select_nodes("book[price > 20]")
recent_books = root.select_nodes("book[year >= 2020]")
# Complex filtering
target_books = root.select_nodes(
"book[@category='fiction' and price < 15 and year >= 2010]"
)
end_time = time.time()
print(f"Processed {len(fiction_books)} fiction books")
print(f"Processed {len(expensive_books)} expensive books")
print(f"Processed {len(recent_books)} recent books")
print(f"Found {len(target_books)} target books")
print(f"Total time: {end_time - start_time:.3f} seconds")
Web Application Scenario¶
from flask import Flask, request
import pygixml
app = Flask(__name__)
@app.route('/api/books/filter', methods=['POST'])
def filter_books():
xml_data = request.data.decode('utf-8')
# Parse XML (fast)
doc = pygixml.parse_string(xml_data)
root = doc.first_child()
# Extract filter parameters
category = request.args.get('category')
max_price = float(request.args.get('max_price', 1000))
min_year = int(request.args.get('min_year', 0))
# Build dynamic XPath query
conditions = []
if category:
conditions.append(f"@category='{category}'")
if max_price < 1000:
conditions.append(f"price <= {max_price}")
if min_year > 0:
conditions.append(f"year >= {min_year}")
xpath_query = "book"
if conditions:
xpath_query += f"[{' and '.join(conditions)}]"
# Execute query (very fast)
results = root.select_nodes(xpath_query)
# Format response
books = []
for result in results:
book_node = result.node()
books.append({
'title': book_node.child('title').child_value(),
'author': book_node.child('author').child_value(),
'price': float(book_node.child('price').child_value()),
'year': int(book_node.child('year').child_value())
})
return {'books': books, 'count': len(books)}
Comparison with Other Libraries¶
vs. lxml¶
Advantages of pygixml: - 2x faster parsing - Lower memory usage - Simpler API - No external dependencies
When to use lxml: - Need XPath 2.0/3.0 features - Require XML Schema validation - Need XSLT transformation
vs. ElementTree¶
Advantages of pygixml: - 16x faster parsing - 2.7x less memory - Full XPath 1.0 support - Better performance with large files
When to use ElementTree: - Standard library requirement - Simple XML tasks only - No performance requirements
Performance Testing¶
You can run the included benchmarks:
# Run performance tests
python benchmarks/benchmark_parsing.py
# Generate performance report
python benchmarks/clean_visualization.py
The benchmarks compare pygixml against lxml and ElementTree across various metrics including parsing speed, memory usage, and XPath performance.
Optimization Checklist¶
[ ] Use
XPathQueryfor repeated queries[ ] Prefer attribute filtering over text filtering
[ ] Be specific in XPath expressions (avoid
//)[ ] Limit result sets with positional predicates
[ ] Reuse
XMLDocumentobjects withreset()[ ] Process large files in chunks
[ ] Use iterators for large node sets
[ ] Avoid unnecessary string conversions
By following these guidelines, you can achieve optimal performance with pygixml in your applications.