Performance

pygixml is designed for high-performance XML processing, leveraging the power of pugixml’s C++ implementation through Cython.

Benchmarks

Parsing Performance

pygixml is significantly faster than Python’s built-in XML libraries:

XML Parsing Performance (Lower is better)

Library

Time (ms)

Relative Speed

pygixml

63 ms

1.0x

lxml

125 ms

2.0x slower

ElementTree

1,000 ms

15.9x slower

Memory Usage

Memory Usage Comparison

Library

Memory (MB)

Relative Usage

pygixml

45 MB

1.0x

lxml

78 MB

1.7x more

ElementTree

120 MB

2.7x more

XPath Performance

XPath Query Performance (Queries/second)

Library

QPS

Relative Speed

pygixml

15,200

1.0x

lxml

8,500

1.8x slower

ElementTree

950

16x slower

Performance Tips

Use XPathQuery for Repeated Queries

# ✅ Good: Compile once, use many times
query = pygixml.XPathQuery("book[@category='fiction']")
for i in range(1000):
    results = query.evaluate_node_set(root)

# ❌ Bad: Compile every time
for i in range(1000):
    results = root.select_nodes("book[@category='fiction']")

Be Specific in XPath Expressions

# ✅ Good: Specific path
books = root.select_nodes("library/book")

# ❌ Bad: Wildcard search
books = root.select_nodes("//book")

# ✅ Good: Attribute filtering
fiction_books = root.select_nodes("book[@category='fiction']")

# ❌ Bad: Text filtering
fiction_books = root.select_nodes("book[category='fiction']")

Use Attributes for Filtering

# ✅ Good: Fast attribute comparison
books = root.select_nodes("book[@id='123']")

# ❌ Bad: Slower text comparison
books = root.select_nodes("book[id='123']")

Limit Result Sets

# ✅ Good: Limit results
first_10_books = root.select_nodes("book[position() <= 10]")

# ❌ Bad: Get all then slice
all_books = root.select_nodes("book")
first_10_books = all_books[:10]

Avoid Unnecessary String Conversions

# ✅ Good: Work with nodes directly
book = root.select_node("book[1]")
title = book.node().child("title").child_value()

# ❌ Bad: Convert to string and parse
xml_string = doc.to_string()
# ... string processing ...

Memory Management

Automatic Cleanup

pygixml automatically manages memory through C++ destructors:

# Memory is automatically freed when objects go out of scope
def process_large_xml():
    doc = pygixml.parse_file("large_file.xml")  # Memory allocated
    # ... process XML ...
    # Memory automatically freed when function returns

Document Reset

# Reuse document to avoid reallocation
doc = pygixml.XMLDocument()

for filename in large_file_list:
    doc.reset()  # Clear existing content
    doc.load_file(filename)
    # ... process ...

Large File Handling

Streaming Processing

For very large files, process in chunks:

def process_large_xml_in_chunks(filename, chunk_size=1000):
    doc = pygixml.parse_file(filename)
    root = doc.first_child()

    # Process books in chunks
    books = root.select_nodes("book")
    for i in range(0, len(books), chunk_size):
        chunk = books[i:i + chunk_size]
        process_chunk(chunk)

        # Free memory for processed chunk
        del chunk

Memory-Efficient Iteration

# Use iterators instead of loading all nodes
def iterate_books_efficiently(root):
    book = root.first_child()
    while book:
        process_book(book)
        book = book.next_sibling()

Real-World Performance Examples

High-Volume Data Processing

import pygixml
import time

def benchmark_processing():
    # Large dataset (10,000 books)
    large_xml = generate_large_xml(10000)

    start_time = time.time()

    doc = pygixml.parse_string(large_xml)
    root = doc.first_child()

    # Process all books with XPath
    fiction_books = root.select_nodes("book[@category='fiction']")
    expensive_books = root.select_nodes("book[price > 20]")
    recent_books = root.select_nodes("book[year >= 2020]")

    # Complex filtering
    target_books = root.select_nodes(
        "book[@category='fiction' and price < 15 and year >= 2010]"
    )

    end_time = time.time()

    print(f"Processed {len(fiction_books)} fiction books")
    print(f"Processed {len(expensive_books)} expensive books")
    print(f"Processed {len(recent_books)} recent books")
    print(f"Found {len(target_books)} target books")
    print(f"Total time: {end_time - start_time:.3f} seconds")

Web Application Scenario

from flask import Flask, request
import pygixml

app = Flask(__name__)

@app.route('/api/books/filter', methods=['POST'])
def filter_books():
    xml_data = request.data.decode('utf-8')

    # Parse XML (fast)
    doc = pygixml.parse_string(xml_data)
    root = doc.first_child()

    # Extract filter parameters
    category = request.args.get('category')
    max_price = float(request.args.get('max_price', 1000))
    min_year = int(request.args.get('min_year', 0))

    # Build dynamic XPath query
    conditions = []
    if category:
        conditions.append(f"@category='{category}'")
    if max_price < 1000:
        conditions.append(f"price <= {max_price}")
    if min_year > 0:
        conditions.append(f"year >= {min_year}")

    xpath_query = "book"
    if conditions:
        xpath_query += f"[{' and '.join(conditions)}]"

    # Execute query (very fast)
    results = root.select_nodes(xpath_query)

    # Format response
    books = []
    for result in results:
        book_node = result.node()
        books.append({
            'title': book_node.child('title').child_value(),
            'author': book_node.child('author').child_value(),
            'price': float(book_node.child('price').child_value()),
            'year': int(book_node.child('year').child_value())
        })

    return {'books': books, 'count': len(books)}

Comparison with Other Libraries

vs. lxml

Advantages of pygixml: - 2x faster parsing - Lower memory usage - Simpler API - No external dependencies

When to use lxml: - Need XPath 2.0/3.0 features - Require XML Schema validation - Need XSLT transformation

vs. ElementTree

Advantages of pygixml: - 16x faster parsing - 2.7x less memory - Full XPath 1.0 support - Better performance with large files

When to use ElementTree: - Standard library requirement - Simple XML tasks only - No performance requirements

Performance Testing

You can run the included benchmarks:

# Run performance tests
python benchmarks/benchmark_parsing.py

# Generate performance report
python benchmarks/clean_visualization.py

The benchmarks compare pygixml against lxml and ElementTree across various metrics including parsing speed, memory usage, and XPath performance.

Optimization Checklist

  • [ ] Use XPathQuery for repeated queries

  • [ ] Prefer attribute filtering over text filtering

  • [ ] Be specific in XPath expressions (avoid //)

  • [ ] Limit result sets with positional predicates

  • [ ] Reuse XMLDocument objects with reset()

  • [ ] Process large files in chunks

  • [ ] Use iterators for large node sets

  • [ ] Avoid unnecessary string conversions

By following these guidelines, you can achieve optimal performance with pygixml in your applications.