Performance¶

liburlparser is designed to be extremely fast and efficient. This page provides performance benchmarks and tips for optimizing your code when working with liburlparser.

Benchmarks¶

Extract From Host¶

Tests were run on a file containing 10 million random domains from various top-level domains:

Library	Function	Time
liburlparser	liburlparser.Host	1.12s
PyDomainExtractor	pydomainextractor.extract	1.50s
publicsuffix2	publicsuffix2.get_sld	9.92s
tldextract	__call__	29.23s
tld	tld.parse_tld	34.48s

Extract From URL¶

Tests were run on a file containing 1 million random URLs:

Library	Function	Time
liburlparser	liburlparser.Host.from_url	2.10s
PyDomainExtractor	pydomainextractor.extract_from_url	2.24s
publicsuffix2	publicsuffix2.get_sld	10.84s
tldextract	__call__	36.04s
tld	tld.parse_tld	57.87s

Performance Optimization Tips¶

1. Choose the Right Method¶

liburlparser provides several methods for extracting domain information, each with different performance characteristics:

from liburlparser import Url, Host
import time

url_str = "https://mail.google.com/about"

# Method 1: Full Url object creation (slowest, but provides all URL components)
start = time.time()
url = Url(url_str)
domain = url.domain
print(f"Method 1 time: {time.time() - start:.6f}s")

# Method 2: Host object from URL (faster, provides all host components)
start = time.time()
host = Host.from_url(url_str)
domain = host.domain
print(f"Method 2 time: {time.time() - start:.6f}s")

# Method 3: Extract host string only (very fast)
start = time.time()
host_str = Url.extract_host(url_str)
print(f"Method 3 time: {time.time() - start:.6f}s")

# Method 4: Extract components directly (fastest for domain extraction)
start = time.time()
components = Host.extract_from_url(url_str)
domain = components["domain"]
print(f"Method 4 time: {time.time() - start:.6f}s")

2. Batch Processing¶

For processing large numbers of URLs, use the fastest method appropriate for your needs:

from liburlparser import Host
import time

# Sample list of URLs
urls = ["https://example.com", "https://google.com", "https://github.com"] * 1000

# Method 1: Creating full Host objects
start = time.time()
domains1 = []
for url in urls:
    host = Host.from_url(url)
    domains1.append(host.domain)
print(f"Method 1 time: {time.time() - start:.4f}s")

# Method 2: Using extract_from_url (faster)
start = time.time()
domains2 = []
for url in urls:
    info = Host.extract_from_url(url)
    domains2.append(info["domain"])
print(f"Method 2 time: {time.time() - start:.4f}s")

3. Memory Optimization¶

If you're processing millions of URLs and memory usage is a concern, use the extraction methods instead of creating full objects:

# Memory-efficient processing
from liburlparser import Host

def process_url_file(input_file, output_file):
    with open(input_file, 'r') as infile, open(output_file, 'w') as outfile:
        for line in infile:
            url = line.strip()
            try:
                # Use extract_from_url instead of creating Host objects
                info = Host.extract_from_url(url)
                outfile.write(f"{url},{info['domain']},{info['suffix']}\n")
            except Exception:
                outfile.write(f"{url},error,error\n")

4. Profiling Your Code¶

You can use Python's built-in profiling tools to identify performance bottlenecks:

import cProfile
from liburlparser import Host

def process_urls(urls):
    results = []
    for url in urls:
        info = Host.extract_from_url(url)
        results.append(info["domain"])
    return results

# Generate sample data
urls = ["https://example.com", "https://google.com", "https://github.com"] * 1000

# Profile the function
cProfile.run('process_urls(urls)')

Comparison with Other Libraries¶

liburlparser is significantly faster than other Python domain extraction libraries because:

It's implemented in C++ with Python bindings
It uses efficient data structures for the Public Suffix List
It provides specialized methods for different use cases

If you're migrating from another library, here's how liburlparser compares:

# tldextract
import tldextract
extracted = tldextract.extract("mail.google.com")
domain = extracted.domain
suffix = extracted.suffix
subdomain = extracted.subdomain

# liburlparser equivalent (much faster)
from liburlparser import Host
host = Host("mail.google.com")
domain = host.domain
suffix = host.suffix
subdomain = host.subdomain

# publicsuffix2
from publicsuffix2 import get_sld
domain = get_sld("mail.google.com")

# liburlparser equivalent
from liburlparser import Host
domain = Host("mail.google.com").domain

# pydomainextractor
import pydomainextractor
extractor = pydomainextractor.DomainExtractor()
result = extractor.extract("mail.google.com")

# liburlparser equivalent
from liburlparser import Host
result = Host.extract("mail.google.com")