Loading...
Searching...
No Matches
liburlparser Library

Logo

Fastest domain extractor library for C++

Complete library for parsing URLs with C++, Python and Command Line

Introduction

liburlparser is a high-performance URL parsing library written in C++ with Python bindings. It provides efficient URL parsing and domain extraction capabilities while maintaining a consistent interface across both C++ and Python. This page introduces the core concepts, architecture, and features of liburlparser.

For installation instructions, see Installation. For usage examples, see Quick Start Guide.

Purpose and Scope

The primary purpose of liburlparser is to parse URLs and extract their components with a focus on accurate domain extraction. The library is designed to:

  • Parse URLs into their component parts (protocol, host, path, query parameters, etc.)
  • Extract domain information from hostnames, correctly handling public suffixes
  • Provide a consistent API across both C++ and Python
  • Offer superior performance compared to other URL parsing solutions

Key Features

  • High Performance: Optimized C++ implementation for fast URL parsing
  • Comprehensive URL Parsing: Extract all components of a URL including protocol, domain, subdomain, suffix, path, query parameters, and fragments
  • Public Suffix List Support: Accurate domain recognition using the public suffix list
  • Clean API Design: Separate TLD::Url and TLD::Host classes for better code organization
  • Cross-Platform Compatibility: Works on Windows, Linux, and macOS
  • Automatic PSL Updates: Updates the public suffix list automatically during build

System Architecture

liburlparser follows a layered architecture with C++ core components providing the parsing functionality and language bindings making this functionality available in Python.

Architecture Diagram

URL Parsing Process

When parsing a URL, the library follows this general flow:

URL Parsing Process

Core Components

The library is built around two primary classes that work together to provide URL parsing functionality:

TLD::Url Class

The TLD::Url class is responsible for parsing complete URLs and extracting all their components. It provides methods to access:

  • Protocol (e.g., "https")
  • User information
  • Host (which is represented as a Host object)
  • Port
  • Path
  • Query parameters
  • Fragment

Key Methods and Properties

  • protocol(): Returns the protocol (e.g., "http", "https")
  • userinfo(): Returns the user information part of the URL
  • host(): Returns a TLD::Host object representing the host part
  • port(): Returns the port number
  • abspath(): Returns the absolute path
  • query(): Returns the query string
  • params(): Returns the query parameters as a map
  • fragment(): Returns the fragment (anchor)
  • str(): Returns the complete URL as a string

TLD::Host Class

The TLD::Host class focuses on parsing and extracting domain information from hostnames. It leverages the Public Suffix List (PSL) to correctly handle domain extraction, even for complex cases like "co.uk". It provides methods to access:

  • Subdomain (e.g., "www" in "www.example.com")
  • Domain (e.g., "example" in "www.example.com")
  • Domain name (the combination of domain and suffix)
  • Suffix (e.g., "com" in "www.example.com" or "co.uk" in "example.co.uk")

Key Methods and Properties

  • domain(): Returns the domain part
  • subdomain(): Returns the subdomain part
  • suffix(): Returns the suffix (TLD)
  • domainName(): Returns the domain name (domain + suffix)
  • fulldomain(): Returns the full domain (subdomain + domain + suffix)
  • str(): Returns the host as a string
  • static fromUrl(const std::string& url, bool ignore_www = false): Creates a Host object from a URL

Usage Examples

Basic URL Parsing

#include <iostream>
#include "urlparser.h"
int main() {
// Parse a URL
TLD::Url url("https://www.example.com/path?param=value#section");
// Access URL components
std::cout << "Protocol: " << url.protocol() << std::endl;
std::cout << "Domain: " << url.domain() << std::endl;
std::cout << "Subdomain: " << url.subdomain() << std::endl;
std::cout << "Suffix: " << url.suffix() << std::endl;
std::cout << "Path: " << url.abspath() << std::endl;
std::cout << "Query: " << url.query() << std::endl;
std::cout << "Fragment: " << url.fragment() << std::endl;
return 0;
}
Represents a URL.
Definition urlparser.h:67
Defines classes for parsing and handling URLs and hosts.

Working with Hosts

#include <iostream>
#include "urlparser.h"
int main() {
// Create a Host object directly
TLD::Host host1("www.example.com");
// Create a Host object from a URL
TLD::Host host2 = TLD::Host::fromUrl("https://www.example.com/path?param=value");
// Access Host components
std::cout << "Domain: " << host1.domain() << std::endl;
std::cout << "Subdomain: " << host1.subdomain() << std::endl;
std::cout << "Suffix: " << host1.suffix() << std::endl;
std::cout << "Domain Name: " << host1.domainName() << std::endl;
// Compare hosts
if (host1 == host2) {
std::cout << "Hosts are equal" << std::endl;
}
return 0;
}
Represents a Host part of a URL.
Definition urlparser.h:200
static Host fromUrl(const std::string &url, const bool ignore_www=DEFAULT_IGNORE_WWW)
Create a Host object from a URL string.

Advanced Example

#include <iostream>
#include "urlparser.h"
int main() {
// Parse a complex URL
const TLD::Url url(
"https://user:password@www.subdomain.example.co.uk:8080/path/to/resource?param1=value1&param2=value2#section",
true // ignore_www parameter
);
// Access URL components
std::cout << "Protocol: " << url.protocol() << std::endl;
std::cout << "User Info: " << url.userinfo() << std::endl;
std::cout << "Domain: " << url.domain() << std::endl;
std::cout << "Subdomain: " << url.subdomain() << std::endl;
std::cout << "Suffix: " << url.suffix() << std::endl;
std::cout << "Port: " << url.port() << std::endl;
std::cout << "Path: " << url.abspath() << std::endl;
std::cout << "Query: " << url.query() << std::endl;
std::cout << "Fragment: " << url.fragment() << std::endl;
// Access query parameters
auto params = url.params();
for (const auto& param : params) {
std::cout << "Parameter: " << param.first << " = " << param.second << std::endl;
}
// Get the host object
TLD::Host host = url.host();
std::cout << "Full Domain: " << host.fulldomain() << std::endl;
return 0;
}
const std::string & fulldomain() const noexcept
Get the full domain of the host.

Public Suffix List

The library uses the Public Suffix List (PSL) to accurately identify domain suffixes. The PSL is automatically downloaded during the build process, but you can also load it manually:

#include "urlparser.h"
int main() {
// Check if PSL is loaded
bool pslLoaded = TLD::Host::isPslLoaded();
// Load PSL from a file
if (!pslLoaded) {
TLD::Host::loadPslFromPath("/path/to/public_suffix_list.dat");
}
// Or load PSL from a string
std::string pslContent = "..."; // PSL content
return 0;
}
static void loadPslFromPath(const std::string &filepath)
Load the Public Suffix List from a file.
static void loadPslFromString(const std::string &filestr)
Load the Public Suffix List from a string.
static bool isPslLoaded() noexcept
Check if the Public Suffix List (PSL) is loaded.

Building and Installation

Prerequisites

  • C++17 compatible compiler
  • CMake 3.19 or higher

Build Steps

git clone https://github.com/mohammadraziei/liburlparser
mkdir -p build && cd build
cmake ..
make
sudo make install

Running Tests

make test

Generating Documentation

make docs

Performance Comparison

URL Parser is designed for high performance. In benchmarks, it consistently outperforms other URL parsing libraries:

Library Function Time (Extract from Host) Time (Extract from URL)
liburlparser Host 1.12s 2.10s
Other libraries - 1.50s - 34.48s 2.24s - 57.87s

Tests were run on a file containing 10 million random domains and 1 million random URLs.

License

This library is distributed under the MIT License. See the LICENSE file for more information.