XMLStreamer: A Memory-Efficient Tool for Parsing Large XML Feeds
Parsing large RSS or Atom feeds often causes memory problems. While working on XML-based projects, I encountered repeated failures with XMLFeedSpider when processing large feeds on ScrapingHub (now Zyte). The OOM killer would terminate spiders that exceeded memory limits.
I developed XMLStreamer to address this problem. The library handles XML feeds of several gigabytes without loading the entire file into memory. This post covers the design decisions behind XMLStreamer and compares its approach to XMLFeedSpider.
Design approach
Most XML parsing mechanisms load large chunks—or the entire file—into memory before processing. Loading entire files into memory creates problems with multi-gigabyte files or continuous streams, resulting in high memory footprint and frequent crashes in resource-constrained environments.
Scrapy’s XMLFeedSpider works for moderate use cases but struggles with large or compressed feeds. XMLStreamer addresses these limitations through:
Streaming architecture: Processes XML incrementally rather than loading it into memory.
GZIP support: Decompresses data on the fly without intermediate storage.
Attribute-based filtering: Tokenizes and filters XML items based on specified attributes.
Configurable limits: Buffer size and runtime limits can be tuned for different operational requirements.
Implementation
XMLStreamer uses a SAX-based approach, processing each XML element as it arrives from HTTP, HTTPS, FTP, or local files. This chunked processing prevents memory inflation with large feeds.
from xmlstreamer import StreamInterpreter
import pprint
url = "https://example.com/large-feed.xml"
separator_tag = "item" # The tag to identify individual items
interpreter = StreamInterpreter(
url=url,
separator_tag=separator_tag,
buffer_size=1024 * 128, # 128 KB buffer
max_running_time=600 # 10 minutes
)
for item in interpreter:
pprint.pprint(item)
The separator_tag parameter defines the XML element boundary for individual items. Buffer size controls memory usage per read operation, while max_running_time provides a safeguard against runaway processes.
Example: Processing a large pricing feed
Consider an XML feed containing thousands of product entries, each with id, name, and price fields. XMLStreamer processes each <product> element individually as it arrives, rather than loading the entire feed into memory.
A parsed <product> element in Python:
{
'id': '42',
'sku' '31337'
'name': 'Your favorite product',
'currency': 'USD',
'price': '19.99'
}
Each entry is processed and discarded before the next one is loaded, maintaining constant memory usage regardless of feed size. This approach works well for scenarios where items can be processed independently.
Comparison with XMLFeedSpider
XMLFeedSpider works well for typical XML feeds but has limitations with large or continuous streams:
| Feature | XMLFeedSpider | XMLStreamer |
|---|---|---|
| Memory Efficiency | Parses bigger chunks (risk of OOM) | Streams data in chunks, minimizing memory usage |
| Compression Handling | May require manual decompression | Built-in GZIP support |
| Tokenization | Basic item extraction | SAX-based approach, custom filters & attributes |
| Flexible Buffer/Runtime | It does not have a buffer | User-defined buffer size & max running time |
| Ease of Integration | Native to Scrapy | Standalone library (easily used with Scrapy too) |
XMLStreamer prioritizes memory efficiency over simplicity. For large data sources or compressed streams in resource-constrained environments, the streaming approach prevents OOM errors that would terminate XMLFeedSpider-based spiders.
Use cases
News aggregation: Processing large RSS/Atom feeds from multiple sources without memory spikes.
Large-scale scraping: Handling XML datasets of several gigabytes where traditional parsing would exceed available memory.
Resource-constrained environments: Operating within strict memory limits on platforms like Zyte or limited VPS instances.
Acknowledgments
Ruslan Spivak’s “Let’s Build A Simple Interpreter” tutorial series influenced the design of XMLStreamer. His work on lexical analysis and parsing informed the architecture of the Tokenizer and StreamInterpreter classes. The tokenizer breaks XML feeds into manageable chunks similar to how a lexer processes input, while the interpreter orchestrates parsing in a streaming fashion.
Reference: Let’s Build A Simple Interpreter
XMLStreamer solves a specific problem: processing large XML feeds without exhausting available memory. The streaming architecture and compression support make it suitable for resource-constrained environments where XMLFeedSpider would fail. The library remains open to contributions that align with UNIX philosophy principles.
Explore XMLStreamer on GitHub:
https://github.com/carlosplanchon/xmlstreamer