Parsing large RSS or Atom feeds often causes memory problems. While working on XML-based projects, I encountered repeated failures with XMLFeedSpider when processing large feeds on ScrapingHub (now Zyte). The OOM killer would terminate spiders that exceeded memory limits.

I developed XMLStreamer to address this problem. The library handles XML feeds of several gigabytes without loading the entire file into memory. This post covers the design decisions behind XMLStreamer and compares its approach to XMLFeedSpider.


Design approach

Most XML parsing mechanisms load large chunks—or the entire file—into memory before processing. Loading entire files into memory creates problems with multi-gigabyte files or continuous streams, resulting in high memory footprint and frequent crashes in resource-constrained environments.

Scrapy’s XMLFeedSpider works for moderate use cases but struggles with large or compressed feeds. XMLStreamer addresses these limitations through:

Streaming architecture: Processes XML incrementally rather than loading it into memory.

GZIP support: Decompresses data on the fly without intermediate storage.

Attribute-based filtering: Tokenizes and filters XML items based on specified attributes.

Configurable limits: Buffer size and runtime limits can be tuned for different operational requirements.


Implementation

XMLStreamer uses a SAX-based approach, processing each XML element as it arrives from HTTP, HTTPS, FTP, or local files. This chunked processing prevents memory inflation with large feeds.

from xmlstreamer import StreamInterpreter
import pprint

url = "https://example.com/large-feed.xml"
separator_tag = "item" # The tag to identify individual items
interpreter = StreamInterpreter(
    url=url,
    separator_tag=separator_tag,
    buffer_size=1024 * 128, # 128 KB buffer
    max_running_time=600 # 10 minutes
)

for item in interpreter:
    pprint.pprint(item)

The separator_tag parameter defines the XML element boundary for individual items. Buffer size controls memory usage per read operation, while max_running_time provides a safeguard against runaway processes.


Example: Processing a large pricing feed

Consider an XML feed containing thousands of product entries, each with id, name, and price fields. XMLStreamer processes each <product> element individually as it arrives, rather than loading the entire feed into memory.

A parsed <product> element in Python:

{
  'id': '42',
  'sku' '31337'
  'name': 'Your favorite product',
  'currency': 'USD',
  'price': '19.99'
}

Each entry is processed and discarded before the next one is loaded, maintaining constant memory usage regardless of feed size. This approach works well for scenarios where items can be processed independently.


Comparison with XMLFeedSpider

XMLFeedSpider works well for typical XML feeds but has limitations with large or continuous streams:

Feature XMLFeedSpider XMLStreamer
Memory Efficiency Parses bigger chunks (risk of OOM) Streams data in chunks, minimizing memory usage
Compression Handling May require manual decompression Built-in GZIP support
Tokenization Basic item extraction SAX-based approach, custom filters & attributes
Flexible Buffer/Runtime It does not have a buffer User-defined buffer size & max running time
Ease of Integration Native to Scrapy Standalone library (easily used with Scrapy too)

XMLStreamer prioritizes memory efficiency over simplicity. For large data sources or compressed streams in resource-constrained environments, the streaming approach prevents OOM errors that would terminate XMLFeedSpider-based spiders.


Use cases

News aggregation: Processing large RSS/Atom feeds from multiple sources without memory spikes.

Large-scale scraping: Handling XML datasets of several gigabytes where traditional parsing would exceed available memory.

Resource-constrained environments: Operating within strict memory limits on platforms like Zyte or limited VPS instances.


Acknowledgments

Ruslan Spivak’s “Let’s Build A Simple Interpreter” tutorial series influenced the design of XMLStreamer. His work on lexical analysis and parsing informed the architecture of the Tokenizer and StreamInterpreter classes. The tokenizer breaks XML feeds into manageable chunks similar to how a lexer processes input, while the interpreter orchestrates parsing in a streaming fashion.

Reference: Let’s Build A Simple Interpreter


XMLStreamer solves a specific problem: processing large XML feeds without exhausting available memory. The streaming architecture and compression support make it suitable for resource-constrained environments where XMLFeedSpider would fail. The library remains open to contributions that align with UNIX philosophy principles.


Explore XMLStreamer on GitHub:
https://github.com/carlosplanchon/xmlstreamer