🚀 Building a Multi-Chain Blockchain Data Harvesting Platform: A Journey from Frustration to Solution

How a side project driven by personal frustration evolved into a unified platform handling Solana-scale data volumes—all in Python.

🔍 The Challenge: Blockchain Data is Everywhere, Yet Hard to Access

Here’s a paradox: blockchain data is the most transparent in the world, yet one of the hardest to actually use.

Every transaction, every block, every state change is recorded on an immutable ledger for anyone to verify. Yet accessing this data at scale remains one of the most frustrating challenges in the blockchain ecosystem.

Whether you’re building analytics dashboards, compliance systems, or research tools, you inevitably face the same problem: how do you reliably extract, transform, and load blockchain data across multiple networks without drowning in operational complexity?

This is the story of how a personal side project—born from frustration with exactly this problem—evolved into a unified data harvesting platform capable of handling everything from Ethereum to Solana-scale throughput.

💡 The Origin Story: A Simple Problem That Shouldn’t Be This Hard

When I joined, the data engineering team was already using blockchain-etl—an excellent open-source project that pioneered blockchain data extraction. It worked well, and the team had built solid pipelines around it.

But as we expanded to support more blockchain networks, something started to feel off. Each new chain meant a new repository: ethereum-etl, bitcoin-etl, solana-etl, and so on. Each had its own patterns, configurations, and quirks.

The operational overhead was growing:

Cognitive load: Understanding one repo didn’t mean you understood another. Each had slightly different abstractions.
Maintenance burden: A bug fix or improvement meant implementing it across multiple repositories.
Onboarding friction: New team members had to learn N different codebases instead of one.

I kept thinking: This shouldn’t be this hard. At its core, blockchain data extraction is a straightforward problem—fetch data from nodes, transform it, export it somewhere. Why does it require so much complexity?

So I started working on it during my personal time. Not because I thought I could build something revolutionary, but because I believed it was fundamentally a simple problem that could be solved with basics.

The turning point came during Diwali break last year. With a few uninterrupted days away from the daily grind, I finally had the space to step back and see the bigger picture. What emerged wasn’t a rewrite—it was a consolidation. The patterns were there all along; they just needed someone to connect them.

🏗️ Standing on Giants’ Shoulders: The Consolidation Journey

The blockchain-etl project deserves enormous credit. It proved that blockchain data could be systematically extracted and made available for analysis. Google BigQuery’s public blockchain datasets, built on this foundation, democratized access to on-chain data for researchers and analysts worldwide.

But the multi-repository architecture, while sensible for an open-source project with different maintainers for each chain, created challenges for operational teams managing multiple blockchains:

flowchart TB subgraph before [Before: Separate Repositories] eth[ethereum-etl] btc[bitcoin-etl] sol[solana-etl] cel[celestia-etl] dots[... more repos] end subgraph after [After: Unified Platform] unified[Single Codebase] unified --> shared[Shared Abstractions] unified --> consistent[Consistent Patterns] unified --> extensible[Easy Extension] end before -->|Consolidation| after

The consolidation wasn’t about replacing blockchain-etl—it was about adapting the proven concepts into a unified architecture that worked better for teams operating at scale. One codebase. Shared abstractions. Consistent patterns across all chains.

📊 The 3 Vs Challenge: Volume, Velocity, and Variety

Blockchain data presents a unique combination of big data challenges—what I call the “3 Vs” of blockchain:

Volume: Modern blockchains generate massive amounts of data. A single day of Ethereum activity can produce gigabytes of transaction data, logs, and traces. Historical backfills for chains that have been running for years require processing terabytes.

Velocity: High-throughput chains don’t wait for your pipeline to catch up. Solana produces blocks every 400 milliseconds. Fall behind, and you’re suddenly processing a backlog that grows faster than you can clear it.

Variety: Every blockchain is different. EVM chains share common patterns, but even among them, there are variations in trace formats, receipt structures, and RPC behaviors. Non-EVM chains like Solana, Stellar, or Celestia have entirely different data models.

Any solution that works for one chain but fails at another isn’t really a solution—it’s a workaround.

⚡ Solana-Scale Engineering: Where Theory Meets Reality

Solana was the proving ground. With thousands of transactions per second and ~400ms block times, it exposed every weakness in naive approaches:

Sequential processing? You’ll never catch up.
Unbounded concurrency? You’ll overwhelm RPC nodes and get rate-limited.
Memory-hungry approaches? You’ll OOM before finishing a day’s worth of data.

The solution required rethinking the pipeline architecture:

Pipelined processing: Overlap extraction, transformation, and export phases so no stage blocks another.
Adaptive batching: Adjust batch sizes based on data density and RPC response times.
Backpressure handling: Gracefully slow down when downstream systems can’t keep up.
Memory-conscious design: Process data in streams rather than loading entire datasets into memory.

If the system could handle Solana, it could handle anything.

🐍 The Power of Basics: Problem-Solving Over Fancy Tooling

When facing high-throughput data challenges, it’s tempting to reach for complex solutions—specialized frameworks, exotic dependencies, or exploring other languages. I took a different approach: understand the problem deeply, then solve it with fundamentals.

The bottleneck in blockchain data extraction isn’t CPU-bound computation—it’s I/O. Network calls to RPC nodes, disk writes, downstream system latency. Once I understood this, the solution became clear.

The toolkit is simple:

async/await: For concurrent RPC calls without thread overhead
ProcessPoolExecutor: For CPU-bound transformation work that benefits from parallelism
threading: For background I/O operations that don’t block the main event loop

No exotic frameworks. No complex dependencies. Just standard library primitives, used correctly.

The result? A system that handles Solana-scale throughput while remaining readable, maintainable, and accessible to any developer. Good architecture beats complexity every time.

🧱 Architecture: Simple by Design

The architecture follows a straightforward three-layer pattern:

flowchart LR subgraph sources [Blockchain Networks] evm[EVM Chains] solana[Solana] stellar[Stellar] others[Other Networks] end subgraph core [Core Pipeline] ext[Extractors] proc[Processors] exp[Exporters] end subgraph outputs [Output Destinations] file[File System] kafka[Kafka] console[Console] end sources --> ext ext --> proc proc --> exp exp --> outputs

Extractors handle RPC communication with blockchain nodes. They know how to fetch raw data efficiently—batching requests, handling rate limits, retrying failures.

Processors transform raw blockchain data into structured formats. They normalize the chaos of different RPC responses into consistent, well-defined schemas.

Exporters save processed data to various destinations. File system for batch processing, Kafka for streaming pipelines, console for debugging.

Each layer has a clean interface. Adding a new blockchain means implementing an Extractor and Processor. Adding a new output destination means implementing an Exporter. The core pipeline doesn’t change.

🌐 Multi-Chain Capabilities: One Platform, Many Networks

The platform currently supports:

EVM-compatible chains: Ethereum, Polygon, BSC, Arbitrum, Optimism, Avalanche, Base, Linea, and others
High-throughput L1s: Solana with its unique account model and transaction structure
Alternative networks: Stellar, Celestia, Ripple, TON, Zilliqa, Aleo

Each chain has its quirks, but the unified architecture means operational knowledge transfers. Learn how the platform handles one chain, and you understand 80% of how it handles any chain.

✨ Key Features: Real-Time and Historical

Real-time streaming: Follow the chain tip with configurable lag, processing blocks as they’re produced. Automatic sync tracking enables seamless resume after interruptions.

Historical extraction: Backfill any date range or block range. Parallel processing makes short work of large historical datasets.

Flexible output: Same data, multiple destinations. Write to local files during development, stream to Kafka in production, export to cloud storage for data lake integration.

Pipelined processing: The newer streaming architecture overlaps extraction, processing, and export for 2-3x throughput improvements on high-frequency chains.

🛡️ Reliability at Scale: When Things Go Wrong

Distributed systems fail. RPC nodes timeout. Networks hiccup. The question isn’t whether failures happen—it’s how the system responds.

Automatic retry with exponential backoff: Transient failures resolve themselves without manual intervention.

Progress tracking and resume: Long-running extractions can be interrupted and resumed without losing progress or duplicating work.

Atomic batch semantics: Either a batch succeeds completely, or it’s retried entirely. No partial state to debug.

Graceful degradation: When RPC nodes struggle, the system slows down rather than crashing. It catches up when conditions improve.

💎 Key Takeaways

Understand the problem first — The bottleneck was I/O, not CPU. This shaped everything.
Basics beat complexity — async/await, process pools, threading. No magic required.
Unify, don’t multiply — One codebase > N repositories, always.
Design for the hardest case — If it handles Solana, it handles anything.

🎯 Conclusion: Building for the Multi-Chain Future

The blockchain ecosystem isn’t converging on a single winner. It’s heading toward a multi-chain world where value and activity flow across dozens of networks. Any data infrastructure built for this future needs to handle that diversity without proportional complexity.

What started as a side project to scratch a personal itch became something more: a proof that blockchain data extraction doesn’t require exotic technology. Python, properly architected, handles Solana-scale throughput. A unified codebase, thoughtfully designed, supports a dozen different blockchain architectures.

The problem was never that hard. It just needed someone frustrated enough to simplify it.

This article was also published officially on the Merkle Science blog, where I am the author. This personal version carries the same core content with more of the behind-the-scenes story.

This article describes high-level architectural approaches to blockchain data extraction. Specific implementation details remain proprietary.

Have you faced similar challenges with blockchain data infrastructure? I’d love to hear your approach—reach out and let’s exchange notes.