Apache Arrow DataFusion - A Primer

Apr 8, 2024
Apache Arrow DataFusion - A Primer
Interested in reading more?

Sign up for our Enterprise Weekly Newsletter.

We'll send you our top, curated content straight to your inbox (along with top industry news, events, and fundings).

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

This post was originally published on The Data Source on April 8th, 2024, my monthly newsletter covering the top innovation in data infrastructure, engineering and developer-first tooling. Subscribe here!

Apache Arrow DataFusion is a query engine built by Andy Grove, designed for building robust data-centric systems. Developed in Rust, a language known for its performance and memory safety, DataFusion serves as a foundational component for processing extensive datasets.

In the evolving landscape of data processing and analytics platforms, DataFusion distinguishes itself with its seamless integration with the Apache Arrow format. Apache Arrow, a cross-language development platform for in-memory data, is designed to expedite analytical processing. Leveraging Arrow's columnar memory layout and zero-copy data exchange capabilities, DataFusion enhances performance and efficiency in handling large datasets. This integration minimizes data movement and serialization overhead, thereby facilitating faster query execution and reducing computational costs.

Within the competitive market, various platforms offer unique strengths. Spark provides versatility and scalability alongside a rich ecosystem of libraries. Presto excels in interactive SQL querying over large datasets. Flink prioritizes real-time stream processing with low latency and high throughput. Dask competes in scalable data processing, particularly for Python-centric workflows. ClickHouse stands out for its prowess in real-time analytics.

The growing community around DataFusion suggests that more advancements will be made in the project. What I’m most excited about is seeing whether DataFusion ends up challenging Apache Spark's dominance, given Spark's widespread adoption and associated complexities around data movement and resource provisioning.

Currently, DataFusion is widely adopted across diverse data processing use cases:

  • Specialized analytical database systems like HoraeDB leverage DataFusion for in-depth data analysis.
  • General-purpose systems such as Ballista rely on DataFusion for diverse data processing tasks similar to Apache Spark.
  • Projects like Blaze use DataFusion as a native Spark runtime replacement, offering improved performance.
  • Streaming data platforms like Synnada benefit from DataFusion's capabilities in processing continuous data streams, suited for transaction-oriented systems.
  • Research platforms like Flock utilize DataFusion for experimenting with new ideas in data storage and analysis.

Tools To Know

Arroyo, a Rust-based distributed stream processing engine

We built a new SQL Engine on Arrow and DataFusion: Arroyo unveiled a revamped SQL engine powered by Apache Arrow and DataFusion SQL toolkit. Initially relying on DataFusion solely for SQL parsing, Arroyo now integrates its physical plans, operators, and expressions, enhancing flexibility and compatibility, especially for streaming operations. This shift mirrors the momentum within the Rust data community, with DataFusion spearheading efforts to bolster streaming capabilities alongside its established batch processing prowess.

Comet, plugin for Apache Spark enabling native query execution

Apple’s Comet Brings Fast Vector Processing to Apache Spark: Apple has released a plug-in designed to enhance Apache Spark's execution of vector searches, thus augmenting the platform's suitability for large-scale machine learning data analysis. Comet, built upon the adaptable Apache DataFusion query engine, written in Rust, and utilizing the Arrow columnar data format, aims to expedite Spark query execution. 

InfluxDB, a Time Series Database

Flight, DataFusion, Arrow, and Parquet: Using the FDAP Architecture to build InfluxDB 3.0: InfluxDB 3.0 faces the challenge of efficiently handling incoming data and making it queryable. Instead of building a new engine from scratch, InfluxData chose to use Apache Arrow DataFusion for querying time series data using SQL. This enables them to implement custom features like late arrival resolution and time series gap filling with ease, and extend functionalities like the Flux language in their cloud environment.

ParadeDB, PostgreSQL for Search & Analytics

pg_analytics: Transforming Postgres into a Fast OLAP Database: Building an advanced analytical database in Postgres is expensive and challenging. Early attempts like Greenplum and subsequent products from Citus and Timescale have struggled to match the performance of non-Postgres databases. This has led many companies to favor alternatives like Elasticsearch. However, with the rise of embeddable query engines like DataFusion, the ParadeDB team believes the project can outpace many OLAP databases in query speed, indicating a shift away from building query engines from scratch within databases. By integrating with DataFusion, there can be continuous improvements made to the database performance.

If you’re a data practitioner leveraging DataFusion or interested in chatting about the broader landscape of data processing / analytics tools, please reach out as I’d love to swap notes!