SiMoD 2024

Abstract: Andrew will discuss engineering tradeoffs made when building Apache DataFusion, an open source and extensible query engine used as the basis of many commercial and open source projects. These decisions (mostly) favored simplicity and worked better than initially expected. He will cover the rationale for which parts of DataFusion use pre-existing standards such as Arrow and Parquet, and which parts are built “from scratch” such as vectorized hashing and normalized sort keys. He will also discuss DataFusion’s design philosophy of extensible APIs paired with simple default implementations. Finally, he will offer lessons learned and enumerate some things that worked well and what could have been improved.

Bio: Andrew Lamb has experience in environments from 2 developers in a VC's office, to large multinational corporations and distributed open source projects. He focuses on systems programming (e.g. databases), and platform engineering, and has paid leadership dues as both an architect and manager/VP. As a Staff Engineer at InfluxData, he works on InfluxDB 3.0's IOx Engine, a new timeseries database written in Rust. He is a Member of the Apache Software Foundation, and a member of, and past chair of the Apache Arrow PMC, and actively contributes to Apache Arrow DataFusion query engine and the Apache Arrow Rust implementation.

Invited Talk - A Simple Approach to Transactions in Database Systems

Speaker: José Orlando Pereira, Associate Professor at U. Minho and Senior Researcher at INESCTEC (Portugal)

Abstract: Reversing the trend of early designs, current NoSQL and NewSQL data management systems increasingly provide some form of multi-item transactional isolation, recovery, and replication. Although these capabilities are more diverse than in traditional SQL systems, architecturally, their implementation is still done much like in those systems, at the storage management layer. We report our experience with a simpler and more flexible approach: Implementing multi-item transactions logically at the query processing level and with a high-level query language. We detail how this idea came to be, summarize current results, and discuss the technical requirements for a database management system to make this possible and efficient.

Bio: José Orlando Pereira (https://jopereira.github.io/) is an Associate Professor at the U. Minho and a Senior Researcher at INESCTEC in Portugal. His work focuses on dependable distributed systems, mainly in data management, including storage systems, replication and transactions, polyglot systems and polystores, and group communication, including consensus and gossip-based protocols for large-scale systems (e.g., HyParView and PlumTree). He is also interested in tools for testing, evaluation, and monitoring.

Keynote - DataFusion: The Case for Building Data Systems using Open Standards

Invited Talk - A Simple Approach to Transactions in Database Systems