We Benchmarked DuckDB, SQLite, and Pandas on 1M Rows: Here’s What Happened

Image by Author

# Introduction

There are numerous tools for processing datasets today. They all claim — of course they do — that they’re the best and the right choice for you. But are they? There are two main requirements these tools should satisfy: they should easily perform everyday data analysis operations and do so quickly, even under the pressure of large datasets.

To determine the best tool among DuckDB, SQLite, and Pandas, we tested them under these conditions.

First, we gave them only everyday analytical tasks: summing values, grouping by categories, filtering with conditions, and multi-field aggregations. This reflected how analysts actually work with real datasets, compared to scenarios designed to showcase the best characteristics of a tool.

Second, we performed those operations on a Kaggle dataset with over 1 million rows. It’s a realistic tipping point — small enough to run on a single machine, yet large enough that memory pressure and query speed start to reveal clear differences between tools.

Let’s see how those tests went.

# The Dataset We Used

// Dataset Overview

We used the Bank dataset from Kaggle. This dataset contains over 1 million rows, comprising five columns:

Column Name	Description
Date	The date the transaction occurred
Domain	The business category or type (RETAIL, RESTAURANT)
Location	Geographic region (Goa, Mathura)
Value	Transaction value
Transaction_count	The total number of transactions on that day

This dataset is generated using Python. While it may not fully resemble real-life data, its size and structure are sufficient to test and compare the performance differences between the tools.

// Peeking Into the Data with Pandas

We used Pandas to load the dataset into a Jupyter notebook and examine its general structure, dimensions, and null values. Here is the code.

import pandas as pd
df = pd.read_excel('bankdataset.xlsx')

print("Dataset shape:", df.shape)

df.head()

Here is the output.

If you want a quick reference to common operations when exploring datasets, check out this handy Pandas Cheat Sheet.

Before benchmarking, let’s see how to set up the environment.

# Setting Up a Fair Testing Environment

All three tools — DuckDB, SQLite, and Pandas — were set up and run in the same Jupyter Notebook environment to ensure the test was fair. This ensured that the conditions during runtime and the use of memory remained constant throughout.

First, we installed and loaded the necessary packages.

Here are the tools we needed:

pandas: for standard DataFrame operations
duckdb: for SQL execution on a DataFrame
sqlite3: for managing an embedded SQL database
time: for capturing execution time
memory_profiler: to measure memory allocation

# Install if any of them are not in your environment
!pip install duckdb --quiet

import pandas as pd
import duckdb
import sqlite3
import time
from memory_profiler import memory_usage

Now let’s prepare the data in a format that can be shared across all three tools.

// Loading Data into Pandas

We’ll use Pandas to load the dataset once, and then we’ll share or register it for DuckDB and SQLite.

df = pd.read_excel('bankdataset.xlsx')

df.head()

Here is the output to validate.

// Registering Data with DuckDB

DuckDB lets you directly access Pandas DataFrames. You don’t have to convert anything—just register and query. Here is the code.

# Register DataFrame as a DuckDB table
duckdb.register("bank_data", df)

# Query via DuckDB
duckdb.query("SELECT * FROM bank_data LIMIT 5").to_df()

Here is the output.

// Preparing Data for SQLite

Since SQLite doesn’t read Excel files directly, we started by adding the Pandas DataFrame to an in-memory database. After that, we used a simple query to examine the data format.

conn_sqlite = sqlite3.connect(":memory:")

df.to_sql("bank_data", conn_sqlite, index=False, if_exists="replace")

pd.read_sql_query("SELECT * FROM bank_data LIMIT 5", conn_sqlite)

Here is the output.

# How We Benchmarked the Tools

We used the same four queries on DuckDB, SQLite, and Pandas to compare their performance. Each query was designed to address a common analytical task that mirrors how data analysis is applied in the real world.

// Ensuring Consistent Setup

The in-memory dataset was used by all three tools.

Pandas queried the DataFrame directly
DuckDB executed SQL queries directly against the DataFrame
SQLite stored a copy of the DataFrame in an in-memory database and ran SQL queries on it

This method ensured that all three tools used the same data and operated with the same system settings.

// Measuring Execution Time

To track query duration, Python’s time module wrapped each query in a simple start/end timer. Only the query execution time was recorded; data-loading and preparation steps were excluded.

// Tracking Memory Usage

Along with processing time, memory usage indicates how well each engine performs with large datasets.

If desired, memory usage can be sampled immediately before and after each query to estimate incremental RAM consumption.

// The Benchmark Queries

We tested each engine on the same four everyday analytical tasks:

Total transaction value: summing a numeric column
Group by domain: aggregating transaction counts per category
Filter by location: filtering rows by a condition before aggregation
Group by domain & location: multi-field aggregation with averages

# Benchmark Results

// Query 1: Total Transaction Value

Here we measure how Pandas, DuckDB, and SQLite perform when summing the Value column across the dataset.

// Pandas Performance

We calculate the total transaction value using .sum() on the Value column. Here is the code.

pandas_results = []

def pandas_q1():
    return df['Value'].sum()

mem_before = memory_usage(-1)[0]
start = time.time()
pandas_q1()
end = time.time()
mem_after = memory_usage(-1)[0]

pandas_results.append({
    "engine": "Pandas",
    "query": "Total transaction value",
    "time": round(end - start, 4),
    "memory": round(mem_after - mem_before, 4)
})
pandas_results

Here is the output.

// DuckDB Performance

We calculate the total transaction value using a full-column aggregation. Here is the code.

duckdb_results = []

def duckdb_q1():
    return duckdb.query("SELECT SUM(value) FROM bank_data").to_df()

mem_before = memory_usage(-1)[0]
start = time.time()
duckdb_q1()
end = time.time()
mem_after = memory_usage(-1)[0]

duckdb_results.append({
    "engine": "DuckDB",
    "query": "Total transaction value",
    "time": round(end - start, 4),
    "memory": round(mem_after - mem_before, 4)
})
duckdb_results

Here is the output.

// SQLite Performance

We calculate the total transaction value by summing the value column. Here is the code.

sqlite_results = []

def sqlite_q1():
    return pd.read_sql_query("SELECT SUM(value) FROM bank_data", conn_sqlite)

mem_before = memory_usage(-1)[0]
start = time.time()
sqlite_q1()
end = time.time()
mem_after = memory_usage(-1)[0]

sqlite_results.append({
    "engine": "SQLite",
    "query": "Total transaction value",
    "time": round(end - start, 4),
    "memory": round(mem_after - mem_before, 4)
})
sqlite_results

Here is the output.

// Overall Performance Analysis

Now let’s compare execution time and memory usage. Here is the code.

import matplotlib.pyplot as plt


all_q1 = pd.DataFrame(pandas_results + duckdb_results + sqlite_results)

fig, axes = plt.subplots(1, 2, figsize=(10,4))

all_q1.plot(x="engine", y="time", kind="barh", ax=axes[0], legend=False, title="Execution Time (s)")
all_q1.plot(x="engine", y="memory", kind="barh", color="salmon", ax=axes[1], legend=False, title="Memory Usage (MB)")

plt.tight_layout()
plt.show()

Here is the output.

Pandas is by far the fastest and most memory-efficient here, completing almost instantly with minimal RAM usage. DuckDB is slightly slower and uses more memory but remains efficient, while SQLite is both the slowest and the heaviest in terms of memory consumption.

// Query 2: Group by Domain

Here we measure how Pandas, DuckDB, and SQLite perform when grouping transactions by Domain and summing their counts.

// Pandas Performance

We calculate the total transaction count per domain using .groupby() on the Domain column.

def pandas_q2():
    return df.groupby('Domain')['Transaction_count'].sum()

mem_before = memory_usage(-1)[0]
start = time.time()
pandas_q2()
end = time.time()
mem_after = memory_usage(-1)[0]

pandas_results.append({
    "engine": "Pandas",
    "query": "Group by domain",
    "time": round(end - start, 4),
    "memory": round(mem_after - mem_before, 4)
})
[p for p in pandas_results if p["query"] == "Group by domain"]

Here is the output.

// DuckDB Performance

We calculate the total transaction count per domain using a SQL GROUP BY on the domain column.

def duckdb_q2():
    return duckdb.query("""
        SELECT domain, SUM(transaction_count) 
        FROM bank_data 
        GROUP BY domain
    """).to_df()

mem_before = memory_usage(-1)[0]
start = time.time()
duckdb_q2()
end = time.time()
mem_after = memory_usage(-1)[0]

duckdb_results.append({
    "engine": "DuckDB",
    "query": "Group by domain",
    "time": round(end - start, 4),
    "memory": round(mem_after - mem_before, 4)
})

[p for p in duckdb_results if p["query"] == "Group by domain"]

Here is the output.

// SQLite Performance

We calculate the total transaction count per domain using SQL GROUP BY on the in-memory table.

def sqlite_q2():
    return pd.read_sql_query("""
        SELECT domain, SUM(transaction_count) AS total_txn
        FROM bank_data
        GROUP BY domain
    """, conn_sqlite)

mem_before = memory_usage(-1)[0]
start = time.time()
sqlite_q2()
end = time.time()
mem_after = memory_usage(-1)[0]

sqlite_results.append({
    "engine": "SQLite",
    "query": "Group by domain",
    "time": round(end - start, 4),
    "memory": round(mem_after - mem_before, 4)
})

[p for p in sqlite_results if p["query"] == "Group by domain"]

Here is the output.

// Overall Performance Analysis

Now let’s compare execution time and memory usage. Here is the code.

import pandas as pd
import matplotlib.pyplot as plt

groupby_results = [r for r in (pandas_results + duckdb_results + sqlite_results) 
                   if "Group by" in r["query"]]

df_groupby = pd.DataFrame(groupby_results)

fig, axes = plt.subplots(1, 2, figsize=(10,4))

df_groupby.plot(x="engine", y="time", kind="barh", ax=axes[0], legend=False, title="Execution Time (s)")
df_groupby.plot(x="engine", y="memory", kind="barh", color="salmon", ax=axes[1], legend=False, title="Memory Usage (MB)")

plt.tight_layout()
plt.show()

Here is the output.

DuckDB is fastest, Pandas trades a bit more time for lower memory, while SQLite is both slowest and most memory-hungry.

// Query 3: Filter by Location (Goa)

Here we measure how Pandas, DuckDB, and SQLite perform when filtering the dataset for Location = 'Goa' and summing the transaction values.

// Pandas Performance

We filter rows for Location == 'Goa' and sum their values. Here is the code.

def pandas_q3():
    return df[df['Location'] == 'Goa']['Value'].sum()

mem_before = memory_usage(-1)[0]
start = time.time()
pandas_q3()
end = time.time()
mem_after = memory_usage(-1)[0]

pandas_results.append({
    "engine": "Pandas",
    "query": "Filter by location",
    "time": round(end - start, 4),
    "memory": round(mem_after - mem_before, 4)
})

[p for p in pandas_results if p["query"] == "Filter by location"]

Here is the output.

// DuckDB Performance

We filter transactions for Location = 'Goa' and calculate their total value. Here is the code.

def duckdb_q3():
    return duckdb.query("""
        SELECT SUM(value) 
        FROM bank_data 
        WHERE location = 'Goa'
    """).to_df()

mem_before = memory_usage(-1)[0]
start = time.time()
duckdb_q3()
end = time.time()
mem_after = memory_usage(-1)[0]

duckdb_results.append({
    "engine": "DuckDB",
    "query": "Filter by location",
    "time": round(end - start, 4),
    "memory": round(mem_after - mem_before, 4)
})

[p for p in duckdb_results if p["query"] == "Filter by location"]

Here is the output.

// SQLite Performance

We filter transactions for Location = 'Goa' and sum their values. Here is the code.

def sqlite_q3():
    return pd.read_sql_query("""
        SELECT SUM(value) AS total_value
        FROM bank_data
        WHERE location = 'Goa'
    """, conn_sqlite)

mem_before = memory_usage(-1)[0]
start = time.time()
sqlite_q3()
end = time.time()
mem_after = memory_usage(-1)[0]

sqlite_results.append({
    "engine": "SQLite",
    "query": "Filter by location",
    "time": round(end - start, 4),
    "memory": round(mem_after - mem_before, 4)
})

[p for p in sqlite_results if p["query"] == "Filter by location"]

Here is the output.

// Overall Performance Analysis

Now let’s compare execution time and memory usage. Here is the code.

import pandas as pd
import matplotlib.pyplot as plt

filter_results = [r for r in (pandas_results + duckdb_results + sqlite_results)
                  if r["query"] == "Filter by location"]

df_filter = pd.DataFrame(filter_results)

fig, axes = plt.subplots(1, 2, figsize=(10, 4))

df_filter.plot(x="engine", y="time", kind="barh", ax=axes[0], legend=False, title="Execution Time (s)")
df_filter.plot(x="engine", y="memory", kind="barh", color="salmon", ax=axes[1], legend=False, title="Memory Usage (MB)")

plt.tight_layout()
plt.show()

Here is the output.

DuckDB is the fastest and most efficient; Pandas is slower with higher memory usage; and SQLite is the slowest but lighter on memory.

// Query 4: Group by Domain & Location

// Pandas Performance

We calculate the average transaction value grouped by both Domain and Location. Here is the code.

def pandas_q4():
    return df.groupby(['Domain', 'Location'])['Value'].mean()

mem_before = memory_usage(-1)[0]
start = time.time()
pandas_q4()
end = time.time()
mem_after = memory_usage(-1)[0]

pandas_results.append({
    "engine": "Pandas",
    "query": "Group by domain & location",
    "time": round(end - start, 4),
    "memory": round(mem_after - mem_before, 4)
})

[p for p in pandas_results if p["query"] == "Group by domain & location"]

Here is the output.

// DuckDB Performance

We calculate the average transaction value grouped by both domain and location. Here is the code.

def duckdb_q4():
    return duckdb.query("""
        SELECT domain, location, AVG(value) AS avg_value
        FROM bank_data
        GROUP BY domain, location
    """).to_df()

mem_before = memory_usage(-1)[0]
start = time.time()
duckdb_q4()
end = time.time()
mem_after = memory_usage(-1)[0]

duckdb_results.append({
    "engine": "DuckDB",
    "query": "Group by domain & location",
    "time": round(end - start, 4),
    "memory": round(mem_after - mem_before, 4)
})

[p for p in duckdb_results if p["query"] == "Group by domain & location"]

Here is the output.

// SQLite Performance

We calculate the average transaction value grouped by both domain and location. Here is the code.

def sqlite_q4():
    return pd.read_sql_query("""
        SELECT domain, location, AVG(value) AS avg_value
        FROM bank_data
        GROUP BY domain, location
    """, conn_sqlite)

mem_before = memory_usage(-1)[0]
start = time.time()
sqlite_q4()
end = time.time()
mem_after = memory_usage(-1)[0]

sqlite_results.append({
    "engine": "SQLite",
    "query": "Group by domain & location",
    "time": round(end - start, 4),
    "memory": round(mem_after - mem_before, 4)
})

[p for p in sqlite_results if p["query"] == "Group by domain & location"]

Here is the output.

// Overall Performance Analysis

Now let’s compare execution time and memory usage. Here is the code.

import pandas as pd
import matplotlib.pyplot as plt

gdl_results = [r for r in (pandas_results + duckdb_results + sqlite_results)
               if r["query"] == "Group by domain & location"]

df_gdl = pd.DataFrame(gdl_results)

fig, axes = plt.subplots(1, 2, figsize=(10, 4))

df_gdl.plot(x="engine", y="time", kind="barh", ax=axes[0], legend=False,
            title="Execution Time (s)")
df_gdl.plot(x="engine", y="memory", kind="barh", ax=axes[1], legend=False,
            title="Memory Usage (MB)", color="salmon")

plt.tight_layout()
plt.show()

Here is the output.

DuckDB handles multi-field group-bys fastest with moderate memory use, Pandas is slower with very high memory usage, and SQLite is the slowest with substantial memory consumption.

# Final Comparison Across All Queries

We’ve compared these three engines against each other in terms of memory and speed. Let’s check the execution time once again. Here is the code.

import pandas as pd
import matplotlib.pyplot as plt

all_results = pd.DataFrame(pandas_results + duckdb_results + sqlite_results)

measure_order = [
    "Total transaction value",
    "Group by domain",
    "Filter by location",
    "Group by domain & location",
]
engine_colors = {"Pandas": "#1f77b4", "DuckDB": "#ff7f0e", "SQLite": "#2ca02c"}

fig, axes = plt.subplots(2, 2, figsize=(12, 8))
axes = axes.ravel()

for i, q in enumerate(measure_order):
    d = all_results[all_results["query"] == q]
    axes[i].barh(d["engine"], d["time"], 
                 color=[engine_colors[e] for e in d["engine"]])
    for y, v in enumerate(d["time"]):
        axes[i].text(v, y, f" {v:.3f}", va="center")
    axes[i].set_title(q, fontsize=10)
    axes[i].set_xlabel("Seconds")

fig.suptitle("Per-Measure Comparison — Execution Time", fontsize=14)
plt.tight_layout()
plt.show()

Here is the output.

This chart shows that DuckDB consistently maintains the lowest execution times for almost all queries, except for the total transaction value where Pandas edges it out; SQLite is the slowest by a wide margin across the board. Let’s check memory next. Here is the code.

import pandas as pd
import matplotlib.pyplot as plt

all_results = pd.DataFrame(pandas_results + duckdb_results + sqlite_results)

measure_order = [
    "Total transaction value",
    "Group by domain",
    "Filter by location",
    "Group by domain & location",
]
engine_colors = {"Pandas": "#1f77b4", "DuckDB": "#ff7f0e", "SQLite": "#2ca02c"}

fig, axes = plt.subplots(2, 2, figsize=(12, 8))
axes = axes.ravel()

for i, q in enumerate(measure_order):
    d = all_results[all_results["query"] == q]
    axes[i].barh(d["engine"], d["memory"], 
                 color=[engine_colors[e] for e in d["engine"]])
    for y, v in enumerate(d["memory"]):
        axes[i].text(v, y, f" {v:.1f}", va="center")
    axes[i].set_title(q, fontsize=10)
    axes[i].set_xlabel("MB")

fig.suptitle("Per-Measure Comparison — Memory Usage", fontsize=14)
plt.tight_layout()
plt.show()

Here is the output.

This chart shows that SQLite swings between being the best and the worst in memory usage, Pandas is extreme with two best and two worst cases, while DuckDB stays consistently in the middle across all queries. As a result, DuckDB proves to be the most balanced choice overall, delivering consistently fast performance with moderate memory usage. Pandas shows extremes—sometimes the fastest, sometimes the heaviest—while SQLite struggles with speed and often ends up on the inefficient side for memory.

Nate Rosidi is a data scientist and in product strategy. He’s also an adjunct professor teaching analytics, and is the founder of StrataScratch, a platform helping data scientists prepare for their interviews with real interview questions from top companies. Nate writes on the latest trends in the career market, gives interview advice, shares data science projects, and covers everything SQL.

Source link

We Benchmarked DuckDB, SQLite, and Pandas on 1M Rows: Here’s What Happened

# Introduction

# The Dataset We Used

// Dataset Overview

// Peeking Into the Data with Pandas

# Setting Up a Fair Testing Environment

// Loading Data into Pandas

// Registering Data with DuckDB

// Preparing Data for SQLite

# How We Benchmarked the Tools

// Ensuring Consistent Setup

// Measuring Execution Time

// Tracking Memory Usage

// The Benchmark Queries

# Benchmark Results

// Query 1: Total Transaction Value

// Pandas Performance

// DuckDB Performance

// SQLite Performance

// Overall Performance Analysis

// Query 2: Group by Domain

// Pandas Performance

// DuckDB Performance

// SQLite Performance

// Overall Performance Analysis

// Query 3: Filter by Location (Goa)

// Pandas Performance

// DuckDB Performance

// SQLite Performance

// Overall Performance Analysis

// Query 4: Group by Domain & Location

// Pandas Performance

// DuckDB Performance

// SQLite Performance

// Overall Performance Analysis

# Final Comparison Across All Queries

Leave a comment Cancel reply

You May Also Like

2024: The Crazy AI Year Ahead – Human Brain Ends! AI Starts

An Overview of Hugging Face Diffusers