
Image by Author
# Introduction
There are numerous tools for processing datasets today. They all claim — of course they do — that they’re the best and the right choice for you. But are they? There are two main requirements these tools should satisfy: they should easily perform everyday data analysis operations and do so quickly, even under the pressure of large datasets.
To determine the best tool among DuckDB, SQLite, and Pandas, we tested them under these conditions.
First, we gave them only everyday analytical tasks: summing values, grouping by categories, filtering with conditions, and multi-field aggregations. This reflected how analysts actually work with real datasets, compared to scenarios designed to showcase the best characteristics of a tool.
Second, we performed those operations on a Kaggle dataset with over 1 million rows. It’s a realistic tipping point — small enough to run on a single machine, yet large enough that memory pressure and query speed start to reveal clear differences between tools.
Let’s see how those tests went.
# The Dataset We Used
// Dataset Overview
We used the Bank dataset from Kaggle. This dataset contains over 1 million rows, comprising five columns:
Column Name | Description |
---|---|
Date | The date the transaction occurred |
Domain | The business category or type (RETAIL, RESTAURANT) |
Location | Geographic region (Goa, Mathura) |
Value | Transaction value |
Transaction_count | The total number of transactions on that day |
This dataset is generated using Python. While it may not fully resemble real-life data, its size and structure are sufficient to test and compare the performance differences between the tools.
// Peeking Into the Data with Pandas
We used Pandas to load the dataset into a Jupyter notebook and examine its general structure, dimensions, and null values. Here is the code.
import pandas as pd
df = pd.read_excel('bankdataset.xlsx')
print("Dataset shape:", df.shape)
df.head()
Here is the output.
If you want a quick reference to common operations when exploring datasets, check out this handy Pandas Cheat Sheet.
Before benchmarking, let’s see how to set up the environment.
# Setting Up a Fair Testing Environment
All three tools — DuckDB, SQLite, and Pandas — were set up and run in the same Jupyter Notebook environment to ensure the test was fair. This ensured that the conditions during runtime and the use of memory remained constant throughout.
First, we installed and loaded the necessary packages.
Here are the tools we needed:
- pandas: for standard
DataFrame
operations - duckdb: for SQL execution on a
DataFrame
- sqlite3: for managing an embedded SQL database
- time: for capturing execution time
- memory_profiler: to measure memory allocation
# Install if any of them are not in your environment
!pip install duckdb --quiet
import pandas as pd
import duckdb
import sqlite3
import time
from memory_profiler import memory_usage
Now let’s prepare the data in a format that can be shared across all three tools.
// Loading Data into Pandas
We’ll use Pandas to load the dataset once, and then we’ll share or register it for DuckDB and SQLite.
df = pd.read_excel('bankdataset.xlsx')
df.head()
Here is the output to validate.
// Registering Data with DuckDB
DuckDB lets you directly access Pandas DataFrame
s. You don’t have to convert anything—just register and query. Here is the code.
# Register DataFrame as a DuckDB table
duckdb.register("bank_data", df)
# Query via DuckDB
duckdb.query("SELECT * FROM bank_data LIMIT 5").to_df()
Here is the output.
// Preparing Data for SQLite
Since SQLite doesn’t read Excel files directly, we started by adding the Pandas DataFrame
to an in-memory database. After that, we used a simple query to examine the data format.
conn_sqlite = sqlite3.connect(":memory:")
df.to_sql("bank_data", conn_sqlite, index=False, if_exists="replace")
pd.read_sql_query("SELECT * FROM bank_data LIMIT 5", conn_sqlite)
Here is the output.
# How We Benchmarked the Tools
We used the same four queries on DuckDB, SQLite, and Pandas to compare their performance. Each query was designed to address a common analytical task that mirrors how data analysis is applied in the real world.
// Ensuring Consistent Setup
The in-memory dataset was used by all three tools.
- Pandas queried the
DataFrame
directly - DuckDB executed SQL queries directly against the
DataFrame
- SQLite stored a copy of the
DataFrame
in an in-memory database and ran SQL queries on it
This method ensured that all three tools used the same data and operated with the same system settings.
// Measuring Execution Time
To track query duration, Python’s time
module wrapped each query in a simple start/end timer. Only the query execution time was recorded; data-loading and preparation steps were excluded.
// Tracking Memory Usage
Along with processing time, memory usage indicates how well each engine performs with large datasets.
If desired, memory usage can be sampled immediately before and after each query to estimate incremental RAM consumption.
// The Benchmark Queries
We tested each engine on the same four everyday analytical tasks:
- Total transaction value: summing a numeric column
- Group by domain: aggregating transaction counts per category
- Filter by location: filtering rows by a condition before aggregation
- Group by domain & location: multi-field aggregation with averages
# Benchmark Results
// Query 1: Total Transaction Value
Here we measure how Pandas, DuckDB, and SQLite perform when summing the Value
column across the dataset.
// Pandas Performance
We calculate the total transaction value using .sum()
on the Value
column. Here is the code.
pandas_results = []
def pandas_q1():
return df['Value'].sum()
mem_before = memory_usage(-1)[0]
start = time.time()
pandas_q1()
end = time.time()
mem_after = memory_usage(-1)[0]
pandas_results.append({
"engine": "Pandas",
"query": "Total transaction value",
"time": round(end - start, 4),
"memory": round(mem_after - mem_before, 4)
})
pandas_results
Here is the output.
// DuckDB Performance
We calculate the total transaction value using a full-column aggregation. Here is the code.
duckdb_results = []
def duckdb_q1():
return duckdb.query("SELECT SUM(value) FROM bank_data").to_df()
mem_before = memory_usage(-1)[0]
start = time.time()
duckdb_q1()
end = time.time()
mem_after = memory_usage(-1)[0]
duckdb_results.append({
"engine": "DuckDB",
"query": "Total transaction value",
"time": round(end - start, 4),
"memory": round(mem_after - mem_before, 4)
})
duckdb_results
Here is the output.
// SQLite Performance
We calculate the total transaction value by summing the value
column. Here is the code.
sqlite_results = []
def sqlite_q1():
return pd.read_sql_query("SELECT SUM(value) FROM bank_data", conn_sqlite)
mem_before = memory_usage(-1)[0]
start = time.time()
sqlite_q1()
end = time.time()
mem_after = memory_usage(-1)[0]
sqlite_results.append({
"engine": "SQLite",
"query": "Total transaction value",
"time": round(end - start, 4),
"memory": round(mem_after - mem_before, 4)
})
sqlite_results
Here is the output.
// Overall Performance Analysis
Now let’s compare execution time and memory usage. Here is the code.
import matplotlib.pyplot as plt
all_q1 = pd.DataFrame(pandas_results + duckdb_results + sqlite_results)
fig, axes = plt.subplots(1, 2, figsize=(10,4))
all_q1.plot(x="engine", y="time", kind="barh", ax=axes[0], legend=False, title="Execution Time (s)")
all_q1.plot(x="engine", y="memory", kind="barh", color="salmon", ax=axes[1], legend=False, title="Memory Usage (MB)")
plt.tight_layout()
plt.show()
Here is the output.
Pandas is by far the fastest and most memory-efficient here, completing almost instantly with minimal RAM usage. DuckDB is slightly slower and uses more memory but remains efficient, while SQLite is both the slowest and the heaviest in terms of memory consumption.
// Query 2: Group by Domain
Here we measure how Pandas, DuckDB, and SQLite perform when grouping transactions by Domain
and summing their counts.
// Pandas Performance
We calculate the total transaction count per domain using .groupby()
on the Domain
column.
def pandas_q2():
return df.groupby('Domain')['Transaction_count'].sum()
mem_before = memory_usage(-1)[0]
start = time.time()
pandas_q2()
end = time.time()
mem_after = memory_usage(-1)[0]
pandas_results.append({
"engine": "Pandas",
"query": "Group by domain",
"time": round(end - start, 4),
"memory": round(mem_after - mem_before, 4)
})
[p for p in pandas_results if p["query"] == "Group by domain"]
Here is the output.
// DuckDB Performance
We calculate the total transaction count per domain using a SQL GROUP BY
on the domain
column.
def duckdb_q2():
return duckdb.query("""
SELECT domain, SUM(transaction_count)
FROM bank_data
GROUP BY domain
""").to_df()
mem_before = memory_usage(-1)[0]
start = time.time()
duckdb_q2()
end = time.time()
mem_after = memory_usage(-1)[0]
duckdb_results.append({
"engine": "DuckDB",
"query": "Group by domain",
"time": round(end - start, 4),
"memory": round(mem_after - mem_before, 4)
})
[p for p in duckdb_results if p["query"] == "Group by domain"]
Here is the output.
// SQLite Performance
We calculate the total transaction count per domain using SQL GROUP BY
on the in-memory table.
def sqlite_q2():
return pd.read_sql_query("""
SELECT domain, SUM(transaction_count) AS total_txn
FROM bank_data
GROUP BY domain
""", conn_sqlite)
mem_before = memory_usage(-1)[0]
start = time.time()
sqlite_q2()
end = time.time()
mem_after = memory_usage(-1)[0]
sqlite_results.append({
"engine": "SQLite",
"query": "Group by domain",
"time": round(end - start, 4),
"memory": round(mem_after - mem_before, 4)
})
[p for p in sqlite_results if p["query"] == "Group by domain"]
Here is the output.
// Overall Performance Analysis
Now let’s compare execution time and memory usage. Here is the code.
import pandas as pd
import matplotlib.pyplot as plt
groupby_results = [r for r in (pandas_results + duckdb_results + sqlite_results)
if "Group by" in r["query"]]
df_groupby = pd.DataFrame(groupby_results)
fig, axes = plt.subplots(1, 2, figsize=(10,4))
df_groupby.plot(x="engine", y="time", kind="barh", ax=axes[0], legend=False, title="Execution Time (s)")
df_groupby.plot(x="engine", y="memory", kind="barh", color="salmon", ax=axes[1], legend=False, title="Memory Usage (MB)")
plt.tight_layout()
plt.show()
Here is the output.
DuckDB is fastest, Pandas trades a bit more time for lower memory, while SQLite is both slowest and most memory-hungry.
// Query 3: Filter by Location (Goa)
Here we measure how Pandas, DuckDB, and SQLite perform when filtering the dataset for Location = 'Goa'
and summing the transaction values.
// Pandas Performance
We filter rows for Location == 'Goa'
and sum their values. Here is the code.
def pandas_q3():
return df[df['Location'] == 'Goa']['Value'].sum()
mem_before = memory_usage(-1)[0]
start = time.time()
pandas_q3()
end = time.time()
mem_after = memory_usage(-1)[0]
pandas_results.append({
"engine": "Pandas",
"query": "Filter by location",
"time": round(end - start, 4),
"memory": round(mem_after - mem_before, 4)
})
[p for p in pandas_results if p["query"] == "Filter by location"]
Here is the output.
// DuckDB Performance
We filter transactions for Location = 'Goa'
and calculate their total value. Here is the code.
def duckdb_q3():
return duckdb.query("""
SELECT SUM(value)
FROM bank_data
WHERE location = 'Goa'
""").to_df()
mem_before = memory_usage(-1)[0]
start = time.time()
duckdb_q3()
end = time.time()
mem_after = memory_usage(-1)[0]
duckdb_results.append({
"engine": "DuckDB",
"query": "Filter by location",
"time": round(end - start, 4),
"memory": round(mem_after - mem_before, 4)
})
[p for p in duckdb_results if p["query"] == "Filter by location"]
Here is the output.
// SQLite Performance
We filter transactions for Location = 'Goa'
and sum their values. Here is the code.
def sqlite_q3():
return pd.read_sql_query("""
SELECT SUM(value) AS total_value
FROM bank_data
WHERE location = 'Goa'
""", conn_sqlite)
mem_before = memory_usage(-1)[0]
start = time.time()
sqlite_q3()
end = time.time()
mem_after = memory_usage(-1)[0]
sqlite_results.append({
"engine": "SQLite",
"query": "Filter by location",
"time": round(end - start, 4),
"memory": round(mem_after - mem_before, 4)
})
[p for p in sqlite_results if p["query"] == "Filter by location"]
Here is the output.
// Overall Performance Analysis
Now let’s compare execution time and memory usage. Here is the code.
import pandas as pd
import matplotlib.pyplot as plt
filter_results = [r for r in (pandas_results + duckdb_results + sqlite_results)
if r["query"] == "Filter by location"]
df_filter = pd.DataFrame(filter_results)
fig, axes = plt.subplots(1, 2, figsize=(10, 4))
df_filter.plot(x="engine", y="time", kind="barh", ax=axes[0], legend=False, title="Execution Time (s)")
df_filter.plot(x="engine", y="memory", kind="barh", color="salmon", ax=axes[1], legend=False, title="Memory Usage (MB)")
plt.tight_layout()
plt.show()
Here is the output.
DuckDB is the fastest and most efficient; Pandas is slower with higher memory usage; and SQLite is the slowest but lighter on memory.
// Query 4: Group by Domain & Location
// Pandas Performance
We calculate the average transaction value grouped by both Domain
and Location
. Here is the code.
def pandas_q4():
return df.groupby(['Domain', 'Location'])['Value'].mean()
mem_before = memory_usage(-1)[0]
start = time.time()
pandas_q4()
end = time.time()
mem_after = memory_usage(-1)[0]
pandas_results.append({
"engine": "Pandas",
"query": "Group by domain & location",
"time": round(end - start, 4),
"memory": round(mem_after - mem_before, 4)
})
[p for p in pandas_results if p["query"] == "Group by domain & location"]
Here is the output.
// DuckDB Performance
We calculate the average transaction value grouped by both domain
and location
. Here is the code.
def duckdb_q4():
return duckdb.query("""
SELECT domain, location, AVG(value) AS avg_value
FROM bank_data
GROUP BY domain, location
""").to_df()
mem_before = memory_usage(-1)[0]
start = time.time()
duckdb_q4()
end = time.time()
mem_after = memory_usage(-1)[0]
duckdb_results.append({
"engine": "DuckDB",
"query": "Group by domain & location",
"time": round(end - start, 4),
"memory": round(mem_after - mem_before, 4)
})
[p for p in duckdb_results if p["query"] == "Group by domain & location"]
Here is the output.
// SQLite Performance
We calculate the average transaction value grouped by both domain
and location
. Here is the code.
def sqlite_q4():
return pd.read_sql_query("""
SELECT domain, location, AVG(value) AS avg_value
FROM bank_data
GROUP BY domain, location
""", conn_sqlite)
mem_before = memory_usage(-1)[0]
start = time.time()
sqlite_q4()
end = time.time()
mem_after = memory_usage(-1)[0]
sqlite_results.append({
"engine": "SQLite",
"query": "Group by domain & location",
"time": round(end - start, 4),
"memory": round(mem_after - mem_before, 4)
})
[p for p in sqlite_results if p["query"] == "Group by domain & location"]
Here is the output.
// Overall Performance Analysis
Now let’s compare execution time and memory usage. Here is the code.
import pandas as pd
import matplotlib.pyplot as plt
gdl_results = [r for r in (pandas_results + duckdb_results + sqlite_results)
if r["query"] == "Group by domain & location"]
df_gdl = pd.DataFrame(gdl_results)
fig, axes = plt.subplots(1, 2, figsize=(10, 4))
df_gdl.plot(x="engine", y="time", kind="barh", ax=axes[0], legend=False,
title="Execution Time (s)")
df_gdl.plot(x="engine", y="memory", kind="barh", ax=axes[1], legend=False,
title="Memory Usage (MB)", color="salmon")
plt.tight_layout()
plt.show()
Here is the output.
DuckDB handles multi-field group-bys fastest with moderate memory use, Pandas is slower with very high memory usage, and SQLite is the slowest with substantial memory consumption.
# Final Comparison Across All Queries
We’ve compared these three engines against each other in terms of memory and speed. Let’s check the execution time once again. Here is the code.
import pandas as pd
import matplotlib.pyplot as plt
all_results = pd.DataFrame(pandas_results + duckdb_results + sqlite_results)
measure_order = [
"Total transaction value",
"Group by domain",
"Filter by location",
"Group by domain & location",
]
engine_colors = {"Pandas": "#1f77b4", "DuckDB": "#ff7f0e", "SQLite": "#2ca02c"}
fig, axes = plt.subplots(2, 2, figsize=(12, 8))
axes = axes.ravel()
for i, q in enumerate(measure_order):
d = all_results[all_results["query"] == q]
axes[i].barh(d["engine"], d["time"],
color=[engine_colors[e] for e in d["engine"]])
for y, v in enumerate(d["time"]):
axes[i].text(v, y, f" {v:.3f}", va="center")
axes[i].set_title(q, fontsize=10)
axes[i].set_xlabel("Seconds")
fig.suptitle("Per-Measure Comparison — Execution Time", fontsize=14)
plt.tight_layout()
plt.show()
Here is the output.
This chart shows that DuckDB consistently maintains the lowest execution times for almost all queries, except for the total transaction value where Pandas edges it out; SQLite is the slowest by a wide margin across the board. Let’s check memory next. Here is the code.
import pandas as pd
import matplotlib.pyplot as plt
all_results = pd.DataFrame(pandas_results + duckdb_results + sqlite_results)
measure_order = [
"Total transaction value",
"Group by domain",
"Filter by location",
"Group by domain & location",
]
engine_colors = {"Pandas": "#1f77b4", "DuckDB": "#ff7f0e", "SQLite": "#2ca02c"}
fig, axes = plt.subplots(2, 2, figsize=(12, 8))
axes = axes.ravel()
for i, q in enumerate(measure_order):
d = all_results[all_results["query"] == q]
axes[i].barh(d["engine"], d["memory"],
color=[engine_colors[e] for e in d["engine"]])
for y, v in enumerate(d["memory"]):
axes[i].text(v, y, f" {v:.1f}", va="center")
axes[i].set_title(q, fontsize=10)
axes[i].set_xlabel("MB")
fig.suptitle("Per-Measure Comparison — Memory Usage", fontsize=14)
plt.tight_layout()
plt.show()
Here is the output.
This chart shows that SQLite swings between being the best and the worst in memory usage, Pandas is extreme with two best and two worst cases, while DuckDB stays consistently in the middle across all queries. As a result, DuckDB proves to be the most balanced choice overall, delivering consistently fast performance with moderate memory usage. Pandas shows extremes—sometimes the fastest, sometimes the heaviest—while SQLite struggles with speed and often ends up on the inefficient side for memory.
Nate Rosidi is a data scientist and in product strategy. He’s also an adjunct professor teaching analytics, and is the founder of StrataScratch, a platform helping data scientists prepare for their interviews with real interview questions from top companies. Nate writes on the latest trends in the career market, gives interview advice, shares data science projects, and covers everything SQL.