
Image by Author
# Introduction
While groupby().sum()
and groupby().mean()
are fine for quick checks, production-level metrics require more robust solutions. Real-world tables often involve multiple keys, time-series data, weights, and various conditions like promotions, returns, or outliers.
This means you frequently need to compute totals and rates, rank items within each segment, roll up data by calendar buckets, and then merge group statistics back to the original rows for modeling. This article will guide you through advanced grouping techniques using the Pandas library to handle these complex scenarios effectively.
# Picking the Right Mode
// Using agg to Reduce Groups to One Row
Use agg
when you want one record per group, such as totals, means, medians, min/max values, and custom vectorized reductions.
out = (
df.groupby(['store', 'cat'], as_index=False, sort=False)
.agg(sales=('rev', 'sum'),
orders=('order_id', 'nunique'),
avg_price=('price', 'mean'))
)
This is good for Key Performance Indicator (KPI) tables, weekly rollups, and multi-metric summaries.
// Using transform to Broadcast Statistics Back to Rows
The transform
method returns a result with the same shape as the input. It is ideal for creating features you need on each row, such as z-scores, within-group shares, or groupwise fills.
g = df.groupby('store')['rev']
df['rev_z'] = (df['rev'] - g.transform('mean')) / g.transform('std')
df['rev_share'] = df['rev'] / g.transform('sum')
This is good for modeling features, quality assurance ratios, and imputations.
// Using apply for Custom Per-Group Logic
Use apply
only when the required logic cannot be expressed with built-in functions. It is slower and harder to optimize, so you should try agg
or transform
first.
def capped_mean(s):
q1, q3 = s.quantile([.25, .75])
return s.clip(q1, q3).mean()
df.groupby('store')['rev'].apply(capped_mean)
This is good for bespoke rules and small groups.
// Using filter to Keep or Drop Entire Groups
The filter
method allows entire groups to pass or fail a condition. This is handy for data quality rules and thresholding.
big = df.groupby('store').filter(lambda g: g['order_id'].nunique() >= 100)
This is good for minimum-size cohorts and for removing sparse categories before aggregation.
# Multi-Key Grouping and Named Aggregations
// Grouping by Multiple Keys
You can control the output shape and order so that results can be dropped straight into a business intelligence tool.
g = df.groupby(['store', 'cat'], as_index=False, sort=False, observed=True)
as_index=False
returns a flat DataFrame, which is easier to join and exportsort=False
avoids reordering groups, which saves work when order is irrelevantobserved=True
(with categorical columns) drops unused category pairs
// Using Named Aggregations
Named aggregations produce readable, SQL-like column names.
out = (
df.groupby(['store', 'cat'])
.agg(sales=('rev', 'sum'),
orders=('order_id', 'nunique'), # use your id column here
avg_price=('price', 'mean'))
)
// Tidying Columns
If you stack multiple aggregations, you will get a MultiIndex
. Flatten it once and standardize the column order.
out = out.reset_index()
out.columns = [
'_'.join(c) if isinstance(c, tuple) else c
for c in out.columns
]
# optional: ensure business-friendly column order
cols = ['store', 'cat', 'orders', 'sales', 'avg_price']
out = out[cols]
# Conditional Aggregations Without apply
// Using Boolean-Mask Math Inside agg
When a mask depends on other columns, align the data by its index.
# promo sales and promo rate by (store, cat)
cond = df['is_promo']
out = df.groupby(['store', 'cat']).agg(
promo_sales=('rev', lambda s: s[cond.loc[s.index]].sum()),
promo_rate=('is_promo', 'mean') # proportion of promo rows
)
// Calculating Rates and Proportions
A rate is simply sum(mask) / size
, which is equivalent to the mean of a boolean column.
df['is_return'] = df['status'].eq('returned')
rates = df.groupby('store').agg(return_rate=('is_return', 'mean'))
// Creating Cohort-Style Windows
First, precompute masks with date bounds, and then aggregate the data.
# example: repeat purchase within 30 days of first purchase per customer cohort
first_ts = df.groupby('customer_id')['ts'].transform('min')
within_30 = (df['ts'] <= first_ts + pd.Timedelta('30D')) & (df['ts'] > first_ts)
# customer cohort = month of first purchase
df['cohort'] = first_ts.dt.to_period('M').astype(str)
repeat_30_rate = (
df.groupby('cohort')
.agg(repeat_30_rate=('within_30', 'mean'))
.rename_axis(None)
)
# Weighted Metrics Per Group
// Implementing a Weighted Average Pattern
Vectorize the math and guard against zero-weight divisions.
import numpy as np
tmp = df.assign(wx=df['price'] * df['qty'])
agg = tmp.groupby(['store', 'cat']).agg(wx=('wx', 'sum'), w=('qty', 'sum'))
# weighted average price per (store, cat)
agg['wavg_price'] = np.where(agg['w'] > 0, agg['wx'] / agg['w'], np.nan)
// Handling NaN Values Safely
Decide what to return for empty groups or all-NaN
values. Two common choices are:
# 1) Return NaN (transparent, safest for downstream stats)
agg['wavg_price'] = np.where(agg['w'] > 0, agg['wx'] / agg['w'], np.nan)
# 2) Fallback to unweighted mean if all weights are zero (explicit policy)
mean_price = df.groupby(['store', 'cat'])['price'].mean()
agg['wavg_price_safe'] = np.where(
agg['w'] > 0, agg['wx'] / agg['w'], mean_price.reindex(agg.index).to_numpy()
)
# Time-Aware Grouping
// Using pd.Grouper with a Frequency
Respect calendar boundaries for KPIs by grouping time-series data into specific intervals.
weekly = df.groupby(['store', pd.Grouper(key='ts', freq='W')], observed=True).agg(
sales=('rev', 'sum'), orders=('order_id', 'nunique')
)
// Applying Rolling/Expanding Windows Per Group
Always sort your data first and align on the timestamp column.
df = df.sort_values(['customer_id', 'ts'])
df['rev_30d_mean'] = (
df.groupby('customer_id')
.rolling('30D', on='ts')['rev'].mean()
.reset_index(level=0, drop=True)
)
// Avoiding Data Leakage
Keep chronological order and ensure that windows only “see” past data. Do not shuffle time-series data, and do not compute group statistics on the full dataset before splitting it for training and testing.
# Ranking and Top-N Within Groups
// Finding the Top-k Rows Per Group
Here are two practical options for selecting the top N rows from each group.
# Sort + head
top3 = (df.sort_values(['cat', 'rev'], ascending=[True, False])
.groupby('cat')
.head(3))
# Per-group nlargest on one metric
top3_alt = (df.groupby('cat', group_keys=False)
.apply(lambda g: g.nlargest(3, 'rev')))
// Using Helper Functions
Pandas provides several helper functions for ranking and selection.
rank — Controls how ties are handled (e.g., method='dense'
or 'first'
) and can calculate percentile ranks with pct=True
.
df['rev_rank_in_cat'] = df.groupby('cat')['rev'].rank(method='dense', ascending=False)
cumcount — Provides the 0-based position of each row within its group.
df['pos_in_store'] = df.groupby('store').cumcount()
nth — Picks the k-th row per group without sorting the entire DataFrame.
second_row = df.groupby('store').nth(1) # the second row present per store
# Broadcasting Features with transform
// Performing Groupwise Normalization
Standardize a metric within each group so that rows become comparable across different groups.
g = df.groupby('store')['rev']
df['rev_z'] = (df['rev'] - g.transform('mean')) / g.transform('std')
// Imputing Missing Values
Fill missing values with a group statistic. This often keeps distributions closer to reality than using a global fill value.
df['price'] = df['price'].fillna(df.groupby('cat')['price'].transform('median'))
// Creating Share-of-Group Features
Turn raw numbers into within-group proportions for cleaner comparisons.
df['rev_share_in_store'] = df['rev'] / df.groupby('store')['rev'].transform('sum')
# Handling Categories, Empty Groups, and Missing Data
// Improving Speed with Categorical Types
If your keys come from a fixed set (e.g., stores, regions, product categories), cast them to a categorical type once. This makes GroupBy
operations faster and more memory-efficient.
from pandas.api.types import CategoricalDtype
store_type = CategoricalDtype(categories=sorted(df['store'].dropna().unique()), ordered=False)
df['store'] = df['store'].astype(store_type)
cat_type = CategoricalDtype(categories=['Grocery', 'Electronics', 'Home', 'Clothing', 'Sports'])
df['cat'] = df['cat'].astype(cat_type)
// Dropping Unused Combinations
When grouping on categorical columns, setting observed=True
excludes category pairs that do not actually occur in the data, resulting in cleaner outputs with less noise.
out = df.groupby(['store', 'cat'], observed=True).size().reset_index(name="n")
// Grouping with NaN Keys
Be explicit about how you handle missing keys. By default, Pandas drops NaN
groups; keep them only if it helps with your quality assurance process.
# Default: NaN keys are dropped
by_default = df.groupby('region').size()
# Keep NaN as its own group when you need to audit missing keys
kept = df.groupby('region', dropna=False).size()
# Quick Cheatsheet
// Calculating a Conditional Rate Per Group
# mean of a boolean is a rate
df.groupby(keys).agg(rate=('flag', 'mean'))
# or explicitly: sum(mask)/size
df.groupby(keys).agg(rate=('flag', lambda s: s.sum() / s.size))
// Calculating a Weighted Mean
df.assign(wx=df[x] * df[w])
.groupby(keys)
.apply(lambda g: g['wx'].sum() / g[w].sum() if g[w].sum() else np.nan)
.rename('wavg')
// Finding the Top-k Per Group
(df.sort_values([key, metric], ascending=[True, False])
.groupby(key)
.head(k))
# or
df.groupby(key, group_keys=False).apply(lambda g: g.nlargest(k, metric))
// Calculating Weekly Metrics
df.groupby([key, pd.Grouper(key='ts', freq='W')], observed=True).agg(...)
// Performing a Groupwise Fill
df[col] = df[col].fillna(df.groupby(keys)[col].transform('median'))
// Calculating Share Within a Group
df['share'] = df[val] / df.groupby(keys)[val].transform('sum')
# Wrapping Up
First, choose the right mode for your task: use agg
to reduce, transform
to broadcast, and reserve apply
for when vectorization is not an option. Lean on pd.Grouper
for time-based buckets and ranking helpers for top-N selections. By favoring clear, vectorized patterns, you can keep your outputs flat, named, and easy to test, ensuring your metrics stay correct and your notebooks run fast.
Josep Ferrer is an analytics engineer from Barcelona. He graduated in physics engineering and is currently working in the data science field applied to human mobility. He is a part-time content creator focused on data science and technology. Josep writes on all things AI, covering the application of the ongoing explosion in the field.