
Image by Author
# Introduction
 
Stress testing is crucial for understanding how your application behaves under heavy load. For machine learning-powered APIs, it is especially important because model inference can be CPU-intensive. By simulating a large number of users, we can identify performance bottlenecks, determine the capacity of our system, and ensure reliability.
In this tutorial, we will be using:
- FastAPI: A modern, fast (high-performance) web framework for building APIs with Python.
- Uvicorn: An ASGI server to run our FastAPI application.
- Locust: An open-source load testing tool. You define user behavior with Python code, and swarm your system with hundreds of simultaneous users.
- Scikit-learn: For our example machine learning model.
# 1. Project Setup and Dependencies
 
Set up the project structure and install the necessary dependencies.
- Create requirements.txtfile and add the following Python packages:
- Open your terminal, create a virtual environment, and activate it.
- Install all the Python packages using the requirements.txtfile.
fastapi==0.115.12
locust==2.37.10
numpy==2.3.0
pandas==2.3.0
pydantic==2.11.5
scikit-learn==1.7.0
uvicorn==0.34.3
orjson==3.10.18
python -m venv venv
venv\Scripts\activate
pip install -r requirements.txt
# 2. Building the FastAPI Application
 
In this section, we will create a file for training the Regression model, for pydantic models, and the FastAPI application.
This ml_model.py handles the machine learning model. It uses a singleton pattern to ensure only one instance of the model is loaded. The model is a Random Forest Regressor trained on the California housing dataset. If a pre-trained model (model.pkl and scaler.pkl) doesn’t exist, it trains and saves a new one.
app/ml_model.py:
import os
import threading
import joblib
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
class MLModel:
    _instance = None
    _lock = threading.Lock()
    def __new__(cls):
        if cls._instance is None:
            with cls._lock:
                if cls._instance is None:
                    cls._instance = super().__new__(cls)
        return cls._instance
    def __init__(self):
        if not hasattr(self, "initialized"):
            self.model = None
            self.scaler = None
            self.model_path = "model.pkl"
            self.scaler_path = "scaler.pkl"
            self.feature_names = None
            self.initialized = True
            self.load_or_create_model()
    def load_or_create_model(self):
        """Load existing model or create a new one using California housing dataset"""
        if os.path.exists(self.model_path) and os.path.exists(self.scaler_path):
            self.model = joblib.load(self.model_path)
            self.scaler = joblib.load(self.scaler_path)
            housing = fetch_california_housing()
            self.feature_names = housing.feature_names
            print("Model loaded successfully")
        else:
            print("Creating new model...")
            housing = fetch_california_housing()
            X, y = housing.data, housing.target
            self.feature_names = housing.feature_names
            X_train, X_test, y_train, y_test = train_test_split(
                X, y, test_size=0.2, random_state=42
            )
            self.scaler = StandardScaler()
            X_train_scaled = self.scaler.fit_transform(X_train)
            self.model = RandomForestRegressor(
                n_estimators=50,  # Reduced for faster predictions
                max_depth=8,  # Reduced for faster predictions
                random_state=42,
                n_jobs=1,  # Single thread for consistency
            )
            self.model.fit(X_train_scaled, y_train)
            joblib.dump(self.model, self.model_path)
            joblib.dump(self.scaler, self.scaler_path)
            X_test_scaled = self.scaler.transform(X_test)
            score = self.model.score(X_test_scaled, y_test)
            print(f"Model R² score: {score:.4f}")
    def predict(self, features):
        """Make prediction for house price"""
        features_array = np.array(features).reshape(1, -1)
        features_scaled = self.scaler.transform(features_array)
        prediction = self.model.predict(features_scaled)[0]
        return prediction * 100000
    def get_feature_info(self):
        """Get information about the features"""
        return {
            "feature_names": list(self.feature_names),
            "num_features": len(self.feature_names),
            "description": "California housing dataset features",
        }
# Initialize model as singleton
ml_model = MLModel()
The pydantic_models.py file defines the Pydantic models for request and response data validation and serialization.
app/pydantic_models.py:
from typing import List
from pydantic import BaseModel, Field
class PredictionRequest(BaseModel):
    features: List[float] = Field(
        ...,
        description="List of 8 features: MedInc, HouseAge, AveRooms, AveBedrms, Population, AveOccup, Latitude, Longitude",
        min_length=8,
        max_length=8,
    )
    model_config = {
        "json_schema_extra": {
            "examples": [
                {"features": [8.3252, 41.0, 6.984, 1.024, 322.0, 2.556, 37.88, -122.23]}
            ]
        }
    }
app/main.py: This file is the core FastAPI application, defining the API endpoints.
import asyncio
from contextlib import asynccontextmanager
from fastapi import FastAPI, HTTPException
from fastapi.responses import ORJSONResponse
from .ml_model import ml_model
from .pydantic_models import (
    PredictionRequest,
)
@asynccontextmanager
async def lifespan(app: FastAPI):
    # Pre-load the model
    _ = ml_model.get_feature_info()
    yield
app = FastAPI(
    title="California Housing Price Prediction API",
    version="1.0.0",
    description="API for predicting California housing prices using Random Forest model",
    lifespan=lifespan,
    default_response_class=ORJSONResponse,
)
@app.get("/health")
async def health_check():
    """Health check endpoint"""
    return {"status": "healthy", "message": "Service is operational"}
@app.get("/model-info")
async def model_info():
    """Get information about the ML model"""
    try:
        feature_info = await asyncio.to_thread(ml_model.get_feature_info)
        return {
            "model_type": "Random Forest Regressor",
            "dataset": "California Housing Dataset",
            "features": feature_info,
        }
    except Exception:
        raise HTTPException(
            status_code=500, detail="Error retrieving model information"
        )
@app.post("/predict")
async def predict(request: PredictionRequest):
    """Make house price prediction"""
    if len(request.features) != 8:
        raise HTTPException(
            status_code=400,
            detail=f"Expected 8 features, got {len(request.features)}",
        )
    try:
        prediction = ml_model.predict(request.features)
        return {
            "prediction": float(prediction),
            "status": "success",
            "features_used": request.features,
        }
    except ValueError as e:
        raise HTTPException(status_code=400, detail=str(e))
    except Exception:
        raise HTTPException(status_code=500, detail="Prediction error")
Key points:
- lifespanmanager: Ensures the ML model is loaded during application startup.
- asyncio.to_thread: This is crucial because scikit-learn’s predict method is CPU-bound (synchronous). Running it in a separate thread prevents it from blocking FastAPI’s asynchronous event loop, allowing the server to handle other requests concurrently.
Endpoints:
- /health: A simple health check.
- /model-info: Provides metadata about the ML model.
- /predict: Accepts a list of features and returns a house price prediction.
run_server.py: It contains the script that is used to run the FastAPI application using Uvicorn.
import uvicorn
if __name__ == "__main__":
    uvicorn.run("app.main:app", host="localhost", port=8000, workers=4)
All the files and configurations are available at the GitHub repository: kingabzpro/Stress-Testing-FastAPI
# 3. Writing the Locust Stress Test
 
Now, let’s create the stress test script using Locust.
tests/locustfile.py: This file defines the behavior of simulated users.
import json
import logging
import random
from locust import HttpUser, task
# Reduce logging to improve performance
logging.getLogger("urllib3").setLevel(logging.WARNING)
class HousingAPIUser(HttpUser):
    def generate_random_features(self):
        """Generate random but realistic California housing features"""
        return [
            round(random.uniform(0.5, 15.0), 4),  # MedInc
            round(random.uniform(1.0, 52.0), 1),  # HouseAge
            round(random.uniform(2.0, 10.0), 2),  # AveRooms
            round(random.uniform(0.5, 2.0), 2),  # AveBedrms
            round(random.uniform(3.0, 35000.0), 0),  # Population
            round(random.uniform(1.0, 10.0), 2),  # AveOccup
            round(random.uniform(32.0, 42.0), 2),  # Latitude
            round(random.uniform(-124.0, -114.0), 2),  # Longitude
        ]
    @task(1)
    def model_info(self):
        """Test health endpoint"""
        with self.client.get("/model-info", catch_response=True) as response:
            if response.status_code == 200:
                response.success()
            else:
                response.failure(f"Model info failed: {response.status_code}")
    @task(3)
    def single_prediction(self):
        """Test single prediction endpoint"""
        features = self.generate_random_features()
        with self.client.post(
            "/predict", json={"features": features}, catch_response=True, timeout=10
        ) as response:
            if response.status_code == 200:
                try:
                    data = response.json()
                    if "prediction" in data:
                        response.success()
                    else:
                        response.failure("Invalid response format")
                except json.JSONDecodeError:
                    response.failure("Failed to parse JSON")
            elif response.status_code == 503:
                response.failure("Service unavailable")
            else:
                response.failure(f"Status code: {response.status_code}")
Key points:
- Each simulated user will wait between 0.5 and 2 seconds between executing tasks.
- Creates realistic random feature data for the prediction requests.
- Each user will make one health_check request and 3 single_prediction requests.
# 4. Running the Stress Test
- To evaluate the performance of your application under load, begin by starting your asynchronous machine learning application in one terminal.
- Open your browser and navigate to http://localhost:8000/docs. Use the interactive API documentation to test your endpoints and ensure they are functioning correctly.
- Open a new terminal window, activate the virtual environment, and navigate to your project’s root directory to run Locust with the Web UI:
- In the Locust web UI, set the total number of users to 500, the spawn rate to 10 users per second, and run it for a minute.
- During the test, Locust will display real-time statistics, including the number of requests, failures, and response times for each endpoint.
- Once the test is complete, click on the Charts tab to view interactive graphs showing the number of users, requests per second, and response times.
- To run Locust without the web UI and automatically generate an HTML report, use the following command:
Model loaded successfully
INFO:     Started server process [26216]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

locust -f tests/locustfile.py --host http://localhost:8000
Access the Locust web UI at http://localhost:8089 in your browser.



locust -f tests/locustfile.py --host http://localhost:8000 --users 500 --spawn-rate 10 --run-time 60s --headless  --html report.html
After the test finishes, an HTML report named report.html will be saved in your project directory for later review.

# Final Thoughts
 
Our app can handle a large number of users as we are using a simple machine learning model. The results show that the model-info endpoint has a greater response time than the prediction, which is impressive. This is the best-case scenario for testing your application locally before pushing it to production.
If you would like to experience this setup firsthand, please visit the kingabzpro/Stress-Testing-FastAPI repository and follow the instructions in the documentation.
 
 
Abid Ali Awan (@1abidaliawan) is a certified data scientist professional who loves building machine learning models. Currently, he is focusing on content creation and writing technical blogs on machine learning and data science technologies. Abid holds a Master’s degree in technology management and a bachelor’s degree in telecommunication engineering. His vision is to build an AI product using a graph neural network for students struggling with mental illness.
 
				 
								 
						
									