Categories Machine Learning

TPOT: Automating ML Pipelines with Genetic Algorithms in Python – KDnuggets


Image by Author

#Introduction

Constructing a machine learning model manually involves a long chain of decisions. Many steps are involved, such as cleaning the data, choosing the right algorithm, and tuning the hyperparameters to achieve good results. This trial-and-error process often takes hours or even days. However, there is a way to solve this issue using the Tree-based Pipeline Optimization Tool, or TPOT.

TPOT is a Python library that uses genetic algorithms to automatically search for the best machine learning pipeline. It treats pipelines like a population in nature: it tries many combinations, evaluates their performance, and “evolves” the best ones over multiple generations. This automation allows you to focus on solving your problem while TPOT handles the technical details of model selection and optimization.

#How TPOT Works

TPOT utilizes genetic programming (GP). It is a type of evolutionary algorithm inspired by natural selection in biology. Instead of evolving organisms, GP evolves computer programs or workflows to solve a problem. In the context of TPOT, the “programs” being evolved are machine learning pipelines.

TPOT works in four main steps:

  1. Generate Pipelines: It starts with a random population of machine learning pipelines, including preprocessing methods and models.
  2. Evaluate Fitness: Each pipeline is trained and evaluated on the data to measure performance.
  3. Selection & Evolution: The best-performing pipelines are selected to “reproduce” and create new pipelines through crossover and mutation.
  4. Iterate Over Generations: This process repeats for multiple generations until TPOT identifies the pipeline with the best performance.

The process is visualized in the diagram below:

Next, we will look at how to set up and use TPOT in Python.

#1. Installing TPOT

To install TPOT, run the following command:

#2. Importing Libraries

Import the necessary libraries:

from tpot import TPOTClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

#3. Loading and Splitting Data

We will use the popular Iris dataset for this example:

iris = load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

The load_iris() function provides the features X and labels y. The train_test_split function holds out a test set so you can measure final performance on unseen data. This prepares an environment where pipelines will be evaluated. All pipelines are trained on the training portion and validated internally.

Note: TPOT uses internal cross-validation during the fitness evaluation.

#4. Initializing TPOT

Initialize TPOT as follows:

tpot = TPOTClassifier(
    generations=5,
    population_size=20,
    random_state=42
)

You can control how long and how widely TPOT searches for a good pipeline. For example:

  • generations=5 means TPOT will run five cycles of evolution. In each cycle, it creates a new set of candidate pipelines based on the previous generation.
  • population_size=20 means 20 candidate pipelines exist in each generation.
  • random_state ensures the results are reproducible.

#5. Training the Model

Train the model by running this command:

tpot.fit(X_train, y_train)

When you run tpot.fit(X_train, y_train), TPOT starts its search for the best pipeline. It creates a group of candidate pipelines, trains each one to see how well it performs (usually using cross-validation), and keeps the top performers. Then, it mixes and slightly changes them to make a new group. This cycle repeats for the number of generations you set. TPOT always remembers which pipeline performed best so far.

Output:

#6. Evaluating Accuracy

This is your final check on how the selected pipeline behaves on unseen data. You can calculate the accuracy as follows:

y_pred = tpot.fitted_pipeline_.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print("Accuracy:", acc)

Output:

#7. Exporting the Best Pipeline

You can export the pipeline into a file for later use. Note that we must import dump from Joblib first:

from joblib import dump

dump(tpot.fitted_pipeline_, "best_pipeline.pkl")
print("Pipeline saved as best_pipeline.pkl")

joblib.dump() stores the entire fitted model as best_pipeline.pkl.

Output:

Pipeline saved as best_pipeline.pkl

You can load it later as follows:

from joblib import load

model = load("best_pipeline.pkl")
predictions = model.predict(X_test)

This makes your model reusable and easy to deploy.

#Wrapping Up

In this article, we saw how machine learning pipelines can be automated using genetic programming, and we also walked through a practical example of implementing TPOT in Python. For further exploration, please consult the documentation.

Kanwal Mehreen is a machine learning engineer and a technical writer with a profound passion for data science and the intersection of AI with medicine. She co-authored the ebook “Maximizing Productivity with ChatGPT”. As a Google Generation Scholar 2022 for APAC, she champions diversity and academic excellence. She’s also recognized as a Teradata Diversity in Tech Scholar, Mitacs Globalink Research Scholar, and Harvard WeCode Scholar. Kanwal is an ardent advocate for change, having founded FEMCodes to empower women in STEM fields.

More From Author

You May Also Like