Categories Machine Learning

Predict, Prevent, Recover: How Machine Learning is Transforming Loan Collection | by Gagana H P | May, 2025

[ad_1]

Gagana H P

Loan defaults are a big problem for banks and lenders because they hurt profits and slow down cash flow. To deal with this, many financial companies now use smart systems that look at past loan data, borrower details, and how people have paid in the past. These systems help make loan recovery more efficient, cut down on costs, and get more money back. In this article, I’ll show you how to create a smart loan recovery system using machine learning.

Dataset Overview: Smart Loan Recovery System

To create a loan recovery system using machine learning, we’ll work with a dataset that includes information about borrowers, their loans, and how they’ve made payments in the past. This data includes important details like:

  • Personal Details: This includes the borrower’s age, job type, income, and how many people depend on them financially.
  • Loan Info: Covers how much money was borrowed, for how long, the interest rate, and the value of any collateral (like property or a vehicle).
  • Payment Record: Looks at how often payments were missed, how late they were, and the regular monthly payment amount (EMI).
  • Recovery Actions: Tracks what steps were taken to collect the loan — like phone calls, visits, or legal action — and how many times those actions were tried.
  • Recovery Result: Shows whether the loan was fully paid back, partly paid or still unpaid.

Smarter Loan Recovery: A Machine Learning Approach

Now, let’s kick things off by loading the dataset we’ll use to build our smart loan recovery system with machine learning.

import pandas as pd
df = pd.read_csv(r"C:\Users\GAGANA\Documents\Data Analytics\Blog\loan-recovery.csv")
print(df.head())
Contains more columns

Before we dive deeper, let’s take a quick look at the summary statistics to understand the overall structure and distribution of the data.

# Summary statistics of the data

df.describe()

Contains more columns

Understanding Data Distribution and How Features Relate to Each Other

I’ll start by looking at how the loan amounts are distributed and how they relate to the borrowers’ monthly income.

import plotly.graph_objects as go
import plotly.express as px
import matplotlib.pyplot as plt
import pandas as pd

fig = px.histogram(df, x='Loan_Amount', nbins=30, marginal="violin", opacity=0.7,
title="Loan Size vs. Monthly Income Analysis",
labels={'Loan_Amount': "Loan Value ($)", 'Monthly_Income': "Monthly Earnings ($)"},
color_discrete_sequence=["royalblue"])

fig.add_trace(go.Scatter(
x=sorted(df['Loan_Amount']),
y=px.histogram(df, x='Loan_Amount', nbins=30, histnorm='probability density').data[0]['y'],
mode='lines',
name='Density Curve',
line=dict(color='red', width=2)
))

scatter = px.scatter(df, x='Loan_Amount', y='Monthly_Income',
color='Loan_Amount', color_continuous_scale='Viridis',
size=df['Loan_Amount'], hover_name=df.index)

for trace in scatter.data:
fig.add_trace(trace)

fig.update_layout(
annotations=[
dict(
x=max(df['Loan_Amount']) * 0.8, y=max(df['Monthly_Income']),
text="As income increases, the approved loan amount typically follows",
showarrow=True,
arrowhead=2,
font=dict(size=12, color="red")
)
],
xaxis_title="Loan Amount (in $)",
yaxis_title="Monthly Income (in $)",
template="plotly_white",
showlegend=True
)

fig.show()

Loan size v/s Monthly income analysis

The graph clearly shows that as monthly income increases, loan amounts tend to be higher too. In other words, people who earn more usually get bigger loans. The curve on top represents how loan amounts are spread out, highlighting that larger loans are more common among higher income groups.

This pattern suggests that lenders consider income when approving loans or profiling customers, making sure loan sizes match what borrowers can realistically repay.

Analyzing the Payment History

Next, let’s explore the payment history. I’ll start by examining how a borrower’s past payment behavior influences the amount recovered on their loan.

How Payment History Affects Loan Recovery Status

Loans where payments are made on time usually get paid back in full. When payments are late, some loans are fully recovered, some only partially, and others may be written off. However, loans with missed payments have a much lower chance of full recovery — many end up partially paid or not recovered at all.

Now, let’s take a closer look at missed payments and see exactly how they impact loan recovery.

fig = px.box(df, x="Recovery_Status", y="Num_Missed_Payments",
title="How Missed Payments Affect Loan Recovery Status",
labels={"Recovery_Status": "Recovery Status", "Num_Missed_Payments": "Number of Missed Payments"},
color="Recovery_Status",
color_discrete_map={"Recovered": "green", "Not Recovered": "red"},
points="all")

fig.update_layout(
xaxis_title="Recovery Status",
yaxis_title="Number of Missed Payments",
template="plotly_white"
)

fig.show()

How Missed Payments Affect Loan Recovery Status

Loans that are only partially recovered usually have up to 4 missed payments. Loans that get fully paid back tend to have very few missed payments — mostly between 0 and 2. On the other hand, loans that are written off often have many missed payments, with some having more than 6. In short, the more missed payments there are, the less likely the loan will be fully recovered, and the greater the chance it will be written off.

Analyzing Loan Recovery Based on Monthly Income

Next, let’s take a closer look at how a borrower’s monthly income relates to the amount of the loan that gets recovered. I’ll start by examining how both income and loan size influence loan recovery.

fig = px.scatter(df, x='Monthly_Income', y='Loan_Amount',
color='Recovery_Status', size='Loan_Amount',
hover_data={'Monthly_Income': True, 'Loan_Amount': True, 'Recovery_Status': True},
title="How Monthly Income and Loan Amount Affect Loan Recovery",
labels={"Monthly_Income": "Monthly Income ($)", "Loan_Amount": "Loan Amount ($)"},
color_discrete_map={"Recovered": "green", "Not Recovered": "red"})

fig.add_annotation(
x=max(df['Monthly_Income']), y=max(df['Loan_Amount']),
text="Higher loans may still get recovered if income is high",
showarrow=True,
arrowhead=2,
font=dict(size=12, color="red")
)

fig.update_layout(
xaxis_title="Monthly Income ($)",
yaxis_title="Loan Amount ($)",
template="plotly_white"
)

fig.show()

How Monthly Income and Loan Amount Affect Loan Recovery

People with higher incomes tend to pay back their loans in full, even when the loan amounts are large. On the other hand, borrowers with lower incomes are more likely to have loans that are only partly recovered or written off. This shows how important income is in loan recovery — higher earnings generally mean better chances of full repayment and fewer losses, even for bigger loans.

Now, I’ll apply K-Means clustering to categorize borrowers according to their income and loan size.

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

features = ['Age', 'Monthly_Income', 'Loan_Amount', 'Loan_Tenure', 'Interest_Rate',
'Collateral_Value', 'Outstanding_Loan_Amount', 'Monthly_EMI', 'Num_Missed_Payments', 'Days_Past_Due']

scaler = StandardScaler()
df_scaled = scaler.fit_transform(df[features])

Now, let’s create a visual to better understand these borrower groups and see how they differ.

optimal_k = 4
kmeans = KMeans(n_clusters=optimal_k, random_state=42, n_init=10)
df['Borrower_Segment'] = kmeans.fit_predict(df_scaled)

fig = px.scatter(df, x='Monthly_Income', y='Loan_Amount',
color=df['Borrower_Segment'].astype(str), size='Loan_Amount',
hover_data={'Monthly_Income': True, 'Loan_Amount': True, 'Borrower_Segment': True},
title="Borrower Segments Based on Monthly Income and Loan Amount",
labels={"Monthly_Income": "Income per Month ($)", "Loan_Amount": "Total Loan Amount ($)", "Borrower_Segment": "Customer Segment"},
color_discrete_sequence=px.colors.qualitative.Vivid)

fig.add_annotation(
x=df['Monthly_Income'].mean(), y=df['Loan_Amount'].max(),
text="Specific income ranges show a higher concentration of large loan amounts",
showarrow=True,
arrowhead=2,
font=dict(size=12, color="red")
)

fig.update_layout(
xaxis_title="Monthly Income ($)",
yaxis_title="Loan Amount ($)",
template="plotly_white",
legend_title="Borrower Segment"
)

fig.show()

Borrower Segments based on Monthly Income and Loan Amount

Borrowers in Segment 1 tend to take moderate to high loan amounts, which suggests they’re fairly financially stable. Segment 0 includes borrowers with lower incomes and moderate loan sizes, indicating they might be under some financial pressure. Segment 2 is spread out more evenly, showing a balanced group that likely approaches loans cautiously. Lastly, Segment 3 borrowers are mostly found in the high loan amount range, often within higher income brackets, but they seem more at risk of default despite their earnings.

Now, let’s give these segments clear names based on what they represent:

# updating Segment_Name

df['Segment_Name'] = df['Borrower_Segment'].map({
0: 'Moderate Income, High Loan Burden',
1: 'High Income, Low Default Risk',
2: 'Moderate Income, Medium Risk',
3: 'High Loan, Higher Default Risk'
})

Building an Early Warning System to Spot Loan Defaults Using Risk Scores

Next, we’ll use the borrower segments to create a model that identifies borrowers who are likely to default on their loans. Once the model flags those high-risk borrowers, we can tailor a loan recovery plan based on how risky each borrower is. Let’s start by training the model.

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

risk_segments = {
'High Loan, Higher Default Risk': 1,
'Moderate Income, High Loan Burden': 1
}
df['High_Risk_Flag'] = df['Segment_Name'].map(risk_segments).fillna(0).astype(int)

# selecting features for the model
features = ['Age', 'Monthly_Income', 'Loan_Amount', 'Loan_Tenure', 'Interest_Rate',
'Collateral_Value', 'Outstanding_Loan_Amount', 'Monthly_EMI', 'Num_Missed_Payments', 'Days_Past_Due']
X = df[features]
y = df['High_Risk_Flag']

# Ensuring balanced class distribution in training and testing sets
split_ratio = 0.2
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=split_ratio, stratify=y, random_state=42)

# training the model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Estimate high-risk probabilities using the trained classifier
probabilities = rf_model.predict_proba(X_test)
risk_scores = probabilities[:, 1] # estimated probability of borrower being high risk

# merge risk scores and risk status indicators with test set features
df_test = X_test.copy()
df_test['Risk_Score'] = risk_scores
df_test['Is_High_Risk'] = (df_test['Risk_Score'] > 0.49).astype(int) # Assign risk label based on probability threshold

# merge test results with borrower profile and recovery data
info_cols = ['Borrower_ID', 'Segment_Name', 'Recovery_Status',
'Collection_Method', 'Collection_Attempts', 'Legal_Action_Taken']
merged_df = df_test.merge(df[info_cols], left_index=True, right_index=True)

First, we labeled borrowers as high-risk based on the segments they belong to. Next, we selected key financial and behavioral indicators to help us build a Random Forest model. We divided the data into training and testing sets and used the training portion to teach the model how to predict whether a borrower might default. Once trained, the model produced a risk score for each borrower in the test set. Based on a defined threshold, we then labeled each borrower as either high-risk or low-risk. Finally, we combined these predictions with borrower information to help design smart recovery plans.

Next, we’ll add a new column that assigns a recovery strategy based on each borrower’s risk score.

# We're adding a new column that shows how to recover based on how risky each case is

def assign_recovery_strategy(risk_score):
if risk_score > 0.75:
return "Send urgent legal warnings and initiate strict recovery procedures"
elif 0.50 <= risk_score <= 0.75:
return "Settlement offers & repayment plans"
else:
return "Automated reminders & monitoring"

df_test['Recovery_Strategy'] = df_test['Risk_Score'].apply(assign_recovery_strategy)

df_test.head()

Contains more columns

We created a function that groups borrowers into three recovery strategies based on their risk scores:

  • High-risk borrowers (risk score above 0.75) are flagged for immediate legal action.
  • Moderate-risk borrowers (risk scores between 0.50 and 0.75) receive settlement offers and repayment plans.
  • Low-risk borrowers (risk score below 0.50) get automated payment reminders.

This function was applied to the test data to assign each borrower a recovery plan tailored to their risk level, helping to focus efforts where they matter most and save costs.

And that’s how you can build a smart loan recovery system using machine learning!

Summary

By using borrower profiles, payment patterns, and clustering methods, we can create a smart loan recovery system that spots high-risk borrowers early and assigns recovery plans tailored to their risk levels. I hope this walkthrough on building a machine learning–powered loan recovery system was insightful and practical.
If you have any questions or feedback, feel free to drop a comment below — I’d genuinely love to hear your thoughts!

This blog came to life after hours of digging through loan data, fueled by curiosity, late-night coffee, and a dash of data science magic — made possible by the guidance and support from Imarticus Learning.
A big thank you to Imarticus Learning for helping me turn numbers into meaningful solutions and for giving me the tools to build data-driven systems like this smart loan recovery model!

[ad_2]

More From Author

You May Also Like