Background
ENSO events influence temperature, precipitation, and extreme weather patterns around the globe. El Niño is characterized by an increase of sea surface temperature, particularly in the central and eastern tropical Pacific Ocean according to the NOAA , while La Niña is the opposite.
Understanding climate variability and spatial patterns is important for anticipating regional impacts of climate change. Seasonal climate variations have significant effects on agriculture, water resources, ecosystems, and human health. Among the most significant drivers of climate variability between years is the El Niño–Southern Oscillation, also known as ENSO conditions.
To capture these trends, I used k-means clustering to show how various regions are affected by ENSO conditions.
Machine learning methods like k-means clustering are a powerful way to capture trends and structures in data. By analyzing seasonal temperature anomalies alongside ENSO metrics, clustering can reveal regions with distinct responses to ENSO forcing.
In this project, data from Berkeley Earth is processed and clustering is then performed on multiple features, including ENSO indices and seasonal temperature anomalies, to create an interpretable map of global ENSO sensitivity patterns. This analysis can help with climate risk assessments and support more localized forecasting.
Methods
For this project, Surface temperature data from Berkley Earth was used in the form of a NetCDF file. This data contains monthly surface temperature data from various 1×1 coordinates across Earth. The temperature data is based on temperature anomalies, which is calculated based on a baseline temperature of the specific coordinate point in the pre-industrial period. The data can be found here.
For Exploratory Data Analysis, I used python libraries such as matplotlib and seaborn for plotting, numpy and pandas to deal with data, NetCDF4 to handle the NetCDF file. For the clustering model, scikit-learn was used.
Before training the model, exploratory data analysis was performed on this dataset to better understand the structure, distribution, and quality of the dataset. This included analyzing the distribution of temperature values and using matplotlib and plotly to map this out.
Then, the dataset was filtered to remove missing values and scaled if necessary. Seasonal smoothing (smoothing over a 3 months period) was applied to decrease the amount of noise. Clustering was performed based on the temperature values and/or temporal or spatial dimensions across each ENSO condition to capture similarities in temperature trends across time and geography. The results were then displayed on an interactive map presented via Streamlit, along with a summary of each cluster.
Handling data
First, I decided to flatten the NetCDF file into the format of a CSV file as this makes it simple to manipulate the data, such as adding ENSO labels and smoothing. scikit-learn also requires 2b tabular data rather than in the form of a NetCDF file.
from netCDF4 import Dataset
nc_file = "/content/Complete_TAVG_LatLong1.nc"
dataset = Dataset(nc_file, mode="r")
dataset.variablesThis then gives:
The variables of interest are: longitude, latitude, temperature, and time.
Then, I flattened the nc file. This iterates through every combination of time, latitude, longitude and appends the time, coordinates and temperature at that specific point in the nc file.
import os
from netCDF4 import Dataset
import pandas as pd
nc_file = "/content/Complete_TAVG_LatLong1 (1).nc"
dataset = Dataset(nc_file, mode="r")
latitudes = dataset.variables['latitude'][:]
longitudes = dataset.variables['longitude'][:]
times = dataset.variables['time'][:]
temperatures = dataset.variables['temperature'][:]
time_dim, lat_dim, lon_dim = temperatures.shape
data = []
for t in range(time_dim):
for i in range(lat_dim):
for j in range(lon_dim):
data.append([latitudes[i], longitudes[j], times[t], temperatures[t, i, j]])
df2 = pd.DataFrame(data, columns=['latitude', 'longitude', 'time', 'temperature'])Running “df2.head()” returns:
From this, it can be assumed that the fill value for temperature (when there is no temperature) is “ — “. I used this to remove all the rows with “ — “ as the temperature.
df2_cleaned = df2[df2['temperature'] != '--'].copy()
display(df2_cleaned.head())I decided to split the time variable into years and months, as this will make separating each data point by months and years simpler. First, I converted each int value in time into datetime, then extracted the year and month from the datetime object and created new rows with those values.
df2_cleaned['year'] = df2_cleaned['time'].astype(int)
fractional_part = df2_cleaned['time'] - df2_cleaned['year']
days_in_year = 365.25
days_of_year = (fractional_part * days_in_year).round().astype(int)
# might not be perfectly accurate for all dates
df2_cleaned['month'] = ((days_of_year / days_in_year) * 12).round().astype(int)
df2_cleaned['month'] = df2_cleaned['month'].clip(1, 12) # month is between 1 and 12
df2_cleaned = df2_cleaned.drop('time', axis=1)With this, data cleaning is complete.
Exploratory Data Analysis (EDA)
First, I decided to display a graph of how many temperature measurements there were per year, as I suspected that there were many null values for earlier years.
import pandas as pd
import matplotlib.pyplot as pltcounts_per_year = df2_cleaned['year'].value_counts().sort_index()
plt.figure(figsize=(10,6))
plt.bar(counts_per_year.index, counts_per_year.values)
plt.xlabel('Year')
plt.ylabel('Number of Data Points')
plt.title('Number of Temperature Measurements per Year')
plt.show()
This graph shows that the majority of temperature values were missing until around 1850, around which the number of measurements began increasing rapidly. Starting from the 1950s all coordinates had all of its temperatures recorded (0 missing values).
I also decided to visualize the temperature anomaly trend from 1750 to present day.
import matplotlib.pyplot as plt
average_annual_temperatures = df2_cleaned.groupby('year')['temperature'].mean()
plt.figure(figsize=(10, 6)) # Adjust figure size as needed
plt.plot(average_annual_temperatures.index, average_annual_temperatures.values)
plt.xlabel('Year')
plt.ylabel('Average Annual Temperature')
plt.title('Average Annual Temperature Over Time')
plt.grid(True)
plt.show()This graph shows that the average temperature anomaly seems to increase slowly but steadily from 1750 to 1950, then increases at a faster rate. The first part of the graph seems more jagged because of a lack of data points.
In addition, a correlation heatmap was plotted to capture relationships between features.
import seaborn as sns
corrs = df_from_drive.corr()
sns.heatmap(corrs, annot=True, cmap='coolwarm')As the graph shows, temperature and year have a positive correlation. Year and latitude has a relatively strong negative correlation, which indicates that as time passed more of the global south’s temperatures were recorded.
A box plot and histograms were also plotted to assess the distribution of temperatures
import seaborn as sns
plt.figure(figsize=(8, 6))
sns.boxplot(y=df_from_drive['temperature'])
plt.ylabel('Temperature')
plt.title('Box Plot of Temperature Values (from df_from_drive)')
plt.show()plt.figure(figsize=(10, 6))
sns.histplot(df_from_drive['temperature'], kde=True, bins=50) # Using seaborn for a nice look with KDE
plt.xlabel('Temperature')
plt.ylabel('Frequency')
plt.title('Distribution of Temperature Values')
plt.show()I made another histogram with only temperature values from after 1970.
This shows that all temperature anomaly values are usually in the -5 to 5 range, and the median is greater than 0, especially for years after 1970. I decided to graph years after 1970 because I believed that the average temperature would have increased significantly by then. However, the box plot and the histograms show that there are some data points with temperature anomalies that significantly differ from the normal -5 to 5 range.
Clustering
For this clustering, I decided to implement seasonal smoothing (smooth temperatures according to season) and isolate only the winter months for the global north, this is to avoid multiple seasons mixing together. Also, I decided to filter out the years with no ENSO condition (neutral) to reduce noise, as these lack distinctive patterns in temperature anomalies unlike la Niña or El Niño years. I used scikit-learn’s k means clustering. I will cluster grid cells by their average temperature anomalies in DJF (dec, jan, feb) during each ENSO phase.
The ENSO labels El Niño, La Niña or neither was determined using data from The Long Paddock, an Australian government agency.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler#filters only for years i have ENSO data for
clustering_df = df[(df['year'] >= 1890) & (df['year'] <= 2017)]
# filters out neutral years
clustering_df = clustering_df[clustering_df['year'].isin(enso_labels.keys())]
clustering_df['enso_label'] = clustering_df['year'].map(enso_labels)
clustering_df = clustering_df[clustering_df['month'].isin([12, 1, 2])]
clustering_df = clustering_df[clustering_df['enso_label'].isin(['E', 'L'])]
clustering_df = pd.get_dummies(clustering_df, columns=['enso_label'], prefix='enso')
This filters out the dataset for the years i have ENSO phase data on, and also for years in those years that are either El Niño or La Niña.
melted = clustering_df.melt(
id_vars=['latitude', 'longitude', 'temp_smooth'],
value_vars=['enso_E', 'enso_L'],
var_name='enso_phase',
value_name='is_phase'
)
melted = melted[melted['is_phase'] == 1]
pivot = melted.groupby(['latitude', 'longitude', 'enso_phase'])['temp_smooth']
.mean().unstack(fill_value=0).reset_index()Here, melting turns two columns (enso_E, enso_L) into two rows per grid point, making it easier to compute averages by ENSO phase later on. I then grouped the dataset by coordinates and ENSO phase, then took the average of all the values for each ENSO phase.
clustering_cols = ['enso_E', 'enso_L']
X = pivot[clustering_cols]
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
kmeans = KMeans(n_clusters=6, random_state=0, n_init=10)
pivot['cluster'] = kmeans.fit_predict(X_scaled)plt.figure(figsize=(10, 6))
plt.scatter(pivot['longitude'], pivot['latitude'], c=pivot['cluster'], cmap='tab10', s=10)
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.title('Clusters of Locations Based on DJF ENSO Temperature Response')
plt.colorbar(label='Cluster')
plt.grid(True)
plt.show()
print(pivot.groupby('cluster')[clustering_cols].mean())
Here, I now apply feature scaling and then apply k means clustering with 6 clusters. I then plot the results, along with the summary of the clusters. I displayed this in a Streamlit app.
Analysis
Cluster 1 encompasses northern Russia (Siberia) and the southern tip of Antarctica. The clustering indicates cooler anomalies, especially during El Niño events, suggesting these regions tend to cool in response to ENSO conditions.
Cluster 2 includes parts of Mongolia, Central Asia, and a band around the US Canada border. The pattern suggests modest cooling, with less extreme values compared to other clusters.
Cluster 3 spans most of Central and South America, Southeast Asia, and half of Africa, particularly around the equator. These regions are characterized by strong warming during El Niño, a classic ENSO response in the tropics.
Cluster 4 includes the southern half of Africa, parts of Oceania, eastern South America, and stretches across central Asia and parts of Europe. According to the clustering, these areas have intermediate to slightly positive temperature anomalies, without as strong a signal as clusters 0 or 3.
Cluster 5 encompasses western Russia, northern Canada, and various areas scattered across Russia’s interior. This cluster has strong negative anomalies, suggesting significant winter cooling, particularly in El Niño conditions.
In cluster 1, Antarctica and Sibera are grouped together, which seems to be a clustering effect. Cluster 3 seems very consistent with what we know of El Niño and La Niña, as El Niño and La Niña affect those areas more significantly. Cluster 2 does not seem to fit in, as areas such as Mongolia and southern Canada should not be largely affected by El Niño. Cluster 4 is somewhat consistent, as it seems to not be affected by the ENSO phase, but it also covers some parts of South America and Africa which should be affected; this may be a clustering result.
Limitations
This clustering analysis does not account for the increase in global average temperatures, potentially leading to inaccurate temperature differences between ENSO conditions. In addition, this analysis focuses on the winter months as El Niño and La Niña usually reach peak intensity during these months, and also such that temperatures from other seasons are not included. However, this may bias clusters towards regions with strong winter signals and vice versa.
ENSO events are also relatively rare, leading to issues in statistical robustness. An unsupervised clustering method cannot differentiate between correlation and causation. Whether an increase between El Niño and La Niña or vice versa is caused by ENSO conditions or noise is unclear.
Conclusions
Overall, this project shows that clustering provides a somewhat valuable lens for exploring global ENSO impacts, especially when constrained to DJF months when teleconnections are strongest. Future work could investigate how patterns change under climate change. By combining data driven methods with climate science, this approach helps highlight both the power and the challenges of using machine learning in climate variability research.
