Emotional well-being, often measured through the Happiness Index, is a fundamental indicator in studies of mental and social health. Understanding the variables that drive variations in happiness is crucial for designing effective and personalized interventions. This study aims to conduct an inferential analysis using Linear Regression (OLS) to identify and quantify the key variables that have a statistically significant relationship with the Happiness Index. The data encompass various dimensions of lifestyle — including mental, physical, and digital factors. A robust analytical model is expected to provide clear insights into priority areas for interventions aimed at enhancing overall well-being.
Data Preparation
- The dataset was imported from a source website, Kaggle. Several preprocessing steps were performed to improve the data structure, including renaming columns for better readability and removing the “User_ID” column, which was not relevant to the model.
#IMPORT DATASET
df = pd.read_csv('D:/Mental_Health_Dataset.csv', header=0)#RENAME COLUMNS
df.rename(columns={
df.columns[3]: 'Daily_Screen_Time',
df.columns[4]: 'Sleep_Quality',
df.columns[5]: 'Stress_Level',
df.columns[7]: 'Exercise_Frequency',
df.columns[9]: 'Happiness_Index'
}, inplace=True)
#DELETE COLUMN
del df['User_ID']
- Outlier Handling: A capping process was applied using the Interquartile Range (IQR) method on numerical columns to reduce the impact of extreme values on the regression coefficients.
It can be observed that there are noticeable gaps between the values of several variables, including Daily_Screen_Time, Stress_Level, and Exercise_Frequency.
def cap_outliers(df, num_cols):
df_capped = df.copy()
for col in num_cols:
Q1 = df[col].quantile(0.25)
Q3 = df[col].quantile(0.75)
IQR = Q3-Q1lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
df_capped[col] = np.where(
df_capped[col] < lower, lower,
np.where(df_capped[col] > upper, upper, df_capped[col])
)
return df_capped
df_capped = cap_outliers(df,num_cols)
#CHECK PROCESS RESULTS
col = 'Exercise_Frequency'
Q1 = df[col].quantile(0.25)
Q3 = df[col].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
print(f"Batas atas sebelum capping: {upper}")
print("Nilai maksimum sebelum:", df[col].max())
print("Nilai maksimum sesudah:", df_capped[col].max())
- Categorical Data Handling: Non-numerical features such as Gender and Social_Media_Platform were converted into dummy variables using One-Hot Encoding to make them compatible with the linear regression model.
df_encod = pd.get_dummies(df_capped, columns=['Gender','Social_Media_Platform'], drop_first=True, dtype=int)Model Testing Process
- OLS Regression Results Evaluation: to estimate the relationship between the dependent variable and the independent variables.
import statsmodels.api as smy = df_encod['Happiness_Index']
X = df_encod.drop(columns='Happiness_Index', axis=1)
X_cons = sm.add_constant(X)
X_model = sm.OLS(y, X_cons).fit()
X_model.summary()
It can be seen from the OLS Regression summary above that several variables have P>|t| values below 0.05, namely Age, Daily_Screen_Time, Sleep_Quality, and Stress_Level. These variables are independent variables that have a significant relationship with the dependent variable, which is the Happiness_Index. Furthermore, the coefficient values of these significant variables are noticeably different from zero, indicating that they have a linear influence on the dependent variable.
- Data Construction: The data were divided into a Training Set (80%) to train the model and a Testing Set (20%) to evaluate the model’s generalization capability.
#IMPORT LIBRARY SKLREAN
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split#PEMBAGIAN DATA TRAINING DAN TESTING
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2, random_state=0)
#PROSES LEARNING HUBUNGAN ANTARA X DAN Y
lm = LinearRegression()
lm.fit(X_train, y_train)
y_test_pred = lm.predict(X_test)
y_train_pred = lm.predict(X_train)
- R² Score Calculation: to evaluate the predictive capability of the model.
#R2 SCORE UNTUK DATA TEST
r2_score(y_test, y_test_pred)#R2 SCORE UNTUK DATA TRAIN
r2_score(y_train, y_train_pred)
From the test results above, the R² score for the test data is 0.6642, and for the training data, it is 0.6441. This indicates that approximately 66.4% of the variability in the dependent variable within the test data can be explained or predicted by this regression model, and a similar level of explanatory power is observed in the training data.
- Re-evaluation of OLS Regression Results: This evaluation was carried out using independent variables with P>|t| values less than 0.05, as identified in the previous OLS Regression test. In this stage, only four variables were included in the model: Age, Daily_Screen_Time, Sleep_Quality, and Stress_Level.
signifikan = ['Age','Daily_Screen_Time', 'Sleep_Quality', 'Stress_Level']X_train_signif = X_train[signifikan]
X_test_signif = X_test[signifikan]
X_train_signif_cons = sm.add_constant(X_train_signif)
X_test_signif_cons = sm.add_constant(X_train_signif)
model_OLS = sm.OLS(y_train, X_train_signif_cons).fit()
print(model_OLS.summary())
After re-testing the OLS Regression, it can be observed from the table above that there are differences in the P>|t| values compared to the previous regression results. Specifically, the variables Age and Daily_Screen_Time show an increase in their P>|t| values, exceeding the 0.05 threshold. This suggests that Age and Daily_Screen_Time are still dependent on other independent variables, as their significance decreased when other variables were removed from the model.
Meanwhile, the variables Sleep_Quality and Stress_Level maintained consistent P>|t| values, showing no increase. This indicates that Sleep_Quality and Stress_Level are the most significant variables that have a strong and independent relationship with the dependent variable (Target Label / Happiness_Index).
These variables serve as the main determining factors of the Happiness_Index. The variable Sleep_Quality has a coefficient value of 0.3078, indicating that the linear regression slope is upward (positive direction) — meaning that higher sleep quality is associated with higher happiness levels. In contrast, Stress_Level has a coefficient value of -0.4973, which indicates that the linear regression slope is downward (negative direction) — meaning that higher stress levels are associated with lower happiness levels.
Conclusion
The final model results show that:
- The variable Sleep_Quality has a positive and significant effect on the Happiness_Index (p < 0.05), meaning that the better a person’s sleep quality, the higher their level of happiness tends to be.
- The variable Stress_Level has a negative and significant effect on the Happiness_Index (p < 0.05), indicating that higher stress levels lead to lower happiness.
- Meanwhile, the variables Age and Daily_Screen_Time have p values greater than 0.05, suggesting that they do not have a statistically significant effect on the Happiness_Index in this model.
Thus, it can be concluded that the most influential factors affecting happiness levels are sleep quality and stress level, where improving sleep quality and reducing stress can significantly enhance a person’s happiness index.
