Predicting Diabetes using Machine Learning: A Step-by-Step Guide
Diabetes is a prevalent chronic disease that affects millions of individuals globally. The early prediction of diabetes can lead to improved disease management and better patient outcomes. In this comprehensive guide, we’ll walk you through the process of developing a machine learning model to predict whether a patient has diabetes or not. We’ll be using the well-known Pima Indians Diabetes Dataset, sourced from the National Institute of Diabetes and Digestive and Kidney Diseases. This dataset includes diagnostic measurements for female individuals of Pima Indian heritage, aged 21 years and above, residing in Phoenix, Arizona.
Table of Contents
- Introduction
- Exploratory Data Analysis (EDA)
- Understanding the Dataset
- Identifying Numerical and Categorical Variables
- Analyzing Numerical and Categorical Variables
- Analyzing the Target Variable
- Handling Outliers and Missing Values
- Correlation Analysis
3. Feature Engineering
- Handling Missing Values
- Creating New Features
- Encoding Categorical Variables
- Standardizing Numerical Variables
4. Model Development
- Data Preparation
- Building a Random Forest Classifier
- Model Evaluation Metrics
5. Feature Importance Analysis
6. Conclusion
7. Exploring Further on GitHub
1. Introduction
Diabetes is a common chronic disease that has a significant impact on global health. Early diagnosis and effective management are crucial to improving the quality of life for affected individuals. Machine learning models offer the potential to predict diabetes based on patterns and relationships within medical data.
2. Exploratory Data Analysis (EDA)
Exploratory Data Analysis (EDA) is the foundation of any data analysis project. Let’s start by gaining a solid understanding of the structure of the Pima Indians Diabetes Dataset.
##################################
# OVERALL VIEW
##################################
def check_df(dataframe, head=5):
print("##################### Shape #####################")
print(dataframe.shape)
print("##################### Types #####################")
print(dataframe.dtypes)
print("##################### Head #####################")
print(dataframe.head(head))
print("##################### Tail #####################")
print(dataframe.tail(head))
print("##################### NA #####################")
print(dataframe.isnull().sum())
print("##################### Quantiles #####################")
print(dataframe.quantile([0, 0.05, 0.50, 0.95, 0.99, 1]).T)
check_df(df)
df.head()
df.info()Understanding the Dataset
The dataset comprises 768 observations and 9 features:
- Pregnancies: Number of pregnancies
- Glucose: Glucose level
- Blood Pressure: Blood pressure (Diastolic)
- Skin Thickness: Skin thickness
- Insulin: Insulin level
- BMI: Body Mass Index
- Diabetes Pedigree Function: A function estimating diabetes likelihood based on family history
- Age: Age in years
- Outcome: Target variable indicating diabetes (1) or non-diabetes (0)
Identifying Numerical and Categorical Variables
Properly identifying numerical and categorical variables is vital for effective analysis and feature engineering.
Analyzing Numerical and Categorical Variables
Conducting a thorough analysis of numerical and categorical variables involves calculating statistics, visualizing distributions, and understanding the unique characteristics of each feature.
##################################
# NUMERIC VARIABLE ANALYSIS ACCORDING TO THE TARGET
##################################
def target_summary_with_num(dataframe, target, numerical_col):
print(dataframe.groupby(target).agg({numerical_col: "mean"}), end="\n\n\n")
for col in num_cols:
target_summary_with_num(df, "Outcome", col)Analyzing the Target Variable
A careful examination of the distribution of the target variable, “Outcome”, helps us understand the balance between positive and negative cases of diabetes.
Handling Outliers and Missing Values
Outliers can impact the quality of our analysis. Additionally, addressing missing values through suitable imputation strategies is crucial for accurate predictions.
Correlation Analysis
By performing correlation analysis, we gain insights into relationships between variables and identify potential multicollinearity.
##################################
# CORRELATION
##################################
df.corr()
f, ax = plt.subplots(figsize=[18, 13])
sns.heatmap(df.corr(), annot=True, fmt=".2f", ax=ax, cmap="magma")
ax.set_title("Correlation Matrix", fontsize=20)
plt.show(block=True)3. Feature Engineering
Feature engineering is a critical step that involves creating new features and transforming existing ones to enhance model performance. Let’s explore the feature engineering tasks we’ll undertake:
Handling Missing Values
We’ll address missing values by identifying variables with zero values that should be treated as missing. Subsequently, we’ll impute these missing values using appropriate methods.
##################################
# MISSING VALUE ANALYSIS
##################################
zero_columns = [col for col in df.columns if (df[col].min() == 0 and col not in ["Pregnancies", "Outcome"])]
zero_columns
for col in zero_columns:
df[col] = np.where(df[col] == 0, np.nan, df[col])
na_columns = missing_values_table(df, na_name=True)
def missing_values_table(dataframe, na_name=False):
na_columns = [col for col in dataframe.columns if dataframe[col].isnull().sum() > 0]
n_miss = dataframe[na_columns].isnull().sum().sort_values(ascending=False)
ratio = (dataframe[na_columns].isnull().sum() / dataframe.shape[0] * 100).sort_values(ascending=False)
missing_df = pd.concat([n_miss, np.round(ratio, 2)], axis=1, keys=['n_miss', 'ratio'])
print(missing_df, end="\n")
if na_name:
return na_columns
missing_vs_target(df, "Outcome", na_columns)
for col in zero_columns:
df.loc[df[col].isnull(), col] = df[col].median()
df.isnull().sum()
##################################
# OUTLIER ANALYSIS
##################################
def outlier_thresholds(dataframe, col_name, q1=0.05, q3=0.95):
quartile1 = dataframe[col_name].quantile(q1)
quartile3 = dataframe[col_name].quantile(q3)
interquantile_range = quartile3 - quartile1
up_limit = quartile3 + 1.5 * interquantile_range
low_limit = quartile1 - 1.5 * interquantile_range
return low_limit, up_limit
def check_outlier(dataframe, col_name):
low_limit, up_limit = outlier_thresholds(dataframe, col_name)
if dataframe[(dataframe[col_name] > up_limit) | (dataframe[col_name] < low_limit)].any(axis=None):
return True
else:
return False
def replace_with_thresholds(dataframe, variable, q1=0.05, q3=0.95):
low_limit, up_limit = outlier_thresholds(dataframe, variable, q1=0.05, q3=0.95)
dataframe.loc[(dataframe[variable] < low_limit), variable] = low_limit
dataframe.loc[(dataframe[variable] > up_limit), variable] = up_limit
for col in df.columns:
print(col, check_outlier(df, col))
if check_outlier(df, col):
replace_with_thresholds(df, col)
for col in df.columns:
print(col, check_outlier(df, col))Creating New Features
To capture valuable information from existing variables, we’ll engineer new features. For example, we might categorize age groups, define BMI ranges, and calculate interactions between glucose and insulin levels.
Encoding Categorical Variables
Categorical variables will be encoded using label encoding for binary categories and one-hot encoding for others. This ensures that the machine learning model can effectively use these variables.
##################################
# FEATURE ENGINEERING
##################################
# Creating a new age variable by categorizing age
df.loc[(df["Age"] >= 21) & (df["Age"] < 50), "NEW_AGE_CAT"] = "mature"
df.loc[(df["Age"] >= 50), "NEW_AGE_CAT"] = "senior"
# Categorizing BMI values as underweight, healthy, overweight, and obese
df['NEW_BMI'] = pd.cut(x=df['BMI'], bins=[0, 18.5, 24.9, 29.9, 100], labels=["Underweight", "Healthy", "Overweight", "Obese"])
# Converting Glucose value to a categorical variable
df["NEW_GLUCOSE"] = pd.cut(x=df["Glucose"], bins=[0, 140, 200, 300], labels=["Normal", "Prediabetes", "Diabetes"])
# Creating a categorical variable based on age and BMI
df.loc[(df["BMI"] < 18.5) & ((df["Age"] >= 21) & (df["Age"] < 50)), "NEW_AGE_BMI_NOM"] = "underweightmature"
# ... similar lines for other combinations
# Creating a categorical variable based on age and Glucose value
df.loc[(df["Glucose"] < 70) & ((df["Age"] >= 21) & (df["Age"] < 50)), "NEW_AGE_GLUCOSE_NOM"] = "lowmature"
# ... similar lines for other combinations
# Creating a categorical variable based on Insulin value
def set_insulin(dataframe, col_name="Insulin"):
if 16 <= dataframe[col_name] <= 166:
return "Normal"
else:
return "Abnormal"
df["NEW_INSULIN_SCORE"] = df.apply(set_insulin, axis=1)
df["NEW_GLUCOSE*INSULIN"] = df["Glucose"] * df["Insulin"]
# Be careful with zero values!
df["NEW_GLUCOSE*PREGNANCIES"] = df["Glucose"] * df["Pregnancies"]
# Uppercasing column names
df.columns = [col.upper() for col in df.columns]
##################################
# ENCODING
##################################
# One-Hot Encoding
cat_cols = [col for col in cat_cols if col not in binary_cols and col not in ["OUTCOME"]]
def one_hot_encoder(dataframe, categorical_cols, drop_first=False):
dataframe = pd.get_dummies(dataframe, columns=categorical_cols, drop_first=drop_first)
return dataframe
df = one_hot_encoder(df, cat_cols, drop_first=True)Standardizing Numerical Variables
Standardizing numerical variables is essential to ensure that they are on the same scale. This can lead to improved performance for various machine learning algorithms.
##################################
# STANDARDIZATION
##################################
scaler = StandardScaler()
df[num_cols] = scaler.fit_transform(df[num_cols])4. Model Development
With our data prepared and features engineered, we move on to building a machine learning model for diabetes prediction. In this guide, we’ll utilize the Random Forest Classifier due to its robustness and ability to capture non-linear relationships.
Data Preparation
We’ll split the dataset into training and testing sets, and we’ll normalize the numerical variables to ensure fair comparisons between different features.
Model-1: Logistic Regression
The initial stage of our journey involves introducing the Logistic Regression model, a fundamental tool in classification. This model serves as a starting point for our evaluation, where it’s trained on a dataset, and predictions are generated for the same data. These predictions help us understand how well the model captures underlying patterns.
To gain deeper insights into the model’s performance, we use the confusion matrix — a visual representation of predicted vs. actual outcomes. This highlights successes and areas for improvement. The classification report provides a comprehensive overview of the model’s performance, including accuracy, precision, recall, and F1-score — metrics that showcase its ability to classify various classes.
Moving to the ROC AUC metric, we assess the model’s capability to distinguish between classes, crucial for imbalanced datasets. Before cross-validation, data is structured into feature variables (X) and the target variable (y). With this foundation, a 5-fold cross-validation assesses the model’s performance on new data, considering accuracy, precision, recall, F1-score, and ROC AUC.
The journey continues with exploration of the K-Nearest Neighbors (KNN) model and hyperparameter optimization. Culminating in a detailed analysis of the fine-tuned model, our journey provides a comprehensive view of classification efforts. Stay tuned for the next segment!
# Model-1: Logistic Regression
# Here you are fitting a logistic regression model and evaluating its performance.
# Model training
log_model = LogisticRegression().fit(X, y)
# Making predictions
y_pred = log_model.predict(X)
# Evaluating confusion matrix
plot_confusion_matrix(y, y_pred)
# Printing classification report
print(classification_report(y, y_pred))
# Calculating ROC AUC
y_prob = log_model.predict_proba(X)[:, 1]
roc_auc_score(y, y_prob)
# Model Validation: 5-Fold Cross Validation
# You're performing 5-fold cross-validation to assess the performance of the logistic regression model.
# Splitting data into features (X) and target (y)
y = df["OUTCOME"]
X = df.drop(["OUTCOME"], axis=1)
# Fitting logistic regression model
log_model = LogisticRegression().fit(X, y)
# Cross-validation
cv_results = cross_validate(log_model, X, y, cv=5, scoring=["accuracy", "precision", "recall", "f1", "roc_auc"])
# Printing cross-validation results
print(cv_results['test_accuracy'].mean())
print(cv_results['test_f1'].mean())
print(cv_results['test_roc_auc'].mean())Model-2: K-Nearest Neighbors
In this step, a K-Nearest Neighbors (KNN) classifier is created and trained using the given features `X` and target labels `y`.The model is trained to predict class labels using the features `X`.In this step, predictions are made for data points using the trained KNN model, and the ROC AUC score is calculated. The ROC AUC (Receiver Operating Characteristic Area Under the Curve) score is calculated to assess the classification ability of the model’s predictions. In this step, cross-validation is used to evaluate the model’s performance more reliably by testing it on different data folds. The model’s performance metrics (accuracy, F1 score, ROC AUC) are calculated using cross-validation. In this step, an exhaustive search is performed to find the best hyperparameters (e.g., number of neighbors) for the KNN model to achieve optimal performance.The optimal number of neighbors is selected from a specified parameter range (2 to 49). The Grid Search method evaluates these hyperparameters using cross-validation and determines the best ones. In this step, the final KNN model is created using the best hyperparameters found through Grid Search, and its performance is evaluated.The final model is retrained with the optimal hyperparameters and evaluated using cross-validation to assess its performance. This code snippet encompasses the fundamental steps of a K-Nearest Neighbors classifier, including model development, performance evaluation, and hyperparameter optimization. It demonstrates how the KNN model evolves from creation to optimization, showcasing a systematic approach to attain the best performance from the classifier and enhance its generalization ability.
# Model-2: K-Nearest Neighbors (KNN)
# Similar steps as above, but this time, you're using a KNN classifier.
# Fitting KNN model
knn_model = KNeighborsClassifier().fit(X, y)
# Making predictions and calculating ROC AUC
y_pred = knn_model.predict(X)
y_prob = knn_model.predict_proba(X)[:, 1]
roc_auc_score(y, y_prob)
# Cross-validation for KNN model
cv_results = cross_validate(knn_model, X, y, cv=5, scoring=["accuracy", "f1", "roc_auc"])
# Hyperparameter Optimization
# You're performing a grid search to find the best hyperparameters for the KNN model.
# Defining hyperparameters to search over
knn_params = {"n_neighbors": range(2, 50)}
# Grid search
knn_gs_best = GridSearchCV(knn_model, knn_params, cv=5, n_jobs=-1, verbose=1).fit(X, y)
# Best parameters from grid search
knn_gs_best.best_params_
# Final Model
# You're using the best parameters found from grid search to train the final KNN model and evaluating its performance.
# Setting best hyperparameters
knn_final = knn_model.set_params(**knn_gs_best.best_params_).fit(X, y)
# Cross-validation for final KNN model
cv_results = cross_validate(knn_final, X, y, cv=5, scoring=["accuracy", "f1", "roc_auc"])
# Printing cross-validation results for the final model
print(cv_results['test_accuracy'].mean())
print(cv_results['test_f1'].mean())
print(cv_results['test_roc_auc'].mean())Model-3 Random Forest Classifier
We’ll train a Random Forest Classifier on the training data and evaluate its performance on the testing data. Metrics such as accuracy, recall, precision, F1 score, and ROC AUC will provide insights into the model’s effectiveness.
##################################
# MODELING
##################################
y = df["OUTCOME"]
X = df.drop("OUTCOME", axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=17)
rf_model = RandomForestClassifier(random_state=46).fit(X_train, y_train)
y_pred = rf_model.predict(X_test)
print(f"Accuracy: {round(accuracy_score(y_pred, y_test), 2)}")
print(f"Recall: {round(recall_score(y_pred,y_test),3)}")
print(f"Precision: {round(precision_score(y_pred,y_test), 2)}")
print(f"F1: {round(f1_score(y_pred,y_test), 2)}")
print(f"Auc: {round(roc_auc_score(y_pred,y_test), 2)}")5. Feature Importance Analysis
By analyzing the importance of features using the trained Random Forest model, we can understand which variables contribute the most to predicting diabetes. This analysis provides valuable insights into the driving factors behind diabetes prediction.
##################################
# FEATURE IMPORTANCE
##################################
def plot_importance(model, features, num=len(X), save=False):
feature_imp = pd.DataFrame({'Value': model.feature_importances_, 'Feature': features.columns})
print(feature_imp.sort_values("Value",ascending=False))
plt.figure(figsize=(10, 10))
sns.set(font_scale=1)
sns.barplot(x="Value", y="Feature", data=feature_imp.sort_values(by="Value", ascending=False)[0:num])
plt.title('Features')
plt.tight_layout()
plt.show(block=True)
if save:
plt.savefig('importances.png')
plot_importance(rf_model, X)6. Conclusion
In this comprehensive guide, we’ve taken you through the step-by-step process of developing a machine learning model for diabetes prediction using the Pima Indians Diabetes Dataset. From exploratory data analysis to feature engineering, model development, and feature importance analysis, each step contributes to accurate diabetes prediction.
Predicting diabetes using machine learning is a crucial application in the healthcare sector. By leveraging techniques such as data analysis and feature engineering, we can build accurate models that aid in the diagnosis and effective management of this chronic disease.
7. Exploring Further on GitHub
For the complete code, Jupyter notebooks, and additional resources, explore our Kaggle Link or GitHub Link
By engaging with the resources provided, you can delve deeper into the dataset, experiment with code, and contribute to advancing diabetes prediction through machine learning.
