Using Machine Learning Models for Breast Cancer Diagnosis — A complete Machine Learning Project Workflow for Life Sciences

28 min readNov 8, 2024

Ernest Bonat, Ph.D., Chad Turner

1. Introduction
2. Research Question
3. Data Explanation
4. Data Preprocessing
5. Classification Models
6. ChatGTP Confusion Matrix Report Generation
7. Model Hyperparameter Optimization
8. Additional Classification Models
9. Correlation Analysis
10. Revising the Dataset
11. Balancing the Target Classes
12. Most Important Features
13. Multi-model Classification
14. Artificial Neural Networks
15. Convolution Neural Networks
16. Conclusion

1. Introduction

The dataset used for this Machine Learning (ML) project is the “Breast Cancer Wisconsin (Diagnostic) Dataset,” which consists of diagnostic imaging data used to detect breast cancer. This dataset can be obtained from Kaggle.com by the University of California Irvine Machine Learning. The data is composed of computer-digitized images from a fine needle aspiration (FNA) of a breast mass. An FNA procedure is a minimally invasive medical procedure where samples of tissue or fluid are taken from areas of interest for diagnostic purposes. This is done by inserting a thin, hollow needle into suspected tissue to collect samples that are placed on a glass slide for analysis. The slides are digitized, and the boundaries of the cells are analyzed using a graphical user interface where the boundaries of each cell nucleus are defined with a contour model known as a “snake.” Inside each nucleus, ten features are analyzed in addition to the mean, largest or worst value, and standard error for these features. There are 30 features of cell nuclei being measured over 569 images. The data consists of 357 benign and 212 malignant diagnoses. The goal with this data is to build an accurate ML model in Python that predicts diagnoses based on feature values. To achieve this, supervised ML classification algorithms will be used.

2. Research Question

What features in cell nuclei have the greatest correlation with a malignant diagnosis for breast cancer?

3. Data Explanation

The target for this dataset is a column called “diagnosis.” This column contains two classes, “M” for malignant or “B” for benign, for each of the 569 images in the dataset.

Feature Descriptions:

1. Radius: The mean of distances from the centroid to points along the snake perimeter of the cell.

2. Texture: This measures the standard deviation of grayscale intensities. This is calculated by analyzing the pixels to look for visual patterns or structures that can indicate the roughness, smoothness, or other properties pertaining to texture. A high standard deviation in the gray scale values indicates a rough or complicated texture. A low standard deviation where the pixels are bunched together indicates a smooth or constant texture.

3. Perimeter: The measured perimeter of the cell nucleus.

4. Area: The calculated area of the cell nucleus.

5. Compactness: The perimeter and area are combined to give the compactness through the formula: perimeter squared, divided by the area minus one. This measurement can be biased toward higher values for smaller cells as there is a decreased accuracy when measuring.

6. Smoothness: A measure of the difference between the length of a radial line and the mean lengths of lines surrounding it.

7. Concavity: A measure of the count and gravity of concavities or indentations in the cell nucleus. This is measured by drawing chords between non-adjacent snake points on the perimeter, then measuring the boundary of the nucleus that resides inside each chord.

8. Concave Points: This measurement is similar to concavity, only it measures the number of contour cavities rather than the extent.

9. Symmetry: To measure the symmetry of a cell, first the longest chord through the center or major axis is defined, then chords perpendicular to the major axis in each direction are defined, splitting the cell up into a grid. Lastly, the difference of the perpendicular lines is measured. If the major axis cuts through the cell’s perimeter due to a concavity, further measurements are used.

10. Fractal Dimension: To obtain this measurement, a method known as “coastline approximation” is used. This is done by measuring the perimeter with increasingly larger scales, which decreases the precision as the scale size increases. These values are plotted on a log scale where the measured downward slope is measured, achieving an approximate fractal dimension.

These 10 features make the total of 30 features in the dataset, as the mean, standard error, and worst condition are calculated for all 10 features. The mean, standard error, and worst condition are added to the end of each feature name (radius_mean, radius_se, and radius_worst). Features 1 through 10 are the mean values for the characteristics of a cell nucleus. Features 11 through 20 are the standard error values for each characteristic, and 21 through 30 are the worst values for each cell nucleus feature. There is also an additional column in this dataset called “id,” which is the patient’s identification number. Below is the DataFrame info.

4. Data Preprocessing

To begin the project, the CSV dataset is downloaded from Kaggle.com, and the analysis of the data begins. A Python program is created where the CSV file is imported using the Pandas library’s “pd.read_csv” function and the file is passed as the argument, creating a DataFrame. Next, a function called “display_dataframe” is created to count and print the sum of duplicated rows, the sum of null values, the data types, and the DataFrame shape. The shape of the DataFrame is 33 columns and 569 rows. There are no duplicated rows, and the null count is 0 for all features except one unnamed row with 569 null values. All the columns are integers except the column “diagnosis,” which, in order to build an ML model, needs to be converted to numeric values. To solve this, this column is label encoded to binary using the Scikit-learn label encoder, creating a new column called “diagnosis_encoded.” The values were originally string values of “M” for malignant or “B” for benign. After encoding, “M” is now “1” and “B” is now “0.” To deal with the column of null values, a “clean_dataframe” function is run, which drops null values and duplicates. This function also removes columns upon request. In this case, the original “diagnosis” column as well as “id” are dropped as the patient’s ID number is unnecessary for a ML model, serving neither as a feature nor a target. This changes the shape of the DataFrame to 31 columns and 569 rows (1 target column and 30 feature columns). After confirming that the dataset is now completely numeric, has no duplicates, and has no null values, the data is now ready for modeling. A new CSV file called “breast_cancer_kaggle_cleaned.csv” is created with Pandas “df.to_csv” function with the argument as the file name and defined “index=False” so that the DataFrame does not have multiple indexes. The preprocessing code is shown below.

def main():

    #loading csv        
    df = pd.read_csv("breast_cancer_kaggle.csv")
    
    #encoding categorical column    
    label_encoder = LabelEncoder()
    df["diagnosis_encoded"] = label_encoder.fit_transform(df["diagnosis"])
    
    #printing dataframe information
    display_dataframe_info(df)
    
    # #cleaning dataframe
    df = clean_dataframe(df, drop_columns=("id", "diagnosis"))

    #saving cleaned dataframe csv file
    df.to_csv("breast_cancer_kaggle_cleaned.csv", index=False)
    
    #printing final dataframe information
    display_dataframe_info(df)
    
    #plotting histogram
    df.hist(bins=11, figsize=(10,8))
    plt.tight_layout()
    plt.subplots_adjust(top=.9)
    plt.show()

if __name__ == "__main__":
    main()

5. Classification Models

For the first ML classification model, Random Forest Classification is used, which employs multiple decision trees that combine their outputs for a final class prediction. The preprocessed dataset is loaded using a function called “load_data.” This function takes two arguments: a CSV file path, which is set to “breast_cancer_kaggle_cleaned.csv,” and a target column, defined as “diagnosis_encoded.” This column is used to classify diagnoses based on the features in the DataFrame. This function then sets the target as “y” and the features as “X” while dropping the “diagnosis_encoded” column from “X.” The data is then split into training and testing sets using a function called “split_data,” which utilizes Scikit-learn’s “train_test_split” function. The test size is defined as “0.2,” resulting in an X_train DataFrame and a y_train Series that consist of 80% of randomly selected values from “X” and “y”. This also creates a Pandas DataFrame and Series for X_test and y_test, which holds the remaining 20% of the randomly chosen values of “X” and “y”. The X training and testing data are then scaled for uniformity using the “scale_data” function, creating “X_train_scaled” and “X_test_scaled.” This function fits the scaler to the training data only and transforms the testing data without fitting it. This approach helps prevent overfitting and ensures the testing data remains unseen. The model is trained using the function “train_random_forest,” which takes “X_train_scaled” and “y_train” as arguments. To evaluate the classification task, the function “evaluate_model” is created, which makes predictions based on the feature testing data. The accuracy score, classification report, and confusion matrix are also generated and then printed using a “plot_results” function. Finally, a confusion matrix graph is created using a function called “plot_confusion_matrix.” The Random Forest model implementation is shown below.

def main():

    #load dataframe and define X and y    
    X, y = load_data("breast_cancer_kaggle_cleaned.csv", "diagnosis_encoded")
    
    #split data for training and testing
    X_train, X_test, y_train, y_test = split_data(X, y)

    #scale training data
    X_train_scaled, X_test_scaled = scale_data(X_train, X_test)

    #train model
    model = train_random_forest(X_train_scaled, y_train)
    
    #evaluate model
    accuracy, report, cm = evaluate_model(model, X_test_scaled, y_test)

    #print metrics
    plot_results(accuracy, report, cm)

    #plot confusion matrix
    plot_confusion_matrix(cm, xlabel="Diagnosis Predicted Labels", ylabel="Diagnosis True Labels",
                            title="Random Forest Diagnostic Confusion Matrix",
                            xtick=("Benign", "Malignant"),
                            ytick=("Benign", "Malignant"))

if __name__ == "__main__":
    main()

Random Forest Classification

Model Performance

Accuracy: 0.9474

94.74% of the classifications are correct.

Classification Metrics

Class 0 is a benign breast cancer diagnosis. Class 1 is a malignant breast cancer diagnosis.

Average Metrics

Macro Average:

Weighted Average:

Confusion Matrix

True Positives (TP): 36 (Correctly predicted malignant diagnoses)
True Negatives (TN): 72 (Correctly predicted benign diagnoses)
False Positives (FP): 3 (Incorrectly predicted malignant diagnoses)
False Negatives (FN): 3 (Incorrectly predicted benign diagnoses)

Observations

The model performs well with 94.74% accuracy in predicting breast cancer diagnoses based on the dataset. The confusion matrix shows 3 false positives and 3 false negatives, indicating an area for improvement, especially for the false negatives.

6. ChatGTP Confusion Matrix Report Generation

Summary

Overall Accuracy: 94.7%
Precision: 92.3%
Recall: 92.3%
F1-Score: 92.3%

This model shows high accuracy and balanced precision and recall, indicating it is effective at distinguishing between malignant and benign diagnoses with minimal misclassification errors. However, it is important to monitor for any signs of overfitting if the model’s performance degrades on new data.

7. Model Hyperparameter Optimization

To see if the accuracy of this model can be improved, a copy of the previous program is created, incorporating grid search cross validation from scikit-learn. A new variable for the parameter grid is defined as “param_grid,” which specifies different parameters for the model to explore in order to identify the best settings for the dataset. The parameter grid is a dictionary containing the following parameters:

Number of Estimators: 50, 100, and 150
Max depth: None, 10, 20, and 30
Minimum Samples Split: 2, 5, and 10
Minimum Samples Leaf: 1, 2, and 4

The Random Forest Classifier is used as the estimator, with 5 cross folds for validation, number of jobs set to -1, and verbose set to 2. Through grid search cross validation, this model returns the optimum parameters for Random Forest Classification: max depth of None, minimum samples leaf of 1, minimum samples split of 2, and number of estimators set to 150. These tuned hyperparameters are added back into the Random Forest Classifier program, but the metrics show no improvements. The “GridSearchCV()” cross validation code is provided below.

def main():
    #load dataframe and define X and y
    X, y = load_data("breast_cancer_kaggle_cleaned.csv", target_column="diagnosis_encoded")
    
    #split data for training and testing
    X_train, X_test, y_train, y_test = split_data(X, y)

    #scale training data
    X_train_scaled, X_test_scaled = scale_data(X_train, X_test)

    #defining random forest classifier
    rf = RandomForestClassifier(random_state=50)

    #defining the parameter grid
    param_grid = {
        "n_estimators": [50, 100, 150],
        "max_depth": [None, 10, 20, 30],
        "min_samples_split": [2, 5, 10],
        "min_samples_leaf": [1, 2, 4]
    }

    #defining gridsearch parameters for random forest 
    grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5, n_jobs=-1, verbose=2)

    #fitting model
    grid_search.fit(X_train_scaled, y_train)

    #finding and printing best parameters
    best_params = grid_search.best_params_
    print(f"\nBest Parameters:\n{best_params}")

    #best model
    best_model = grid_search.best_estimator_
    
    #evaluate model
    accuracy, report, cm = evaluate_model(best_model, X_test_scaled, y_test)

    #print metrics
    plot_results(accuracy, report, cm)

if __name__ == "__main__":
    main()

8. Additional Classification Models

To evaluate how well this dataset performs on different algorithms, a program is created to run multiple classification algorithms. This program accesses their metrics and plots their accuracy ratings against each other, allowing for the determination of which algorithm is the most efficient for classifying diagnoses. To do this, the necessary libraries are imported with their modules and classes. The function “load_data” is used to load the dataset and define X and y. The “split_data” function splits the data into 80% training and 20% testing DataFrames. The “scale_data” function is applied to scale feature training and testing data. A dictionary named “models” is created that includes several algorithms, each with their hyperparameters tuned from grid search cross-validation. The algorithms included are Logistic Regression, K-Nearest Neighbors (KNN), Decision Tree Classification, Random Forest Classification, Naïve Bayes (GaussianNB), Support Vector Machine (SVM), LightGBM, and XGBoost. Two empty lists named “model_accuracy” and “F1_score” are created and a for loop is established so that each algorithm is trained, makes predictions on the testing data, and prints each accuracy rating and F1 score, then stores them in the empty lists. A bar plot is then created using Matplotlib to display each model’s accuracy rating and F1 score.

Logistic Regression and SVM models perform the best, each achieving 99.12% accuracy ratings, which is an increase from 98% when no hyperparameters are tuned. With the accuracy ratings being so high, this data could be overfitting the models, or there may be other unforeseen issues with the data.

def main():
    #load dataframe and define X and y
    X, y = load_data("breast_cancer_kaggle_cleaned.csv", target_column="diagnosis_encoded")
    
    #split data for training and testing
    X_train, X_test, y_train, y_test = split_data(X, y)

    #scale training data
    X_train_scaled, X_test_scaled = scale_data(X_train, X_test)
    
    #defining classification models
    models = {
    "Logistic Regression": LogisticRegression(C=.1, penalty="l2", solver="liblinear", random_state=42),
    "K-Nearest Neighbors": KNeighborsClassifier(n_neighbors=11, metric="manhattan", weights="uniform"),
    "Decision Tree": DecisionTreeClassifier(max_depth=None, min_samples_leaf=4, min_samples_split=10),
    "Random Forest": RandomForestClassifier(max_depth=None, min_samples_leaf=1, min_samples_split=2, 
                                n_estimators=150, random_state=50),
    "Naive Bayes": GaussianNB(var_smoothing=1e-06),
    "Support Vector Machine": SVC(kernel='linear', C=1.0, max_iter=1000),
    "LightGBM": lgb.LGBMClassifier(learning_rate=.1, max_depth=10, n_estimators=200, num_leaves=31),
    "XGBoost" : XGBClassifier(max_depth=7, learning_rate=.01, min_child_weight=1,
                                n_estimators=150, subsample=.8 ) 
    }
    
    model_accuracy = []
    model_f1 = []
    
    for model_name, model in models.items():
        model.fit(X_train_scaled, y_train)
        y_predict = model.predict(X_test_scaled)
        
        accuracy = accuracy_score(y_test, y_predict)
        model_accuracy.append(accuracy*100)
        
        f1 = f1_score(y_test, y_predict)
        model_f1.append(f1*100)
        
        print(f"{model_name} Accuracy: {accuracy:.4f}")
        print(f"{model_name} F1 Score: {f1:.4f}")
    
    #plotting accuracy and F1 scores for all models
    bar_width = 0.35
    index = np.arange(len(models))
    plt.figure(figsize=(10, 6))

    #plotting accuracy
    bars_accuracy = plt.bar(index, model_accuracy, bar_width, label='Accuracy', color='skyblue')

    #plotting F1 Score bars
    bars_f1 = plt.bar(index + bar_width, model_f1, bar_width, label='F1 Score', color='salmon')

    #titles and labels
    plt.xlabel("Classification Models")
    plt.ylabel("Scores")
    plt.title("Classification Model Comparison: Accuracy and F1 Scores")
    plt.ylim(0, 100)  
    plt.xticks(index + bar_width / 2, models.keys(), rotation=45)
    plt.grid(axis="y")
    plt.legend()

    #annotations
    plt.bar_label(bars_accuracy, fmt='%.2f', label_type='center')
    plt.bar_label(bars_f1, fmt='%.2f', label_type='center')
    plt.tight_layout()
    plt.show()
        
if __name__ == "__main__":
    main()

9. Correlation Analysis

To check for correlations, a simple Python program is created that first loads the dataset as a Pandas DataFrame. Next, a variable called “corr_matrix” is set to “df.corr().round(2),” which reads the DataFrame, calculates the correlations for all features, and rounds the results to the second decimal point. The results are printed, but with 30 features, they are too difficult to examine, so a heatmap is created using the Seaborn library to plot the results to better visualize the correlations for the data.

In this heatmap, nearly all the features show either a positive correlation or none, with no negative correlations present. However, there are several clusters of high correlations worth exploring. To inspect these correlations more closely, a variable called “threshold” is created and set to “0.7.” This filters out all correlations, leaving only those features with a correlation greater than 0.7. Next, another variable named “filtered_corr” is created and set to corr_matrix[(corr_matrix >= threshold) | (corr_matrix <= -threshold)]. This code checks each element in the correlation matrix to see if it meets the threshold criteria, combining both conditions to identify values that are either above or below the threshold. The result is a refined matrix only showing correlations above the threshold. Finally, this new matrix is plotted, providing a clearer visualization of the significant relationships between features.

In this heatmap, setting a threshold to greater than 0.7 shows a much clearer view of the highly correlated variables. The first pattern that emerges is dense clusters of high correlations, especially near the diagonal line of self-correlations. Another observation is that the radius, perimeter, and area have nearly a 1-to-1 correlation to one another. This makes sense because if the radius of a cell increases, the perimeter and area will as well. This pattern is repeated for the standard error and worst condition. These clusters of high correlations indicate multicollinearity, where more than two independent variables are highly correlated, possibly providing redundant information. The multicollinearity suggested that the model is overfitted, which could be affecting the performance.

Another interesting observation is that as the radius mean, perimeter mean, area mean, and concavity mean get larger in size, they tend to have a positive correlation with the diagnosis. This trend is repeated with the worst condition features, where concavity_worse shows the strongest correlation to diagnosis_encoded at 0.79. This indicates that as the cell nucleus grows in general size and has larger indentations, the likelihood of a malignant diagnosis also increases, effectively answering the research question. The correlation analysis code implementation is shown below.

def main():
    #reading csv
    df = pd.read_csv("breast_cancer_kaggle_cleaned.csv")
    
    #defining correlation matrix
    corr_matrix = df.corr().round(2)
    
    #creating a threshold and filtered matrix
    threshold = 0.7
    filtered_corr = corr_matrix[(corr_matrix >= threshold) | (corr_matrix <= -threshold)]

    #plotting the correlation matrix
    plt.figure(figsize=(10,8))
    sns.set_context("notebook", font_scale=.5)
    plot = sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', annot_kws={"size": 6})
    plt.xticks(rotation=45)
    plot.set_title("Breast Cancer Diagnostic Correlation Matrix", fontsize=10)

    #plotting the filtered correlation matrix
    plt.figure(figsize=(10,8))
    sns.set_context("notebook", font_scale=.5)
    plot = sns.heatmap(filtered_corr, annot=True, cmap='coolwarm', annot_kws={"size": 6})
    plt.xticks(rotation=45)
    plot.set_title("Strong Correlations Threshold: >.7", fontsize=10)
    plt.show()
    
    #removing highly correlated columns
    df = df.drop(df.columns[10:30], axis=1)
    df = df.drop(columns=["perimeter_mean", "area_mean", "concave points_mean", "compactness_mean"])
    
    #creating revised matrix 
    corr_matrix2 = df.corr().round(2)
    filtered_corr2 = corr_matrix2[(corr_matrix2 >= threshold) | (corr_matrix2 <= -threshold)]

    #plotting the new correlation matrix
    plt.figure(figsize=(10,8))
    sns.set_context("notebook", font_scale=1)
    plot = sns.heatmap(corr_matrix2, annot=True, cmap='coolwarm', annot_kws={"size": 10})
    plt.xticks(rotation=15)
    plt.yticks(rotation=45)
    plot.set_title("Revised Breast Cancer Diagnostic Correlation Matrix", fontsize=15)
    
    #plotting the new filtered correlation matrix
    plt.figure(figsize=(10,8))
    sns.set_context("notebook", font_scale=1)
    plot = sns.heatmap(filtered_corr2, annot=True, cmap='coolwarm', annot_kws={"size": 10})
    plt.xticks(rotation=15)
    plt.yticks(rotation=45)
    plot.set_title("Revised Strong Correlations Threshold: >.7", fontsize=15)
    plt.show()
    
    #creating a revised dataset
    df.to_csv("breast_cancer_kaggle_cleaned-2.csv", index=False)
   
if __name__ == "__main__":
    main()

10. Revising the Dataset

To minimize the complexity and reduce the risk of overfitting, the redundant features with standard error (“_se”) and worst condition (“_worst”) are dropped from the dataset. To achieve this, the last twenty columns are indexed and dropped using the command “df.drop(df.columns[10:30], axis=1)” and the correlation matrix is reassessed. Despite this, some highly correlated values remain, so the additional two columns “perimeter_mean” and “area_mean” are dropped as they are nearly 1-to-1 correlated with the “radius_mean.” While this significantly simplified the data, a correlation of 0.92 exists between the features “concavity_mean” and “concave_points_mean,” along with a correlation of 0.88 between “concave points_mean” and “compactness_mean”. To resolve this, “concave_points_mean” and compactness_mean” are dropped, leaving a total of 6 feature columns. A revised CSV file is then created by executing “df.to_csv(“breast_cancer_kaggle_cleaned-2.csv”, index=False).” This revised dataset is then run through the Random Forest model to ensure the metrics remain robust. These changes lower the complexity of the data and help fit the models better, addressing the overfitting issue in the original dataset, which led to low error rates.

11. Balancing the Target Classes

Upon further inspection of the revised dataset, an imbalance in the classes is identified that could affect the performance of the ML model. The data skews toward the benign class, which contains 357 instances compared to only 212 instances of the malignant class. This imbalance introduces bias toward the benign class, potentially causing the models to be more effective at classifying benign diagnoses than malignant ones. Additionally, this can lead to misleading accuracy ratings that favor the majority class while increasing the error rate for the minority class. To address this imbalance, a function called “apply_smote” is created. This function uses the synthetic minority oversampling technique (SMOTE) to generate synthetic data for the minority class using its k-nearest neighbors. It takes the scaled training data (X and y) and creates “X_train_balanced” and “y_train_balanced,” each containing an equal number of instances (568) to eliminate bias toward the majority class. The testing data remains unbalanced to allow the model to learn from the balanced training set while making predictions based on the actual testing data. This function is integrated into the models and executed using the revised dataset.

Random Forest Classification

Model Performance

Accuracy: 0.9737

97.37% of the classifications are correct.

Classification Metrics

Class 0 is a benign breast cancer diagnosis. Class 1 is a malignant breast cancer diagnosis.

Average Metrics

Macro Average:

Weighted Average:

Confusion Matrix

True Positives (TP): 39 (Correctly predicted malignant diagnoses)
True Negatives (TN): 72 (Correctly predicted benign diagnoses)
False Positives (FP): 1 (Incorrectly predicted malignant diagnoses)
False Negatives (FN): 2 (Incorrectly predicted benign diagnoses)

Observations

This model now performs much better, achieving 97.37% accuracy in predicting breast cancer diagnoses after balancing the classes and revising the dataset. The confusion matrix indicates 1 false positive and 2 false negatives, meaning it mislabeled two malignant cases as benign and one benign case as malignant. Ideally, there would be no false negatives, as it would be preferable to have a false positive diagnosis rather than a false negative. The macro and weighted averages yield the same result for the precision, recall, F1-score, and support, demonstrating that the data is well balanced in class distribution and providing consistent performance. The code below shows how to fix the “diagnosis_encoded” imbalanced classes.

def main():
    #load dataframe and define X and y
    X, y = load_data("breast_cancer_kaggle_cleaned-2.csv", target_column="diagnosis_encoded")
    
    #split data for training and testing
    X_train, X_test, y_train, y_test = split_data(X, y)

    #scale training data
    X_train_scaled, X_test_scaled = scale_data(X_train, X_test)
    
    #balance training data
    X_train_balanced, y_train_balanced = apply_smote(X_train_scaled, y_train)

    #train model
    model = train_random_forest(X_train_balanced, y_train_balanced)
    
    #evaluate model
    accuracy, report, cm = evaluate_model(model, X_test_scaled, y_test)

    #print metrics
    plot_results(accuracy, report, cm)

    #plot confusion matrix
    plot_confusion_matrix(cm, xlabel="Diagnosis Predicted Labels", ylabel="Diagnosis True Labels",
                            title="Random Forest Diagnostic Confusion Matrix",
                            xtick=("Benign", "Malignant"),
                            ytick=("Benign", "Malignant"))

if __name__ == "__main__":
    main()

12. Most Important Features

To further analyze the revised dataset and determine if the best features were retained for the classification models, a program is created to visualize the features that the Random Forest algorithm identifies as the most important for classification. To accomplish this, a function called “rf_importance” is created that has a Random Forest Classification program within it. The variable “feature_importances = rf_model.feature_importances_” is added to this function to extract the important features. Then, a Pandas DataFrame is created to store these features along with their importance, sorted by importance, and the resulting DataFrame is plotted. In the main function, the importance function is called twice to plot the feature importance for both the original dataset and the revised dataset.

It is found that the Random Forest algorithm ranks many of the features that were dropped as more important than those retained. Among the 6 features retained, only “radius_mean” and “concavity_mean” were in the top ten important features from the original dataset. To verify the assumptions from the correlation analysis, the Random Forest model is run using the top ten features identified from the feature importance of the original dataset. The model achieves an accuracy of 95.61% and a confusion matrix showing 3 false negatives and 2 false positives. In comparison, the Random Forest model with the revised dataset has an accuracy rating of 97.37% with only 2 false negatives and 1 false positive in the confusion matrix. This indicates that the choices made during the correlation analysis are better suited for the classification task.

While Random Forest importance reveals what it perceives as the most important features, this is done through calculations, not necessarily logic. This highlights the importance of having human oversight in supervised ML projects. The algorithm indicates that the most important features are the ones that show some of the highest correlations to one another. This could be because the algorithm finds features that hold patterns that may not necessarily be important for the task. While ML models are excellent tools in making predictions, a human is still needed to interpret these predictions, assess the quality of the data, and identify biases, especially in a delicate application like breast cancer diagnosis where the stakes are high. The features of importance can be defined using the code below.

def main():    
    #defining datasets
    rf_importance("breast_cancer_kaggle_cleaned.csv", plt_title1)
    rf_importance("breast_cancer_kaggle_cleaned-2.csv", plt_title2)
    
    #defining plot titles
    plt_title1 = "Random Forest Feature Importance (Original Dataset)"
    plt_title2 = "Random Forest Feature Importance (Revised Dataset)"
        
if __name__ == "__main__":
    main()

13. Multi-model Classification

To observe how the revised dataset performs across multiple models, it is run through the multi-model program, comparing the following models with their hyperparameters tuned using grid search cross validation: Logistic Regression, KNN, Decision Tree Classification, Random Forest Classification, Naïve Bayes, SVM, LightGBM, and XGBoost. Among these models, Random Forest Classification and XGBoost outperform the others, achieving equal metrics, with accuracy scores of 97.37% and F1-scores of 96.3%.

14. Artificial Neural Networks

Multilayer Perceptron

The next ML model used is a Multilayer Perceptron (MLP). An MLP is a type of artificial neural network that consists of an input layer followed by one or more layers of threshold logistic units (TLU), which function as basic neurons. It is a feedforward neural network (FNN) where the data passes linearly through the input layer, then through one or more hidden layers, and finally out to the output layer. In the hidden layers, data points or features are processed as each neuron applies its weights and biases, along with an activation function. This transformation of the input data enables the network to learn complex patterns. Activation functions like the rectified linear unit (ReLU) or sigmoid introduce non-linearity, assisting in learning intricate relationships. The number of layers and neurons are hyperparameters that can be tuned using cross-validation. The model runs through multiple epochs or iterations until the model converges, minimizing the loss or difference between the predicted and target outputs or until improvement stops.

To create this model, the necessary libraries are imported, and the revised dataset is loaded as a Pandas DataFrame. The target and features are defined using the “load_data” function. Then, the data is split using the “split_data” function, creating a training DataFrame with 80% of random values from the features and a testing DataFrame with the remaining 20% for validation. X_train and X_test are scaled and balanced using the “scale_data” and “apply_smote” functions. The MLP model is created using the “MLPClassifier” from Scikit-learn, and the hyperparameters are defined. “hidden_layer_sizes” is set to 100, creating one hidden layer with 100 neurons. For the activation function, the “relu” algorithm is used to introduce non-linearity and identify complex patterns. To optimize weights, the “solver” is set to “adam” or adaptive moment estimation (a stochastic optimizer) and “max_iter” is set to 1000, allowing the optimizer enough iterations to converge and learn all the patterns in the training data. The model is then fit using “X_train_balanced” and “y_train_balanced.” The “evaluate_model” function is used to make predictions on the test data and generate the accuracy score, classification report, and confusion matrix. These results are printed using the “plot_results” function and the “plot_confusion_matrix” function to display the confusion matrix graph.

Model Performance

Accuracy: 0.9825

98.25% of the classifications are correct.

Classification Metrics

Class 0 is a benign breast cancer diagnosis. Class 1 is a malignant breast cancer diagnosis.

Average Metrics

Macro Average:

Weighted Average:

True Positives (TP): 41 (Correctly predicted malignant diagnoses)
True Negatives (TN): 71 (Correctly predicted benign diagnoses)
False Positives (FP): 0 (Incorrectly predicted malignant diagnoses)
False Negatives (FN): 2 (Incorrectly predicted benign diagnoses)

Observations

This MLP model performs very well in classifying breast cancer diagnoses achieving an accuracy score of 98.25%. It demonstrates high precision, recall, and F1 scores, indicating that it makes accurate predictions. The macro and weighted averages are nearly the same, suggesting that the dataset is well balanced and the model is consistent. The confusion matrix reveals that the MLP predicted the correct diagnoses except for 2 cases that were incorrectly classified as benign when they were actually malignant. The MLP classification model code development can be found below.

def main():    
    #load dataframe and define X and y
    X, y = load_data("breast_cancer_kaggle_cleaned-2.csv", target_column="diagnosis_encoded")
    
    #split data for training and testing
    X_train, X_test, y_train, y_test = split_data(X, y)

    #scale training data
    X_train_scaled, X_test_scaled = scale_data(X_train, X_test)
    
    #balance training data
    X_train_balanced, y_train_balanced = apply_smote(X_train_scaled, y_train)

    #creating model and defining hyperparameters
    model = MLPClassifier(hidden_layer_sizes=(100,), activation='relu',
                solver='adam', max_iter=1000, random_state=40)
    
    #fitting model to training data
    model.fit(X_train_balanced, y_train_balanced)
    
    #evaluate model
    accuracy, report, cm = evaluate_model(model, X_test_scaled, y_test)

    #print metrics
    plot_results(accuracy, report, cm)

    #plot confusion matrix
    plot_confusion_matrix(cm, xlabel="Diagnosis Predicted Labels", ylabel="Diagnosis True Labels",
                            title="MLP Diagnostic Confusion Matrix",
                            xtick=("Benign", "Malignant"),
                            ytick=("Benign", "Malignant"))
        
if __name__ == "__main__":
    main()

15. Convolution Neural Networks

For the final ML model, a deep learning model known as a Convolutional Neural Networks (CNN) is used. While similar to the MLP in its use of neurons and connected layers, a CNN has a more complex architecture that includes specialized layers such as convolutional layers and pooling layers, in addition to fully connected layers. CNNs are usually used for more complex applications, such as image recognition, but can also be applied to simpler classification tasks like this.

To prepare this model, the necessary libraries are imported, and TensorFlow’s random seed is set for the reproducibility of results. Like the previous models, the dataset is loaded, the target and features are defined, the data is split into 80% training and 20% testing sets, and the data is balanced using functions. For X_test_balanced and X_train_balanced, the data needs to be reshaped to be used in the CNN model, as a one-dimensional CNN needs the format of: (samples, features, 1). For “X_train_reshaped,” the shape is (568, 6, 1). 568 is the number of feature rows (after balancing), 6 is the number of features, and 1 is the number of channels per input sequence. For X_test_reshaped, the shape is (114, 6, 1), with 114 representing 20% of the feature data.

Building the Model

To create the CNN model, “model.Sequential()” is called for a linear stack of layers where one input gets one output. The layers inside the CNN are then defined, starting with “layers.Conv1D(),” which creates a one-dimensional convolution layer for the input. For the activation function, “relu” is used, creating non-linearity. For the pooling layer, “layers.MaxPooling1D()” is used, which tries to retain the important features while reducing the dimensionality from the convolutional layer. Next is “layers.Flatten(),” which flattens the shape of the pooling layer into a 1D array, preparing it for the dense layers. The last two layers are dense layers, or “layers.Dense,” where the first is set to 64 neurons with the activation function “relu”. The last dense layer is the output layer, where one neuron with the “sigmoid” activation function is used, which takes an input on the X axis and returns a Y value between 0 and 1.

Compiling and Fitting the Model

The model is then compiled using “model.compile” and the “adam” algorithm for the optimizer, with the loss function set to “binary cross-entropy” for measuring the difference between predicted and actual class labels. To fit the model, a variable called “history” is created so that the results can be plotted and set it to “model.fit,” where the number of epochs is set to “30.” This allows the training and testing data enough time to converge while avoiding overfitting. The batch size used is “32,” which indicates that during each epoch, the model will train on batches of 32 values of the training data until all data is used.

def main():
    #loading dataset, defining target and features
    X, y = load_data(file_path="breast_cancer_kaggle_cleaned-2.csv", target_column="diagnosis_encoded")
    
    #splitting data into test and training sets
    X_train, X_test, y_train, y_test = split_data(X, y)
    
    #scaling data
    X_train_scaled, X_test_scaled = scale_data(X_train, X_test)
    
    #balance training data
    X_train_balanced, y_train_balanced = apply_smote(X_train_scaled, y_train)
    
    #creating number of classes
    class_number = np.unique(y).sum()
    
    # #reshaping for CNN
    X_train_reshaped = X_train_balanced.reshape(X_train_balanced.shape[0], X.shape[1], 1)
    X_test_reshaped = X_test_scaled.reshape(X_test_scaled.shape[0], X.shape[1], 1)
    
    #defining 1D CNN model
    model = models.Sequential([
        layers.Conv1D(32, kernel_size=3, activation="relu", input_shape=(X_train_reshaped.shape[1], 1)),
        layers.MaxPooling1D(pool_size=2),
        layers.Flatten(),
        layers.Dense(64, activation="relu"),
        layers.Dense(class_number, activation="sigmoid")
    ])

    #compiling the model
    model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
    model.summary()
    
    #fit the model
    history = model.fit(X_train_reshaped, y_train_balanced, epochs=30, batch_size=32, 
                        validation_data=(X_test_reshaped, y_test))

    #making predictions
    y_pred_probs = model.predict(X_test_reshaped)
    y_pred = (y_pred_probs > 0.5).astype(int)
    
    #model metrics
    test_loss, test_accuracy, = model.evaluate(X_test_reshaped, y_test)
    report = classification_report(y_test, y_pred)
    cm = confusion_matrix(y_test, y_pred)
    
    #printing metrics
    print(f"Test Accuracy: {test_accuracy:.4f}")
    print(f"Test Loss: {test_loss:.4f}")
    print("Classification Report:")
    print(report)
    print(f"confusion matrix:\n{cm}")
    
    #plot training and loss history
    plot_history(history)
    
    #plot confusion matrix
    plot_confusion_matrix(cm, xlabel="Diagnosis Predicted Labels", ylabel="Diagnosis True Labels",
                            title="CNN Diagnostic Confusion Matrix",
                            xtick=("Benign", "Malignant"),
                            ytick=("Benign", "Malignant"))

if __name__ == "__main__":
    main()

Evaluating the Model

To create probabilities for the testing data and to evaluate the model, a new variable called “y_pred_probs” is created and set to “model.predict(X_test_reshaped).” Next, the variable “y_pred” is created and set to “(y_pred_probs > .5).astype(int).” This now creates a threshold for predicted values, giving a Boolean True if the predicted value is greater than 0.5 or False if it is less than 0.5, then converts the Booleans to binary integers where they are evaluated against the true labels. For the metrics, a variable for loss and accuracy is created and printed, along with a classification report and confusion matrix. Two subplots are created using the “plot_history” function to visualize the loss and accuracy over epochs from the training set against the testing set using Matplotlib. Finally, a confusion matrix graph is generated using the “plot_confusion_matrix” function.

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #
=================================================================
 conv1d (Conv1D)             (None, 4, 32)             128

 max_pooling1d (MaxPooling1D  (None, 2, 32)            0
 )

 flatten (Flatten)           (None, 64)                0

 dense (Dense)               (None, 64)                4160

 dense_1 (Dense)             (None, 1)                 65

=================================================================
Total params: 4,353
Trainable params: 4,353
Non-trainable params: 0
_________________________________________________________________
Epoch 1/30
2024-11-08 09:41:04.274207: I tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 8100
18/18 [==============================] - 5s 47ms/step - loss: 0.6278 - accuracy: 0.7324 - val_loss: 0.5166 - val_accuracy: 0.9123
Epoch 2/30
18/18 [==============================] - 0s 20ms/step - loss: 0.4727 - accuracy: 0.8697 - val_loss: 0.3795 - val_accuracy: 0.9035
Epoch 3/30
18/18 [==============================] - 0s 20ms/step - loss: 0.3566 - accuracy: 0.8838 - val_loss: 0.2880 - val_accuracy: 0.9298
Epoch 4/30
18/18 [==============================] - 0s 24ms/step - loss: 0.2757 - accuracy: 0.8996 - val_loss: 0.2349 - val_accuracy: 0.9298
Epoch 5/30
18/18 [==============================] - 0s 24ms/step - loss: 0.2234 - accuracy: 0.9313 - val_loss: 0.2106 - val_accuracy: 0.9298
Epoch 6/30
18/18 [==============================] - 0s 22ms/step - loss: 0.1898 - accuracy: 0.9419 - val_loss: 0.1934 - val_accuracy: 0.9386
Epoch 7/30
18/18 [==============================] - 0s 23ms/step - loss: 0.1714 - accuracy: 0.9489 - val_loss: 0.1823 - val_accuracy: 0.9474
Epoch 8/30
18/18 [==============================] - 0s 25ms/step - loss: 0.1569 - accuracy: 0.9437 - val_loss: 0.1705 - val_accuracy: 0.9474
Epoch 9/30
18/18 [==============================] - 0s 23ms/step - loss: 0.1467 - accuracy: 0.9472 - val_loss: 0.1598 - val_accuracy: 0.9474
Epoch 10/30
18/18 [==============================] - 0s 25ms/step - loss: 0.1368 - accuracy: 0.9507 - val_loss: 0.1531 - val_accuracy: 0.9474
Epoch 11/30
18/18 [==============================] - 0s 25ms/step - loss: 0.1295 - accuracy: 0.9525 - val_loss: 0.1455 - val_accuracy: 0.9474
Epoch 12/30
18/18 [==============================] - 0s 21ms/step - loss: 0.1243 - accuracy: 0.9560 - val_loss: 0.1397 - val_accuracy: 0.9474
Epoch 13/30
18/18 [==============================] - 0s 26ms/step - loss: 0.1201 - accuracy: 0.9560 - val_loss: 0.1353 - val_accuracy: 0.9474
Epoch 14/30
18/18 [==============================] - 0s 24ms/step - loss: 0.1167 - accuracy: 0.9560 - val_loss: 0.1299 - val_accuracy: 0.9561
Epoch 15/30
18/18 [==============================] - 0s 21ms/step - loss: 0.1136 - accuracy: 0.9577 - val_loss: 0.1257 - val_accuracy: 0.9561
Epoch 16/30
18/18 [==============================] - 0s 21ms/step - loss: 0.1113 - accuracy: 0.9577 - val_loss: 0.1234 - val_accuracy: 0.9561
Epoch 17/30
18/18 [==============================] - 0s 24ms/step - loss: 0.1096 - accuracy: 0.9595 - val_loss: 0.1233 - val_accuracy: 0.9561
Epoch 18/30
18/18 [==============================] - 0s 22ms/step - loss: 0.1069 - accuracy: 0.9577 - val_loss: 0.1203 - val_accuracy: 0.9649
Epoch 19/30
18/18 [==============================] - 1s 31ms/step - loss: 0.1064 - accuracy: 0.9613 - val_loss: 0.1210 - val_accuracy: 0.9649
Epoch 20/30
18/18 [==============================] - 0s 23ms/step - loss: 0.1077 - accuracy: 0.9613 - val_loss: 0.1155 - val_accuracy: 0.9649
Epoch 21/30
18/18 [==============================] - 0s 24ms/step - loss: 0.1022 - accuracy: 0.9630 - val_loss: 0.1202 - val_accuracy: 0.9737
Epoch 22/30
18/18 [==============================] - 0s 24ms/step - loss: 0.1013 - accuracy: 0.9648 - val_loss: 0.1163 - val_accuracy: 0.9649
Epoch 23/30
18/18 [==============================] - 0s 24ms/step - loss: 0.0994 - accuracy: 0.9648 - val_loss: 0.1150 - val_accuracy: 0.9649
Epoch 24/30
18/18 [==============================] - 0s 22ms/step - loss: 0.0998 - accuracy: 0.9648 - val_loss: 0.1138 - val_accuracy: 0.9649
Epoch 25/30
18/18 [==============================] - 0s 24ms/step - loss: 0.1012 - accuracy: 0.9613 - val_loss: 0.1128 - val_accuracy: 0.9737
Epoch 26/30
18/18 [==============================] - 0s 23ms/step - loss: 0.1018 - accuracy: 0.9630 - val_loss: 0.1110 - val_accuracy: 0.9737
Epoch 27/30
18/18 [==============================] - 0s 23ms/step - loss: 0.0959 - accuracy: 0.9648 - val_loss: 0.1112 - val_accuracy: 0.9561
Epoch 28/30
18/18 [==============================] - 0s 25ms/step - loss: 0.0962 - accuracy: 0.9665 - val_loss: 0.1136 - val_accuracy: 0.9737
Epoch 29/30
18/18 [==============================] - 0s 25ms/step - loss: 0.0945 - accuracy: 0.9683 - val_loss: 0.1119 - val_accuracy: 0.9737
Epoch 30/30
18/18 [==============================] - 0s 23ms/step - loss: 0.0960 - accuracy: 0.9613 - val_loss: 0.1116 - val_accuracy: 0.9737
4/4 [==============================] - 0s 4ms/step
4/4 [==============================] - 0s 9ms/step - loss: 0.1116 - accuracy: 0.9737
Test Accuracy: 0.9737
Test Loss: 0.1116
Classification Report:
              precision    recall  f1-score   support

           0       0.96      1.00      0.98        73
           1       1.00      0.93      0.96        41

    accuracy                           0.97       114
   macro avg       0.98      0.96      0.97       114
weighted avg       0.97      0.97      0.97       114

confusion matrix:
[[73  0]
 [ 3 38]]

Model Performance

Accuracy: .9737

Loss: 0.1116

97.37% of the classifications are correct with a low loss value of 0.1116.

Classification Metrics

Class 0 is a benign breast cancer diagnosis. Class 1 is a malignant breast cancer diagnosis.

Average Metrics

Macro Average:

Weighted Average:

Confusion Matrix

True Positives (TP): 38 (Correctly predicted malignant diagnoses)
True Negatives (TN): 73 (Correctly predicted benign diagnoses)
False Positives (FP): 0 (Incorrectly predicted malignant diagnoses)
False Negatives (FN): 3 (Incorrectly predicted benign diagnoses)

Observations

This model performs well, with a 97.37% accuracy rating and a loss rating of 0.1116, showing that it has little error in predictions. It correctly predicted diagnoses, except for 3 instances that predicted a benign diagnosis when the correct label was malignant. An interesting observation was that this model’s accuracy score matched the results of the Random Forest and XGBoost models. This indicates that the data is now stable and the models are equally complex, finding the same patterns in the dataset. To test this further, new unseen data would need to be introduced to each model.

The source code for this research paper can be found in the following GitHub repository https://github.com/cturner119/breast_cancer_diagnosis/tree/main.

16. Conclusion

For diagnosing breast cancer from a FNA procedure using ML models, all the data provided was excessive. The mean values for the features of the cell nucleus would suffice, but the addition of worst condition and standard error for each feature led to overfitting, increased complexity, and multicollinearity. Another observed issue was class imbalance, which was addressed using SMOTE to create artificial data and give each class equal representation. Although the Random Forest algorithm suggested what it calculated to be the most important features, using these features reduced the accuracy rating by approximately 4%. This indicates that the algorithm was finding patterns in these features that were not applicable to classifying diagnoses. This situation demonstrated the necessity of human oversight in guiding ML algorithms, despite their predictive capabilities. The Random Forest Classification, XGBoost, and CNN all yielded the same accuracy of 97.37%. These models’ equal metrics may result from the simplicity of the dataset, which is not very large, or it could indicate that the models could still be overfitting. Alternatively, it could imply that all models captured the same underlying patterns in the data. The most effective model for classifying diagnoses from this dataset proved to be the MLP, which achieved an accuracy rating of 98.25%. While these models could be further tested and potentially deployed for diagnosing breast cancer, the CNN may not be the best option due to its complexity and resource use. The analysis of the correlation matrix revealed the feature “concavity_worse” had the greatest correlation with a malignant diagnosis at 0.79, although this feature was not used in the final dataset. After dropping unnecessary features, the “radius_mean” exhibited the highest correlation at 0.73. This finding indicates larger cell nuclei and greater indentations in the cell nucleus correlate with a higher likelihood of a malignant breast cancer diagnosis.

Using Machine Learning Models for Breast Cancer Diagnosis — A complete Machine Learning Project Workflow for Life Sciences

1. Introduction

2. Research Question

3. Data Explanation

4. Data Preprocessing

5. Classification Models

6. ChatGTP Confusion Matrix Report Generation

8. Additional Classification Models

9. Correlation Analysis

10. Revising the Dataset

11. Balancing the Target Classes

12. Most Important Features

13. Multi-model Classification

14. Artificial Neural Networks

15. Convolution Neural Networks

16. Conclusion

Written by Ernest Bonat, Ph.D.

No responses yet