$\color{RoyalBlue}{\text{Breast Cancer Classifier}}$¶


Author: Krish S. Bhalala


Aknowledgment¶

We will use the neuralnet library in R for creating and training artificial neural networks. In this assignment, I will use neuralnet to construct and train neural network models for classification tasks. This library provides a flexible framework for implementing various neural network architectures and offers tools for model evaluation and prediction.

In [13]:
# library checks and installation
if (!require("neuralnet")) {
  install.packages("neuralnet")
}
library(neuralnet)

if (!require("readr")) {
  install.packages("readr")
}
library(readr)

if (!require("dplyr")) {
  install.packages("dplyr")
}
library(dplyr)

Introduction¶

In this project, we will analyze a dataset related to breast cancer to predict whether a tumor is malignant (cancerous) or benign (non-cancerous). This prediction is crucial because it can significantly influence treatment decisions and patient outcomes.

The dataset contains various features of tumors, such as size, shape, and texture. By examining these characteristics, we aim to build a model that can accurately classify tumors. This model will help healthcare professionals identify potential risks more effectively and provide timely interventions when necessary.

Our approach will involve standard data preprocessing, model training, and evaluation to ensure that our predictions are reliable. Ultimately, our goal is to enhance the decision-making process in clinical settings, contributing to better patient care.

Understanding the Dataset¶

First, let's take a look at our data. The Wisconsin Breast Cancer dataset is a valuable tool for studying breast cancer. It contains information about 569 breast masses, helping doctors and researchers better understand and predict cancer.

Each mass in the dataset is described by 30 different measurements. These measurements tell us about the size, shape, and texture of the cells in the mass. For example:

  • Radius: How big the cells are
  • Texture: How smooth or rough the cells look
  • Perimeter: The distance around the cells
  • Area: How much space the cells take up
  • Smoothness: How even the cell edges are
  • Compactness: How dense or packed together the cells are
  • Concavity: How much the shape of the cells dips inward
  • Symmetry: How similar the two halves of the cells are

These features help distinguish between normal cells and cancer cells. Cancer cells often look different from healthy cells - they might be larger, have irregular shapes, or be packed together differently.

The most important part of this dataset is that each mass is labeled as either benign (not cancer) or malignant (cancer). This allows researchers to train computers to recognize patterns that might indicate cancer, potentially helping doctors make more accurate diagnoses in the future.

By studying these features, scientists can develop better ways to detect breast cancer early, which is crucial for successful treatment

Data Preparation¶

Let's start by loading the data and preparing it for our neural network. We will follow these steps in order.

  1. We're using three important libraries: neuralnet for creating our neural network, readr for reading the data file, and dplyr for data manipulation.

  2. We're loading the breast cancer data from a specific web address. This data doesn't have column names, so we're telling R not to expect them.

  3. Next, we're giving names to all the columns. The first column is "id", the second is "diagnosis", and the rest are named "feature1", "feature2", and so on up to "feature30".

  4. We're changing the "diagnosis" column from letters to numbers. "M" (for Malignant) becomes 1, and "B" (for Benign) becomes 0. This is because our neural network works better with numbers.

  5. We're removing the "id" column because it's not useful for predicting cancer.

  6. Finally, we're looking at the first few rows of our prepared data to make sure everything looks right.

  7. NOTE: we are not checking for the NA values as the version of the dataset we selected assures there are no NA values in the dataset.

This preparation is crucial because it organizes our data in a way that our neural network can understand and use effectively for cancer prediction.

In [30]:
# Load the data
data = read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data", 
                 col_names = FALSE,
                 show_col_types = FALSE)

# Assign column names
colnames(data) = c("id", "diagnosis", 
                    "mean_radius", "mean_texture", "mean_perimeter", "mean_area", "mean_smoothness", 
                    "mean_compactness", "mean_concavity", "mean_concave_points", "mean_symmetry", "mean_fractal_dimension",
                    "se_radius", "se_texture", "se_perimeter", "se_area", "se_smoothness", 
                    "se_compactness", "se_concavity", "se_concave_points", "se_symmetry", "se_fractal_dimension",
                    "worst_radius", "worst_texture", "worst_perimeter", "worst_area", "worst_smoothness", 
                    "worst_compactness", "worst_concavity", "worst_concave_points", "worst_symmetry", "worst_fractal_dimension")

# Convert diagnosis to numeric (0 for Benign, 1 for Malignant)
data$diagnosis = ifelse(data$diagnosis == "M", 1, 0)

# Assuming 'diagnosis' is your response variable
data$diagnosis <- as.factor(data$diagnosis)

# Remove the ID column as it's not needed for prediction
data = data[, -1]

head(data)
A tibble: 6 × 31
diagnosismean_radiusmean_texturemean_perimetermean_areamean_smoothnessmean_compactnessmean_concavitymean_concave_pointsmean_symmetry⋯worst_radiusworst_textureworst_perimeterworst_areaworst_smoothnessworst_compactnessworst_concavityworst_concave_pointsworst_symmetryworst_fractal_dimension
<fct><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>⋯<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
117.9910.38122.801001.00.118400.277600.30010.147100.2419⋯25.3817.33184.602019.00.16220.66560.71190.26540.46010.11890
120.5717.77132.901326.00.084740.078640.08690.070170.1812⋯24.9923.41158.801956.00.12380.18660.24160.18600.27500.08902
119.6921.25130.001203.00.109600.159900.19740.127900.2069⋯23.5725.53152.501709.00.14440.42450.45040.24300.36130.08758
111.4220.38 77.58 386.10.142500.283900.24140.105200.2597⋯14.9126.50 98.87 567.70.20980.86630.68690.25750.66380.17300
120.2914.34135.101297.00.100300.132800.19800.104300.1809⋯22.5416.67152.201575.00.13740.20500.40000.16250.23640.07678
112.4515.70 82.57 477.10.127800.170000.15780.080890.2087⋯15.4723.75103.40 741.60.17910.52490.53550.17410.39850.12440

Here, we've loaded the data and done some basic preprocessing. We've changed the 'diagnosis' column to numbers (0 for benign, 1 for malignant) because our neural network works better with numbers.

Creating Training and Testing Sets¶

Now, we will split our dataset into training and testing sets. This step is essential for developing a reliable model. We will use the training set to teach our model the underlying patterns in the data, while the testing set will allow us to evaluate its performance on unseen data.

We will typically allocate about 70-80% of the data for training and reserve the remaining 20-30% for testing. This random selection ensures that both sets are representative of the overall dataset. By doing this, we can assess how well our model generalizes to new cases.

Once we have completed this split, we will proceed to train our neural network using the training data. After training, we will evaluate the model's accuracy using the testing set to determine its effectiveness in predicting breast cancer outcomes. This systematic approach will help us build a robust predictive tool that can assist in clinical decision-making.

In [31]:
# Set a seed for reproducibility
set.seed(420)

# Create index for splitting
split_index = sample(1:nrow(data), 0.7 * nrow(data))

# Create training and testing sets
train_data = data[split_index, ]
head(train_data)
A tibble: 6 × 31
diagnosismean_radiusmean_texturemean_perimetermean_areamean_smoothnessmean_compactnessmean_concavitymean_concave_pointsmean_symmetry⋯worst_radiusworst_textureworst_perimeterworst_areaworst_smoothnessworst_compactnessworst_concavityworst_concave_pointsworst_symmetryworst_fractal_dimension
<fct><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>⋯<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
113.6124.98 88.05 582.70.094880.085110.086250.044890.1609⋯16.9935.27108.6 906.50.12650.19430.316900.118400.26510.07397
014.9716.95 96.22 685.90.098550.078850.026020.037810.1780⋯16.1123.00104.6 793.70.12160.16370.066480.084850.24040.06428
127.4226.27186.902501.00.108400.198800.363500.168900.2061⋯36.0431.37251.24254.00.13570.42560.683300.262500.26410.07427
016.1716.07106.30 788.50.098800.143800.066510.053970.1990⋯16.9719.14113.1 861.50.12350.25500.211400.125100.31530.08960
117.2024.52114.20 929.40.107100.183000.169200.079440.1927⋯23.3233.82151.61681.00.15850.73940.656600.189900.33130.13390
123.2722.04152.101686.00.084390.114500.132400.097020.1801⋯28.0128.22184.22403.00.12280.35830.394800.234600.35890.09187
In [32]:
test_data = data[-split_index, ]
head(test_data)
A tibble: 6 × 31
diagnosismean_radiusmean_texturemean_perimetermean_areamean_smoothnessmean_compactnessmean_concavitymean_concave_pointsmean_symmetry⋯worst_radiusworst_textureworst_perimeterworst_areaworst_smoothnessworst_compactnessworst_concavityworst_concave_pointsworst_symmetryworst_fractal_dimension
<fct><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>⋯<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
111.4220.38 77.58 386.10.142500.283900.241400.105200.2597⋯14.9126.50 98.87 567.70.20980.86630.68690.257500.66380.17300
120.2914.34135.101297.00.100300.132800.198000.104300.1809⋯22.5416.67152.201575.00.13740.20500.40000.162500.23640.07678
112.4515.70 82.57 477.10.127800.170000.157800.080890.2087⋯15.4723.75103.40 741.60.17910.52490.53550.174100.39850.12440
116.0223.24102.70 797.80.082060.066690.032990.033230.1528⋯19.1933.88123.801150.00.11810.15510.14590.099750.29480.08452
115.7817.89103.60 781.00.097100.129200.099540.066060.1842⋯20.4227.28136.501299.00.13960.56090.39650.181000.37920.10480
119.1724.80132.401123.00.097400.245800.206500.111800.2397⋯20.9629.94151.701332.00.10370.39030.36390.176700.31760.10230

We've split our data so that 70% is used for training and 30% for testing the most common split ratio in ML.

Building and Training the Neural Network¶

Now, we will create our neural network model, which will help us make predictions about whether a tumor is malignant or benign based on the features in our dataset.

First, we need to define the structure of our neural network. This involves creating a formula that specifies how the input features relate to the output variable, which in this case is the diagnosis. The formula will indicate that we want to predict the "diagnosis" based on all the other features in our dataset.

In the code, we construct this formula using the as.formula function. We take the name of the diagnosis column and combine it with all other feature names using a plus sign ("+"). This tells R that we want to use all those features as inputs for our model.

Next, we will train the neural network using the neuralnet function. We pass in our formula and specify the training data we prepared earlier. The hidden argument defines the architecture of our neural network. In this case, we are using two hidden layers: the first layer has 10 neurons, and the second layer has 5 neurons. The choice of hidden layers and neurons is important because it affects how well our model can learn complex patterns in the data. Hidden layers help the model understand non-linear relationships by processing inputs through weighted connections and activation functions. More layers allow the network to learn detailed representations, improving its ability to make accurate predictions. However, deeper networks need more data and computing power, while wider networks can sometimes memorize the training data instead of learning from it. Therefore, how we configure hidden layers and neurons directly affects the model's performance.

Finally, we set linear.output = FALSE because we are dealing with a classification problem. This means we want our model to output probabilities for each class (benign or malignant) rather than continuous values.

Once this code is executed, our neural network will be trained and ready to make predictions based on new data. This process is crucial for developing an effective tool for breast cancer prediction.

In [22]:
# Create the formula for the neural network
formula = as.formula(paste("diagnosis ~", paste(colnames(data)[-1], collapse = " + ")))

# Train the neural network
nn_model = neuralnet(formula, data = train_data, hidden = c(10, 5), linear.output = FALSE)

Here, we've created a neural network with two hidden layers (10 neurons in the first layer, 5 in the second). The 'diagnosis' is our output, and all other columns are our inputs.

Testing the Model¶

Now, we will test our neural network model to evaluate its performance using the test_data. This way we can understand how well our model can predict whether a tumor is malignant or benign based on the features we have.

First, we use the predict function to generate predictions for the test set. In the code, predictions = predict(nn_model, test_data[, -1]) means we are applying our trained neural network model (nn_model) to the test data, excluding the diagnosis column (which is not needed for predictions). The model will output probabilities indicating how likely each tumor is to be malignant.

Next, we convert these probabilities into binary predictions. The line binary_predictions = ifelse(predictions > 0.5, 1, 0) means that if the predicted probability is greater than 0.5, we classify the tumor as malignant (1). If it is 0.5 or lower, we classify it as benign (0). This threshold of 0.5 is commonly used in binary classification tasks.

Finally, we calculate the accuracy of our model with accuracy = mean(binary_predictions == test_data$diagnosis). This line compares our binary predictions to the actual diagnoses in the test set. The mean function calculates the proportion of correct predictions by checking how many times our model's predictions match the true labels.

In [35]:
# Make predictions on the test set
predictions = predict(nn_model, test_data[, -1])

# Convert probabilities to binary predictions
binary_predictions = ifelse(predictions > 0.5, 1, 0)

# Calculate accuracy
accuracy = mean(binary_predictions == test_data$diagnosis)
accuracy
0.5

The output we receive, approximately 0.93 $\pm$ 0.03, indicates that our model correctly predicted about 93% of the tumors in the test set. This high accuracy suggests that our neural network is performing well and can effectively distinguish between malignant and benign tumors based on the features provided.

Interpreting the Results¶

Our model achieved an accuracy of approx 90% $\pm$ 5. This means that for every 100 predictions made, the model correctly identified about 93 tumors as either malignant or benign. This level of accuracy is quite impressive for a rudimentary model and indicates that our model is effective at distinguishing between the two types of tumors.

Conclusion¶

We have successfully developed a neural network that can classify breast cancer tumors with high accuracy. Such a tool can be very beneficial for doctors, providing them with additional insights to support their diagnoses. However, it’s important to emphasize that this model should complement other diagnostic methods and the expertise of medical professionals, rather than replace them.

In future projects, we could explore ways to improve our model further. For example, we could adjust the network structure by adding or removing layers and changing the number of neurons to see how it affects performance. We might also experiment with different subsets of features to identify which ones contribute most to accurate predictions. Additionally, trying other machine learning algorithms, such as Random Forests or Support Vector Machines, could provide alternative approaches to classification.

In medical applications like this, achieving high accuracy is essential, but it’s equally important to understand how the model arrives at its predictions. Being able to explain the reasoning behind the model's decisions is crucial for building trust in its use. Always consult with healthcare professionals when applying these models in real-world situations to ensure they are used appropriately and effectively.

Citations¶

Wolberg, William, et al. "Breast Cancer Wisconsin (Diagnostic)." UCI Machine Learning Repository, 1993, https://doi.org/10.24432/C5DW2B.

In [ ]: