Cancer Type Identification Through  Parameterization of Image Patterns Using  Machine Learning

Please use this identifier to cite or link to this item: http://hdl.handle.net/123456789/28301

Title:	Cancer Type Identification Through Parameterization of Image Patterns Using Machine Learning
Authors:	SARAH TARIQ SHEIKH
Keywords:	Bioinformatics
Issue Date:	2023
Publisher:	Quaid I Azam university Islamabad
Abstract:	Finding patterns in the histopathology images of cancer is the new challenge of this era. The traditional method for scanning histopathology images is done manually by pathologists who look for patterns in whole slides. This method is quite time taking and laborious as it takes days and even months for proper diagnosis. Moreover, there is a chance of human error in this approach. Now, many machine learning algorithms like K-NN, logistic regression, random forest, decision trees etc. are currently being used in healthcare and image processing to help doctors and scientists to diagnose diseases faster. The goal of this study is to develop improved strategies for various CAD phases that will play a critical role in not only minimizing the variability gap between and among observers, but also reduce the overall time and cost. The dataset of colon cancer was taken from the Kaggle database. The histopathology images were 768x768 pixels in size and in JPEG format. The images were first observed manually to find any differences between benign and malignant images. The nucleus regions were quite different between the two, so the nucleus regions were selected as the regions of interest in this study. Different preprocessing steps, for example, brightness and contrast normalization, were performed to increase the quality of images for analysis. Through color segmentation only nucleus regions were extracted, while masking the other features in the images as white. Many features of the nucleus were extracted, like nucleus mean area, nucleus area standard deviation, nucleus mean height, nucleus mean width and aspect ratio and were output to a data file. This file was further used in the training and testing phase using the K-NN model due to its simplicity. Through graphical and statistical analysis, it was observed that the nucleus mean area, mean height, aspect ratio and nucleus area standard deviation were quite higher in malignant images than benign images. Through correlation between the features and malignancy, it was further reinforced that the nucleus area mean, nucleus aspect ratio and nucleus height had more effect on malignancy than other features. Hence, during the training phase only these features were selected, while dropping the others from the feature vector. A total of 1090 images were used in this study, split 80/20 for training and test phases respectively, resulting in 870 images for training and 220 for the testing phase. Different values of k were selected to find the best value where accuracy of the model exceeds. The model gave 90.91% accuracy at k=8, which yields an area under the ROC curve of 0.87, which indicates a good performance of the model. This software can be used for early screening purposes due to its high accuracy, which will not only help in diagnosis at a faster rate but also can be made as a standard technique used in cancer diagnosis
URI:	http://hdl.handle.net/123456789/28301
Appears in Collections:	M.Phil

Files in This Item:

File	Description	Size	Format
BIO 7401.pdf	BIO 7401	1.22 MB	Adobe PDF	View/Open

Show full item record