Artificial Intelligence in Radiology





Key Points





  • The convergence of artificial intelligence and radiology was a natural development as these two fields evolved technologically.



  • Different types of artificial intelligence such as supervised learning and deep learning have their applications to radiology.



  • Machines can interpret radiologic images through multiple techniques and can provide speed and accuracy.



  • The ability to process large numbers of images rapidly offers the radiologist some relief for the increasing demands of this specialty.



  • The future role of AI in radiology is evolving with pitfalls and promises, and radiologists should have a strong role in its development.



Background


Radiology


Radiology as a discipline has been on the forefront of advances in medicine and always been one of the early adopters of computing revolutionary technologies. Since 1895, when Wilhelm Rontgen took the first x-ray of his wife’s hand, there have been significant developments in this field.


Digital Radiology


The use of digital technology in radiology started in 1967 with the advent of CT (computed tomography) scanners. Next, a major shift happened with the advent of MRI (magnetic resonance imaging) in the 1980s. Since then, various more technologies like single-photon emission CT, positron emission CT, multidetector CT, dual-energy CT scanners, and positron emission tomography with MRI, with significantly higher level of computing sophistications evolved. As development continued, the concept of imaging networks evolved mainly for the purpose of image storage and distribution, called PACS (picture archiving and communication systems), and is now widely used throughout small and large medical enterprises.


Artificial Intelligence


The term “artificial intelligence” (AI) was introduced for the first time by John McCarthy of Dartmouth college in the summer of 1956 at a conference, and he defined AI as “the science and engineering of making intelligent machines.”


Computerized Decision-making Tool


These early intelligent machines started to get utilized in different fields of life, and in 1972 one of the earliest computerized clinical decision tools in medicine was developed in the United Kingdom. This clinical decision system was called AA Phelp, and it helped in diagnosing the cause of abdominal pain based on clinical symptoms. This computer system was more accurate than the diagnosis made by senior physicians (91% vs. 79%); however, the problem back then was that computers took a few hours to make the diagnosis.


Computer Vision and Image Recognition


Even with so much success made by intelligent machines, the progress in the field of image recognition or analysis was relatively slow. Things changed in 2009, when researchers from Princeton University, under the leadership of Li Fei-Fei, developed a new database for computer vision and pattern recognition. They built this new database using 3.2 million images and named that database ImageNet.


After 2009, this revolutionary ImageNet database was used by multiple groups to improve computer vision and pattern recognition, who presented their results at annual ILSVRC (ImageNet Large Scale Visual Recognition Challenge). In 2012, a group took significant leap when graphics processing unit (GPU) was used for the first time in convolutional neural network (CNN) named AlexNet, which performed significantly better at the annual ILSVRC.


In 2015, a team of computer scientist introduced new type of CNN called ResNet (residual neural network), which won the first prize at the annual ILSVRC and for the first time outperformed human vision for certain tasks. This neural network is fundamentally different from other CNNs, in which not all the layers are connected to each other, so it is more time-efficient and accurate.


Technical Knowledge


Central Processing Unit Versus Graphics Processing Unit


GPU is a special processor different from the regular central processing unit (CPU) microprocessor as shown in Fig. 11.1 and has thousands of small cores that can handle multiple computation in parallel. Originally it was used in video games and was optimized in such a way that each pixel value can be processed by each core. More hardware is allocated to computations and less to fast cache memory in GPU, the opposite of CPU, where memory is much more important. CPU is good in performing serial tasks that take a lot of memory versus GPU, which performs a lot of parallel computations.




Fig. 11.1


Architectural difference between central a processing unit (CPU) and a graphics processing unit (GPU), which has multiple microprocessors versus two or three in a CPU.



Computer Languages


Python and R are the two most used programming languages for data science in deep learning and AI.


Artificial Intelligence


AI is any computer algorithm that can perform a task at human level. AI includes all type of machine learning (ML). ML includes all types of deep learning and deep learning includes all types of CNNs ( Fig. 11.2 ).




Fig. 11.2


All types of convolutional neural network (CNN) are subset of deep learning, all forms of deep learning are subset of machine learning, and all machine learning is a subset of artificial intelligence.


Machine Learning





  • ML is defined by computer scientist Tom Mitchel as “the study of computer algorithms that improve automatically through experience.” ML is when data train the computer to perform tasks, instead of programming the computer. ML can be supervised by humans, if needed.



  • In supervised ML ( Fig. 11.3 ), the data are labeled, which means that some data already know the right answer. Regression and classification are two types of supervised learning based on output variables. When the output is a real number, it is called regression, like height, weight, or size. Classification is the other type, when the output variable is in the form of category or class, like choosing between different colors (e.g., red or blue) ( Fig. 11.4 ).




    Fig. 11.3


    Difference in training and resulting model and how it is applied to new input between supervised and unsupervised learning.



    Fig. 11.4


    Types of classical machine learning.



  • In unsupervised ML, the input data are not labeled, and developers let the machine discover answers on its own. Unsupervised learning can be of two types: clustering and association. Clustering is applied by developers when they like to identify structure or pattern in uncategorized data; for example, uncategorized data of thyroid nodules can be categorized based on shape, size, and calcifications. Association, on the other hand, is when developers need to recognize associations between different data objects while evaluating large databases ( Fig. 11.5 ).




    Fig. 11.5


    Example of association type of unsupervised learning explains how computers recognize association between two data objects.



Computer-Aided Diagnosis


Computer-aided diagnosis is a type of supervised ML. The earliest use of computer-aided diagnosis was in 1963 by Lodwick to evaluate chest x-rays. Current computers are 30 million times faster than what was used in 1963.


In computer-aided diagnosis, images go through multiple steps ( Figs. 11.6 and 11.7 ) like preprocessing, segmentation, candidate detection, feature extraction, and classification.



  • 1.

    Preprocessing means radiology images go through recalibration and noise removal.


  • 2.

    Segmentation is the separation of image data, based on anatomy. It is considered one of the most interesting and studied areas in medical imaging.


  • 3.

    During candidate selection, areas that need more attention get identified (for example, density or calcification).


  • 4.

    In the fourth step, the selected candidate’s feature extraction is performed. This step requires vector space paradigm, which means each candidate is represented by a vector, which is a row of numbers and each number represents each feature of the candidate. Several features define the dimension of feature space and help with pattern classification also known as ML.


  • 5.

    The last step is classification, in which normal and abnormal features are classified by trainers.




Fig. 11.6


Steps of computer-aided diagnosis.



Fig. 11.7


Steps of computer aided diagnosis for lung nodule. CT, Computed tomography.



Deep Learning


Deep learning is a type of ML ( Fig. 11.8 ) where data trains multilayered network or deep neural networks and extract higher-level features. Deep neural networks are not the only way to perform deep learning, but they are the most used method.




Fig. 11.8


Diagrammatic representation of differences in machine learning and deep learning (which is also a type of machine learning).



Neural networks are many interconnected processing elements that change their dynamic internal state due to external inputs/data.


These concepts were first conceived in 1943. In the 1990s neural networks were shallow and consisted primarily of only three layers: an input layer, a hidden layer, and an output layer. Each neuron in these layers receives input data, performs operation, and exports the results to the next layer. Deep neural networks have more than two hidden layers present, but very commonly, they can have 30 to 40 hidden layers.


Cnn


CNN is a type of artificial neural network that differs from others due to the convolutional layer. Convolutional in mathematics refers to producing a third function by combining two functions. Neural network convolution can be when a kernel (filter) is used with input data to create a feature map. In radiology, filters are commonly used to create edge sharpening and blurring and other appearances of images. The next step is pooling, which could be averaging the values or maximizing the values in the target area. In the fully connected layers, features extracted from convolutional and pooling layers get mapped into the final outputs of the model. Activation is one of the most integral components of CNN and transforms linear operations output (convolutions) into nonlinear, making them learn and perform complex tasks.


CNN is most widely used deep learning model in imaging due to following reasons:



  • 1.

    Unlike other fully connected neural networks where every single neuron has to be connected to every single neuron of the next layer, in CNN only some of the neurons will be connected to some of the neurons of the next layer through a kernel or filter. It reduces the time it takes to train; for example, if there is an input layer of 64 × 64 and an output layer of 64 × 64, a total of 16,777,216 parameters needs to be trained versus if there is a kernel/filter of 3 × 3 used between 2 layers, then only 9 parameters needs to be trained.


  • 2.

    The features that are relevant can be learned from the images. Most of the filters we use on regular photos are predefined, although in CNN filters they are not predefined but they learn to perform certain tasks from raw image data. Then these trained kernels are applied to the input images, which results in different feature maps.


  • 3.

    It is very efficient for new tasks, as existing trained CNN can train itself for new tasks by transfer learning.


  • 4.

    CNN has outperformed other algorithms on image analysis.



Residual Network


Evidence suggest depth of the network is very important, and adding more layers increases the error rate. Therefore, in ResNet (residual network) weightage is assigned to each stacked layer as well to allow some features to skip the layers so that the error rate, which was high with conventional plain multilayer system, can be reduced. In summary, adding layers are important but these can increase errors, which can be corrected by adding weightage and skipping the layers.


Convolutional Neural Network–Recurrent Neural Network


A CNN–recurrent neural network is a type of CNN that is good at processing sequential information like speech and language. It can be used to evaluate medical records or radiology reports. In one study, recurrent neural network–based natural language processing (NLP) was performed of the musculoskeletal imaging reports for fracture versus no fracture. Interestingly this project was not done based on vocabulary, grammar, or dictionary; instead, it generated annotation by output characters (which are not words) so in this study these 11 characters were used: f, r, a, c, t, u, e, n, o, -, EOS (end of sentence symbol). For example, the study was labeled fracture from the sentences like (“s/p intramedullary nailing in the femoral shaft” or “increased callus formation”).


Convolutional Neural Network–Generative Adversarial Network.In a CNN-generative adversarial network, two (generator and discriminator) networks work with and against each other. In radiology it can help to augment synthetic data to create realistic-appearing medical images. In a 2018 publication, artificial brain images were created (from very preliminary raw data) that could not be identified as artificial, even by a well-trained neuroradiologist. The generative adversarial network–based model can be very helpful in a scenario with difficult and sick patients and only have 5 to 10 minutes to obtain all relevant images instead of 20 to 30 minutes for brain MRI.


Transfer Learning


It this process, a preexisting trained model gets retrained with the new data. This will significantly reduce the data demands for CNN model development and can have a significant role in radiology as abnormal medical images are very limited.


Training A Cnn Model


Although a child can recognize a cat from the dog, describing the exact features that help distinguish between them is difficult. Both animals have teeth, ears, eyes, and tongue. Similarly, CNN can reliably identify lesions without needing specific morphologic parameters; instead, they need lot of images labeled “fracture” or “no fracture.” Images to train CNN do not typically require preprocessing, segmentation, or feature extraction. CNN models are typically described as a “black box” that extracts differentiating features and arrives at weighted probabilities for specific question. It is important to understand that CNN needs lot of images to train the system. If a CNN model is built on insufficient number of images, it will suffer from overfitting, which means it will not be applicable generally to other clinical images outside the training set. After the CNN model is trained on sufficient images, the model can be tested on a separate validation data set. Finally, the model can be tested on testing data set, which ideally should be different from training and validation data sets.


Radiomics


Radiographic image data along with the clinical outcomes have led to the expansion of the field of radiomics. Radiomics is relatively a new field in radiology and deals with the extraction of a large number of features such as size, shape, texture analysis, image intensity histograms, and image voxel relationships from radiology images. These features include spatial information on pixel or voxel distribution and patterns and help in the diagnosis of diseases of the bone, muscle, and other organs, which can lead to precise treatment. AI with radiomics provides a smarter capability of managing enormous data sets efficiently than the traditional systems.


ML techniques have led to expansion of the potential of radiomics to impact clinical care. One recent study applied ML-enhanced radiomics to differentiate sacral chordomas from sacral giant cell tumors on 3D CT. The researchers found that contrast-enhanced CT characteristics were more useful than those from unenhanced imaging for differentiation of these two tumor types when comparing different feature selection and classification methods. In guiding more precise evaluations and treatments, there is potential for ML-enhanced radiomics in evaluating other Musculoskeletal (MSK) and non-MSK tumors as well.


Dice Similarity Coefficient


It is a measure of similarity between two samples. Dice similarity coefficient is commonly used to measure accuracy of image segmentation algorithms. It is important to know when reading radiology literature especially with AI.


Applications of Artificial Intelligence in Radiology


AI is considered one of the most disruptive and transformative technology of the present time, and its use in radiology is already felt to be groundbreaking. In the last 5 years alone, there have been more than 6000 citations in PubMed related to AI and radiology.


There is a strong argument that AI in medicine will be led by specialties like radiology because they generate multibillion images every year. Images in radiology are all digital and are stored and distributed via powerful network of computers like PACS (picture archiving and storage systems).


The number of images provided to radiologist for review for each patient have increased multiple times over the last few decades. Average number of images radiologists review while reading a CT scan have increased by approximately 10 times from 82 images in 1999 to 679 images in 2010. For MRI, the average number of images increased from 164 images to 570 images, respectively. The total number of radiologists needed is a lot more than the available workforce around the world. In the United States, the percentage of radiologists as a share of overall physician workforce has decreased from 8.8% in 1995 to 4.0 % in 2011. In Japan, there is only 1 radiologist per 40,000 patient population compared with 1 regular physician per 500 patient population. Overall accuracy in interpreting images is also decreasing as medical error rates are increasing. Radiologist error rates are as high as 30% due to the large number of images, search patterns, and also the complexity of the examinations—for example, people who read three to five MRI liver examinations a day versus people who read three to five cases a week. Per year, 40,000 to 80,000 deaths are related to diagnostic errors. Economic cost associated with their diagnostic errors are more than $38 to $50 billion dollars.


Use of AI can hopefully help improve much needed efficiency and efficacy in the field of radiology.


In the next few sections, some of the direct roles of a radiologist and how AI can assist will be discussed. Some of the other potential roles of AI in advanced radiology department will also be discussed.


Pathology Detection


One of the primary roles of a radiologist is to evaluate and interpret images. There are multiple pathologies in which AI can help the radiologist, but the next few paragraphs will address more of musculoskeletal pathologies.


Fracture Detection


There are multiple published studies showing that fractures can be detected using deep learning. Some limitations of all these studies are that to train CNNs, thousands of images are needed for the same body part, images need labeling, and the ability to recognize fracture does not cross over to other body parts very easily (unlike humans who can transfer that knowledge). In addition, the results are binary, such as presence of a fracture or no fracture. Description or type of fractures present is currently not available.


In 2017, a group from Stanford introduced MURA (musculoskeletal radiograph), a large data set of musculoskeletal X-rays containing 40,561 images from 14,863 examinations, with each examination labeled manually by radiologists as normal or abnormal. They trained a 169-layer DenseNet baseline model to detect and localize abnormalities. Their model achieved an area under the receiver operating characteristic curve (ROC) of 0.929, with 0.815 sensitivity and 0.887 specificity. Model performance was comparable to the best radiologist performance in detecting abnormalities on finger and wrist studies. However, model performance was lower than best radiologist performance in detecting abnormalities on elbow, forearm, hand, humerus, and shoulder examinations.


Since the time of publication, Stanford’s large MURA data set was made public to encourage other groups to compete for fracture-detection algorithms. The performance of some of the models exceeded the original model and as of August 3, 2021, the top eight models have k values ranging 0.843 to 0.795, which is better than the best of the three radiologists in the original study (0.778).


Compression fractures can be subtle, and differences in bone density in the spines can be very difficult to evaluate. A well-researched article was published in Radiology in 2017 in which an automated computer system was developed to detect, localize, and classify compression fractures and to measure bone density on spine CT scans. Sensitivity for detection or localization of compression fractures was 95.7%. Accuracy for classification by Genant type (anterior, middle, or posterior height loss) was 0.95. Accuracy for categorization by Genant height loss grade was 0.68. The average bone attenuation for T12–L4 vertebrae was 146 Hounsfield units (HU) ± 29 (standard deviation) in case patients and 173 HU ± 42 in control patients; this difference was statistically significant.


In 2017 a Swedish study was published, in which 256,000 wrist, hand, and ankle x-rays were classified for fracture, laterality, body part, and view. Researchers used five openly available deep learning neural networks, and the results demonstrate that all five networks perform at 90% accuracy for laterality, body part, and examination view. Accuracy for recognizing fractures was 83% for the best performing deep learning neural network. These results were comparable to senior orthopedic surgeons evaluating images for fractures.


In 2019, a study was performed on more than 20,000 images to detect hip fractures. Results were compared between orthopedic surgeons and deep CNN, and accuracy was 92% and 95%, sensitivity was 88% and 94%, and specificity was 97% and 97%, respectively. The performance of deep CNN outperformed the orthopedic surgeons. This could develop into very important application, which can help us detect hip fracture early and accurately, because missed hip fractures are associated with long-term pain and disability.


Cartilage Abnormality Detection


Cartilage abnormality detection is lot more difficult and different than fracture detection. It is usually done on MRIs. Cartilage first needs to be accurately segmented and then the segmented cartilage is divided into small patches/sections for abnormality detection. Other issue with cartilage is that it is inherently usually small or thin.


A study published in 2018, performed on knee MRI images to look for cartilage abnormality. In this study knee images had to go through two separate CNNs, one for the segmentations and other one for the characterization (detection) ( Fig. 11.9 ). After segmentation, each knee cartilage surface area was divided into small patches. Each patch was evaluated separately, so total of 17,395 small patches from 175 patients were evaluated twice. Results show sensitivity of 84% and specificity of 85% for evaluation 1 and 80% and 88% for evaluation 2. AUC was 0.917 and 0.914 for evaluation 1 and 2. There was good interobserver agreement between two evaluations with a k of 0.76.




Fig. 11.9


Two separate convoluted neural networks (CNNs) used for the cartilage lesion detection, one for the segmentation (top) and one for the classification (bottom) .

From Liu F, Zhou Z, Samsonov A, et al. Deep learning approach for evaluating knee MR images: achieving high diagnostic performance for cartilage lesion detection. Radiology . 2018;2891:160–169. doi:10.1148/radiol.2018172986 .


Meniscal or Ligamentous Tear Detection


More work is getting done to detect abnormalities involving ligaments and menisci. Bien et al. from Stanford reported an AUC of 0.847 for detecting meniscal tears with their CNN model (MRNet; 1130 training and 120 validation examinations). The network was also trained to recognize anterior cruciate ligament tears and achieved an AUC of 0.937 on the internal validation set. When tested on an external validation set of 183 examinations obtained from a public data set of 917 examinations from Croatia, AUC was slightly lower (0.824) for the detection of anterior cruciate ligament tears. However, when MRNet was trained on the external data set, AUC improved to 0.911 on the external validation set.


Segmentation


Drawing boundaries around normal anatomical structures or pathological findings for the purpose of quantification (volumes) is segmentation. It is extremely time-consuming with a high level of variations and inaccuracies, which can have very important clinical ramifications, especially if long-term changes need to be tracked. Use of AI for autosegmentation will reduce the variability, and more accurate 3D volumes can be measured.


Muscle Segmentation


Most radiologists do not quantify a lot of information available on radiologic images, because it is extremely time-consuming and inaccurate. A good example of these is generalized muscle atrophy or decrease bone density. A study by Lee et al. tried to quantify muscle on CT scan and their best model achieved 0.93 Dice similarity coefficient.


Cartilage Segmentation


Osteoarthritis is one of leading causes of pain and disability, so early detection and quantification of cartilage loss can be helpful. Manual cartilage segmentation can be extremely time-consuming. One of the main hurdles in automated cartilage assessments tools is to clearly identify the cartilage; since the cartilage is most easily recognized in the knee, most of the AI work is performed on knee MRIs ( Fig. 11.10 ). In a study published in 2019 by Gaj et al., a developed model was trained and tested on knee MRI images from the Osteoarthritis Initiative data set. The model provided excellent segmentation performance for cartilages with Dice coefficients ranging from 0.84 to 0.91. For meniscus segmentation, the model achieves 0.87 to 0.89 Dice coefficients.




Fig. 11.10


Manual and automatic segmentation for femoral condyles (FC), lateral tibial (LTB), medial tibial cartilage (MTC), patellar cartilage (PC), lateral meniscus (LM), and medial meniscus (MM).

From Gaj S, Yang M, Nakamura K, Li X. Automated cartilage and meniscus segmentation of knee MRI with conditional generative adversarial networks. Magn Reson Med . 2019;841:437–449. https://doi.org/10.1002/mrm.28111 .


A study published in 2019 was done on delayed gadolinium-enhanced MRI of cartilage utilizing a deep learning approach to develop hip cartilage segmentation technique that is fully automatic and enables an accurate, reliable, and reproducible analysis of cartilage thickness, surface area, and volume. This extremely efficient and objective evaluation of biochemical cartilage composition and morphology yields the potential to improve patient selection in femoroacetabular impingement surgery and help surgeons with planning.


Nonmusculoskeletal Segmentation


AI can help radiologists to accurately measure the size of the different organs. Hu et al. performed multiorgan segmentation (liver, spleen, and both kidneys) on 140 abdominal CT images. They employed deep fully CNNs for organ detection. Average Dice overlap ratios for the liver, spleen, and both kidneys are 96.0%, 94.2%, and 95.4%, respectively. The computation time for a CT volume was 125 seconds in average. They achieved accuracies comparable to state-of-the-art methods with much higher efficiency.


Classification


After a radiologist identifies the finding, the next step is to classify the disease and assess the severity of it. There are multiple deep learning models available to assess different benign and malignant diseases.


Automated Osteoarthritis Evaluation


Antony et al. in 2016 demonstrate that classification accuracy can be significantly improved using deep CNN models pretrained on ImageNet and fine-tuned on knee osteoarthritis images. They used transfer learning techniques to existing CNN and extract features to create automated knee osteoarthritis quantification model, capable of classifying into mild, moderate, or severe grades.


A study published in 2020 analyzed hip joints on weight-bearing AP pelvic radiographs from participants in the Osteoarthritis Initiative. Femoral osteophytes, acetabular osteophytes, and joint space narrowing were graded as absent, mild, moderate, or severe. Subchondral sclerosis and subchondral cysts were graded as present or absent ( Fig. 11.11 ). The accuracy of the model for assessing these five features was 86.7% for femoral osteophytes, 69.9% for acetabular osteophytes, 81.7% for joint space narrowing, 95.8% for subchondral sclerosis, and 97.6% for subchondral cysts in the internal test set of approximately 1500 hip joints, and 82.7% for femoral osteophytes, 65.4% for acetabular osteophytes, 80.8% for joint space narrowing, 88.5% for subchondral sclerosis, and 91.3% (95 of 104) for subchondral cysts in the external test set of approximately 100 hip joints.


Apr 6, 2024 | Posted by in PHYSICAL MEDICINE & REHABILITATION | Comments Off on Artificial Intelligence in Radiology

Full access? Get Clinical Tree

Get Clinical Tree app for offline access