Artificial Intelligence Has Varied Diagnostic and Predictive Performance in Diagnosing Patellofemoral Osteoarthritis, Trochlear Dysplasia, and Patellofemoral Tracking Abnormalities: A Systematic Review

Purpose

To systematically review and evaluate the diagnostic efficacy and predictive power of artificial intelligence (AI) models in detecting patellofemoral (PF) compartment pathology and to compare their performance against ground-truth human clinical experts when applicable.

Methods

In accordance with the Preferred Reporting Items for Systematic Reviews and Meta-analyses guidelines, the PubMed, Ovid/MEDLINE, and Cochrane Library databases were searched from inception through May 2024 for studies on AI methods for diagnosing trochlear dysplasia, PF osteoarthritis, or PF instability and tracking abnormalities on cross-sectional imaging. AI model choice, knee pathology, input/output data, performance metrics (accuracy, area under the curve [AUC], precision-recall curve average precision, sensitivity, specificity, positive predictive value, and negative predictive value), sample sizes of datasets, image modalities, and limitations were recorded.

Results

Of 68 studies screened, 17 met the inclusion criteria. Ten studies investigated AI diagnostics for PF osteoarthritis; four, PF tracking and/or instability; and three, trochlear dysplasia. Various deep learning architectures and machine learning algorithms were used. Input data included computed tomography scans, magnetic resonance imaging scans, and radiographs. Output data included anatomic landmark identification and diagnostic predictions. AUC values ranged from 0.664 to 0.990, and accuracy ranged from 74% to 99%. Model performance was moderate to excellent, with AI models consistently surpassing traditional methods in processing times. Common limitations included small sample size, single-center datasets, limited generalizability, and bias due to imbalanced datasets.

Conclusions

AI models showed variable diagnostic performance in identifying PF pathologies and predicting disease progression, with reported AUCs ranging from 0.664 to 0.990 and accuracies between 74% and 99%. Although some studies suggested that AI outperformed traditional diagnostic methods such as interpretation by musculoskeletal radiologists, manual segmentation, or arthroscopy, the degree of superiority was inconsistent and influenced by significant heterogeneity in model architectures, imaging modalities, and reference standards. Given the broad scope of this review and variability across studies, caution is warranted in interpreting these findings, and specific clinical recommendations cannot be made at this time.

Clinical Relevance

AI-based diagnostic tools show promise in supporting the evaluation of PF joint pathologies by potentially improving efficiency and consistency in image interpretation. However, because of the heterogeneity in current models and study designs, the clinical applicability of these tools remains limited. Further refinement and external validation of AI algorithms are needed before their integration into routine clinical decision making can be fully endorsed.

Artificial intelligence (AI), particularly through the advancements of deep learning (DL) and machine learning (ML), has impacted numerous sectors, including orthopaedic surgery. ^,^,^, ML, a subset of AI, involves the development of algorithms that allow computers to learn from and make decisions based on data. DL, a more advanced subset of ML, uses neural networks with many layers to analyze complex patterns and features in large datasets. The evolution of these technologies has enabled DL algorithms to interpret intricate data patterns, whereas ML has enhanced predictive modeling capabilities. In orthopaedic surgery, AI has been proposed across various stages, from preoperative planning and intraoperative guidance to postoperative rehabilitation. Specifically, there has been an increased emphasis on the use of AI for automated image processing and analysis, which has significantly improved the efficiency of diagnostic processes. Within the field of orthopaedics, AI algorithms are being evaluated to aid clinicians in real-time fracture recognition, ^,^, prognostication of tumor survivorship, ^,^,^, and postoperative assessment of implant positioning ^,^,^,^, and, more recently, the detection of soft-tissue knee injuries. ^,^,^, To this point, the diagnostic potential of AI is particularly promising because it allows for detection and classification of musculoskeletal (MSK) abnormalities in imaging studies with superior speed compared with traditional ground-truth methods.

The impact of AI is especially relevant in the management of patellofemoral (PF) pathology, in which accurate assessment, diagnostic capability, and predictive modeling of outcomes can be crucial for effective and efficient treatment. Currently, the primary applications of AI involve using magnetic resonance imaging (MRI) and computed tomography (CT) images to detect subtle changes in cartilage, bone, and soft tissues that are indicative of disorders such as PF pain syndrome, chondromalacia patellae, osteoarthritis (OA), and patellar instability. ^,^,^,^, ML models can predict the status of these conditions and the outcomes of various treatment modalities, aiding in the development of personalized treatment plans. For example, AI can assist in identifying patients who are likely to benefit from nonsurgical treatments versus those who may require surgical intervention, thereby optimizing clinical decision making. Furthermore, predictive models based on ML can assess image-based and clinically based patient-specific risk factors to forecast surgical outcomes. ^,

Despite AI’s demonstrated benefits, its application in using radiographic and cross-sectional imaging for diagnosing knee injuries, such as PF OA, trochlear dysplasia, and chondromalacia patellae, as well as PF tracking abnormalities, remains poorly understood. Therefore, the purpose of this study was to systematically review and evaluate the diagnostic efficacy and predictive power of AI models in detecting PF compartment pathology and to compare their performance against ground-truth human clinical experts when applicable. The hypothesis was that AI models would exhibit excellent performance characteristics in the identification and evaluation of PF pathology.

Methods

Study Selection

Two independent authors (J.T-K., M.A.B.) completed a query of the literature in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-analyses (PRISMA) guidelines and reviewed the search results, with each author blinded to the other’s results; a third author (E.H.) was available for arbitration on potential disagreements or discrepancies. Studies were deemed eligible for full-text review based on an initial approval screening of article titles and abstracts.

Search Criteria and Strategy

A systematic review was performed in accordance with the PRISMA guidelines using the PubMed, Ovid/MEDLINE, and Cochrane Library databases from inception through May 2024. A Boolean search syntax was used to capture the maximum number of articles for screening in the initial search: ((“Trochlea” OR “Patellofemoral” OR “Patellofemoral Instability” OR “Trochlear Dysplasia” OR “Knee Disorders” OR “Knee Abnormalities”) AND (artificial intelligence OR neural network∗ OR deep learning OR machine learning OR machine intelligence) AND (diagnostic performance OR diagnostic accuracy OR sensitivity OR specificity OR ROC curve OR area under the curve OR AUC OR predictive value of test OR score OR scores OR scoring system OR scoring systems OR observ∗ OR observer variation OR detect∗ OR evaluat∗ OR analy∗ OR assess∗ OR measure∗)).

Eligibility Criteria

Rigorous inclusion criteria were established to ensure the integrity and relevance of the selected literature. Articles were deemed eligible if they met 3 key criteria: The study investigated the development or application of AI specifically for detecting trochlear dysplasia or abnormalities in PF tracking using cross-sectional imaging techniques; the study was published in a peer-reviewed journal in the English language, and the full text of the study was available. Exclusion criteria included articles consisting solely of abstracts, technical papers, cadaveric or animal experiments, or letters to the editor ( Fig 1 ). Finally, the bibliographies of all included studies were cross-referenced to ensure no relevant studies were overlooked.

Image, AltText currently not available — Fig 1

Data Extraction

Full-text examination of articles passing the screening process was only undertaken after application of our strict inclusion and exclusion criteria. Furthermore, to ensure completeness, all references cited in the included studies were exhaustively reviewed. Two independent authors (J.T-K., M.A.B.) systematically compiled all pertinent data using a predefined Microsoft Excel data sheet (Microsoft, Redmond, WA) with a modified information extraction table. The columns for these extraction tables included the following: publication data; study title and design; study methodology; knee pathology and anatomic region; sample size (patients); dataset size; AI model; image and model input and output; ground truth; training set, validation set, and test set sizes; performance grading; accuracy grading; area under the curve (AUC) for the receiver operating characteristic curve; conclusions; and limitations.

Outcomes Analyzed and Statistical Methods

All data were qualitatively synthesized and reported in both narrative fashion and individual table formats. Data extracted were presented as means, medians, ranges, and confidence intervals as appropriate and as provided in respective studies. Outcome measures of interest included accuracy, AUC, average precision (AP), dispersion of data (mean absolute error [MAE], mean absolute deviation [MAD], and/or root-mean-square [RMS]), inter-rater reliability (κ value or intraclass correlation coefficient [ICC]), sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and Dice coefficient. No regression modeling or predictive analytics were performed because the analysis was descriptive in nature and did not require inferential modeling. This absence is disclosed in accordance with reporting considerations for AI-related methodologies. All statistical analyses were performed using R (version 4.0.2, The R Foundation for Statistical Computing, Vienna, Austria), with pooled analysis for quantitative statistical analysis. P <.05 was considered statistically significant.

Results

A total of 69 studies were initially identified through the electronic database search. After the removal of duplicate records, the remaining articles were assessed according to predefined inclusion and exclusion criteria. After a thorough evaluation of full-text eligibility, 17 studies were ultimately selected for inclusion in this review for both quantitative and qualitative data analyses ( Fig 1 ). All studies were classified as Level III and IV evidence, with an average Methodological Index for Non-randomized Studies (MINORS) score of 5.35 ± 0.68. Most of the studies (14 of 17, 82.4%) were retrospective in nature, and 64.7% (11 of 17) used predictive designs. Details of study characteristics and knee pathology of interest are provided in Table 1 .

Table 1

Study Characteristics and Methodologic Quality Strength Assessment

Study	Knees (Patients), n	Pathology	Prospective or Retrospective	Predictive or Diagnostic	LOE	MINORS Score
Liu et al. (2023)	14,652 (483)	PF OA	Retrospective	Predictive	IV	5
Bayramoglu et al. (2022)	5,507	PF OA	Retrospective	Diagnostic	IV	5
Hu et al. (2022)	104	PF OA or cartilage injury	Prospective	Predictive	III	4
Xu et al. (2023)	464	Trochlear dysplasia	Retrospective	Predictive	IV	6
Shi et al. (2021)	41	PF pain syndrome	Retrospective	Diagnostic	IV	6
Yurova et al. (2024)	15	PF OA	Retrospective	Diagnostic	IV	5
Tuya et al. (2023)	1,280	PF OA	Retrospective	Diagnostic	IV	6
Tuya et al. (2023)	1,230	PF maltracking	Retrospective	Predictive	IV	6
Bayramoglu et al. (2021)	18,436 (2,803)	PF OA	Retrospective	Predictive	IV	5
Hu et al. (2023)	364 (182)	PF OA	Prospective	Predictive	III	5
Bayramoglu et al. (2024)	3,276 (1,832)	PF OA	Prospective	Predictive	III	5
Nagawa et al. (2024)	51 (49)	PF instability	Retrospective	Predictive	IV	5
Barbosa et al. (2024)	140 (95)	Trochlear dysplasia	Retrospective	Predictive	IV	6
Cerveri et al. (2018)		Trochlear dysplasia	Retrospective	Predictive	IV	4
Pedoia et al. (2019)	1,481 (302)	PF OA or cartilage injury	Retrospective	Predictive	IV	6
Cheng et al. (2020)	176 (93)	PF pain syndrome or OA	Retrospective	Diagnostic	IV	6
Hu et al. (2024)	600	PF OA	Retrospective	Diagnostic	IV	5

LOE, level of evidence; MINORS, Methodological Index for Non-randomized Studies; OA, osteoarthritis; PF, patellofemoral.

AI Model Choice

A comprehensive overview of the various AI models used in the included studies, detailing their respective image input types, image planes, and ground truth/reference standards, is presented in Table 2 . For a clear explanation of DL concepts, it is important to note that deep neural networks can suffer from the vanishing gradient problem , in which gradients become increasingly small as they propagate backward through many layers. This impairs the training of early network layers, making it difficult for the model to learn because important signals become weaker as they move backward through the layers. Skip connections , introduced in certain architectures such as residual networks (ResNets), help mitigate this issue by creating direct pathways or shortcuts between nonadjacent layers, allowing gradients to flow more effectively and enabling the training of much deeper networks.

Table 2

Overview of AI Model Parameters and Methodology for PF Pathology Studies

Study	AI Model	Image Input	Image Plane	Ground Truth/Reference Standard	Training Set	Validation Set	Testing Set	Model Output
Liu et al. (2023)	ResNet	CT	Axial	MSK radiologist	Not specified	Not specified	Not specified	Landmark prediction coordinates
Bayramoglu et al. (2022)	GBM-CNN	Radiography	Sagittal	Comparison of multiple algorithms (GBM)	Not specified	Not specified	Not specified	OARSI and KL grades
Hu et al. (2022)	MWRN	MRI	Sagittal, coronal, and axial	Arthroscopy	Not specified	Not specified	Not specified	Prediction of image reconstruction
Xu et al. (2023)	U-Net CNN	MRI	Axial	Radiologist and senior surgeons with >10 yr of experience	370	Not specified	94	Pixel-level regression prediction
Shi et al. (2021)	MI-CNN	Radiography	Dynamic	Single-input CNN	70%	Not specified	30%	Classification of PFPS
Yurova et al. (2024)	U-Net CNN	MRI and CT	Sagittal, coronal, and axial	Previous algorithm segmentations	75%	Not specified	25%	Creation of biomechanical model for patellar motion
Tuya et al. (2023)	HRNet	Radiography	Axial	2 MSK radiologists	1,280	187	129	KL classification of PF OA
Tuya et al. (2023)	U-Net CNN	Radiography	Sunrise	3 MSK radiologists	1,230	Not specified	201	Prediction of landmarks
Bayramoglu et al. (2021)	R-CNN	Radiography	Sagittal	2 Independent expert OARSI graders	596	5-Fold cross		Prediction of PF OA status
Hu et al. (2023)	D-CNN	MRI	Sagittal, coronal, axial, and 3D reconstruction	Biomarker Consortium Database	Not specified	5-Fold cross	Not specified	Prediction of PF OA
Bayramoglu et al. (2024)	D-CNN	Radiography	Sagittal	2 Independent radiologists	Not specified	5-Fold cross	Not specified	Prediction of PF OA
Nagawa et al. (2024)	SVM	MRI	Sagittal, coronal, and axial	2 Radiologists with >5 yr of experience	Not specified	5-Fold cross	Not specified	Prediction of PFI
Barbosa et al. (2024)	U-Net CNN	MRI	Sagittal, coronal, axial, and 3D reconstruction	Expert MSK radiologist	80%	20%	Not specified	Landmark prediction (6, 3, and 7 output channels)
Cerveri et al. (2018)	SSPA-NN–SSM	CT	Sagittal, coronal, axial, and 3D reconstruction	Previous algorithm segmentations	66	15	Not specified	Prediction of clinical conditions (3 outputs)
Pedoia et al. (2019)	U-Net CNN	MRI	Coronal	5 Radiologists with >5 yr of experience	65%	20%	15%	Prediction of cartilage lesions (2-class output)
Cheng et al. (2020)	HNN	MRI	Sagittal and 3D reconstruction	Manual segmentation, >15 yr of experience	80	9-Fold cross	10	Probability maps for clinical conditions
Hu et al. (2024)	TRGCN	MRI	2D and 3D	MSK radiologist segmentations	155 OA 325 Control	Not specified	39 OA 81 Control	Simulated PF tracking

AI, artificial intelligence; CNN, convolutional neural network; CT, computed tomography; D-CNN, dilated convolutional neural network; GBM, gradient boosting machine; GBM-CNN, gradient boosting machine and convolutional neural network; HNN, hypercomplex neural network; HRNet, high-resolution network; KL, Kellgren-Lawrence; MI-CNN, multi-instance convolutional neural network; MRI, magnetic resonance imaging; MSK, musculoskeletal; MWRN, multi-wavelet residual network; OA, osteoarthritis; OARSI, Osteoarthritis Research Society International; PF, patellofemoral; PFI, patellofemoral insufficiency; PFPS, patellofemoral pain syndrome; R-CNN, region-based convolutional neural network; ResNet, residual network; SSPA-NN–SSM, supervised spatiotemporal aggregation neural network and spatial structure mining; SVM, support vector machine; TRGCN, temporal relational graph convolutional network; 2D, 2-dimensional; 3D, 3-dimensional.

Only gold members can continue reading. Log In or Register to continue