Benjamin Lee BS, Christopher J Peterson MD, MS
Corresponding author: Christopher Peterson
Contact Information: Cjpeterson1@carilionclinic.org
DOI: 10.12746/swrccc.v10i45.1043
Artificial Intelligence (AI) and Machine Learning (ML) have advanced rapidly in recent years and now have the potential to change medicine. This review provides an introduction to AI and the potential it has to affect medical practice. Specific examples of past milestones particularly in the domain of critical care are presented, including ML models that can interpret chest x-rays or predict clinical outcomes such as extubation failure or ICU mortality. Included is a brief general discussion of what AI is, how it is made, and how physicians will be involved with it. Arguments are then presented as to why AI will likely not leave physicians without a job, including expectations vs. reality, that AI still requires human supervision, that new discoveries bring new challenges, and that AI cannot design itself. Far from displacing physicians, AI, if implemented well, stands poised to automate repetitive tasks, making physicians more accurate, and allowing them to spend more time with patients.
Keywords: Machine learning; artificial intelligence; medicine; technology
The reader may have noticed the recent uptick in words like “artificial intelligence,” “machine learning,” and “deep learning,” in both the press and academic publications. The technologies these words represent are unique and distinct but are often referred to simply under the umbrella term “artificial intelligence” (AI). The field of AI has brought forth powerful tools that have found application in many modern devices such as Tesla’s self-driving cars, Apple’s Siri assistant, and Google Maps. Given the broad range of fields that have found uses for AI, many in the medical field may wonder, “What will be the impact of AI on medicine?”
To understand the future of AI, it may be helpful to understand its past. The earliest discussions about AI began in the mid-twentieth century, with Alan Turing’s work on computational machines. In 1956, the term “Artificial Intelligence” was coined in a seminal meeting at Dartmouth College.1 Shortly thereafter, Frank Rosenblatt’s work laid the foundation for the neural network, which has been influential and continues to shape AI even today. More recently, the image classification record set by Hinton et al. at the ImageNet competition in 2015 generated renewed interest in the AI field by introducing what is now called deep learning.2,3 Since then, AI has been integrated into academia, industry, and everyday life, from predictive text messaging to recommending friends on social media.
Broadly defined, AI encompasses computer systems capable of performing a wide variety of tasks that usually require human-like intelligence to perform.4 Currently, however, the field of AI could be better described as machine learning (ML), entailing systems that can learn from data with minimal human intervention and excluding more higher order intelligent functions such as emotion or self-awareness.5 These tasks include functions like transcribing audio to text, object identification, data categorization, and recommendation systems for ads and purchases. This is the area of focus for the majority of modern research projects, many having the potential to benefit physicians. This paper will focus primarily on Machine Learning (ML), which is the most relevant to medicine in the near future.
Medicine has many areas amenable to ML, specifically the high volume of high-quality medical data and advanced imaging available. Electronic medical records provide databases that physicians and data scientists can use to develop new tools to assist with patient care. Machine learning has the potential to create computational tools that can detect patterns and insights in the data that physicians cannot now discern. For example, one ML model can detect breast cancer in mammograms that would otherwise be read as benign.6 Another model was able to outperform cardiology residents in recognizing electrocardiogram abnormalities.7 In his book Deep Medicine, cardiologist Eric Topol paints a vivid picture of the types of machine learning tools that physicians would most appreciate, such as Alexa-like speech recognition apps that can document encounters or place orders based on verbal commands.8 While the potential for these tools is great, physician insight and input will be needed to both develop and implement these tools in clinical practice. This review will provide both an introduction to ML technology and a perspective on its future in medicine, particularly pulmonary critical care, as well as encourage physicians to be involved in its integration into medicine.
Machine learning models are comprised of trained algorithms that attempt to identify underlying patterns in data sets and make predictions accordingly. An algorithm, in the machine learning context, may be defined as the mathematical formula and architecture used to perform iterative calculations on data to train a model. Algorithms are not magic; under the hood, they resemble a web of interconnected statistical and mathematical principles used to perform calculations that can learn from patterns in the data. Algorithms vary in complexity, with one of the simplest algorithms being logistic regression, an algorithm that iteratively learns the likelihood of a binary (1 or 0) outcome, e.g., using time spent in the sun and the altitude to predict whether an individual will be sunburnt or not. More complex algorithms like Neural Networks can identify higher-order relationships, such as combinations of pixel colors and intensities that would predict whether a photo was taken on a sunny day or a cloudy day. The model, or weights, can be thought of as a redux of what an algorithm has learned after having been trained for a specific task. In the case of the earlier logistic regression, the model would be the final weighted values one would multiply the values for time in the sun and altitude by to best predict whether an individual will be sunburnt. The accuracy of these models is then determined by how accurately they make predictions on unfamiliar data sets; most models still make numerous mistakes, but typically the model that makes the least mistakes is then used.
Physicians already use their own types of predictive models when creating diagnostic differentials. For example, a patient presenting with cough, malaise, myalgia, and fever during the winter months would be suspected to have a respiratory virus. Likewise, an ML model could also assess these and other data points to diagnose influenza. Both experienced physicians and ML models identify which combination of factors would produce the most accurate prediction and weigh them accordingly. The constellation of symptoms and patterns that are consistently predictive of an illness are often developed into a model. In clinical medicine, this takes the form of diagnostic guidelines. These guidelines can be transferred among physicians and adapted as new research further refines understanding of the underlying clinical patterns.
There are also patterns beyond established guidelines that physicians can rely on, often referred to as “intuition” or “clinical gestalt.” Physicians develop this intuition based on years of experience, and these experiences are often assembled, sometimes subconsciously, into patterns that can direct clinical practice. While these cognitive “models” can be effective, they are limited by both the experience of the physician and the inability to be transferred among persons. Machine learning algorithms offer a way to analyze these types of patterns with much greater computational precision and then encapsulate these patterns with computer models, allowing others to add to and access the insights learned.
The ability of ML to deal with this complexity comes from both the large amounts of data that are fed into these algorithms as well as the flexibility of the algorithms themselves. The selection of each algorithm is crucial, as each has strengths and weaknesses, and thus may perform differently on certain types of datasets. For example, Convolutional Neural Nets (CNNs) are particularly effective at analyzing images, while Random Forest algorithms have the unique strength of generating easily interpretable flowcharts and decision trees. Several other algorithms most widely used in publications include Support Vector Machines (SVM), Artificial Neural Nets (ANN), and Bayesian Networks. In some cases, researchers have found that simpler models perform just as well at a task as bigger more computationally expensive deep learning models, allowing for analyses to be made much faster, even in real-time, rather than taking minutes to complete.9 To interpret the results of machine learning advances in medical literature, it is important to recognize these model-specific behaviors and limitations.
Selecting an algorithm is but the first step in many. Every model is trained, tested, and reworked by human users to ensure that each matches the specific parameters of the task at hand. If the necessary conditions, assumptions, and desired outcomes are not fully understood, the results of the model performance may suffer or even fail altogether. Another limitation is overfitting, instances in which the model is too specific to features found only in the dataset it was trained on. For example, a model that erroneously predicts that all 45-year-old males named John have influenza is focusing on details that will clearly not apply to other patients. The opposite is also true; overgeneralizing that every patient with a runny nose has the “flu” is also inaccurate as there are many other causes of nasal discharge. The optimal model for predicting the “flu” will likely lie between these two extremes, and the skill of applying machine learning is being able to locate the point at which the model is specific enough to be helpful but generalizes well enough to be used on other patients.
As with statistics, results can be misleading and require scrutiny before and even after being deployed. Changing disease rates in populations may render a previously adequate model out of date. Furthermore, the complexity of medicine makes it unlikely that any model will be able to be 100% accurate, necessitating ongoing reevaluations and fine-tuning.
As a particularly data-heavy specialty with numerous signals to monitor simultaneously, critical care may benefit highly from employing ML. Many critical care physicians have experienced signal fatigue from monitoring numerous vital signs and laboratory results from seriously ill patients. This, combined with heavy workloads, is cited as a major component of the high burnout rates in critical care physicians.10 Machine learning applications offer a way for computers to assist in monitoring complex signal patterns and then alerting physicians where to direct their limited attention. Reliable models that can instantaneously interpret patterns will help provide a stop-gap measure to help prevent adverse outcomes that a physician might not otherwise discern. Increased accuracy in the ability to predict who will require more intensive care could aid physicians in making decisions earlier in a hospital stay to reduce the need for respiratory interventions in these individuals. Many such ML tools are already under development, with no fewer than 85 ML models specific to ICU care registered on the National Institutes of Health website (clinicaltrials.gov) as of March 2021.
Machine Learning models have demonstrated particular strength in performing imaging-based tasks. A recent landmark model called ChexNet was trained on a database of over 100,000 frontal chest X-rays. This massive dataset had 14 different diseases labeled by 4 radiologists and was the first of its kind ever assembled. ChexNet was able to perform better than the radiologists at the task of identifying pneumonia on x-ray alone.11 Since those results in 2017 much more work has been done in this area12 and last year saw the FDA approve an ML model to be integrated into x-ray machines that can detect pneumothorax.13 Several ML models for detecting pneumonia have demonstrated high levels of accuracy (up to 99%) and AUC (99%).14,15
The sepsis early risk assessment (SERA) model has an 87% sensitivity and specificity for predicting sepsis 12 hours before its onset, a 32% improvement in accuracy compared to physician predictions.16 Many other models have been developed to predict sepsis17 and sepsis-induced coagulopathy.18 However, one sepsis alarm integrated into the Epic EMR (Epsis Sepsis Model), though shown to be effective,19 may have nonetheless contributed to alarm fatigue during the COVID-19 pandemic due to poor implementation.20 This example was an early attempt to integrate ML into hospital systems and provides several important lessons about model deployment that led to this failure to scale. The model was later found to have a lower AUC upon external validation than initially reported.21 Due to the cutoffs used, it was eventually only found to catch 7% of sepsis patients missed by a physician and missed sepsis in 67% of sepsis patients despite generating alerts on 18% of all hospitalized patients.21 This further underscores the importance of rigorously testing and scrutinizing a model before deployment, as well as having the input of physicians to make sure that the tool is clinically meaningful.
A model predicting methicillin resistant Staphylococcus aureus in mechanically ventilated patients with 98% sensitivity, 47% specificity, and a positive predictive value of 0.65 has been developed. In this study, admission from the ED was found to be the most predictive factor/feature.22 Other models have been developed to diagnose bloodstream infections early23 as well as predict their outcomes.24 In a study in Israel, an ML algorithm analyzed patient data and predicted patterns of bacterial resistance, resulting in researchers’ being able to reduce the number of mismatched antibiotic prescriptions for urinary tract infections by 30–40%.25 It is possible that future models may assist providers with the detection of pathogens and the selection of appropriate antibiotics.
A model predicted extubation failure using 89 clinical and laboratory variables as inputs performed with an AUC of .83 when trained on the MIMIC-IV dataset and areas under the receiver operating curve (AUC) of 0.8 in prospective validation on an external dataset.26 Examining what the algorithm had learned, researchers found that mechanical ventilation (MV) duration and pressure support ventilation (PSV) levels were the most influential factors the model used in making these predictions. Another model was developed to predict prolonged mechanical ventilation with an AUC of .85 and tracheostomy placement with an AUC of .83 using six different severity of illness scores calculated on the first day of ICU admission.27
Predictors for mortality and disease severity are already employed in clinical medicine, such as the Acute Physiologic and Chronic Health Evaluation (APACHE-IV) and Simplified Acute Physiologic Score (SAPS). An interesting study used ML to not only predict overall ICU mortality with high accuracy but also to sort through the model parameters to explain which factors were most predictive of mortality.23 These investigators were able to determine hospital mortality with an AUC of 87–91% compared to the AUC of 0.88 using APACHE-IV achieves. Machine learning identified factors like a significant increase in creatinine level was likely to precede mortality, even though this had been considered an unimportant factor. The model was also able to detect time-dependent patterns, such as an increase in glucose level commonly seen 16 hours prior to death. These are signals that are too complex for humans to monitor, but these models can and do excel at continuously monitoring and alerting physicians that the signal is more than a simple upward or downward trend.28 One study suggested that models may become more predictive with greater length of stay,29 possibly due to more data or selection for patients that have more severe illnesses. In contrast, another study found ideal mortality prediction at approximately 2 days into ICU admission,30 although this may be due to the frequency of data sampling provided to the model or fewer patients with longer ICU stays.29,30 Predictors of mortality from specific pathologies, such as sepsis,31 heart failure,32 and acute kidney injury33 have also been studied.
A CNN model predicted ARDS using chest x-rays with an 83% sensitivity and 88% specificity on a group of 413 images that were reviewed by 6 physicians.34 One study found that accuracy in predicting ARDS severity was improved when using the patient condition on ICU day two rather than day one (as described in the Berlin criteria) and that a predictive model using PaO2/FiO2xPEEP index was superior to those using PaO2/FiO2.35 Another model identified more cases of ARDS than clinicians (73.5% v 33.2%), but in cases in which clinicians did diagnose ARDS, the diagnosis was made earlier.36
Researchers were able to use a decision trees model to determine which factors were most predictive of ARDS due to COVID-19 in a sample of 600 patients. They discovered that age and BMI over 25 were the most predictive, but also discovered creatine kinase and neutrophil to lymphocyte ratio to be important novel predictors. The model performed with an overall AUC of .99 when tested on an external dataset.37
Lehman and colleagues used algorithms called Hierarchical Dirichlet Processes (HDP) that were designed to identify clusters of topics in unstructured progress notes to stratify ICU patients from the MIMIC-II database according to the risk of mortality. They found that including topics identified in nursing documentation of the first 24-hours of the ICU stay greatly improved the accuracy of the SAPS-1 model from an AUC of .72 to .82.38
Length of stay has also been studied to predict the length of stay in the context of traumatic brain injury (TBI),39 predictions were also made using patient vital signs40 as well as a gradient boosted decision trees algorithm trained using data available in the eICU and MIMIC III datasets.41 A model was able to predict readmission from information in the MIMIC III dataset significantly better than both the Stability and Workload Index for Transfer score (AUC = .65) and the Modified Early Warning Score (AUC = .58), with an AUC of .76.42 While more work is clearly needed, these studies nonetheless provide a proof of concept for how future versions of these models could benefit intensivists.
The physician, though typically not trained in computer science, has a crucial role as both producer and interpreter of clinical data. Clinical datasets containing imaging and documentation on an individual patient level, are in high demand for ML. Machine learning projects typically require thousands of images or subjects, with algorithms typically performing better when given more subjects and when given more granular details about those subjects. For example, the ChexNet model was trained on a dataset of over 100,000 chest x-rays furnished by Stanford University physicians.6 Publicly available databases, like the one Stanford produced, are key ways in which physicians can provide researchers with de-identified, high-quality labeled data. The Medical Information Mart for Intensive Care (MIMIC-IV),42 operated by the Massachusetts Institute of Technology (MIT), is a database of de-identified data from over 40,000 patients admitted to the Beth Israel Deaconess medical center in Boston, MA. This and previous MIMIC databases have already proven an invaluable resource for studying care in the ICU.43 Future databases like these will need to be collected under the guidance of physicians, who are uniquely positioned to ensure both their accuracy and clinical relevance.
In addition to data collection efforts, physicians are also needed to guide the ML projects themselves. Both physicians and engineers are experts in their own fields, but typically lack sufficient understanding to effectively develop ML for a field like medicine. Physicians are needed to both identify relevant problems and effectively integrate ML tools into healthcare. Only by working together will both groups be able to improve technology in medicine, and nowhere is this more apparent than with ML. Only with good, labeled data and guidance from physicians will research teams and companies be able to build the tools that medicine most needs. As the primary end-users of many of these models, physicians will also be needed to monitor the performance and the accuracy of these models with prospective studies,44 and give necessary feedback.
For those interested in a more thorough introduction, we recommend starting with deep Medicine by Eric Topol,8 Intelligence-Based Medicine by Anthony Chang,45 and reviews by Sanchez-Pinto et al.46 For more formal training in AI (or ML), the American Board of Artificial Intelligence in Medicine (ABAIM) offers introductory courses in medicine-based machine learning as well as a network of other physicians interested in ML.
One of the most often expressed concerns about AI, ML, and technology in general is how it will affect (or replace) various occupations. History shows that technology drastically altered agricultural and manufacturing jobs in the US. Indeed, many cite the oncoming shift to AI as a fourth industrial revolution47 and some fear a similar outcome in medicine with AI. However, physicians’ being fully replaced by AI is unlikely for several reasons.
Medicine has experienced technological shifts before and has often leaned toward sensationalism when predicting the end effect. When the MRI was first developed, some predicted that it would make radiologists obsolete; clearly, this has not been the case.48 Other predictions, from the cure for the common cold49 to a rapidly developed HIV vaccine,50 have been postulated and proven inaccurate. As far back as the 1960s, AI researchers were making wild predictions that within their lifetimes all scientific problems would be handed over to AI and that human labor would be obsolete.51 While the future is not always clear, it is often less spectacular than predicted, especially in a complex field like medicine in which fundamental interactions among humans will likely prove themselves irreplaceable in the healing process.
Despite its complexity, AI is ultimately a tool and, like other tools, requires human input. While AI can identify complex patterns in data, it requires humans to determine whether or not these patterns are useful or clinically relevant. Furthermore, AI is still error-prone and will likely always require supervision from a trained user to interpret the data within the broader landscape of clinical knowledge. For example, by changing only a small number of pixels, MIT students were able to trick a Google image search into labeling a picture of a cat as “guacamole.”52 Machine learning algorithms are also highly susceptible to biases, and constant evaluation is necessary to eliminate these from the dataset to prevent training these into the model. For example, a model used to detect skin cancer was more likely to interpret the image as malignant if a ruler was present in the image.53 Rulers proved to be present more often in cases in which the dermatologist had concern for cancer, and thus the physicians’ interpretations were perpetuated through a bias in the dataset.53 Similar confounders have been observed in other models for melanoma,54 pneumonia,55 and hip fractures.56
Care must also be taken especially before deploying models to make sure that they will generalize well to represent the population at large. Physicians ultimately will be the only ones able to identify some of these biases. External validation and scrutiny of models become increasingly important to ensure that confounders and biases are mitigated, especially considering that much medical data is dependent on physicians who may or may not follow best practices.44 Patients will look to physicians to understand what to do with the information the models put out. Artificial intelligence is not aptly placed to address ethical issues, build rapport, and empathize with patients. This is and likely will always be reserved for people and not machines. Though models may aid in decision-making, the final decision and responsibility will, as always, rest with the physician.
Just as the invention of MRI did not solve all of radiology’s challenges, AI discoveries may simultaneously solve some problems and lead to other much deeper, more complicated questions. Artificial intelligence may liberate physicians from the burden of monotonous tasks like reading chest X-rays and EKGs, but the fact remains that the compendium of medical knowledge, though prodigious, is incomplete. Discoveries driven by AI will likely reveal a veritable hydra of medical conundrums that still await proper attention, e.g., one AI study identified three new potential multiple sclerosis subtypes.57 Addressing these discoveries may enhance the way even field experts practice medicine. For example, Lee Sedol, the world’s premier player of the traditional board game Go, was recently beaten by Google’s milestone AlphaGo program that employed ML algorithms. Although he lost 4 of 5 rounds to the computer, Sedol was able to win one round with a novel strategy catalyzed by his struggle with his AI opponent, and this changed the way the game is played by professionals today.58 Similarly, human chess players assisted by AI, a so-called “Centaur” player, proved more effective than either humans or AI programs alone.59 The same has and will likely continue to be true for medicine as physicians learn to use new tools.
Machine learning algorithms are still very much dependent upon a human operator to carefully design the algorithm architecture. While it is true that ML involves an algorithm improving its performance on a specific task with minimal human input, human input is still crucial. An algorithm can only improve at the task that it has been set to, and, while programs have been able to outperform humans at certain tasks, the program must be told what to do by an operator. Indeed, as in statistics, it would be unexpected for a statistics program to select the appropriate statistical test, ask the right research question, or even collect the research data automatically. A data scientist is required to select the right algorithm to investigate the specific question, and then through a process called data cleaning ensure that the data will run through the algorithm correctly.57 As entropy would suggest, even once established, there is a continual need to fine-tune and update models, much as how Microsoft or Apple products receive updates. These adjustments do not happen automatically and require work from developers and new data from those using these models. Finally, for those concerned over a potential fully autonomous and self-aware AI, given the extremely limited understanding of human cognition, it is unreasonable to expect humans to design a machine with capabilities that exceed our current understanding. As physicist and philosopher Ragnar Fjelland observed, “The overestimation of technology is closely connected with the underestimation of humans.”60
Machine learning has potential to augment clinical medicine. Though complex, it is based on principles familiar to and practiced by physicians. Furthermore, its effectiveness depends on the supervision and contribution of physicians who can bridge the gap between data engineers and the patient. Like any tool, ML will require training and adaptation but can confer a significant advantage to those who use it. Its effective integration into medicine will depend not only on technological advancements but on physicians who will develop and use it in daily practice. Though unfamiliar, the capacity for ML to improve medical practice and patient care is increasing and should encourage physicians in integrating this capability into medicine.
We would like to thank Ranadip Pal, Professor and Associate Chair for Graduate Studies at Texas Tech University for his editorial assistance and expert review.
Article citation: Lee B, Peterson CJ. Machine learning and medicine-A brief introduction. The Southwest Respiratory and Critical Care Chronicles 2022;10(45):28–36
From: School of Medicine (BL, CJP), Texas Tech University Health Sciences Center, Lubbock, Texas; College of Engineering (BL), Texas Tech University, Lubbock Texas
Submitted: 4/11/2022
Accepted: 10/5/2022
Conflicts of interest: none
This work is licensed under a Creative Commons
Attribution-ShareAlike 4.0 International License.