5/26/2019

How IBM Watson Overpromised and Underdelivered on AI Health Care

Eliza Strickland, How IBM Watson Overpromised and Underdelivered on AI Health Care, IEEE Spectrum, 2 Apr 2019. 
In many attempted applications, Watson’s NLP struggled to make sense of medical text—as have many other AI systems. “We’re doing incredibly better with NLP than we were five years ago, yet we’re still incredibly worse than humans,” says Yoshua Bengio, a professor of computer science at the University of Montreal and a leading AI researcher. In medical text documents, Bengio says, AI systems can’t understand ambiguity and don’t pick up on subtle clues that a human doctor would notice. Bengio says current NLP technology can help the health care system: “It doesn’t have to have full understanding to do something incredibly useful,” he says. But no AI built so far can match a human doctor’s comprehension and insight. “No, we’re not there,” he says....
IBM’s work on cancer serves as the prime example of the challenges the company encountered. “I don’t think anybody had any idea it would take this long or be this complicated,” says Mark Kris, a lung cancer specialist at Memorial Sloan Kettering Cancer Center, in New York City, who has led his institution’s collaboration with IBM Watson since 2012. 
The effort to improve cancer care had two main tracks. Kris and other preeminent physicians at Sloan Kettering trained an AI system that became the product Watson for Oncology in 2015. Across the country, preeminent physicians at the University of Texas MD Anderson Cancer Center, in Houston, collaborated with IBM to create a different tool called Oncology Expert Advisor. MD Anderson got as far as testing the tool in the leukemia department, but it never became a commercial product. 
Both efforts have received strong criticism. One excoriating article about Watson for Oncology alleged that it provided useless and sometimes dangerous recommendations (IBM contests these allegations). More broadly, Kris says he has often heard the critique that the product isn’t “real AI.” And the MD Anderson project failed dramatically: A 2016 audit by the University of Texas found that the cancer center spent $62 million on the project before canceling it. A deeper look at these two projects reveals a fundamental mismatch between the promise of machine learning and the reality of medical care—between “real AI” and the requirements of a functional product for today’s doctors. 
Watson for Oncology was supposed to learn by ingesting the vast medical literature on cancer and the health records of real cancer patients. The hope was that Watson, with its mighty computing power, would examine hundreds of variables in these records—including demographics, tumor characteristics, treatments, and outcomes—and discover patterns invisible to humans. It would also keep up to date with the bevy of journal articles about cancer treatments being published every day. To Sloan Kettering’s oncologists, it sounded like a potential breakthrough in cancer care. To IBM, it sounded like a great product. “I don’t think anybody knew what we were in for,” says Kris. 
Watson learned fairly quickly how to scan articles about clinical studies and determine the basic outcomes. But it proved impossible to teach Watson to read the articles the way a doctor would. “The information that physicians extract from an article, that they use to change their care, may not be the major point of the study,” Kris says. Watson’s thinking is based on statistics, so all it can do is gather statistics about main outcomes, explains Kris. “But doctors don’t work that way.” 
In 2018, for example, the FDA approved a new “tissue agnostic” cancer drugthat is effective against all tumors that exhibit a specific genetic mutation. The drug was fast-tracked based on dramatic results in just 55 patients, of whom four had lung cancer. “We’re now saying that every patient with lung cancer should be tested for this gene,” Kris says. “All the prior guidelines have been thrown out, based on four patients.” But Watson won’t change its conclusions based on just four patients. To solve this problem, the Sloan Kettering experts created “synthetic cases” that Watson could learn from, essentially make-believe patients with certain demographic profiles and cancer characteristics. “I believe in analytics; I believe it can uncover things,” says Kris. “But when it comes to cancer, it really doesn’t work.” 
The realization that Watson couldn’t independently extract insights from breaking news in the medical literature was just the first strike. Researchers also found that it couldn’t mine information from patients’ electronic health records as they’d expected. 
At MD Anderson, researchers put Watson to work on leukemia patients’ health records—and quickly discovered how tough those records were to work with. Yes, Watson had phenomenal NLP skills. But in these records, data might be missing, written down in an ambiguous way, or out of chronological order. In a 2018 paper published in The Oncologist, the team reported that its Watson-powered Oncology Expert Advisor had variable success in extracting information from text documents in medical records. It had accuracy scores ranging from 90 to 96 percent when dealing with clear concepts like diagnosis, but scores of only 63 to 65 percent for time-dependent information like therapy timelines. 
In a final blow to the dream of an AI superdoctor, researchers realized that Watson can’t compare a new patient with the universe of cancer patients who have come before to discover hidden patterns. Both Sloan Kettering and MD Anderson hoped that the AI would mimic the abilities of their expert oncologists, who draw on their experience of patients, treatments, and outcomes when they devise a strategy for a new patient. A machine that could do the same type of population analysis—more rigorously, and using thousands more patients—would be hugely powerful. 
But the health care system’s current standards don’t encourage such real-world learning. MD Anderson’s Oncology Expert Advisor issued only “evidence based” recommendations linked to official medical guidelines and the outcomes of studies published in the medical literature. If an AI system were to base its advice on patterns it discovered in medical records—for example, that a certain type of patient does better on a certain drug—its recommendations wouldn’t be considered evidence based, the gold standard in medicine. Without the strict controls of a scientific study, such a finding would be considered only correlation, not causation.... 
These studies aimed to determine whether Watson for Oncology’s technology performs as expected. But no study has yet shown that it benefits patients. Wachter of UCSF says that’s a growing problem for the company: “IBM knew that the win on Jeopardy! and the partnership with Memorial Sloan Kettering would get them in the door. But they needed to show, fairly quickly, an impact on hard outcomes.” Wachter says IBM must convince hospitals that the system is worth the financial investment. “It’s really important that they come out with successes,” he says. “Success is an article in the New England Journal of Medicine showing that when we used Watson, patients did better or we saved money.” Wachter is still waiting to see such articles appear.... 
Some success stories are emerging from Watson Health—in certain narrow and controlled applications, Watson seems to be adding value. Take, for example, the Watson for Genomics product, which was developed in partnership with the University of North Carolina, Yale University, and other institutions. The tool is used by genetics labs that generate reports for practicing oncologists: Watson takes in the file that lists a patient’s genetic mutations, and in just a few minutes it can generate a report that describes all the relevant drugs and clinical trials. “We enable the labs to scale,” says Vanessa Michelini, an IBM Distinguished Engineer who led the development and 2016 launch of the product.

沒有留言:

張貼留言