The Crisis of Reproducibility in Artificial Intelligence Research

AI research is dominated by tech giants, but the line between real breakthroughs and commercial product advertising is gradually blurring. Some scientists think it's time to stop this.







Last month, the journal Nature published an abusive review signed by 31 scientists. They didn't like the Google Health study that had appeared earlier in the same journal. In it, Google described the successful results of an artificial intelligence (AI) test that looked for signs of breast cancer in medical photographs. Critics argue that the Google team provided so little information about the code and test progress that the study looked more like a promotional description of a closed proprietary technology.



“We couldn't take it any longer,” says Benjamin Haibe-Caines, lead reviewer who studies computational genomics at the University of Toronto. "And it's not about this particular study - we have been observing a similar trend for many years in a row, and it really annoys us."



Haibe-Kains and colleagues are among a growing number of scientists who are resisting the apparent lack of transparency in AI research. “After seeing this work from Google, we realized that this was just another example of a series of enthusiastic publications in a highly respected journal that had nothing to do with science,” he says. - It's more an advertisement for cool technology. We cannot do anything about it. "



Science is built on trust, including disclosing the details of how research is conducted in sufficient detail for others to replicate and validate their findings. This is how science corrects itself and uproots unconfirmed results. Reproducibility allows others to base their work on these results, which helps move the field of knowledge forward. Science that cannot be reproduced finds itself on the sidelines of history.



At least in theory. In practice, few studies are fully reproducible, since most researchers are more interested in getting new results than repeating old ones. However, in fields such as biology, physics, computer science, researchers expect authors to provide enough information to share so that these experiments can be repeated - even if this is rarely done.



Ambitious newbie



AI is scolded for several reasons. First, it's a beginner. It has become an experimental science in the last ten years, says Joel Pigno, a computer scientist at Facebook AI Research and McGill University, who co-authored the complaint. “At first it was a purely theoretical area, but now we are doing more and more experiments,” she says. "And our commitment to rigorous methodology lags behind the ambition of our experiments."



This is not just an academic problem. The lack of transparency prevents new AI models and technologies from being properly tested for reliability, distortion and safety. AI is rapidly moving from research labs to the real world, which directly affects people's lives. However, machine learning (ML) models that work well in the lab can break down in the real world, with potentially dangerous consequences. Reproducing the results of experiments by different researchers in different conditions will more quickly reveal possible problems, which will make AI more reliable for everyone.



AI already suffers from the "black box" problem: sometimes it is impossible to say how or why an ML model produces exactly this result. Lack of transparency in research only makes things worse. Large models require as many observers as possible in order for more people to experience and understand their work. This is how you can make AI use safer in healthcare, fairer in public order, and polite in chat.



The normal reproducibility of AI is hampered by the lack of three things: code, data and hardware. The 2020 State of AI"A verified annual analysis by investors Nathan Benaich and Ian Hogarth, found that only 15% of AI research shares code. Industry researchers are more likely to misbehave than university scientists. In particular, OpenAI companies are highlighted in the report. and DeepMind, which are least likely to share their code.



The lack of tools necessary for reproducibility is felt more acutely when it comes to the two pillars of AI - data and hardware. Data is often held in private hands - for example, the data that Facebook collects on its users - or is sensitive, as is the case with medical records. Tech giants are conducting more and more research on huge and extremely expensive computer clusters that few universities or small companies have access to.



For example, training a GPT-3 language generator, according to some estimates, OpenAI cost $ 10- $ 12 million - and this is only if we take into account the latest model, without taking into account the cost or development and training of prototypes. “Then that figure could probably be increased by an order of magnitude or two,” says Benaich, founder of AI start-up venture capital firm Air Street Capital. A tiny percentage of big tech firms can afford it, he says: "No one else can throw such huge budgets on such experiments."





Hypothetical question: some people have access to GPT-3 and some do not. What happens when we see new work emerging where people outside of the OpenAI project are using GPT-3 to get cutting edge results?

And the main problem is: does OpenAI choose winning and losing researchers?




The speed of progress is dizzying. Thousands of works are published every year. However, if you don't know who you can trust, it is very difficult to promote the development of this area. Replication allows other researchers to verify that the authors did not manually match the best results and that the new technologies do indeed work as described. “It's getting more and more difficult to distinguish reliable results from the rest,” says Piño.



What can be done here? Like many other AI researchers, Pigno divides his time between the university and corporate labs. In recent years, she has actively influenced the publication system of AI research. For example, last year she helped promote the list of items that researchers must provide in a paper submission to one of the largest AI conferences, NeurIPS. It includes code, detailed description of experiments.



Reproducibility is valuable in itself



Pinho has also helped launch several repeatability contests in which researchers try to replicate the results of published researchers. Participants select papers accepted at conferences and compete with each other by running experiments based on the information provided. True, they receive only recognition as a reward.



Lack of motivation does not promote the spread of such practices in all fields of research, not just in AI. Reproduction is a necessary thing, but it is not encouraged in any way. One solution to this problem is to involve students in this work. In the past couple of years, Rosemary Ke, Ph.D. from Mila, a Montreal research institute founded by Yoshua Benjio, has organized a reproducibility competition, within the framework of which students try to reproduce research submitted to NeurIPS within the framework of training. Some of the successful attempts are peer-reviewed and published in ReScience.



"Reproducing someone else's work from scratch takes a lot of effort," says Ke. “The Reproducibility Competition rewards this effort and honors people who do a good job.” Ke and others talk about these attempts at AI conferences, organizing workshops to encourage researchers to add transparency to their work. This year, Pinho and Ke have expanded their competition to include the seven largest AI conferences, including ICML and ICLR.



Another project promoting transparency is called Papers with Code. It was organized by AI researcher Robert Stoinik when he was working at the University of Cambridge. Now he and Pinho work together on Facebook. The project first launched as a standalone website where researchers could link their work to their code. This year, the project has partnered with the popular arXiv preprint server. Since October, all works on machine learning published on arXiv have a Papers with Code section, from where there is a link to the code that the authors of the work are ready to publish. The goal of the project is to make the distribution of such code the norm.



Do these attempts affect anything? Pigno found that last year, when the prerequisite list was released, the number of code-submitted papers submitted to the NeurIPS conference increased from 50% to 75%. Thousands of reviewers say they used the code to rate submissions. The number of participants in the reproducibility competition is growing.



The devil is in the details



But this is just the beginning. Haibe-Kains points out that code alone is often not enough to rerun an experiment. To build AI models, you have to make many small changes - add a parameter there, value here. Any of these can make a working model non-working. Without metadata describing how the models are trained and tweaked, the code can be useless. “The devil really is in the little things,” he says.



It is also not always clear what code to distribute. Many laboratories use special programs to run models. Sometimes it is proprietary proprietary software. It's also sometimes difficult to tell which piece of code to share, says Haibe-Kains.



Pinho is not particularly concerned with such obstacles. “There's a lot to be expected from distributing the code,” she says. Sharing data is more difficult, but there are solutions. If researchers are unable to share the data, they can provide instructions on how to collect a suitable dataset themselves. Or, you can arrange for a small number of reviewers to access the data and validate the results for everyone else, Khaibe-Kains says.



The biggest problem is with the hardware. DeepMind claims that large projects like AlphaGo or GPT-3 that big labs spend money on will benefit everyone in the end. Inaccessible to other researchers in the early stages, AI, which requires huge computing power, often becomes more efficient and more accessible during development. “AlphaGo Zero outpaced its predecessor, AlphaGo, using much less computing power,” said Koray Kavukchuoglu, vice president of research at DeepMind.



In theory, this means that even if the study is reproduced late, it will still be possible. Kavukchuoglu notes that Jean-Carlo Pascutto, a Belgian programmer at Mozilla who writes chess and go programs in his spare time, was able to replicate a variant of AlphaGo Zero called Leela Zero using algorithms described in DeepMind's papers. Pigno also believes that flagship studies such as AlphaGo and GPT-3 are rare. She says that most AI research works on computers available to the average lab. And such a problem is not unique to AI. Pinho and Benayhom point to particle physics, in which some experiments can only be performed on expensive equipment such as the Large Hadron Collider.



However, physics experiments are carried out at the LHC by several laboratories together. And large AI experiments are usually carried out on equipment owned and controlled by private companies. But Pinho says that this is changing too. For example, the Compute Canada group is assembling computing clusters to enable universities to conduct large-scale AI experiments. Some companies, including Facebook, give universities limited access to their equipment. “The situation is not completely resolved,” she says. "But some doors are starting to open."





, . . Google, , Nature , , Google - .

: , , ( ). . .




Haibe-Kains doubts. When he asked the Google Health team to share the code from his cancer diagnosing AI, he was told that the code still needed further testing. The team reiterates this same excuse in a formal response to the criticism of Haibe-Kains, also published in Nature. “We are going to put our programs through rigorous testing before being used in a clinical setting, working with patients, service providers and regulators to make everything work efficiently and safely.” The researchers also stated that they are not allowed to share all the medical data they use.



That won't work, says Khaibe-Kains. "If they want to make a commercial product out of this, then I understand why they do not want to disclose all the information." However, he believes that if you publish in a scientific journal or at a conference, it is your duty to publish code that others can run. Sometimes it is possible to release a version trained to us with less data, or using less expensive hardware. The results may be worse, but people can tinker with them. “The line between commercial product manufacturing and research is constantly blurring,” says Haibe-Kains. "I think this area of ​​expertise will eventually fail."



Research habits are hard to give up



If companies are criticized for publishing work, why bother with it? Part of it, of course, has to do with public relations. However, this is mainly because the best commercial labs are full of university researchers. To some extent, the culture of places like Facebook AI Research, DeepMind and OpenAI is shaped by traditional academic habits. Also, tech companies benefit from participating in the wider research community. All large AI projects in private laboratories build on a variety of results from published research. And few AI researchers have used open source ML tools like Facebook's PyTorch or Google's TensorFlow.



The more research is done in tech giant companies, the more trade-offs will have to be made between business and research requirements. The question is how researchers will tackle these problems. Haibe-Kains would like journals like Nature to split their publications into separate streams - replicable research and demonstration of technological advances.



Pinho is more optimistic about the future. “I wouldn't be working at Facebook if there was no open approach to research,” she says.



Other corporate labs also insist on a commitment to openness. “Scientific work requires careful study and reproducibility on the part of other researchers,” says Kavukchuoglu. "This is a critical part of our research approach at DeepMind."



“OpenAI has grown into something very different from a traditional laboratory,” says Kayla Wood, a company spokesman. "Naturally, questions arise for her." She notes that OpenAI is working with more than 80 commercial and academic organizations through the Partnership on AI initiative to think about long-term norms for publishing research.



Pinho thinks there is something in this. She believes AI companies are demonstrating a third way of doing research, somewhere between the two Haibe-Kains streams. She compares the smart results of private AI labs with pharmaceutical companies - the latter investing billions in drug development and keeping most of the results for themselves.



The long-term impact of the practices adopted by Pinho and others remains to be seen. Will habits change permanently? How will this affect the use of AI outside of research? Much depends on which direction the AI ​​is going. The trend towards larger models and datasets - which is followed, for example, by OpenAI - will support a situation in which advanced AI options are not available to most researchers. On the other hand, new technologies such as model compression and few-shot learning could break this trend and allow more researchers to work with smaller, more efficient AIs.



Either way, large companies will continue to dominate AI research. And if done right, there’s nothing wrong with that, says Pigno: “AI is changing how research labs work.” The key is to make sure the broader public has a chance to participate in the research. Because faith in AI, on which so much depends, starts at the cutting edge.



All Articles