Seldom has the intersection of artificial intelligence and scientific discovery attracted such intense scrutiny. Two new AI systems, Robin and Co-Scientist, were recently published in the journal Nature. Both employ multi-agent architectures to assist researchers in generating hypotheses and analysing data. According to Karin Verspoor, Dean of Computing Technologies at RMIT University, scientists must combine deep analysis with broad reasoning strategies. These systems represent a significant advancement, yet their inherent constraints warrant careful examination.
Robin, developed by the non-profit organisation FutureHouse, is the first system to automate key intellectual steps in experimental biology. It proposed thirty drug candidates for dry age-related macular degeneration, a leading cause of blindness worldwide. The top five candidates were selected for laboratory testing by human researchers. Through iterative rounds of brainstorming and analysis, two drugs were ultimately identified as promising. Co-Scientist, built by Google DeepMind, similarly generates hypotheses through elaborate reasoning agents.
Both systems, however, fall short of validating their hypotheses directly through physical experiments. They rely heavily on human input to define research questions and to scrutinise predictions. Co-Scientist employs a reflection agent that mimics a critical peer reviewer assessing hypothesis quality. Ranking agents then debate hypotheses in simulated tournaments using multiple interacting language models. Notwithstanding these sophisticated mechanisms, neither system can independently confirm its own findings.
Broader concerns about AI in science have also emerged from recent research. The Agents4Science conference at Stanford showcased AI-generated papers spanning mechanical engineering and protein design. One system, called BadScientist, deliberately produced research that appeared convincing but was fundamentally unsound. Recent work has revealed increased quantity but diminished quality in AI-assisted papers and peer reviews. Fabricated references and misleading images in published works further underscore these pervasive risks.
What distinguishes these developments is their implicit acknowledgement that AI cannot yet replicate human scientific reasoning. The imprecision of language-based reasoning remains a fundamental constraint for these systems. Organisations such as Sakana AI continue pursuing full automation of the scientific process. Nevertheless, the evidence suggests that human oversight remains indispensable for maintaining research integrity. The trajectory of AI in science thus hinges on collaboration rather than substitution.






