Google DeepMind just dropped its blueprint for measuring progress toward artificial general intelligence. The AI research lab unveiled a cognitive framework designed to evaluate AGI capabilities—and it's inviting the developer community to help build the benchmarks through a new Kaggle hackathon. The move comes as the industry races toward increasingly powerful AI systems without clear standards for what AGI actually means or how to measure it.
Google DeepMind is taking a swing at one of AI's biggest open questions: how do we actually know when we're getting close to artificial general intelligence? The answer, according to research scientist Ryan Burnell and the DeepMind team, starts with a cognitive framework that breaks AGI evaluation into measurable components.
The framework represents a significant shift in how the industry thinks about AGI progress. Instead of vague predictions about when machines will match human intelligence, DeepMind's approach focuses on specific cognitive capabilities that can be tested and tracked over time. It's less about the sci-fi singularity and more about rigorous, testable benchmarks.
But Google isn't building this measurement system alone. The company announced it's launching a hackathon on Kaggle—the data science competition platform Google acquired back in 2017—to crowdsource benchmark development. Developers and researchers can now submit their own tests for evaluating AI capabilities across the cognitive framework's various dimensions.
The timing couldn't be more relevant. As OpenAI pushes toward AGI with its next-generation models, Microsoft pours billions into AI infrastructure, and Meta open-sources increasingly capable systems, the industry lacks consensus on what these milestones actually mean. Everyone's racing toward AGI, but there's no agreed-upon finish line.
DeepMind's framework attempts to solve that by establishing objective criteria. Rather than a single test or threshold, the cognitive approach examines multiple dimensions of intelligence—reasoning, learning efficiency, generalization across domains, and real-world application. It's reminiscent of how psychologists assess human cognition, but adapted for artificial systems.











