The Wikimedia Foundation just launched a new AI-friendly database that transforms its 19 million Wikidata entries into vectors, making it dramatically easier for smaller AI developers to access structured information. The project aims to level the playing field against Big Tech companies who already have the resources to vectorize this data themselves.
The Wikimedia Foundation just handed smaller AI developers a powerful new weapon in their fight against Big Tech dominance. Through a year-long project, the organization has transformed all 19 million entries in Wikidata into AI-friendly vectors that capture context and meaning, not just raw information.
The Wikipedia Embedding Project, led by Wikimedia Deutschland in Berlin, represents a significant shift in how structured knowledge gets distributed to AI developers. Instead of the clunky, structured data format that previously required extensive processing, the new vector database presents information as interconnected graphs where Douglas Adams connects to "human" and his book titles simultaneously.
"Really, for me, it's about giving them that edge up and to at least give them a chance, right?" Wikimedia portfolio lead Lydia Pintscher told The Verge. Her team spent months using a large language model to convert Wikidata's traditionally structured format into vectors that AI systems can immediately understand and utilize.
The timing couldn't be more strategic. While companies like OpenAI and Anthropic have the engineering resources and capital to vectorize Wikidata themselves, smaller developers have been locked out of this crucial data transformation process. The new database essentially democratizes access to one of the web's largest repositories of structured, human-curated information.
Pintscher points to Govdirectory as an example of what becomes possible when developers can easily tap into Wikidata's volunteer-curated information. The platform helps users find social media handles and contact information for public officials worldwide by leveraging the structured relationships within Wikidata.
The project addresses a fundamental problem in current AI training: most language models prioritize popular topics that flood the internet, leaving niche subjects underrepresented. "This could be a better way to get information into ChatGPT, for instance, than generating a ton of content and then waiting for the next time for ChatGPT to retrain, and maybe, or maybe not, taking into account what you contributed," Pintscher explained.