In a significant push to make artificial intelligence accessible beyond English-speaking users, Google has deeply integrated 29 Indic languages and dialects into its foundational AI model, Gemini. This three-year effort aims to drive AI adoption across India's rural and semi-urban regions, with a sharp focus on critical sectors like agriculture and healthcare.
Bridging the Language Divide for Last-Mile Impact
Manish Gupta, senior director at Google DeepMind, emphasized that proficiency in local languages is non-negotiable for AI to be effective at the last mile. He explained that applications in farming and medicine require a nuanced, contextual understanding of regional tongues to function with high accuracy and relevance.
"For instance, Google's Gemini powers an agricultural analysis model currently deployed at the last mile. It is imperative for our AI models to grasp the context of local languages to perform efficiently," Gupta stated. This agriculture application processes information in local languages to provide farmers with soil analysis, crop predictions, and other insights at subsidized costs, facilitated through government and private partnerships.
The Global Race for Linguistic AI Proficiency
The move comes as Big Tech companies globally recognize that developing AI expertise across a wider range of languages is the next major frontier. Data highlights the urgency: while top Indic languages like Hindi, Bangla, and Urdu are spoken by over 13% of the world's population, they constitute less than 1% of the open internet. This scarcity risks embedding biases from English-dominated data into AI services.
In India, where only about 10% of the population speaks English, regional language proficiency in AI models is critical for meaningful adoption. Jibu Elias, responsible computing lead for India at Mozilla Foundation, noted that training AI in Indic languages caters to a vast global population. "Indic languages provide an ideal template to add nuances of various languages, and given their scale, make for ideal training grounds for AI models that can later be exported," he said.
The competitive landscape is heating up. As of late November, Anthropic's Claude Sonnet 4.5 led in Indic language performance with a score of 60.7 out of 100. Google's Gemini 3 Pro followed closely at 59.9, ahead of OpenAI's GPT-5.2 and xAI's Grok 4.
Ethical Data and Multimodal Expansion
Industry experts stress that linguistic scale must be paired with ethical data practices. Elias pointed out that India's linguistic diversity extends far beyond the 22 official languages, encompassing many languages with millions of speakers but no script, and several nearly extinct dialects. The 2011 Census lists 121 languages and over 19,500 raw dialects.
"While it is imperative for Big Tech to develop AI abilities for India's non-English majority, taking an equitable approach is key. Tech firms must work directly with community stakeholders, not just data annotation startups, to ethically scale AI," Elias added.
Gupta mentioned that Google collaborates with grassroots-level data providers to collect Indic datasets. Beyond text, Google is expanding into multimodal AI in healthcare. A peer-reviewed paper on using Gemini 1.5 to detect diabetes through retina scans will be published next month in Ophthalmology Science.
"The development of such multimodal AI models is imperative, and India is playing a big role. Native understanding of local languages is a key part of this process," Gupta concluded.
Background and Collaborations: Google's Indic-language journey began with Project Vaani in December 2022, an open-source initiative to gather public datasets. In July last year, it introduced IndicGenBench, a proprietary benchmark to evaluate model performance across 29 Indian languages. The company has partnered with the Indian government's Bhashini initiative and the Indian Institute of Science (IISc), Bengaluru, to fine-tune these models.