AI Learns Gujarati: Bhashini Project's National Language Translation Mission Advances
At Dhirubhai Ambani University (DAU) in Gandhinagar, doctoral student Bhargav Dave is engaged in a meticulous task: matching words between Gujarati and English. Medical terms are fed one by one into an artificial intelligence system that is gradually learning to speak the language. Hriday, the Gujarati word for heart, maps cleanly to its English equivalent. Pancreas, however, becomes swadupind — a translation that reflects careful deliberation.
Ground Rules for Translation
"There are ground rules," Dave explains. "The word 'doctor,' for instance, will not be rendered as 'tabib' or 'daktar'. Similarly, medical terms like bronchial carcinoma, a variant of lung cancer, will not be given convoluted translations." So far, nearly 2 lakh such translations have been completed for Gujarati alone. Dave is part of a large team dedicated to the Gujarati component of Bhashini, a nationwide project that brings together software developers, computer scientists, linguists, and language experts.
Their collective mission is to teach AI not only to recognize Gujarati word patterns but also to translate fluidly between Gujarati and other Indian languages. This project gains significance amid growing conversations about AI's impact on linguistic diversity.
Bhashini: Bridging Language Gaps
Bhashini — an acronym for BHASha INterface for India — operates under the National Language Translation Mission (NLTM), spearheaded by the Ministry of Electronics and Information Technology (MeitY). Its ambition is to make artificial intelligence work for every Indian in their own tongue. Recently, Bhashini's core team gathered at DAU to review four years of work as the initiative reached a significant milestone, with both an app and a website now live.
Krishna Kishore, deputy director at MeitY for the NLTM, stated that the central purpose is to bring all 22 scheduled languages of India into the digital ecosystem. "The primary idea was to make all Centre-run websites and platforms accessible in every language. Many implementations are already visible across websites. But beyond those, certain key areas such as law, agriculture, and e-governance require special focus. Multiple teams across India are working with that domain-specific lens," he said.
Beyond Literal Translation
Translation is only part of the story. "The focus is equally on pragmatics — it should not only convey the literal meaning but should also make the context clear. The use of Bhashini has been successfully demonstrated at events like the Maha Kumbh," Kishore added. Experts associated with the project highlighted that one of its focus areas is data sovereignty — keeping user data within Indian systems rather than routing it through foreign servers.
The platform is also designed to be voice-native, allowing users to interact directly through speech without typing. Recently, at the India AI Impact Summit 2026, MeitY unveiled VoicERA — an open-source, end-to-end Voice AI stack. Government departments can now quickly deploy voice-enabled citizen services in areas such as agriculture advisories, livelihood services, grievance redressal, citizen feedback, and scheme discovery.
Scale and Complexity of the Work
Prof Dipti Misra Sharma, a computational linguistics expert from IIIT Hyderabad and chief investigator for the Indian Language to Indian Language Machine Translation (IL-IL MT) initiative, provided insights into the project's scale and complexity. "About 12 institutions have been associated with this project for four years now. Some of the basic models already existed, but the quality of translation was not satisfactory. Bhashini is unique due to its scope — unlike general-purpose large language models, here each language is given richer, more focused data. Moreover, languages with limited online presence such as Dogri, Bodo, and Maithili are also included," she said.
The diversity of India's languages presents both a challenge and an opportunity. "We increasingly encounter mixed-language usage such as 'Jab We Met'. How to train a model to translate that? These were the questions we wrestled with when working on the project. To give a sense of the computational scale involved, a supercomputer, Param Siddhi AI at C-DAC, was deployed for the task," she added.
Need for Green AI
In his presentation to the DAU gathering, Prof. Majumder advocated for edge-native, low-energy, human-centred AI as a global public good. "The dominant model of AI today is energy-intensive, dependent on massive computational infrastructure and large proprietary datasets. For a country like India, what we need is a small, optimized automatic speech recognition model for indigenous languages — one that can run directly from a phone," he said.
"It should work offline or with minimal connectivity, consume low energy, and protect user privacy. That's what will genuinely enable access to education, government services, and information in a speaker's own language."
A New Lease of Life for Dogri
Dr Preeti Dubey, assistant professor of Computer Applications at Government College for Women, Udhampur, represented the Dogri language cohort at the DAU gathering. Compared to major Indian languages, she explained, Dogri enters the AI age with a thinner training dataset.
"Like elsewhere in India, the Jammu and Himachal Pradesh region — home to the bulk of Dogri speakers — is seeing more children enrol in English-medium schools. Many learn Dogri at home, from parents or grandparents. For the project, we reached out to local libraries, converting physical pages to PDFs and then using optical character recognition to glean Dogri words and sentences," she said.
With a speaker base estimated between 2 and 5 million, Dogri's inclusion in Bhashini represents a meaningful step toward preserving the language in an increasingly AI-dominated digital world — a model, experts say, of how technology can be turned towards linguistic conservation rather than homogenization.
How AI Learns a New Language
The process by which AI acquires a language is fundamentally different from how a child does. Where a baby begins by mimicking sounds and slowly learns to connect spoken words to visual symbols, AI learns by detecting statistical patterns across massive datasets. At DAU, for instance, the team fed nearly two lakh Gujarati sentences from a wide range of sources into the system to build its foundational language model.
Early efforts focused on legal, scientific, technological, and administrative vocabulary. Rather than translating directly between two Indian languages, the system relies on what linguists call a pivot language. "If we give an input for translating a Gujarati word into Bengali, the system first compares the Gujarati word to English or Hindi, which have far larger training corpora, and then maps that to Bengali, before generating the final translation. But for Bhashini, the larger goal is to improve direct Indian language-to-Indian language translation, and that demands much stronger parallel sentence data, rigorous evaluation, and terminology control," said one team member.
Linguists' Indispensable Role
A major part of generative AI involves predicting the next word in a sentence — an ability that only becomes reliable when the AI deeply understands a language's syntax. This is where linguists become indispensable. They set terminology and style guidelines, review outputs for fluency and adequacy, and help the system navigate the natural variation in real-world usage.
Gujarati, like most Indian languages, follows a subject-object-verb sentence structure — quite different from English's subject-verb-object order — and teams must define a standard register while carefully documenting acceptable dialectal variations, from Surti to Kathiawadi, rather than imposing a single "correct" form. Finally, all data must be labelled and structured so that the AI can use it for interactive purposes, allowing native Gujarati speakers to query government repositories or converse with digital systems entirely in their own language.
The Building Blocks of AI Language Learning
- Data collection: Models are trained on large monolingual corpora drawn from books, news websites, government documents, online articles, and voice samples, supplemented by parallel sentence pairs across language pairs.
- Cleaning and standardization: Scientists remove noise and duplicates, and normalize spelling, script, punctuation, and formatting for consistency.
- Finding and teaching comparable data: The system first learns to match exact equivalent words across languages, then moves to structure and grammar. Source sentences are matched with target sentences, and quality checks are applied throughout.
- Tokenization: Words are broken down into smaller units called tokens, allowing the model to process each element individually and handle rare words and morphologically complex forms.
- Training the neural machine translation (NMT) model: A transformer-based system is trained to transfer meaning at the sentence level, generating translations one token at a time.
- Employing neural networks: The neural network begins predicting the next word and completing sentences, improving in accuracy through repeated exposure to data.
- Beta testing and limited deployment: Once initial benchmarks are met, the model is connected to major databases and the broader internet, where it continues learning autonomously within defined parameters, developing a progressively deeper understanding of the language.
- Transfer learning and execution: The model learns from existing data through knowledge transfer (for example, from Hindi to Gujarati) on a multilingual platform at regular intervals. It is assessed through metrics and human review, fine-tuned for specific domains such as medical or legal, and then deployed. Ongoing updates enable it to incorporate feedback and support real-time training.
