New AI could stop fake news in Urdu

A deep learning model trained on more than 14,000 Pakistani news articles can spot misinformation with 96% accuracy, according to a new report in academic journal Science Advances.
It’s the most comprehensive artificial intelligence system yet for detecting fake news in Urdu, the world’s 10th most spoken language with more than 170 million speakers worldwide.
Viral falsehoods can have a huge impact on public health, elections and public trust in police and government
The system can identify fake news, misleading content and even partially true stories, and it tackles the major shortcomings of previous attempts at an Urdu model.
Dr Muhammad Zeeshan Babar, from Heriot-Watt's School of Engineering and Physcial Sciences, said: “Most automated fake news detection systems are trained on English language datasets.
“Urdu is the 10th most spoken language in the world and the national language of Pakistan. But it lacks large datasets to train AI systems. It can be described as a low-resource language.”
Existing Urdu datasets didn’t cover politics or religion
Zeeshan Babar and his colleagues began by assessing the existing Urdu datasets.
“We found real weaknesses in the available Urdu datasets. Many of them didn’t include news about politics, religion and other societal issues because they are delicate subjects. That’s a critical gap.
“Misinformation in Pakistani news, which is read by the diaspora around the world, touches on all of those subjects.
“Viral falsehoods can have a huge impact on public health, elections and public trust in police and government.
“A robust fact-checking infrastructure for Urdu is vital, which is why we build our Urdu Fake News Detection dataset.”

Open access to scale up efforts
The team compiled a dataset of 14,178 Urdu language news articles collected between 2017 and 2023. The articles cover 15 subject areas including politics, health, business, education, sports, science, crime, technology and social issues.
According to the paper, 8,283 articles were labelled as real and 5,895 as fake.
The system learned to detect patterns in vocabulary, phrasing, sentiment and linguistic structure that distinguish fabricated stories from legitimate reporting.
Dr Waseem Abbasi, head of computer science at the University of Lahore in Pakistan, said: “We’ve made the dataset open access so that we can continually improve its performance.
“Reaching 96% accuracy is excellent, but we know that’s still a significant margin of error that could influence content moderation, advertising or even legal enforcement.
“We are also aware that algorithms trained on past data may struggle with emerging narratives; they could misclassify satire or political dissent.
“But for millions of Urdu news consumers trying to navigate a polluted information ecosystem, this could be significant.”
The team’s next focus is on extending the research to other language datasets.
The research was funded by Heriot-Watt University.