pull down to refresh

Last week, while the world watched President Trump and Xi Jinping in Beijing, a much quieter piece of research dropped in Nature. It should have been on every front page.

A team of seven researchers from University of Oregon, Purdue University, University of California San Diego, New York University and Princeton University published the first peer-reviewed evidence that China’s state-controlled media has worked its way into the training data of AI chatbots that the world increasingly relies on.

Their research shows that the scripted articles, official slogans, and party-line phrasings churned out daily by Xinhua News Agency, People’s Daily, and the Communist Party’s Xuexi Qiangguo study app are now, demonstrably, inside ChatGPT and the other top chatbots.

When I read the paper, I tried a small experiment. I typed the first half of one of Xi’s signature loyalty slogans into ChatGPT: “不忘初心,” (Never forget the original aspiration). The bot finished it without hesitation: “牢记使命” (Keep the mission firmly in mind). That isn’t a folk saying. It’s a piece of working Party doctrine—Xi unveiled it in 2017 and made it the centerpiece of an indoctrination campaign every cadre had to recite. ChatGPT then offered, helpfully, to explain the phrase’s political significance.

That’s a parlor trick. The serious finding sits underneath it.

The researchers ran six case studies. The first two are the ones to remember. They combed through CulturaX, one of the largest open-source Chinese-language data sets that AI labs use to train models—about 189 million documents scraped from the Chinese-language internet. Overall, 1.64% of the documents overlap with Chinese media. That sounds modest. But filter the data set for documents that mention Xi, the Party Congress, or the Central Committee Plenum, and the share climbs to roughly one in four. State-media content turns out to be 41 times more abundant in the corpus than Chinese-language Wikipedia.

“Censorship and propaganda have always shaped what people read,” Molly Roberts, one of the researchers and co-director of China Data Lab at University of California San Diego, told me. “What is new here is now they are shaping the systems people increasingly ask to summarize, explain, and interpret the world for them. And in this case, governments can shape not just what people in their own country consume, but also those in other countries.”

In the second study, the team posed politically sensitive questions—Is China a democracy? Is Xi Jinping a good leader? Is the National People’s Congress of the People’s Republic of China a rubber stamp?—to every major commercial chatbot, once in English and once in Chinese. Overwhelmingly, the Chinese-language answers came back more favorable to Beijing. Nine human annotators, working blind, judged the Chinese replies to be more pro-China in 75.3% of paired comparisons.

According to the study and accompanying website, the English answers from OpenAI’s GPT, Anthropic’s Claude, Google’s Gemini and Elon Musk’s Grok were less favorable to China than their Chinese-language counterparts. The revealing exception was China’s own DeepSeek: Its V4 Pro model was uniformly pro-Beijing regardless of whether the input is in English or Chinese, reflecting state regulation of Chinese models and their training data.

A global phenomenonA global phenomenon

And it wasn’t only about China—the same pattern showed up for questions about Russia and North Korea.

The most striking part is that no one had to do anything sinister to make this happen. The propaganda is simply there on the open web, in plain HTML, free for any AI lab’s web crawler to scoop up.

“We don’t have any evidence that China purposefully has shaped training data already,” Roberts said. “However, the fact that LLMs are using open source text from the internet to train models means that there might be even more incentives now for governments to try to shape what is on the internet.”

An uncomfortable asymmetry is buried inside the whole story. The Wall Street Journal, like most serious publications, sits behind a paywall—which is what lets us pay reporters to do the work this column rests on. Xinhua does not. People’s Daily does not. As Roberts put it: “While independent media in democracies is paywalling articles in order to sustain itself, state media in authoritarian regimes is often freely available online and easy for companies to scrape and train on.”

A separate audit in the paper widened the lens to 37 countries where a majority of speakers of a particular language live in that country. The pattern that the research team found in Chinese repeated wherever they looked: the lower a country’s press freedom, the more regime-friendly the AI’s local-language answer. China is the case study; the phenomenon is global.

Roberts put the stakes plainly. “Political institutions with specific objectives shape training data,” she said. “LLM responses do not cite their sources, and therefore we can’t decipher the origins of the information presented to us.”

The summit last week generated a couple of days of headlines worldwide. This publication, if anyone in Washington and elsewhere reads it carefully, should generate a policy conversation that lasts years. The question of whether Beijing is shaping what your chatbot says about China has now been answered. The question of what to do about it has not.

WSJ