Skip to main content

57% of the internet may already be AI sludge

a cgi word bubble
Google Deepmind / Pexels

It’s not just you — search results really are getting worse. Amazon Web Services (AWS) researchers have conducted a study that suggests 57% of content on the internet today is either AI-generated or translated using an AI algorithm.

The study, titled “A Shocking Amount of the Web is Machine Translated: Insights from Multi-Way Parallelism,” argues that low-cost machine translation (MT), which takes a given piece of content and regurgitates it in multiple languages, is the primary culprit. “Machine generated, multi-way parallel translations not only dominate the total amount of translated content on the web in lower resource languages where MT is available; it also constitutes a large fraction of the total web content in those languages,” the researchers wrote in the study.

Recommended Videos

They also found evidence of selection bias in what content is machine translated into multiple languages compared to content published in a single language. “This content is shorter, more predictable, and has a different topic distribution compared to content translated into a single language,” the researchers’ wrote.

What’s more, the increasing amount of AI-generated content on the internet combined with increasing reliance on AI tools to edit and manipulate that content could lead to a phenomenon known as model collapse, and is already reducing the quality of search results across the web. Given that frontier AI models like ChatGPT, Gemini, and Claude rely on massive amounts of training data that can only be acquired by scraping the public web (whether that violates copyright or not), having the public web stuffed full of AI-generated, and often inaccurate, content could severely degrade their performance.

“It is surprising how fast model collapse kicks in and how elusive it can be,” Dr. Ilia Shumailov from the University of Oxford told Windows Central. “At first, it affects minority data—data that is badly represented. It then affects diversity of the outputs and the variance reduces. Sometimes, you observe small improvement for the majority data, that hides away the degradation in performance on minority data. Model collapse can have serious consequences.”

The researchers demonstrated those consequences by having professional linguists classify 10,000 randomly selected English sentences from one of 20 categories. The researchers observed “a dramatic shift in the distribution of topics when comparing 2-way to 8+ way parallel data (i.e. the number of language translations), with ‘conversation and opinion’ topics increasing from 22.5% to 40.1%” of those published.

This points to a selection bias in the type of data that is translated into multiple languages, which is “substantially more likely” to be from the “conversation and opinion” topic.

Additionally, the researchers found that “highly multi-way parallel translations are significantly lower quality (6.2 Comet Quality Estimation points worse) than 2-way parallel translations.” When the researchers audited 100 of the highly multi-way parallel sentences (those translated into more than eight languages), they found that “a vast majority” came from content farms with articles “that we characterized as low quality, requiring little or no expertise, or advance effort to create.”

That certainly helps explain why OpenAI’s CEO Sam Altman keeps keening on about how its “impossible” to make tools like ChatGPT without free access to copyrighted works.

Andrew Tarantola
Andrew Tarantola is a journalist with more than a decade reporting on emerging technologies ranging from robotics and machine…
Is AI already plateauing? New reporting suggests GPT-5 may be in trouble
A person sits in front of a laptop. On the laptop screen is the home page for OpenAI's ChatGPT artificial intelligence chatbot.

OpenAI's next-generation Orion model of ChatGPT, which is both rumored and denied to be arriving by the end of the year, may not be all it's been hyped to be once it arrives, according to a new report from The Information.

Citing anonymous OpenAI employees, the report claims the Orion model has shown a "far smaller" improvement over its GPT-4 predecessor than GPT-4 showed over GPT-3. Those sources also note that Orion "isn’t reliably better than its predecessor [GPT-4] in handling certain tasks," specifically coding applications, though the new model is notably stronger at general language capabilities, such as summarizing documents or generating emails.

Read more
ChatGPT monthly usage may now rival Google Chrome
A person sits in front of a laptop. On the laptop screen is the home page for OpenAI's ChatGPT artificial intelligence chatbot.

A number of popular generative AI platforms are seeing consistent growth as users are figuring out how they want to use the tools -- and ChatGPT is at the top of the list with the most visits, at 3.7 billion worldwide. So many people are visiting the AI chatbot, and its figures are rivaling browser market share. It can only be compared to Google Chrome figures in terms of monthly users, which is estimated to be around 3.45 billion.

Statistics from Similarweb indicate that ChatGPT saw a 17.2% month-over-month (MoM) growth and a 115.9% year-over-year (YoY) traffic growth. Some highlights that spurned the ChatGPT growth during 2024 include its parent company, OpenAI, updating its web address from a subdomain, chat.openai.com, to a main domain, chatgpt.com. The tool especially saw a surge of traffic in May 2024, when it hit a 2.2-billion-visit milestone, and has been growing ever since, according to Similarweb researcher David F. Carr.

Read more
Apple Intelligence may get an M4 upgrade
Apple Intelligence on the Apple iPhone 16 Plus.

According to Nikkei Asia, Apple is talking with its biggest iPhone manufacturing partner, Foxconn, about building new Apple Intelligence servers in Taiwan.

More servers will mean more processing power for Apple Intelligence features, allowing more people to complete more complex tasks. Existing Apple servers are currently powered by the M2 Ultra chip but there are plans to use one of the new M4 chips for future servers.

Read more