Security researchers have discovered that some datasets used by companies that developed large language models (LLM) included API keys, passwords, and many other forms of credentials.
It’s no secret that large language models are taking over the online world. Companies boast powerful AI solutions that seem to be the answer to everything.
However, for an AI agent or solution to be effective, it must be trained on as much data as possible. Some of that data is taken directly from the Internet, and companies and organizations specialize in this type of data gathering.
Common Crawl is one such organization that offers data sets to companies that need to train their AI, and everything is gathered from the available Internet. This means that some sensitive information might be collected as well.
Security researchers from Truffle Security found that all kinds of credentials, API keys and passwords are caught in the net. The biggest problem is that some web developers hardcode sensitive information on the website, and it eventually lands in LLM training data.
The researchers found 11,908 live secrets (API keys, passwords, and other credentials that successfully authenticate with their respective services) across 2.76 million websites.
“Leaked keys in Common Crawl’s dataset should not reflect poorly on their organization; it’s not their fault developers hardcode keys in front-end HTML and JavaScript on web pages they don’t control. And Common Crawl should not be tasked with redacting secrets; their goal is to provide a free, public dataset based on the public Internet for organizations like Truffle Security to conduct this type of research,” explained the researchers.
In fact, companies that develop LLMs have warned about this particular problem. The recommendation is simple: don’t hardcode any kind of sensitive information in websites, especially since people using the AIs might use the provided code for their work, thus unknowingly spreading the problem even more.
tags
Silviu is a seasoned writer who followed the technology world for almost two decades, covering topics ranging from software to hardware and everything in between.
View all postsFebruary 20, 2025
February 11, 2025
December 24, 2024
December 19, 2024