Image News

Unveiling the infiltration of CSAM in widely used AI training data

A recent study has found that a popular AI training dataset contains thousands of instances of child sexual abuse material. The dataset was used to train popular AI image generators,

ByJavier RodriguezJavier Rodriguez

PublishedFebruary 13, 2024

Massive ⁤public dataset found‍ to ⁢contain ⁢child sexual abuse material

An extensive public dataset used ⁤to train⁢ AI image generators ‌has been discovered to contain⁤ thousands of instances of child sexual abuse material ‍(CSAM), according to a study published by the ⁤Stanford Internet Observatory (SIO). The LAION-5B dataset, which consists of metadata and image ‍identifiers,⁢ was ⁤found ‍to have 1,008 CSAM images,⁤ with the potential for more instances that were not accounted for in the study.

The CSAM images linked in LAION-5B were ⁢found on various‍ websites, including Reddit, ‍Twitter, Blogspot, WordPress, XHamster, and XVideos. The‌ dataset was used to train popular AI image generator Stable⁤ Diffusion⁣ version 1.5, known ⁣for ⁣its ability to create explicit images.

Identification and removal of ⁢CSAM

The ⁢SIO focused on images tagged as “unsafe” by‍ LAION’s safety classifier and used the Microsoft-developed tool PhotoDNA to detect CSAM. Matches were then sent to the Canadian Centre for Child‍ Protection (C3P) for verification. The identified source material is currently being removed, with the image URLs ⁤reported to the National Center for ⁤Missing and Exploited Children (NCMEC) in the US and the C3P.

Concerns and⁢ responses

Stability AI, the company ‌behind Stable Diffusion, did not respond to questions regarding the presence of CSAM in LAION-5B and ⁣whether any of⁣ that material made its ‌way into their models. While the company released Stable Diffusion 2.0 with filters to prevent unsafe images, ‍version 1.5, which was trained on LAION-5B, was released by another startup called RunwayML, ⁢in collaboration ‌with Stability AI.

Previous controversies

This is not the first time LAION’s training data has‌ been embroiled in controversy. Google, which used‌ a predecessor of ⁢LAION-5B, decided not to⁤ release its Imagen AI generator due ⁣to concerns ⁢about biased and problematic models. An audit of the predecessor dataset uncovered inappropriate content, including pornographic imagery and racist slurs.

Response from ‌LAION

LAION has announced plans for regular maintenance procedures to ‌remove links in its datasets that point to suspicious or potentially unlawful content on the public internet. The company emphasized its ‍zero-tolerance policy for⁢ illegal content and plans to return its datasets to the public after updating filtering.

Updated statement ⁣from Stability AI

A spokesperson for Stability AI clarified that their models ⁢were trained on a filtered subset of the dataset and emphasized their commitment to ⁤preventing the⁤ misuse of AI ‌for unlawful activity. They also ⁤stressed that the SIO studied version 1.5 of Stable Diffusion, which was not⁢ released by Stability AI, and expressed disagreement‍ with the decision to release that version‌ of⁢ the LAION-5B-trained model.

Unveiling the infiltration of CSAM in widely used AI training data