The Untold Risks of Online Personal Data: A Dive into Identity Scraping
In today’s digital age, most of our lives are conducted online, leaving a vast trail of personal data accessible to anyone with the right tools. As William Agnew, a postdoctoral fellow in AI ethics at Carnegie Mellon University, warns, “anything you put online can [be] and probably has been scraped.” This statement hits home, particularly in light of recent findings related to sensitive personal information circulating on the web.
A Shocking Discovery of Exposed Identity Documents
Researchers have uncovered thousands of instances of validated identity documents that include sensitive images such as credit cards, driver’s licenses, passports, and birth certificates. This alarming revelation is underscored by the existence of over 800 validated job application documents—including résumés and cover letters—tied to real individuals via platforms like LinkedIn. Although there were instances where researchers couldn’t verify documents due to issues like image clarity, the findings still spotlight a haunting reality: private information is perilously accessible.
The Scope of Personal Information at Risk
When examining the discovered résumés, researchers revealed that numerous documents contained highly sensitive information. This included not just the typical data points like names and contact information, but also more personal details such as disability status, results from background checks, and crucial demographic information including birth dates and racial identifiers. The interconnection of résumés with online presences allowed for additional data retrieval, exposing people’s home addresses, government identifiers, and even the contact details of personal references.
Examples of identity-related documents found in CommonPool’s small-scale data set show a credit card, a Social Security number, and a driver’s license. For each sample, the type of URL site is shown at the top, the image in the middle, and the caption in quotes below. All personal information has been replaced, and text has been paraphrased to avoid direct quotations. Images have been redacted to show the presence of faces without identifying the individuals.
The DataComp CommonPool Controversy
Released in 2023, DataComp CommonPool has emerged as the largest existing dataset of publicly available image-text pairs, compiling a staggering 12.8 billion samples. While its creators claimed the dataset was meant for academic research, the absence of restrictions on commercial use raises ethical red flags. This opens the door for enterprises to access potentially sensitive information without appropriate consent or consideration.
CommonPool is a follow-up to the LAION-5B dataset, utilized in training cutting-edge models such as Stable Diffusion and Midjourney. Both datasets capitalized on web-scraped data gathered by the nonprofit Common Crawl from 2014 to 2022. Although commercial models often remain tight-lipped about their training datasets, the overlaps between DataComp CommonPool and LAION-5B suggest that sensitive personal data is not only prevalent in CommonPool but likely endemic to multiple downstream models that train on this same data.
Rachel Hong, a Ph.D. student at the University of Washington and the lead author of the study, estimates that since DataComp CommonPool has been downloaded over 2 million times in the past two years, many models trained on this data likely duplicate the same privacy risks.
Ethical Dilemmas in Data Collection
The ethical implications are severe. Abeba Birhane, a cognitive scientist and tech ethicist at Trinity College Dublin’s AI Accountability Lab, emphasizes that any large-scale web-scraped dataset almost certainly contains content that ought not to be there. This means unwanted exposure to personally identifiable information (PII), child sexual abuse imagery, and even hate speech.
Birhane’s own research into LAION-5B has already confirmed these troubling realities. If powerful AI models are built on datasets that include unwanted and harmful data, what does this mean for privacy, security, and ethical AI development?
In conclusion, the intricacies of online data privacy become more complex with each revelation about identity scraping and data misuse. As individuals and researchers alike dig deeper into this digital landscape, the need for stricter regulations and ethical considerations surrounding data scraping becomes ever clearer. The conversation around online privacy is far from over, and it’s vital that all of us remain vigilant and informed about the footprint we leave in the digital realm.
Inspired by: Source

