View a PDF of the paper titled Data-Constrained Synthesis of Training Data for De-Identification, by Thomas Vakili and two other authors.
Abstract: Many sensitive domains — such as the clinical domain — lack widely available datasets due to privacy risks. The increasing generative capabilities of large language models (LLMs) have made synthetic datasets a viable path forward. In this study, we domain-adapt LLMs to the clinical domain and generate synthetic clinical texts that are machine-annotated with tags for personally identifiable information using capable encoder-based NER models. The synthetic corpora are then used to train synthetic NER models. The results show that training NER models using synthetic corpora incurs only a small drop in predictive performance. The limits of this process are investigated in a systematic ablation study — using both Swedish and Spanish data. Our analysis shows that smaller datasets can be sufficient for domain-adapting LLMs for data synthesis. Instead, the effectiveness of this process is almost entirely contingent on the performance of the machine-annotating NER models trained using the original data.
Submission History
From: Thomas Vakili [view email]
[v1] Thu, 20 Feb 2025 16:09:27 UTC (787 KB)
[v2] Fri, 21 Feb 2025 16:58:44 UTC (787 KB)
[v3] Sat, 31 May 2025 10:43:20 UTC (950 KB)
### Understanding the Need for Synthetic Data in Sensitive Domains
In an era where data privacy is paramount, especially in sensitive areas like healthcare, the struggle to access diverse and annotated datasets is real. Traditional datasets often carry privacy risks, making it challenging for researchers to collect and use the data necessary for training machine learning models. This limitation raises an urgent need for innovative solutions, one of which is the use of synthetic data generated through advanced models like large language models (LLMs).
### The Role of Large Language Models (LLMs)
Large language models have garnered attention due to their remarkable capabilities in generating human-like text. As these models become increasingly sophisticated, they present an opportunity to create synthetic datasets tailored to specific domains. For instance, in the clinical domain, LLMs can produce clinical narratives that mimic real patient records, enabling researchers to bypass some of the ethical considerations associated with using actual patient data.
### De-Identification Using Machine Annotation
The generated synthetic clinical texts are not just standalone artifacts; they are equipped with machine-generated annotations for personally identifiable information (PII). This is where Named Entity Recognition (NER) models come into play. By using encoder-based NER models to tag sensitive information within the synthetic texts, researchers can ensure that the data remains compliant with privacy standards while retaining its utility for training machine learning applications.
### Training Efficacy of Synthetic NER Models
One of the compelling findings of the work by Thomas Vakili and collaborators is that synthetic datasets can effectively contribute to the training of NER models. The study reveals that when synthetic corpora are used to train these models, there is only a minor drop in predictive performance compared to traditional methods. This discovery highlights a potential pathway for leveraging synthetic data without compromising the quality of machine learning outcomes.
### Systematic Investigation and Ablation Studies
To bolster their claims, the researchers conducted systematic ablation studies utilizing both Swedish and Spanish data. This rigorous approach allows for an in-depth exploration of the parameters governing the efficacy of the data synthesis process. Their findings suggest that a smaller quantity of original data is often adequate for adaptively training LLMs aimed at generating domain-specific datasets, challenging the conventional belief that larger datasets are always necessary for high-quality model training.
### The Critical Role of Machine-Annotating NER Models
An intriguing aspect of the research is its emphasis on the performance of the machine-annotating NER models trained on original datasets. The study indicates that the success of synthetic data generation is significantly dependent on the accuracy of these models. As such, investing in high-performing NER models becomes a crucial step in the entire process, underlining the interconnectedness of data generation and annotation quality.
With the advent of synthetic data methodologies, researchers can explore new possibilities in data-scarce fields while ensuring compliance with privacy regulations. This innovative approach holds promise for various applications, particularly in the clinical domain, where data availability is critical for advancement. By leveraging synthetic data, the research community can continue to push the boundaries of machine learning capabilities while safeguarding individual privacy and promoting ethical standards in data usage.
Inspired by: Source

