Predicting where proteins are located within a cell is a fundamental question in biology and drug discovery, often referred to as subcellular localization. Understanding the location of a protein is crucial because its function is closely tied to where it resides—be it the nucleus, cytoplasm, or cell membrane. By mapping these protein locations, researchers can gain valuable insights into cellular processes and identify new therapeutic targets that could revolutionize medicine.
This article explores how researchers can collaboratively train AI models to forecast protein properties like subcellular location, all while safeguarding sensitive data from being shared between institutions. Thanks to NVIDIA FLARE and the NVIDIA BioNeMo Framework, this advanced training process becomes more accessible and secure.
How to Fine-Tune a Model for Subcellular Localization
A hands-on NVIDIA FLARE tutorial illustrates how to refine the ESM-2nv model, enabling it to classify proteins based on their subcellular location. The ESM-2nv model utilizes protein sequence embeddings from datasets, such as those detailed in the study “Light Attention Predicts Protein Location from the Language of Life”.
In this tutorial, we concentrate on predicting subcellular localization formatted as FASTA files in accordance with the biotrainer standard. Each file includes the protein sequence, a training/validation split, and a location class—such as Nucleus, Cell_membrane, and several others, typically totaling ten distinct classes.
A sample from the FASTA format looks like this:
>Sequence1 TARGET=Cell_membrane SET=train VALIDATION=False
MMKTLSSGNCTLNVPAKNSYRMVVLGASRVGKSSIVSRFLNGRFEDQYTPTIEDFHRKVYNIHGDMYQLDILDTSGNHPFPAMRRLSILT
GDVFILVFSLDSRESFDEVKRLQKQILEVKSCLKNKTKEAAELPMVICGNKNDHSELCRQVPAMEAELLVSGDENCAYFEVSAKKNTNVNE
MFYVLFSMAKLPHEMSPALHHKISVQYGDAFHPRPFCMRRTKVAGAYGMVSPFARRPSVNSDLKYIKAKVLREGQARERDKCSIQ
In this snippet:
- TARGET indicates the subcellular location class.
- SET differentiates between training and testing datasets.
- VALIDATION marks sequences meant for validation.
The dataset encompasses ten location classes, presenting an exciting challenge for real-world classification.
How to Use Federated Learning with BioNeMo Protein Language Models
Getting started is incredibly straightforward. Using BioNeMo Framework v2.5 within Docker, you can launch a Jupyter Lab environment, making it easy to run the Federated Protein Property Prediction tutorial in your browser.
NVIDIA FLARE facilitates federated training, allowing participants to train models locally and only contribute model updates rather than sharing entire datasets. These updates are aggregated to create a centralized global model using FedAvg, ensuring data privacy while enabling collaboration.
Training and Visualization
For this demonstration, researchers fine-tuned a 650-million-parameter ESM-2nv model pre-trained in BioNeMo. Such a larger model balances predictive accuracy and computational efficiency, making it ideal for federated training scenarios.
Key workflow steps include:
- Data Splitting: Heterogeneous sampling reflects the variability expected across institutions, enhancing the realism of the federated training setup.
- Federated Averaging (FedAvg): Local client updates are pooled into a shared global model, protecting sensitive data while allowing collaborative learning.
- Visualization with TensorBoard: Researchers can monitor both local and federated training runs in real-time, gaining insights into the evolution of the global model over various communication rounds.

Results
The comparative study examined local training versus federated training (FedAvg) under conditions of heterogeneous data.
| Client | # Samples | Local Accuracy | FedAvg Accuracy |
| Site-1 | 1,844 | 78.2 | 81.8 |
| Site-2 | 2,921 | 78.9 | 81.3 |
| Site-3 | 2,151 | 79.2 | 82.1 |
| Average | — | 78.8 | 81.7 |
The results demonstrate that federated learning can harness collective intelligence from various institutions to create a more robust predictive model than what any single site could achieve.

Benefits of Using BioNeMo and FLARE for Protein Prediction
The advantages of utilizing BioNeMo and FLARE for protein prediction extend beyond merely identifying cellular locations. This approach unites the scientific community, fostering collaborative AI development for advancing biological research:
- Strengthened Prediction: Federated learning allows the pooling of collective intelligence without the need to share raw protein data.
- Collaborative Advantage: Each institution contributes to constructing a more powerful predictive model while keeping sensitive data within local confines.
- Accelerated Discovery: The BioNeMo Framework provides researchers with advanced tools for biological sequence analysis, expediting breakthroughs in the field.
Get Started with Federated Protein Prediction
Federated protein property prediction using the NVIDIA BioNeMo and NVIDIA FLARE represents a transformational approach in life sciences. By aligning the nuanced language of life (protein sequences) with federated AI workflows, this methodology accelerates discoveries in drug development, healthcare, and biotechnology while ensuring data privacy.
The future of AI in life sciences is not isolated; it’s collaborative. With FLARE and BioNeMo, this future is already unfolding. To begin exploring federated protein property prediction, visit the NVIDIA/NVFlare GitHub repository for initial steps and more advanced, practical examples.
Inspired by: Source

