The third New England RLHF Hackathon recently took place, showcasing a variety of innovative projects that delved into machine learning and reinforcement learning. For enthusiasts and participants interested in future events, joining the Discord community is highly encouraged for updates and discussions. Join the Discord community to stay connected and informed.
During this exciting event, several standout projects were presented, each highlighting unique approaches and methodologies in the realm of reinforcement learning. Here’s a closer look at some of the most intriguing projects from the hackathon:
- Pink Elephants Pt 3 (Authors: Sid Verma, Louis Castricato): This project focused on developing a model capable of training a “pink elephant” through Inverse Learning from Q-learning (ILQL). The authors employed the standard trlX implementation but encountered challenges while tuning hyperparameters. They suggested that future research could benefit from more sophisticated reward shaping techniques, possibly integrating methods like DPO (Direct Preference Optimization) and ReST (Reinforcement via Self-Training) for enhanced training effectiveness.
- Exploring Iterated RLHF (Authors: Arjun Prakash, Jacob Makar-Limanov): This project aimed to deepen the understanding of iterated RLHF and the co-evolution of large language models (LLMs) alongside reward models. By substituting human evaluators with an “idealized” model, UltraRM-13b, the team focused on aligning the LLM with this gold standard rather than human preferences. Their future work plans include refining their approach using advanced techniques like ReST.
- Visualizing the Reward Model via QDAIF (Authors: Will Beddow, Matthew Bernstein, Chase Blagden): This project sought to visualize and interpret the reward model in RLHF. The team adapted the Quality-Diversity through AI Feedback (QDAIF) technique, employing a Deberta model fine-tuned on human preference data as the fitness function. They utilized llama-70b for generating and mutating poetry, uncovering patterns in rewards linked to various poem types and tones.
The next hackathon is set to take place at NeurIPS, and those interested in participating or learning more should definitely consider joining the Discord community for further information.
Pink Elephants Pt 3
This project is an extension of previous work on the pink elephant problem, aiming to build infrastructure for training a pink elephant model using ILQL. The authors, Sid Verma and Louis Castricato, faced difficulties in achieving convergence with satisfactory results after experimenting with a wide range of hyperparameters.
To improve future results, they proposed the need for more nuanced reward shaping. Currently, their approach uses a simple binary reward system where +1 is awarded for accepted answers and -1 for rejected ones. They are considering training reward models to replace these binary signals for improved feedback. Additionally, exploring the advantages of ReST over traditional methods like PPO (Proximal Policy Optimization) could lead to better convergence outcomes. The potential combination of ReST with DPO for fine-tuning represents an exciting avenue for their ongoing research.
Exploring Iterated RLHF
Authors Arjun Prakash and Jacob Makar-Limanov introduced a project that examines iterated Reinforcement Learning from Human Feedback (RLHF). Their primary goal is to understand how LLMs and their reward models can evolve together over time.
In their experimental setup, the authors replaced human participants with a “gold standard” reward model, enabling them to align their LLM with this idealized model rather than solely learning from human preferences. They are currently implementing their algorithm using UltraRM-13b as the gold standard, laying the groundwork for future evaluations of various iteration methods, including ReST.
Visualizing the Reward Model via QDAIF
In their project, Will Beddow, Matthew Bernstein, and Chase Blagden seek to enhance interpretability within RLHF by visualizing the reward model. Given the intricate and high-dimensional nature of reward models, this goal is quite challenging. The team aimed to mitigate these complexities by modifying a novel technique for creative writing generation using reward models.
Their approach involves adapting QDAIF, which generates creative solutions along two axes (such as tone and genre for poetry) and mutates them based on quality. Rather than relying on a basic fitness function, they utilized a reward model trained on human preference data to visualize a lower bound of the reward function. This innovation allows them to discern which traits correlate with higher or lower rewards.
Implementation and Results
The team modified the existing QDAIF implementation by substituting the fitness function with a reward model specifically designed to evaluate poetry. For poem generation and mutation, they used llama-70b, while the reward model was a Deberta model fine-tuned on human preference data.
After running their implementation for 2500 iterations, they produced a map illustrating the relationship between various poem types and their associated rewards. Interestingly, their findings revealed that sonnets tended to receive the lowest overall rewards, with reflective tones also scoring poorly.
For those interested in delving deeper into the technical aspects of these projects, the source code is available at the following link: OpenELM GitHub Repository.
As the field of Reinforcement Learning continues to evolve, events like the New England RLHF Hackathon foster collaboration and innovation. Participants leave with valuable insights and the opportunity to contribute to the growing community of researchers and developers passionate about machine learning and AI.

