I am an Independent Alignment Researcher, contracting with Anthropic focused on scalable oversight and adversarial robustness. I enjoy fast empirical research with LLMs and have been supervised by Ethan Perez since Summer 2023 when I took part in the MATS Program. I am particularly proud of our paper on debate which was awarded best paper at ICML 2024. I am also providing technical support and coaching for Ethan's MATS 7 cohort.
Concurrently, I advise Speechmatics and help them advance speech recognition products, including Flow, our low-latency voice assistant. Before starting AI safety research, I worked as a machine learning engineer and manager with Speechmatics, helping to deliver our latest speech-to-text system Ursa.
The aim of this website is for you to learn more about my projects, publications and hobbies (and maybe to enjoy some AI generated art!). All accompanying code can be found on my GitHub. Thanks for visiting!
Publications
Best-of-N Jailbreaking
December 4th 2024 | Aimed for ICML 2025 | John Hughes*, Sara Price*, Aengus Lynch*, Rylan Schaeffer, Fazl Barez, Sanmi Koyejo, Henry Sleight, Erik Jones, Ethan Perez†, Mrinank Sharma†
We introduce Best-of-N (BoN) Jailbreaking, a straightforward algorithm that effectively jailbreaks AI systems across modalities by sampling prompt variations with simple augmentations. Achieving high attack success rates on models like GPT-4o and Claude 3.5 Sonnet, BoN also bypasses advanced defenses and extends to other modalities (vision and audio). Its effectiveness increases with more samples and the scaling behaviour follows a power-law, highlighting significant vulnerabilities of AI to subtle input changes.
Read the paper | Visit the repo | Website & Examples | Twitter Thread
Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approach
Decmeber 3rd 2024 | NeurIPS 2024 AdvML Frontiers (Oral) & SoLaR | Tony T. Wang*, John Hughes*, Henry Sleight, Rylan Schaeffer, Rajashree Agrawal, Fazl Barez, Mrinank Sharma, Jesse Mu, Nir Shavit, Ethan Perez†
We explored the challenge of jailbreak-defense by prohibiting just one specific behavior. Using the prevention of bomb-making assistance from an LLM as a case study, we discovered that common defenses like safety training, adversarial training, and input/output classifiers do not completely address the issue. We developed a transcript-classifier defense that surpasses these traditional methods, yet it still occasionally fails, underscoring the complexities of jailbreak-defense within a narrow scope.
Debating with More Persuasive LLMs Leads to More Truthful Answers
February 9th 2024 | ICML 2024 Oral & Best Paper | Akbir Khan*, John Hughes*, Dan Valentine*, Laura Ruis, Kshitij Sachan, Ansh Radhakrishnan, Edward Grefenstette, Samuel R. Bowman, Tim Rocktäschel†, Ethan Perez†
We investigated the potential of weaker language models (non-experts) to assess the correctness of stronger models (experts) via LLM debate, demonstrating a significant improvement in accuracy for both non-expert models and humans in the QuALITY comprehension task. Pioneered the optimisation of expert debaters for persuasiveness in an unsupervised manner, leading to enhanced non-expert capabilities in identifying accurate answers during debates.
Other Supported Work
October 17th 2024 | ICLR 2025 (in review) | Felix J Binder*, James Chua*, Tomek Korbak, Henry Sleight, John Hughes, Robert Long, Ethan Perez, Miles Turpin, Owain Evans
We study introspection in LLMs, defining it as acquiring knowledge from internal states rather than training data, and finetune models to predict their own behavior in hypothetical scenarios. Experiments show evidence of introspection in simple tasks but highlight challenges in more complex or out-of-distribution cases.
July 21st 2024 | ICLR 2025 (in review); NeurIPS 2024 AdvML Frontiers & SoLaR & Red Teaming GenAI & SATA (4 Orals) | Rylan Schaeffer*, Dan Valentine*, Luke Bailey, James Chua, Cristóbal Eyzaguirre, Zane Durante, Joe Benton, Brando Miranda, Henry Sleight, John Hughes, Rajashree Agrawal, Mrinank Sharma, Scott Emmons, Sanmi Koyejo, Ethan Perez
We study the transferability of gradient-based image jailbreaks in vision-language models (VLMs) and find that such attacks are highly specific, with little-to-no transfer between models, even with shared architectures or training methods. However, transfer improves when attacking ensembles of highly similar VLMs, highlighting their relative robustness compared to language models and image classifiers.
April 1st 2024 | COLM 2024 | Matthias Gerstgrasser*, Rylan Schaeffer*, Apratim Dey*, Rafael Rafailov*, Henry Sleight, John Hughes, Tomasz Korbak, Rajashree Agrawal, Dhruv Pai, Andrey Gromov, Daniel A. Roberts, Diyi Yang, David L. Donoho, Sanmi Koyejo
We study the effects of training generative models on their own outputs and show that replacing real data with synthetic data leads to model collapse, while accumulating synthetic data alongside real data avoids this issue.
February 19th 2020 | NeurIPS 2020 | Will Williams*, Sam Ringer*, Tom Ash, John Hughes, David MacLeod, Jamie Dougherty
We propose a hierarchical VQ-VAE approach for lossy image compression, introducing a novel training objective that leverages stochastic quantization and hierarchical latent structures. Our method achieves high perceptual quality and semantic feature retention at very low bitrates, demonstrated on CelebA and MNIST datasets. Visit the repo.