I'm currently doing a PhD at the Gatsby Computational Neuroscience Unit in London, where I'm lucky to be co-supervised by Andrew Saxe and Felix Hill. I'm interested in building and better understanding safe AIs that can push the frontiers of human reasoning and help us do science as safe and helpful assistants. During my PhD, I've had the pleasure of interning at Meta FAIR (with Ari Morcos and then on the Llama 3/3.1 team) and Google DeepMind (on the Grounded Language Agents team).

Before my PhD, I did my undergrad and master's at MIT, where I double majored in Computer Science and Neuroscience, and was supervised by Boris Katz and Ila Fiete. I also did a smattering of internships in applied math, computational physics, software engineering, computer vision, and quantitative research on my path to figuring out what I wanted to do.

Email  |  Twitter  |  CV  |  Scholar  |  GitHub


In-Context Learning
What needs to go right for an induction head? A mechanistic study of in-context learning circuits and their formation
Aaditya K. Singh, Ted Moskovitz, Felix Hill, Stephanie C.Y. Chan*, Andrew M. Saxe*
International Conference on Machine Learning (ICML), 2024 (Spotlight)
arxiv | github | tweet

Induction heads are thought to be responsible for much of in-context learning. We take inspiration from optogenetics in neuroscience to introduce a framework for measuring and manipulating activations in a deep network throughout training. We introduce the method of clamping to better understand what gives rise to the phase change characteristic of induction circuit formation, finding that the interaction between three smoothly evolving sub-circuits yields this sudden drop in the loss.

The transient nature of emergent in-context learning in transformers
Aaditya K. Singh*, Stephanie C.Y. Chan*, Ted Moskovitz, Erin Grant, Andrew M. Saxe**, Felix Hill**
Neural Information Processing Systems (NeurIPS), 2023
arxiv | neurips | github | tweet

We train transformers on synthetic data designed so that both in-context learning (ICL) and in-weights learning (IWL) strategies can lead to correct predictions. We find that ICL is often transient, meaning it first emerges, then disappears and gives way to IWL, all while the training loss decreases, indicating an asymptotic preference for IWL. We find that L2 regularization may offer a path to more persistent ICL that removes the need for early stopping based on ICL-style validation tasks. Finally, we present initial evidence that ICL transience may be caused by competition between ICL and IWL circuits.

Reproduced by this recent paper in naturalistic settings, and this other paper in a different synthetic setting.

Data distributional properties drive emergent in-context learning in transformers
Stephanie C.Y. Chan, Adam Santoro, Andrew Kyle Lampinen, Jane X. Wang, Aaditya K. Singh, Pierre H. Richemond, Jay McClelland, Felix Hill
Neural Information Processing Systems (NeurIPS), 2022 (Oral)
arxiv | neurips | github | tweet

Large transformer-based models are able to perform in-context few-shot learning, without being explicitly trained for it. We find that in-context learning (ICL) emerges when the training data exhibits particular distributional properties such as burstiness (items appear in clusters rather than being uniformly distributed over time) and having large numbers of rarely occurring classes. We found that in-context learning typically trades off against more conventional weight-based learning, but that the two modes of learning could co-exist in a single model when it was trained on data following a skewed Zipfian distribution (a common property of naturalistic data, including language).

Our insights were used by this recent paper to elicit ICL in RL settings.

Large Language Models
The Llama 3 herd of models
Llama team, AI@Meta, Contributors: Aaditya K. Singh, ...
arxiv | github | tweet

Llama 3 is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens.

My contributions: Math pretraining data (S3.1.1), Scaling laws (S3.2.1), and Evals (S5).

Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs
Aaditya K. Singh, DJ Strouse
In submission, 2024
arxiv | github | tweet

Through a series of carefully controlled, inference-time experiments, we find evidence of strong (scale-dependent) number tokenization-induced inductive biases in numerical reasoning in frontier LLMs. Specifically, we demonstrate that GPT-3.5 and GPT-4 models show largely improved performance when using right-to-left (as opposed to default left-to-right) number tokenization. Furthermore, we find that model errors when using standard left-to-right tokenization follow stereotyped error patterns, suggesting that model computations are systematic rather than approximate. These effects are weaker in larger models (GPT-4) yet stronger in newer, smaller models (GPT 4 Turbo).

Reproduced by this recent blog post in newer models (e.g., Llama 3). Claude 3, released after our work and with SOTA math capabilities, also notably uses R2L tokenization.

Confronting reward model overoptimization with constrained RLHF
Ted Moskovitz, Aaditya K. Singh, DJ Strouse, Tuomas Sandholm, Ruslan Salakhutdinov, Anca D. Dragan, Stephen McAleer
International Conference on Learning Representations, 2024 (Spotlight)
arxiv | openreview | tweet

Optimizing large language models to align with human preferences via reinforcement learning from human feedback can offer suffer from overoptimization. Furthermore, human preferences are often multi-faceted, requiring many sub-components. In this work, we study overoptimization in composite RMs showing that correlation between component RMs has a significant effect. We then introduce an approach to circumvent this issue using constrained reinforcement learning as a means of preventing the agent from exceeding each RM's threshold of usefulness. Our method addresses the problem of weighting component RMs by learning dynamic weights, naturally expressed by Lagrange multipliers. As a result, each RM stays within the range at which it is an effective proxy, improving evaluation performance.

Quantifying Variance in Evaluation Benchmarks
Lovish Madaan, Aaditya K. Singh, Rylan Schaeffer, Andrew Poulton Sanmi Koyejo, Pontus Stenetorp, Sharan Narang, Dieuwke Hupkes
In submission, 2024
arxiv | tweet

We quantify variance in evaluation benchmarks through a range of metrics, including difference in performance across ten 7B 210B token runs with different initializations. We find that continuous metrics often show less variance (higher signal-to-noise), suggesting they may be more useful when doing pretraining ablations, especially at smaller compute scales. Furthermore, we find that methods from human testing (e.g., item analysis or item response theory) are not effective at reducing variance.

Brevity is the soul of wit: pruning long files for code generation
Aaditya K. Singh, Yu Yang, Kushal Tirumala, Mostafa Elhoushi, Ari S. Morcos
Data-centric Machine Learning Research Workshop @ ICML, 2024
arxiv | tweet

Longer files are often conflated with "higher quality" data. This breaks down for code! The longest Python files in the public Stack dataset are often nonsensical, yet make up a disproportionate amount of tokens (2% of files make up 20% of tokens). We provide qualitative and quantitative evidence for this, ending with the causal experiment: pruning these files leads to modest improvements in efficiency and/or performance at small compute scales. As compute is scaled up, benefits diminish, as seen in related work.

Decoding data quality via synthetic corruptions: embedding-guided pruning of code data
Yu Yang, Aaditya K. Singh, Mostafa Elhoushi, Anas Mahmoud, Kushal Tirumala, Fabian Gloeckle, Baptiste Rozière, Carole-Jean Wu, Ari S. Morcos, Newsha Ardalani
Efficient Natural Language and Speech Processing Workshop @ NeurIPS, 2023 (Oral)
arxiv | tweet

Code datasets, often collected from diverse and uncontrolled sources such as GitHub, potentially suffer from quality issues, thereby affecting the performance and training efficiency of Large Language Models (LLMs) optimized for code generation. First, we explore features of "low-quality" code in embedding space, through the use of synthetic corruptions. Armed with this knowledge, we devise novel pruning metrics that operate in embedding space to identify and remove low-quality entries in the Stack dataset. We demonstrate the benefits of this synthetic corruption informed pruning (SCIP) approach on the well-established HumanEval and MBPP benchmarks, outperforming existing embedding-based methods.

Know your audience: specializing grounded language models with listener subtraction
Aaditya K. Singh, David Ding, Andrew M. Saxe, Felix Hill, Andrew Kyle Lampinen
European chapter of the Association for Computational Linguistics (EACL), 2023
arxiv | ACL Anthology | tweet | show bibtex

arxiv | ACL Anthology | tweet

Our Perceiver IO-inspired cross-attention adapter was used by concurrent work and shown to be generally useful for image captioning.

