Mayug Maniparambil

I am a Research Scientist at Fin AI (previously Intercom), where my research focuses on RL post-training for LLMs and VLMs, with a particular interest in LLM reasoning and exploration in reinforcement learning.

I completed my PhD at the SFI Centre for Research Training in Machine Learning (ML Labs), where I focused on multimodal learning, efficient vision-language model alignment, and low-data training strategies. My research explored representational similarity, "platonic representations" and universal embeddings in vision and language encoders, and was supervised by Prof. Noel O'Connor and the late Dr. Kevin McGuinness.

Before this, I worked at Qure.ai as a computer vision researcher, developing weakly supervised and active learning-based models for cranial bleed detection and segmentation in CT imaging. During my undergrad I also collaborated with the Computational Imaging Lab at IIT Madras, under the guidance of Prof. Kaushik Mitra, on deep generative methods for phase retrieval and medical image denoising.

I hold a dual degree (B.Tech + M.Tech) in Electrical Engineering with a specialization in Signal Processing from IIT Madras. My research has been published at CVPR, ICCV, and BMVC, and spans topics in Vision Language Models, LLMs, few-shot learning, and cross-modal pretraining. I recently interned at Amazon Robotics, Berlin, focusing on vision-language models and domain adaptation in robotic defect detection systems.

Email  /  GitHub  /  Google Scholar  /  LinkedIn  /  CV  / 

profile photo

Research

My research interests include LLM reasoning, RL post-training and exploration for LLMs and VLMs, multimodal learning and representation alignment, and developing efficient machine learning models with limited supervision.

project image

When RL Suppresses Its Own Vocabulary: Recovering Reasoning Diversity in Puzzle-to-Math Transfer


Mayug Maniparambil, Arjun Karuvally, Terrence Sejnowski, Fergal Reid
arXiv, 2026
arxiv /

We study cross-domain transfer of reinforcement learning with verifiable rewards (RLVR) in a 7B model whose SFT and RL stages use only constraint-satisfaction puzzles, with no math in the post-training data. Using a reasoning primitive-level framework, we show that puzzle SFT induces a reasoning vocabulary and RL composes it into longer compute-verify chains, but also suppresses exploratory primitives such as hypothesize and backtrack. A novelty bonus that rewards diverse correct rollouts restores these recovery primitives and raises the hard-math ceiling from 16% to 36% pass@32 — without ever training on mathematics.

project image

TopoBench: Benchmarking LLMs on Hard Topological Reasoning


Mayug Maniparambil, Nils Hoehing, Janak Kapuriya, Arjun Karuvally, Ellen Rushe, Anthony Ventresque, Noel E. O'Connor, Fergal Reid
ICLR Workshop, 2026
arxiv / website /

Topological grid puzzles require reasoning over global spatial invariants such as connectivity, loop closure, and region symmetry, which remains challenging for even the most powerful LLMs. TopoBench comprises six puzzle families across three difficulty levels, and even frontier models solve fewer than a quarter of hard instances. Annotating 750 chain-of-thought traces with an error taxonomy, we find premature commitment and constraint forgetting are the dominant failure modes, and that the bottleneck lies in extracting constraints from spatial representations rather than in reasoning over them.

project image

Hold-One-Shot-Out (HOSO) for Validation-Free Few-Shot CLIP Adapters


Chris Vorster, Mayug Maniparambil, Noel E. O'Connor, Noel Murphy, Derek Molloy
CVPR Findings, 2026
arxiv /

Most few-shot CLIP adaptation methods select the adapter blending ratio on the test set or via an extra validation set, and so are not strictly few-shot. We introduce Hold-One-Shot-Out (HOSO), which learns the blending ratio from a single hold-out shot while the adapter trains on the remaining support examples. Under a strict validation-free protocol, HOSO-Adapter outperforms the CLIP-Adapter baseline by more than 4 percentage points on average across 11 standard few-shot datasets.

project image

Are Natural-Domain Foundation Models Effective for Accelerated Cardiac MRI Reconstruction?


Anam Hashmi, Mayug Maniparambil, Julia Dietlmeier, Kathleen M. Curran, Noel E. O'Connor
CVPR Workshop, 2026
arxiv /

We ask whether natural-domain foundation models can serve as effective image priors for accelerated cardiac MRI reconstruction, comparing them against domain-specific counterparts such as BiomedCLIP. We propose an unrolled reconstruction framework that embeds pretrained, frozen visual encoders — CLIP, DINOv2, and BiomedCLIP — within each cascade to guide reconstruction. While task-specific models such as E2E-VarNet lead in standard in-distribution settings, foundation-model-based approaches remain competitive.

project image

Harnessing Frozen Unimodal Encoders for Flexible Multimodal Alignment


Mayug Maniparambil, Raiymbek Akshulakov, Yasser Abdelaziz Dahou Djilali, Sanath Narayan, Ankit Singh, Noel E. O'Connor
CVPR (accepted), 2025
arxiv / code /

We propose a novel framework for aligning vision and language modalities using frozen unimodal encoders. Our analysis reveals that semantically aligned encoder pairs can be effectively connected through lightweight projection layers. By training simple MLP projectors within this framework, we achieve 76% accuracy on ImageNet, while reducing data requirements by 20× and compute by 65× compared to traditional multimodal alignment approaches. This method significantly improves the accessibility of multimodal model development and enables flexible adaptation to tasks such as zero-shot segmentation, multilingual retrieval, and classification—by leveraging powerful, pretrained unimodal vision and language models.

project image

Pinpoint Counterfactuals: Reducing Social Bias in Foundation Models via Localized Counterfactual Generation


Kirill Sirotkin, Marcos Escudero-Viñolo, Pablo Carballeira, Mayug Maniparambil, Catarina Barata, Noel E. O'Connor
arXiv, 2024
arxiv /

We introduce a localized counterfactual generation method that addresses societal biases in foundation models by constraining modifications to specific attribute-relevant regions through automated masking and guided inpainting. Applied to the Conceptual Captions dataset for creating gender counterfactuals, our approach achieves higher visual and semantic fidelity compared to existing methods, while preserving model performance on non-human-centric tasks. Fine-tuning models with our counterfactuals demonstrates measurable bias reduction across multiple metrics, establishing a framework for creating balanced datasets that enable both accurate bias profiling and effective mitigation.

project image

Test-Time Adaptation with SaLIP: A Cascade of SAM and CLIP for Zero-shot Medical Image Segmentation


Sidra Aleem, Fangyijie Wang, Mayug Maniparambil, Eric Arazo, Julia Dietlmeier, Kathleen Curran, Noel E. O'Connor, Suzanne Little
CVPR Workshops (Oral), 2024
arxiv / code /

We introduce SaLIP, a training-free framework that combines SAM and CLIP for zero-shot medical image segmentation. Our method uses CLIP to select relevant regions and SAM to segment them accurately, achieving significant improvements over baseline SAM across multiple medical imaging tasks.

project image

The STOIC2021 COVID-19 AI Challenge: Applying Reusable Training Pipelines to Medical Imaging


Dominik Müller, Mayug Maniparambil, et al.
Medical Image Analysis, 2024
arxiv /

This study presents the outcomes of the STOIC2021 challenge, highlighting the effectiveness of reusable training pipelines in medical imaging tasks related to COVID-19 diagnosis.

project image

Do Vision and Language Encoders Represent the World Similarly?


Mayug Maniparambil, Raiymbek Akshulakov, Yasser Abdelaziz Dahou Djilali, Sanath Narayan, Mohamed El Amine Seddik, Karttikeya Mangalam, Noel E. O'Connor
CVPR, 2024
arxiv / code /

This paper investigates whether independently trained vision and language encoders learn similar representations of the world. Utilizing Centered Kernel Alignment (CKA), the study finds that unaligned vision and language encoders exhibit semantic similarities in their representation spaces. The authors propose two methods—a Fast Quadratic Assignment Problem (QAP) optimization and a novel localized CKA metric-based matching—to align these representations without additional training. The effectiveness of these methods is demonstrated on downstream tasks such as cross-lingual and cross-domain caption matching and image classification.

project image

Enhancing CLIP with GPT-4: Harnessing Visual Descriptions as Prompts


Mayug Maniparambil, Chris Vorster, Derek Molloy, Noel Murphy, Kevin McGuinness, Noel E. O'Connor
ICCV, 2023
arxiv / code /

We demonstrate how GPT-4 can generate visually descriptive prompts to enhance CLIP’s zero-shot performance on fine-grained datasets. Our approach significantly improves accuracy and introduces a novel few-shot adapter that outperforms existing methods like CoCoOP.

project image

BaseTransformers: Attention over Base Data-Points for One Shot Learning


Mayug Maniparambil, Kevin McGuinness, Noel E. O'Connor
BMVC, 2022
arxiv /

We propose BaseTransformers, a novel approach that leverages attention mechanisms over base data-points to enhance one-shot learning performance. Our method achieves state-of-the-art results on multiple benchmarks.

project image

Phase Retrieval for Fourier Ptychography under Varying Amount of Measurements


Lokesh Boominathan, Mayug Maniparambil, Honey Gupta, Rahul Baburajan, Kaushik Mitra
BMVC, 2018
arxiv /

We explore phase retrieval techniques for Fourier Ptychography, focusing on scenarios with varying measurement quantities. Our findings contribute to improved imaging quality in computational photography.


Design and source code from Jon Barron's website