Mayug Maniparambil

I am a Research Scientist at Fin AI (previously Intercom), where my research focuses on RL post-training for LLMs and VLMs, with a particular interest in LLM reasoning and exploration in reinforcement learning.

I completed my PhD at the SFI Centre for Research Training in Machine Learning (ML Labs), where I focused on multimodal learning, efficient vision-language model alignment, and low-data training strategies. My research explored representational similarity, "platonic representations" and universal embeddings in vision and language encoders, and was supervised by Prof. Noel O'Connor and the late Dr. Kevin McGuinness.

Before this, I worked at Qure.ai as a computer vision researcher, developing weakly supervised and active learning-based models for cranial bleed detection and segmentation in CT imaging. During my undergrad I also collaborated with the Computational Imaging Lab at IIT Madras, under the guidance of Prof. Kaushik Mitra, on deep generative methods for phase retrieval and medical image denoising.

I hold a dual degree (B.Tech + M.Tech) in Electrical Engineering with a specialization in Signal Processing from IIT Madras. My research has been published at CVPR, ICCV, and BMVC, and spans topics in Vision Language Models, LLMs, few-shot learning, and cross-modal pretraining. I recently interned at Amazon Robotics, Berlin, focusing on vision-language models and domain adaptation in robotic defect detection systems.

Email / GitHub / Google Scholar / LinkedIn / CV /

Research

My research interests include LLM reasoning, RL post-training and exploration for LLMs and VLMs, multimodal learning and representation alignment, and developing efficient machine learning models with limited supervision.

	When RL Suppresses Its Own Vocabulary: Recovering Reasoning Diversity in Puzzle-to-Math Transfer Mayug Maniparambil, Arjun Karuvally, Terrence Sejnowski, Fergal Reid arXiv, 2026 arxiv / We study cross-domain transfer of reinforcement learning with verifiable rewards (RLVR) in a 7B model whose SFT and RL stages use only constraint-satisfaction puzzles, with no math in the post-training data. Using a reasoning primitive-level framework, we show that puzzle SFT induces a reasoning vocabulary and RL composes it into longer compute-verify chains, but also suppresses exploratory primitives such as hypothesize and backtrack. A novelty bonus that rewards diverse correct rollouts restores these recovery primitives and raises the hard-math ceiling from 16% to 36% pass@32 — without ever training on mathematics.
	TopoBench: Benchmarking LLMs on Hard Topological Reasoning Mayug Maniparambil, Nils Hoehing, Janak Kapuriya, Arjun Karuvally, Ellen Rushe, Anthony Ventresque, Noel E. O'Connor, Fergal Reid ICLR Workshop, 2026 arxiv / website / Topological grid puzzles require reasoning over global spatial invariants such as connectivity, loop closure, and region symmetry, which remains challenging for even the most powerful LLMs. TopoBench comprises six puzzle families across three difficulty levels, and even frontier models solve fewer than a quarter of hard instances. Annotating 750 chain-of-thought traces with an error taxonomy, we find premature commitment and constraint forgetting are the dominant failure modes, and that the bottleneck lies in extracting constraints from spatial representations rather than in reasoning over them.
	Hold-One-Shot-Out (HOSO) for Validation-Free Few-Shot CLIP Adapters Chris Vorster, Mayug Maniparambil, Noel E. O'Connor, Noel Murphy, Derek Molloy CVPR Findings, 2026 arxiv / Most few-shot CLIP adaptation methods select the adapter blending ratio on the test set or via an extra validation set, and so are not strictly few-shot. We introduce Hold-One-Shot-Out (HOSO), which learns the blending ratio from a single hold-out shot while the adapter trains on the remaining support examples. Under a strict validation-free protocol, HOSO-Adapter outperforms the CLIP-Adapter baseline by more than 4 percentage points on average across 11 standard few-shot datasets.
	Are Natural-Domain Foundation Models Effective for Accelerated Cardiac MRI Reconstruction? Anam Hashmi, Mayug Maniparambil, Julia Dietlmeier, Kathleen M. Curran, Noel E. O'Connor CVPR Workshop, 2026 arxiv / We ask whether natural-domain foundation models can serve as effective image priors for accelerated cardiac MRI reconstruction, comparing them against domain-specific counterparts such as BiomedCLIP. We propose an unrolled reconstruction framework that embeds pretrained, frozen visual encoders — CLIP, DINOv2, and BiomedCLIP — within each cascade to guide reconstruction. While task-specific models such as E2E-VarNet lead in standard in-distribution settings, foundation-model-based approaches remain competitive.
	Harnessing Frozen Unimodal Encoders for Flexible Multimodal Alignment Mayug Maniparambil, Raiymbek Akshulakov, Yasser Abdelaziz Dahou Djilali, Sanath Narayan, Ankit Singh, Noel E. O'Connor CVPR (accepted), 2025 arxiv / code / We propose a novel framework for aligning vision and language modalities using frozen unimodal encoders. Our analysis reveals that semantically aligned encoder pairs can be effectively connected through lightweight projection layers. By training simple MLP projectors within this framework, we achieve 76% accuracy on ImageNet, while reducing data requirements by 20× and compute by 65× compared to traditional multimodal alignment approaches. This method significantly improves the accessibility of multimodal model development and enables flexible adaptation to tasks such as zero-shot segmentation, multilingual retrieval, and classification—by leveraging powerful, pretrained unimodal vision and language models.
	Pinpoint Counterfactuals: Reducing Social Bias in Foundation Models via Localized Counterfactual Generation Kirill Sirotkin, Marcos Escudero-Viñolo, Pablo Carballeira, Mayug Maniparambil, Catarina Barata, Noel E. O'Connor arXiv, 2024 arxiv / We introduce a localized counterfactual generation method that addresses societal biases in foundation models by constraining modifications to specific attribute-relevant regions through automated masking and guided inpainting. Applied to the Conceptual Captions dataset for creating gender counterfactuals, our approach achieves higher visual and semantic fidelity compared to existing methods, while preserving model performance on non-human-centric tasks. Fine-tuning models with our counterfactuals demonstrates measurable bias reduction across multiple metrics, establishing a framework for creating balanced datasets that enable both accurate bias profiling and effective mitigation.
	Test-Time Adaptation with SaLIP: A Cascade of SAM and CLIP for Zero-shot Medical Image Segmentation Sidra Aleem, Fangyijie Wang, Mayug Maniparambil, Eric Arazo, Julia Dietlmeier, Kathleen Curran, Noel E. O'Connor, Suzanne Little CVPR Workshops (Oral), 2024 arxiv / code / We introduce SaLIP, a training-free framework that combines SAM and CLIP for zero-shot medical image segmentation. Our method uses CLIP to select relevant regions and SAM to segment them accurately, achieving significant improvements over baseline SAM across multiple medical imaging tasks.
	The STOIC2021 COVID-19 AI Challenge: Applying Reusable Training Pipelines to Medical Imaging Dominik Müller, Mayug Maniparambil, et al. Medical Image Analysis, 2024 arxiv / This study presents the outcomes of the STOIC2021 challenge, highlighting the effectiveness of reusable training pipelines in medical imaging tasks related to COVID-19 diagnosis.
	Do Vision and Language Encoders Represent the World Similarly? Mayug Maniparambil, Raiymbek Akshulakov, Yasser Abdelaziz Dahou Djilali, Sanath Narayan, Mohamed El Amine Seddik, Karttikeya Mangalam, Noel E. O'Connor CVPR, 2024 arxiv / code / This paper investigates whether independently trained vision and language encoders learn similar representations of the world. Utilizing Centered Kernel Alignment (CKA), the study finds that unaligned vision and language encoders exhibit semantic similarities in their representation spaces. The authors propose two methods—a Fast Quadratic Assignment Problem (QAP) optimization and a novel localized CKA metric-based matching—to align these representations without additional training. The effectiveness of these methods is demonstrated on downstream tasks such as cross-lingual and cross-domain caption matching and image classification.
	Enhancing CLIP with GPT-4: Harnessing Visual Descriptions as Prompts Mayug Maniparambil, Chris Vorster, Derek Molloy, Noel Murphy, Kevin McGuinness, Noel E. O'Connor ICCV, 2023 arxiv / code / We demonstrate how GPT-4 can generate visually descriptive prompts to enhance CLIP’s zero-shot performance on fine-grained datasets. Our approach significantly improves accuracy and introduces a novel few-shot adapter that outperforms existing methods like CoCoOP.
	BaseTransformers: Attention over Base Data-Points for One Shot Learning Mayug Maniparambil, Kevin McGuinness, Noel E. O'Connor BMVC, 2022 arxiv / We propose BaseTransformers, a novel approach that leverages attention mechanisms over base data-points to enhance one-shot learning performance. Our method achieves state-of-the-art results on multiple benchmarks.
	Phase Retrieval for Fourier Ptychography under Varying Amount of Measurements Lokesh Boominathan, Mayug Maniparambil, Honey Gupta, Rahul Baburajan, Kaushik Mitra BMVC, 2018 arxiv / We explore phase retrieval techniques for Fourier Ptychography, focusing on scenarios with varying measurement quantities. Our findings contribute to improved imaging quality in computational photography.

Design and source code from Jon Barron's website

Mayug Maniparambil

Research

When RL Suppresses Its Own Vocabulary: Recovering Reasoning Diversity in Puzzle-to-Math Transfer

TopoBench: Benchmarking LLMs on Hard Topological Reasoning

Hold-One-Shot-Out (HOSO) for Validation-Free Few-Shot CLIP Adapters

Are Natural-Domain Foundation Models Effective for Accelerated Cardiac MRI Reconstruction?

Harnessing Frozen Unimodal Encoders for Flexible Multimodal Alignment

Pinpoint Counterfactuals: Reducing Social Bias in Foundation Models via Localized Counterfactual Generation

Test-Time Adaptation with SaLIP: A Cascade of SAM and CLIP for Zero-shot Medical Image Segmentation

The STOIC2021 COVID-19 AI Challenge: Applying Reusable Training Pipelines to Medical Imaging

Do Vision and Language Encoders Represent the World Similarly?

Enhancing CLIP with GPT-4: Harnessing Visual Descriptions as Prompts

BaseTransformers: Attention over Base Data-Points for One Shot Learning

Phase Retrieval for Fourier Ptychography under Varying Amount of Measurements