About

4th year Ph.D. Student

Department of Electrical & Computer Engineering

Worcester Polytechnic Institute, MA, USA

Welcome to my website! I am pursuing my Ph.D. under the supervision of Dr. Bashima Islam at the at the Department of Electrical and Computer Engineering, WPI .

My research centers on spatial acoustic reasoning for Audio Large Language Models, focusing on how AI systems can infer geometry, location, and physical context from sound using stepwise, verifiable reward for transparent and interpretable reasoning. I also study multi-modal large language models (MLLMs) that integrate audio, speech, sensor, and vision inputs, with an emphasis on efficient modality switching for egocentric perception in dynamic real-world environments.

I worked as a Research Scientist Intern at Meta Reality Labs with the Audio Research Group in the summer of 2024 and as a Part-Time Student Researcher in Fall 20224 with the same group. Previously, I worked as a Software Engineer, AI & IoT at Advanced Chemical Industries Limited (ACI) .

Prior to joining WPI, I completed my Bachelor's in Electrical and Electronic Engineering from Bangladesh University of Engineering and Technology (BUET).

Please take the time to visit my website to learn more about myself, my research, and my professional experiences. Whether you are a fellow engineer, researcher, potential collaborator, or just interested in talking to me about something, please feel free to contact me via email.


Seeking Opportunities

I'm actively looking for internship, full-time industrial research scientist, and post-doc positions focused on Reasoning Models through Verifiable Rewards, Multimodal Learning, and Generative Modeling, and related fields. Reach out to me at sbiswas@wpi.edu if you think I would be a good fit!


News

01-2026

OWL got accepted at ICLR, 2026

01-2026

My internship work at Meta Reality Labs on hair-noise suppression for Ray-Ban Meta Glasses was accepted at ICASSP, 2026

10-2025

Received Peter B. Myers Graduate Fellowship from Dept. of ECE, WPI.

08-2025

RAVEN got accepted at EMNLP Main Conference, 2025.

07-2025

LOCUS got accepted at EWSN, 2025.

06-2025

EgoAdapt got accepted at ICCV, 2025.

05-2025

Our paper QUADS got accepted at INTERSPEECH, 2025.

12-2024

Received Master of Science in Electrical and Computer Engineering from WPI.

08-2024

I will be joining Meta Reality Labs as Part-Time Student Researcher.

06-2024

Our paper got accepted at INTERSPEECH, 2024.

05-2024

I will be joining Meta Reality Labs as Research Scientist Intern.

05-2024

Our paper FreeML got accepted at EWSN, 2024.

11-2023

Passed Ph.D. diagonstic exam.

05-2023

Started Working as Graduate Research Assistant at BASH Lab, ECE, WPI.

08-2022

Started Working as Graduate Teaching Assistant at Dept. of ECE, WPI.

08-2022

I will be starting my Ph.D. at BASH Lab, ECE, WPI.

02-2021

Received my Bachelor of Science in Electrical and Electronic Engineering from Bangladesh University of Engineering and Technology (BUET).

02-2021

Defended my undergrad thesis titled "A Deep Learning Based Energy Efficient Downlink Power Control Mechanism for Cellular Networks".

01-2021

I will be starting as Software Engineer, AI & IoT at ACI, Limited.


Education

Ph.D. in Electrical and Computer Engineering
Worcester Polytechnic Institute, Worcester, MA, USA
August 2022 - May 2027 (Expected)
Tentative Thesis Title: Toward Robust and Efficient Reasoning in Perceptually Grounded Multi-modal Large Language Models
MSc. in Electrical and Computer Engineering
Worcester Polytechnic Institute, Worcester, MA, USA
August 2022 - December 2024
BSc. in Electrical and Electronic Engineering
Bangladesh University of Engineering and Technology, Dhaka, Bangladesh
February 2016 - February 2021
Thesis: A Deep Learning Based Energy Efficient Downlink Power Control Mechanism for Cellular Networks

Publications

OWL: GEOMETRY-AWARE SPATIAL REASONING FOR AUDIO LARGE LANGUAGE MODELS
Subrata Biswas, Mohammad Nur Hossain Khan, Bashima Islam
International Conference on Learning Representations (ICLR'26)
DOI PDF Video Code
Spatial reasoning is fundamental to auditory perception, yet current audio large language models (ALLMs) largely rely on unstructured binaural cues and single- step inference. This limits both perceptual accuracy in direction and distance estimation and the capacity for interpretable reasoning. Recent work such as BAT demonstrates spatial QA with binaural audio, but its reliance on coarse categorical labels (left, right, up, down) and the absence of explicit geometric supervision constrain resolution and robustness. We introduce the Spatial-Acoustic Geometry Encoder (SAGE), a geometry-aware audio encoder that aligns binaural acoustic features with 3D spatial structure using panoramic depth images and simulated room-impulse responses at training time, while requiring only audio at inference. Building on this representation, we present OWL, an ALLM that integrates SAGE with a spatially grounded chain-of-thought to rationalize over direction-of-arrivals (DoA) and distance estimates. Through curriculum learning from perceptual QA to multi-step reasoning, OWL supports o’clock-level azimuth and DoA estimation. To enable large-scale training and evaluation, we construct and release BiDepth, a dataset of over one million QA pairs combining binaural audio with panoramic depth images and room impulse responses across both in-room and out-of-room scenarios. Across two benchmark datasets, our new BiDepth and the public SpatialSoundQA, OWL reduces mean DoA error by 11◦ through SAGE and improves spatial reasoning QA accuracy by up to 25% over BAT.
HAIR NOISE ANALYSIS AND MITIGATION FOR SMART GLASSES AUDIO CAPTURES
Subrata Biswas, Daniel Wong, Bashima Islam, Sanjeel Parekh, Vladimir Tourbabin
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP'26)
DOI PDF Video Code
Head-worn devices such as augmented-reality (AR) and smart glasses introduce a previously overlooked form of audio degrada- tion: hair noise, caused by the wearer’s hair brushing against device frames and embedded microphones. To the best of our knowledge, this phenomenon has not been systematically studied. This paper addresses this gap through three contributions. First, we conduct a user study quantifying the perceptual annoyance of hair noise. Sec- ond, we introduce the Hair Noise Mitigation (HNM) dataset, the first multi-channel corpus of hair noise collected across diverse real- world conditions. We further characterize its spectral and spatial properties, revealing a non-stationary and directionally dependent nature. Finally, we propose online and offline semi-supervised non- negative matrix factorization (NMF) methods as benchmark miti- gation approaches, showing perceptual gains that motivate further research. Together, these contributions establish hair noise as a dis- tinct challenge for wearable audio systems and lay the groundwork for tailored enhancement techniques.
RAVEN: Query-Guided Representation Alignment for Question Answering over Audio, Video, Embedded Sensors, and Natural Language
Subrata Biswas, Mohammad Nur Hossain Khan, Bashima Islam
Empirical Methods in Natural Language Processing (EMNLP'25) (main)
DOI PDF Video Code
Multimodal question answering (QA) often requires identifying which video, audio, or sensor tokens are relevant to the question. Yet modality disagreements are common: off-camera speech, background noise, or motion outside the field of view often mislead fusion models that weight all streams equally. We present RAVEN, a unified QA architecture whose core is QuART, a query-conditioned cross-modal gating module that assigns scalar relevance scores to each token across modalities, enabling the model to amplify informative signals and suppress distractors before fusion. RAVEN is trained through a three-stage pipeline comprising unimodal pretraining, query-aligned fusion, and disagreement-oriented fine-tuning - each stage targeting a distinct challenge in multi-modal reasoning: representation quality, cross-modal relevance, and robustness to modality mismatch.
EGOADAPT: Adaptive Multisensory Distillation and Policy Learning for Efficient Egocentric Perception
Sanjoy Chowdhury, Subrata Biswas, Sayan Nag, Tushar Nagarajan, Calvin Murdock, Ishwarya Ananthabhotla, Yijun Qian, Vamsi Krishna Ithapu, Dinesh Manocha, Ruohan Gao
International Conference on Computer Vision (ICCV'25)
DOI PDF Video Code
Modern perception models, particularly those designed for multisensory egocentric tasks, have achieved remarkable performance but often come with substantial computational costs. These high demands pose challenges for real-world deployment, especially in resource-constrained environments. In this paper, we introduce EGOADAPT, a framework that adaptively performs cross-modal distillation and policy learning to enable efficient inference across different egocentric perception tasks, including egocentric action recognition, active speaker localization, and behavior anticipation. Our proposed policy module is adaptable to task-specific action spaces, making it broadly applicable. Experimental results on three challenging egocentric datasets—EPIC-Kitchens, EasyCom, and Aria Everyday Activities—demonstrate that our method significantly enhances efficiency, reducing GMACs by up to 89.09%, parameters up to 82.02%, and energy up to 9.6×, while still on-par and in many cases outperforming, the performance of corresponding state-of-the-art models.
QUADS: QUAntized Distillation Framework for Efficient Speech Language Understanding
Subrata Biswas, Mohammad Nur Hossain Khan, Bashima Islam
INTERSPEECH'25
DOI PDF Video Code
Spoken Language Understanding (SLU) systems must balance performance and efficiency, particularly in resource-constrained environments. Existing methods apply distillation and quantization separately, leading to suboptimal compression as distillation ignores quantization constraints. We propose QUADS, a unified framework that optimizes both through multi-stage training with a pre-tuned model, enhancing adaptability to low-bit regimes while maintaining accuracy.
LOCUS – LOcalization with Channel Uncertainty and Sporadic Energy
Subrata Biswas, Mohammad Nur Hossain Khan, Alex Colwell, Jack Adiletta, Bashima Islam
International Conference On Embedded Wireless Systems and Networks (EWSN'25)
DOI PDF Video Code
Accurate sound source localization (SSL), such as direction-of-arrival (DoA) estimation, relies on consistent multichannel data. However, batteryless systems often suffer from missing data due to the stochastic nature of energy harvesting, degrading localization performance. We propose LOCUS, a deep learning framework that recovers corrupted features in such settings. LOCUS integrates three modules: (1) Information-Weighted Focus (InFo) to identify corrupted regions, (2) Latent Feature Synthesizer (LaFS) to reconstruct missing features, and (3) Guided Replacement (GRep) to restore data without altering valid inputs.
Missingness-resilient Video-enhanced Multimodal Disfluency Detection
Payal Mohapatra, Shamika Likhite, Subrata Biswas, Bashima Islam, Qi Zhu
INTERSPEECH'24
DOI PDF Video Code
Most existing speech disfluency detection techniques only rely upon acoustic data. In this work, we present a practical multimodal disfluency detection approach that leverages available video data together with audio. We curate an audio-visual dataset and propose a novel fusion technique with unified weight-sharing modality-agnostic encoders to learn the temporal and semantic context.
Memory-efficient Energy-adaptive Inference of Pre-Trained Models on Batteryless Embedded Systems
Subrata Biswas, Pietro Farina, Eren Yildiz, Khakim Akhunov, Saad Ahmed, Bashima Islam, Kasim Sinan Yildirim
International Conference On Embedded Wireless Systems and Networks (EWSN'24)
DOI PDF Video Code
Batteryless systems frequently face power failures, requiring extra runtime buffers to maintain inference progress and leaving only a memory space for storing ultra-tiny deep neural networks (DNNs). We combat these issues by proposing FreeML, a framework to optimize pre-trained DNN models for memory-efficient and energy-adaptive inference on batteryless systems.

Work Experience

Graduate Research Assistant
BASH LAB, Worcester, MA, USA
May 2022 – Present
Part-Time Student Researcher
Meta Reality Labs, Redmond, WA, USA
August 2024 – Present
Research Scientist Intern
Meta Reality Labs, Redmond, WA, USA
May 2024 – August 2024
Graduate Teaching Assistant
Dept. of ECE, WPI, Worcester, MA, USA
August 2022 – May 2023
Software Engineer, AI & IoT
Advanced Chemical Industries Limited, Dhaka, Bangladesh
February 2021 – August 2022

Awards

10-2025

Received Peter B. Myers Graduate Fellowship from Dept. of ECE, WPI.

08-2022

1st Runner up at Robi Datathon 2.0.

10-2020

5th at IEEE Video and Image Processing Cup.

04-2020

4th at IEEE Signal Processing Cup.

06-2019

Winner of Bangladesh Section, IEEE YESIST12 Innovation Challenge 2019.