Subrata Biswas

About

4th year Ph.D. Student

Department of Electrical & Computer Engineering

Worcester Polytechnic Institute, MA, USA

Welcome to my website! I am pursuing my Ph.D. under the supervision of Dr. Bashima Islam at the at the Department of Electrical and Computer Engineering, WPI .

My research centers on spatial acoustic reasoning for Audio Large Language Models, focusing on how AI systems can infer geometry, location, and physical context from sound using stepwise, verifiable reward for transparent and interpretable reasoning. I also study multi-modal large language models (MLLMs) that integrate audio, speech, sensor, and vision inputs, with an emphasis on efficient modality switching for egocentric perception in dynamic real-world environments.

I'm currenly working as Research Scientist Intern with Speech and Audio group at Mitsubishi Electric Research Laboratories (MERL) . Previously, I worked as a Research Scientist Intern at Meta Reality Labs with the Audio Research Group in the summer of 2024 and as a Part-Time Student Researcher in Fall 2024 with the same group. Before starting Ph.D, I worked as a Software Engineer, AI & IoT at Advanced Chemical Industries Limited (ACI) .

Prior to joining WPI, I completed my Bachelor's in Electrical and Electronic Engineering from Bangladesh University of Engineering and Technology (BUET).

Please take the time to visit my website to learn more about myself, my research, and my professional experiences. Whether you are a fellow engineer, researcher, potential collaborator, or just interested in talking to me about something, please feel free to contact me via email.

View Curriculum Vitae

News

03-2026

I've joined Mitsubishi Electric Research Laboratories (MERL) as Research Scientist Intern

02-2026

Passed my PhD Proposal exam!

01-2026

OWL got accepted at ICLR, 2026

01-2026

My internship work at Meta Reality Labs on hair-noise suppression for Ray-Ban Meta Glasses was accepted at ICASSP, 2026

10-2025

Received Peter B. Myers Graduate Fellowship from Dept. of ECE, WPI.

08-2025

RAVEN got accepted at EMNLP Main Conference, 2025.

07-2025

LOCUS got accepted at EWSN, 2025.

06-2025

EgoAdapt got accepted at ICCV, 2025.

05-2025

Our paper QUADS got accepted at INTERSPEECH, 2025.

12-2024

Received Master of Science in Electrical and Computer Engineering from WPI.

08-2024

I will be joining Meta Reality Labs as Part-Time Student Researcher.

06-2024

Our paper got accepted at INTERSPEECH, 2024.

05-2024

I will be joining Meta Reality Labs as Research Scientist Intern.

05-2024

Our paper FreeML got accepted at EWSN, 2024.

11-2023

Passed Ph.D. diagonstic exam.

05-2023

Started Working as Graduate Research Assistant at BASH Lab, ECE, WPI.

08-2022

Started Working as Graduate Teaching Assistant at Dept. of ECE, WPI.

08-2022

I will be starting my Ph.D. at BASH Lab, ECE, WPI.

02-2021

Received my Bachelor of Science in Electrical and Electronic Engineering from Bangladesh University of Engineering and Technology (BUET).

02-2021

Defended my undergrad thesis titled "A Deep Learning Based Energy Efficient Downlink Power Control Mechanism for Cellular Networks".

01-2021

I will be starting as Software Engineer, AI & IoT at ACI, Limited.

Work Experience

Graduate Research Assistant

BASH LAB, Worcester, MA, USA

May 2022 – Present

Advisor: Dr. Bashima Islam

Research Scientist Intern

Mitsubishi Electric Research Laboratories, Cambridge, MA, USA

March 2026 – Present

Manager: Yoshiki Masuyama

Part-Time Student Researcher

Meta Reality Labs, Redmond, WA, USA

August 2024 – November 2024

Manager: Sanjeel Parekh, and Vladimir Tourbabin

Research Scientist Intern

Meta Reality Labs, Redmond, WA, USA

May 2024 – August 2024

Manager: Sanjeel Parekh, and Vladimir Tourbabin

Graduate Teaching Assistant

Dept. of ECE, WPI, Worcester, MA, USA

August 2022 – May 2023

Software Engineer, AI & IoT

Advanced Chemical Industries Limited, Dhaka, Bangladesh

February 2021 – August 2022

Education

Ph.D. in Electrical and Computer Engineering

Worcester Polytechnic Institute, Worcester, MA, USA

August 2022 - May 2027 (Expected)

Advisor: Dr. Bashima Islam

Tentative Thesis Title: Toward Robust and Efficient Reasoning in Perceptually Grounded Multi-modal Large Language Models

MSc. in Electrical and Computer Engineering

Worcester Polytechnic Institute, Worcester, MA, USA

August 2022 - December 2024

Advisor: Dr. Bashima Islam

BSc. in Electrical and Electronic Engineering

Bangladesh University of Engineering and Technology, Dhaka, Bangladesh

February 2016 - February 2021

Advisor: Dr. Md. Farhad Hossain

Thesis: A Deep Learning Based Energy Efficient Downlink Power Control Mechanism for Cellular Networks

Publications

OWL: GEOMETRY-AWARE SPATIAL REASONING FOR AUDIO LARGE LANGUAGE MODELS

Subrata Biswas, Mohammad Nur Hossain Khan, Bashima Islam

International Conference on Learning Representations (ICLR'26)

DOI PDF Video Code

Spatial reasoning is fundamental to auditory perception, yet current audio large language models (ALLMs) largely rely on unstructured binaural cues and single- step inference. This limits both perceptual accuracy in direction and distance estimation and the capacity for interpretable reasoning. Recent work such as BAT demonstrates spatial QA with binaural audio, but its reliance on coarse categorical labels (left, right, up, down) and the absence of explicit geometric supervision constrain resolution and robustness. We introduce the Spatial-Acoustic Geometry Encoder (SAGE), a geometry-aware audio encoder that aligns binaural acoustic features with 3D spatial structure using panoramic depth images and simulated room-impulse responses at training time, while requiring only audio at inference. Building on this representation, we present OWL, an ALLM that integrates SAGE with a spatially grounded chain-of-thought to rationalize over direction-of-arrivals (DoA) and distance estimates. Through curriculum learning from perceptual QA to multi-step reasoning, OWL supports o’clock-level azimuth and DoA estimation. To enable large-scale training and evaluation, we construct and release BiDepth, a dataset of over one million QA pairs combining binaural audio with panoramic depth images and room impulse responses across both in-room and out-of-room scenarios. Across two benchmark datasets, our new BiDepth and the public SpatialSoundQA, OWL reduces mean DoA error by 11◦ through SAGE and improves spatial reasoning QA accuracy by up to 25% over BAT.

HAIR NOISE ANALYSIS AND MITIGATION FOR SMART GLASSES AUDIO CAPTURES

Subrata Biswas, Daniel Wong, Bashima Islam, Sanjeel Parekh, Vladimir Tourbabin

IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP'26)

DOI PDF Video Code

Head-worn devices such as augmented-reality (AR) and smart glasses introduce a previously overlooked form of audio degrada- tion: hair noise, caused by the wearer’s hair brushing against device frames and embedded microphones. To the best of our knowledge, this phenomenon has not been systematically studied. This paper addresses this gap through three contributions. First, we conduct a user study quantifying the perceptual annoyance of hair noise. Sec- ond, we introduce the Hair Noise Mitigation (HNM) dataset, the first multi-channel corpus of hair noise collected across diverse real- world conditions. We further characterize its spectral and spatial properties, revealing a non-stationary and directionally dependent nature. Finally, we propose online and offline semi-supervised non- negative matrix factorization (NMF) methods as benchmark miti- gation approaches, showing perceptual gains that motivate further research. Together, these contributions establish hair noise as a dis- tinct challenge for wearable audio systems and lay the groundwork for tailored enhancement techniques.

RAVEN: Query-Guided Representation Alignment for Question Answering over Audio, Video, Embedded Sensors, and Natural Language

Subrata Biswas, Mohammad Nur Hossain Khan, Bashima Islam

Empirical Methods in Natural Language Processing (EMNLP'25) (main)

DOI PDF Video Code

Multimodal question answering (QA) often requires identifying which video, audio, or sensor tokens are relevant to the question. Yet modality disagreements are common: off-camera speech, background noise, or motion outside the field of view often mislead fusion models that weight all streams equally. We present RAVEN, a unified QA architecture whose core is QuART, a query-conditioned cross-modal gating module that assigns scalar relevance scores to each token across modalities, enabling the model to amplify informative signals and suppress distractors before fusion. RAVEN is trained through a three-stage pipeline comprising unimodal pretraining, query-aligned fusion, and disagreement-oriented fine-tuning - each stage targeting a distinct challenge in multi-modal reasoning: representation quality, cross-modal relevance, and robustness to modality mismatch.

EGOADAPT: Adaptive Multisensory Distillation and Policy Learning for Efficient Egocentric Perception

Sanjoy Chowdhury, Subrata Biswas, Sayan Nag, Tushar Nagarajan, Calvin Murdock, Ishwarya Ananthabhotla, Yijun Qian, Vamsi Krishna Ithapu, Dinesh Manocha, Ruohan Gao

International Conference on Computer Vision (ICCV'25)

DOI PDF Video Code

Modern perception models, particularly those designed for multisensory egocentric tasks, have achieved remarkable performance but often come with substantial computational costs. These high demands pose challenges for real-world deployment, especially in resource-constrained environments. In this paper, we introduce EGOADAPT, a framework that adaptively performs cross-modal distillation and policy learning to enable efficient inference across different egocentric perception tasks, including egocentric action recognition, active speaker localization, and behavior anticipation. Our proposed policy module is adaptable to task-specific action spaces, making it broadly applicable. Experimental results on three challenging egocentric datasets—EPIC-Kitchens, EasyCom, and Aria Everyday Activities—demonstrate that our method significantly enhances efficiency, reducing GMACs by up to 89.09%, parameters up to 82.02%, and energy up to 9.6×, while still on-par and in many cases outperforming, the performance of corresponding state-of-the-art models.

QUADS: QUAntized Distillation Framework for Efficient Speech Language Understanding

Subrata Biswas, Mohammad Nur Hossain Khan, Bashima Islam

INTERSPEECH'25

DOI PDF Video Code

Spoken Language Understanding (SLU) systems must balance performance and efficiency, particularly in resource-constrained environments. Existing methods apply distillation and quantization separately, leading to suboptimal compression as distillation ignores quantization constraints. We propose QUADS, a unified framework that optimizes both through multi-stage training with a pre-tuned model, enhancing adaptability to low-bit regimes while maintaining accuracy.

LOCUS – LOcalization with Channel Uncertainty and Sporadic Energy

Subrata Biswas, Mohammad Nur Hossain Khan, Alex Colwell, Jack Adiletta, Bashima Islam

International Conference On Embedded Wireless Systems and Networks (EWSN'25)

DOI PDF Video Code

Accurate sound source localization (SSL), such as direction-of-arrival (DoA) estimation, relies on consistent multichannel data. However, batteryless systems often suffer from missing data due to the stochastic nature of energy harvesting, degrading localization performance. We propose LOCUS, a deep learning framework that recovers corrupted features in such settings. LOCUS integrates three modules: (1) Information-Weighted Focus (InFo) to identify corrupted regions, (2) Latent Feature Synthesizer (LaFS) to reconstruct missing features, and (3) Guided Replacement (GRep) to restore data without altering valid inputs.

Missingness-resilient Video-enhanced Multimodal Disfluency Detection

Payal Mohapatra, Shamika Likhite, Subrata Biswas, Bashima Islam, Qi Zhu

INTERSPEECH'24

DOI PDF Video Code

Most existing speech disfluency detection techniques only rely upon acoustic data. In this work, we present a practical multimodal disfluency detection approach that leverages available video data together with audio. We curate an audio-visual dataset and propose a novel fusion technique with unified weight-sharing modality-agnostic encoders to learn the temporal and semantic context.

Memory-efficient Energy-adaptive Inference of Pre-Trained Models on Batteryless Embedded Systems

Subrata Biswas, Pietro Farina, Eren Yildiz, Khakim Akhunov, Saad Ahmed, Bashima Islam, Kasim Sinan Yildirim

International Conference On Embedded Wireless Systems and Networks (EWSN'24)

DOI PDF Video Code

Batteryless systems frequently face power failures, requiring extra runtime buffers to maintain inference progress and leaving only a memory space for storing ultra-tiny deep neural networks (DNNs). We combat these issues by proposing FreeML, a framework to optimize pre-trained DNN models for memory-efficient and energy-adaptive inference on batteryless systems.

Awards

10-2025

Received Peter B. Myers Graduate Fellowship from Dept. of ECE, WPI.

08-2022

1st Runner up at Robi Datathon 2.0.

10-2020

5th at IEEE Video and Image Processing Cup.

04-2020

4th at IEEE Signal Processing Cup.

06-2019

Winner of Bangladesh Section, IEEE YESIST12 Innovation Challenge 2019.

Contact

Email

sbiswas@wpi.edu
subrata.buet.eee@gmail.com

Phone

+1 508 535 2131

About

4th year Ph.D. Student

Department of Electrical & Computer Engineering

Worcester Polytechnic Institute, MA, USA

Seeking Opportunities

News

Work Experience

Education

Publications

Awards

Contact

Email

Phone