⌁ Hello, I'm —

Mohammad Nur Hossain Khan

Late-stage PhD candidate at Worcester Polytechnic Institute, advised by Dr. Bashima Islam. I build audio large language models and self-supervised models for physiological time-series.

My research sits at the intersection of multimodal LLMs, audio understanding, and AI for health. I work on audio moment retrieval over long-form recordings, multi-speaker diarization and non-speech vocalization recognition for infant-centric soundscapes, and graph-based self-supervised learning for irregular physiological signals like ECG inter-beat intervals.

First-author publications at EMNLP and Interspeech on audio-LLMs (RAVEN, QUADS), co-author on Owl (ICLR 2026), a geometry-aware spatial audio LLM. Comfortable training and deploying large multimodal models on multi-GPU A100 clusters, and equally happy compressing them onto microcontrollers (UniT, EfficientMic).

Audio LLMs Audio Moment Retrieval Graph SSL for ECG Multi-speaker Diarization TinyML / On-device
12+
Peer-reviewed papers
4
First-author venues
ICLR'26
Latest top-tier acceptance
2026
Defense expected
What's new

Recent News

10 · 2025
EfficientMic (single-microphone acoustic sensing for smart infrastructure) accepted at ACM BuildSys 2025.
Co-author
08 · 2025
RAVEN accepted at EMNLP 2025 (Main).
First author
07 · 2025
LLaSA accepted at UbiComp 2025.
07 · 2025
LOCUS accepted at EWSN 2025.
07 · 2025
Mindfulness Meditation & Respiration accepted at ACM IMWUT 2025.
First author
05 · 2025
QUADS accepted at INTERSPEECH 2025.
First author
05 · 2025
Received M.S. in Electrical & Computer Engineering from WPI.
11 · 2024
Passed Ph.D. diagnostic exam.
08 · 2024
InfantMotion2vec accepted at IEEE BSN 2024.
First author
04 · 2024
Paper accepted at IEEE/ACM CHASE 2024.
08 · 2022
Started Ph.D. at BASH Lab, ECE, WPI.
In progress

Current Research

In submission · 2026

Confidence-Guided Retrieval Refinement for Audio Moment Retrieval

A retrieve–rerank framework that localizes natural-language queries inside long-form (5-minute) YouTube audio. A cross-modal retriever returns top-K temporal candidates; a second-stage reranker — trained with Direct Alignment Preference Optimization (DAPO) — selectively refines them only when reranker confidence exceeds a learned threshold, avoiding destructive corrections on locally ambiguous candidates.

Boundaries are further sharpened by retrieval-grounded span refinement with an IoU overlap constraint, preventing hallucinated spans. On CASTELLA (1,347 test queries), the full system beats published UVCOM by +2.82 R1@0.5, +1.97 R1@0.7, and +1.75 mAP.

Audio LLM Retrieve–Rerank DAPO CASTELLA Lead author
In preparation · 2026

Graph-Based Self-Supervised Foundation Model for ECG & IBI

A graph neural network that treats each heartbeat as a node carrying 10 physiologically-meaningful HRV features (pNN50, RMSSD proxy, MAD-normalized IBI, local trend, quality flag) and connects beats through K-NN similarity edges, temporal sequential edges, and self-loops — built dynamically from learned embeddings every forward pass.

Pretrained with a hybrid SSL objective: masked-node reconstruction for local rhythmic context, plus a BYOL contrastive loss under physiologically realistic augmentations (jitter, beat dropout, scale drift). Designed for ECG's irregular beat sequences — exactly where fixed-rate methods like wav2vec2 and HuBERT struggle.

Graph Neural Net BYOL + Masked SSL ECG / HRV Physiology-grounded Lead author
In submission · 2026

Multi-task Audio LLM for Long-Form Home Recordings

An audio large language model that jointly performs speaker counting, fine-grained speaker-role diarization (infant / parent / sibling / non-family), and non-speech vocalization labeling at 0.1-second resolution — a setting where standard audio LLMs trained on web caption data degrade sharply.

Combines an Audio Spectrogram Transformer trained on Littlebeats wearable data and AudioSet with a stage-wise reasoning corpus for caregiver–infant interaction. Post-trained with GRPO via the VERL framework on 4×A100 (80 GB).

Audio LLM Diarization GRPO / VERL LoRA · DDP · Slurm Lead author
Accepted · ICLR 2026

Owl — Geometry-Aware Spatial Reasoning for Audio LLMs

Co-developed SAGE, a geometry-aware binaural audio encoder trained with auxiliary room-impulse-response prediction so it learns direct-to-reverberation ratio, RT60, and room layout cues from binaural audio alone.

Co-built Owl, an audio LLM integrating SAGE with a Q-Former projector and LLaMA-2-7B backbone, trained with a three-stage curriculum: perception → relational geometry → chain-of-thought. Released BiDepth, a 1.1M-QA-pair dataset; Owl improves spatial QA accuracy by up to 25% over BAT.

Spatial Audio LLM Q-Former · LLaMA-2-7B ICLR 2026 Co-author
Peer-reviewed

Selected Publications

RAVEN architecture

RAVEN: Query-Guided Representation Alignment for Question Answering over Audio, Video, Embedded Sensors, and Natural Language

Mohammad Nur Hossain Khan, Subrata Biswas, Bashima Islam
EMNLP 2025 · Main Conference

We present RAVEN, a unified multimodal QA architecture whose core is QuART — a query-conditioned cross-modal gating module that assigns scalar relevance scores to each token across audio, video, and sensor modalities, enabling the model to amplify informative signals and suppress distractors before fusion. RAVEN is trained through a three-stage pipeline: unimodal pretraining, query-aligned fusion, and disagreement-oriented fine-tuning. We release AVS-QA, a dataset of 300K synchronized Audio-Video-Sensor streams with auto-generated QA pairs. RAVEN achieves up to 14.5% and 8.0% accuracy gains over SOTA multimodal LLMs on seven benchmarks, and remains robust under modality corruption, outperforming baselines by 50.23%.

Mindfulness Meditation overview

Mindfulness Meditation and Respiration: Accelerometer-based Respiration Rate and Mindfulness Progress Estimation

Mohammad Nur Hossain Khan, David Creswell, Jordan Albert, Patrick O'Connell, Shawn Fallon, Mathew Polowitz, Xuhai "Orson" Xu, Bashima Islam
ACM IMWUT 2025

A smartphone-accelerometer–based respiration tracking algorithm that accurately captures slow breathing patterns typical of mindfulness meditation, eliminating the need for additional wearables. We introduce the first quantitative framework to estimate mindfulness skills — concentration, sensory clarity, and equanimity — from accelerometer-derived respiration data. Tested on 261 mindfulness sessions in both controlled and real-world settings; respiration tracking achieves MAE of 1.6 BPM, and mindfulness skill estimation reaches F1 of 80–84%.

QUADS diagram

QUADS: QUAntized Distillation Framework for Efficient Speech Language Understanding

Mohammad Nur Hossain Khan, Subrata Biswas, Bashima Islam
INTERSPEECH 2025

QUADS unifies knowledge distillation and quantization through multi-stage training with a pre-tuned model, enhancing adaptability to low-bit regimes while maintaining accuracy. It achieves 71.13% accuracy on SLURP and 99.20% on FSC, with only minor degradations (≤5.56%) compared to SOTA. It reduces computational complexity by 60–73× (GMACs) and model size by 83–700×, demonstrating strong robustness under extreme quantization.

LLaSA overview

LLaSA: A Sensor-Aware LLM for Natural Language Reasoning of Human Activity from IMU Data

Sheikh Asif Imran Shouborno, Mohammad Nur Hossain Khan, Subrata Biswas, Bashima Islam
UbiComp 2025

We introduce SensorCap (35,960 IMU–caption pairs) and OpenSQA (199,701 QA pairs for causal/explanatory reasoning), and develop LLaSA — a family of compact sensor-aware LLMs (7B, 13B) that generate interpretable, context-rich responses grounded in raw IMU data. LLaSA outperforms commercial LLMs including GPT-3.5 and GPT-4o-mini on benchmark and real-world tasks.

LOCUS architecture

LOCUS — Localization with Channel Uncertainty and Sporadic Energy

Subrata Biswas, Mohammad Nur Hossain Khan, Alex Colwell, Jack Adiletta, Bashima Islam
EWSN 2025

A deep learning framework recovering corrupted features for sound source localization on batteryless systems with sporadic energy. LOCUS integrates Information-Weighted Focus (InFo), a Latent Feature Synthesizer (LaFS), and Guided Replacement (GRep) — yielding up to 36.91% DoA error reduction on DCASE/LargeSet and 25.87–59.46% real-world gains. We release a 50-hour multichannel dataset.

See full list on Google Scholar →

Background

Education

Ph.D. in Electrical & Computer Engineering

Worcester Polytechnic Institute · Worcester, MA, USA
Aug 2022 — Expected 2026
Advisor: Dr. Bashima Islam · BASH Lab

M.S. in Electrical & Computer Engineering

Worcester Polytechnic Institute · Worcester, MA, USA
Conferred May 2025

B.Sc. in Electrical & Electronic Engineering

Bangladesh University of Engineering and Technology (BUET) · Dhaka, Bangladesh
Feb 2011 — Mar 2016
Advisor: Dr. Shaikh Anowarul Fattah
Thesis: Surface EMG-Based Hand Gesture Recognition using Discrete Wavelet Transformation
Experience & service

Experience

Graduate Research Assistant — BASH Lab

Worcester Polytechnic Institute · Audio LLMs, multimodal QA, SSL for wearables · Reviewer for IEEE BSN, ACM IMWUT, IEEE TAC · Co-organizer, IEEE ASRU 2025 satellite workshop on AI for Children's Speech & Language.
May 2022 — Present

Graduate Teaching Assistant — Dept. of ECE, WPI

On-Device Deep Learning · Discrete-Time Signal & System Analysis · Communications & Networks.
Aug 2023 — May 2024

Assistant Engineer — Operations

Electricity Generation Company of Bangladesh · Dhaka, Bangladesh · Pre-PhD industry experience.
Sep 2017 — Aug 2022
Let's chat

Get in Touch

Whether you're a fellow researcher, a potential collaborator, or just want to chat about audio LLMs, wearable sensing, or graph SSL — I'd love to hear from you.