Main picture

Ph.D. Student

Machine Learning Department
Carnegie Mellon University
Advisor: Ameet Talwalkar

EmailGoogle ScholarGitHubTwitter

Bio

I’m a 5th-year Ph.D. student in the Machine Learning Department at CMU, advised by Ameet Talwalkar. My work centers on enhancing LLM’s interaction with real-world applications, in particular building multi-modal models and agent systems that operate in real-world environments, such as browsers, command lines, and IDEs. I’m also interested in enhancing LLMs’ abilities to model diverse data types and applying them to long-tail, low-resource domains such as science and business.

I obtained my B.S. in Mathematics of Computation at UCLA, where I was fortunate to work with Lin Yang on sample-efficient reinforcement learning. I have also worked on multi-agent RL and Theory of Mind, advised by Song-Chun Zhu and Ying Nian Wu. My PhD is supported by JP Morgan AI PhD Fellowship.

I'm on the industry job market, seeking research scientist positions starting Oct 2025! Feel free to email me if there's a fit! Here's my CV.

 

News

 

Selected Publications

For a full list of publications, see Research.
, ,
Thinking vs. Doing: Agents that Reason by Scaling Test-Time Interaction
NeurIPS 2025
The current paradigm of test-time scaling relies on generating long reasoning traces ("thinking" more) before producing a response. In this work, we propose to scale test-time interaction, an untapped dimension of test-time scaling that increases the agent's interaction horizon to enable running rich behaviors such as exploration, backtracking, and dynamic re-planning within a single rollout.
, ,
CAT: Content-Adaptive Image Tokenization
NeurIPS 2025
Most existing image tokenizers encode images into a fixed number of tokens or patches. We introduce Content-Adaptive Tokenizer (CAT), which dynamically adjusts representation capacity based on the image content and encodes simpler images into fewer tokens. We design a caption-based evaluation system that leverages LLMs to predict content complexity and determine the optimal compression ratio for a given image.
, , ,
Mixture‑of‑Mamba: Enhancing Multi‑Modal State‑Space Models with Modality‑Aware Sparsity
ICLR Scalable Optimization for Efficient and Adaptive Foundation Models Workshop, 2025 (Oral, top 8/96).
We propose Mixture-of-Mamba, a novel SSM architecture that introduces modality-aware sparsity through modality-specific parameterization of the Mamba block. Building on Mixture-of-Transformers, we extend the benefits of modality-aware sparsity to SSMs while preserving their computational efficiency.
, ,
ScribeAgent: Towards Specialized Web Agents Using Production-Scale Workflow Data
ICLR Foundation Models in the Wild Workshop, 2025.
Most LLM-based web agents rely on prompting general-purpose, proprietary models like GPT-4. We explore an alternative approach that fine-tunes open-source LLMs using production-scale workflow data. This simple yet effective approach achieves SOTA direct generation performance on Mind2Web and improves the task success rate by 7.3% over the previous best text-only web agents on WebArena.
, ,
UPS: Efficiently Building Foundation Models for PDE Solving via Cross-Modal Adaptation
TMLR 2024 & ICML AI4Science Workshop, 2024 (Spotlight).
UPS is developed for solving diverse spatiotemporal PDEs defined over various domains, dimensions, and resolutions. It unifies different PDEs into a consistent representation space and processes diverse collections of PDE data using a unified network architecture that combines LLMs with domain-specific neural operators.
, ,
Tag-LLM: Repurposing General-Purpose LLMs for Specialized Domains
ICML, 2024.
LLMs demonstrate proficiency in understanding natural language. However, their capabilities wane in highly specialized domains underrepresented in the pretraining corpus, such as physical and biomedical sciences. This work explores how to repurpose general LLMs into specialized task solvers through learning custom input tags to condition the LLM.
, ,
Cross-Modal Fine-Tuning: Align then Refine
ICML, 2023 (Oral).
ORCA is a general cross-modal fine-tuning framework that extends the applicability of a single large-scale pretrained model to diverse modalities. It adapts to a target task via an align-then-refine workflow. Given the target input, ORCA first learns an embedding network that aligns the embedded feature distribution with the pretraining modality. The pretrained model is then fine-tuned on the embedded data to exploit the knowledge shared across modalities.
, ,
Efficient Architecture Search for Diverse Tasks
NeurIPS, 2022.
DASH is developed for efficiently solving diverse ML problems outside of the well-researched domains such as vision and natural language processing. Being fast, simple, and broadly applicable, DASH fixes a standard CNN topology and searches for the right kernel sizes and dilation rates that its operations should take on. It expands the network capacity to extract features at multiple resolutions for different types of data while only requiring searching over the operation space.
,
Theoretically Principled Deep RL Acceleration via Nearest Neighbor Function Approximation
AAAI, 2021.
We propose a theoretically principled nearest neighbor (NN) function approximator that can replace the value networks in deep RL methods. Inspired by human similarity judgments, the NN approximator estimates the action values using rollouts on past observations and can provably obtain a small regret bound that depends only on the intrinsic complexity of the environment.