Junhong Shen

Ph.D. Student

Machine Learning Department
Carnegie Mellon University
Advisor: Ameet Talwalkar

Email: junhongs@andrew.cmu.edu
Office: Gates Hillman Centers 8226

Google Scholar GitHub Twitter

Bio

I’m a rising 5th-year Ph.D. student in the Machine Learning Department at CMU, advised by Ameet Talwalkar. My work centers on enhancing LLM’s interaction with real-world applications, in particular building multi-modal models and agent systems that operate in real-world environments, such as browsers, command lines, and IDEs. I’m also interested in enhancing LLMs’ abilities to model diverse data types and applying them to long-tail, low-resource domains such as science and business.

I obtained my B.S. in Mathematics of Computation at UCLA, where I was fortunate to work with Lin Yang on sample-efficient reinforcement learning. I have also worked on multi-agent RL and Theory of Mind, advised by Song-Chun Zhu and Ying Nian Wu. My PhD is supported by JP Morgan AI PhD Fellowship.

News

Jun 2025: Check out our TTI work on new axis for agent test-time scaling!
May 2025: Started my intership at DeepMind in Mountain View!
Apr 2025: We are organizing the CMU Agent Workshop again this year. Check out and participate!
Dec 2024: My internship work at FAIR is released! Check out Content-Adaptive Tokenizer (CAT) and Multi-Modal Mixture-of-Mamba!
Nov 2024: Check out our newest work on web agents built on top of open-source LLMs! ScribeAgent paper, code, and blog post.
Sep 2024: Grateful to be awarded the J.P. Morgan AI PhD Fellowship (accepted) and the Bloomberg PhD Fellowship (declined)!
May 2024: My internship work at Microsoft Research, Tag-LLM, is accepted by ICML 2024!
Apr 2023: Our work on cross-modal fine-tuning is accepted by ICML 2023 as oral presentation!
Oct 2022: We are organizing the 2022 AutoML Decathlon. Check out and participate!
Sep 2022: Our work on NAS for diverse tasks is accepted by NeurIPS 2022.

Selected Publications

For a full list of publications, see Research.

Thinking vs. Doing: Agents that Reason by Scaling Test-Time Interaction paper illustration

Junhong Shen*, Hao Bai*, Lunjun Zhang, Yifei Zhou, Amrith Setlur, Shengbang Tong, Diego Caples, Nan Jiang, Tong Zhang, Ameet Talwalkar, Aviral Kumar
Thinking vs. Doing: Agents that Reason by Scaling Test-Time Interaction

Preprint

The current paradigm of test-time scaling relies on generating long reasoning traces ("thinking" more) before producing a response. In this work, we propose to scale test-time interaction, an untapped dimension of test-time scaling that increases the agent's interaction horizon to enable running rich behaviors such as exploration, backtracking, and dynamic re-planning within a single rollout.

papercodewebsiteblogpost

Mixture‑of‑Mamba: Enhancing Multi‑Modal State‑Space Models with Modality‑Aware Sparsity paper illustration

Weixin Liang*, Junhong Shen*, Genghan Zhang, Ning Dong, Luke Zettlemoyer, Lili Yu
Mixture‑of‑Mamba: Enhancing Multi‑Modal State‑Space Models with Modality‑Aware Sparsity

In ICLR Scalable Optimization for Efficient and Adaptive Foundation Models Workshop, 2025 (Oral, top 8/96).

We propose Mixture-of-Mamba, a novel SSM architecture that introduces modality-aware sparsity through modality-specific parameterization of the Mamba block. Building on Mixture-of-Transformers, we extend the benefits of modality-aware sparsity to SSMs while preserving their computational efficiency.

papercode

CAT: Content-Adaptive Image Tokenization paper illustration

Junhong Shen, Kushal Tirumala, Michihiro Yasunaga, Ishan Misra, Luke Zettlemoyer, Lili Yu, Chunting Zhou
CAT: Content-Adaptive Image Tokenization

Preprint

Most existing image tokenizers encode images into a fixed number of tokens or patches. We introduce Content-Adaptive Tokenizer (CAT), which dynamically adjusts representation capacity based on the image content and encodes simpler images into fewer tokens. We design a caption-based evaluation system that leverages LLMs to predict content complexity and determine the optimal compression ratio for a given image.

paper

Junhong Shen, Atishay Jain, Zedian Xiao, Ishan Amlekar, Mouad Hadji, Aaron Podolny, Ameet Talwalkar
ScribeAgent: Towards Specialized Web Agents Using Production-Scale Workflow Data

In ICLR Foundation Models in the Wild Workshop, 2025.

Most LLM-based web agents rely on prompting general-purpose, proprietary models like GPT-4. We explore an alternative approach that fine-tunes open-source LLMs using production-scale workflow data. This simple yet effective approach achieves SOTA direct generation performance on Mind2Web and improves the task success rate by 7.3% over the previous best text-only web agents on WebArena.

papercode

Tag-LLM: Repurposing General-Purpose LLMs for Specialized Domains paper illustration

Junhong Shen, Neil Tenenholtz, James Brian Hall, David Alvarez-Melis, Nicolo Fusi
Tag-LLM: Repurposing General-Purpose LLMs for Specialized Domains

In ICML, 2024.

LLMs demonstrate proficiency in understanding natural language. However, their capabilities wane in highly specialized domains underrepresented in the pretraining corpus, such as physical and biomedical sciences. This work explores how to repurpose general LLMs into specialized task solvers through learning custom input tags to condition the LLM.

papercode

Efficient Architecture Search for Diverse Tasks paper illustration

Junhong Shen*, Mikhail Khodak*, Ameet Talwalkar
Efficient Architecture Search for Diverse Tasks

In NeurIPS, 2022.

DASH is developed for efficiently solving diverse ML problems outside of the well-researched domains such as vision and natural language processing. Being fast, simple, and broadly applicable, DASH fixes a standard CNN topology and searches for the right kernel sizes and dilation rates that its operations should take on. It expands the network capacity to extract features at multiple resolutions for different types of data while only requiring searching over the operation space.

papercodeblogpost

Theoretically Principled Deep RL Acceleration via Nearest Neighbor Function Approximation paper illustration

Junhong Shen, Lin F. Yang
Theoretically Principled Deep RL Acceleration via Nearest Neighbor Function Approximation

In AAAI, 2021.

We propose a theoretically principled nearest neighbor (NN) function approximator that can replace the value networks in deep RL methods. Inspired by human similarity judgments, the NN approximator estimates the action values using rollouts on past observations and can provably obtain a small regret bound that depends only on the intrinsic complexity of the environment.

papercodeslides

Cite Thinking vs. Doing: Agents that Reason by Scaling Test-Time Interaction

@misc{shenbai2025tti,
 title={Thinking vs. Doing: Agents that Reason by Scaling Test-Time Interaction}, 
 author={Junhong Shen and Hao Bai and Lunjun Zhang and Yifei Zhou and Amrith Setlur and Shengbang Tong and Diego Caples and Nan Jiang and Tong Zhang and Ameet Talwalkar and Aviral Kumar},
 year={2025},
 eprint={2506.07976},
 archivePrefix={arXiv},
 primaryClass={cs.LG},
 url={https://arxiv.org/abs/2506.07976},
 }

Cite Mixture‑of‑Mamba: Enhancing Multi‑Modal State‑Space Models with Modality‑Aware Sparsity

@misc{liangshen2025mixtureofmamba,
 title={Mixture-of-Mamba: Enhancing Multi-Modal State-Space Models with Modality-Aware Sparsity},
 author={Weixin Liang and Junhong Shen and Genghan Zhang and Ning Dong and Luke Zettlemoyer and Lili Yu},
 year={2025},
 eprint={2501.16295},
 archivePrefix={arXiv},
 primaryClass={cs.LG},
 url={https://arxiv.org/abs/2501.16295},
 }

Cite CAT: Content-Adaptive Image Tokenization

@misc{shen2024adaptivetokenizer,
 title={CAT: Content-Adaptive Image Tokenization},
 author={Junhong Shen and Kushal Tirumala and Michihiro Yasunaga and Ishan Misra and Luke Zettlemoyer and Lili Yu and Chunting Zhou},
 year={2025},
 eprint={2501.03120},
 archivePrefix={arXiv},
 primaryClass={cs.CV},
 }

Cite ScribeAgent: Towards Specialized Web Agents Using Production-Scale Workflow Data

@misc{shen2024scribeagent,
 title={ScribeAgent: Towards Specialized Web Agents Using Production-Scale Workflow Data},
 author={Junhong Shen and Atishay Jain and Zedian Xiao and Ishan Amlekar and Mouad Hadji and Aaron Podolny and Ameet Talwalkar},
 year={2024},
 eprint={2411.15004},
 archivePrefix={arXiv},
 primaryClass={cs.CL},
 }

Cite UPS: Efficiently Building Foundation Models for PDE Solving via Cross-Modal Adaptation

@misc{shen2024ups, title={UPS: Efficiently Building Foundation Models for PDE Solving via Cross-Modal Adaptation},
 author={Junhong Shen and Tanya Marwah and Ameet Talwalkar},
 year={2024},
 eprint={2403.07187},
 archivePrefix={arXiv},
 primaryClass={cs.LG}
 }

Cite Tag-LLM: Repurposing General-Purpose LLMs for Specialized Domains

@misc{shen2024tagllm,
 title={Tag-LLM: Repurposing General-Purpose LLMs for Specialized Domains}, 
 author={Junhong Shen and Neil Tenenholtz and James Brian Hall and David Alvarez-Melis and Nicolo Fusi},
 year={2024},
 eprint={2402.05140},
 archivePrefix={arXiv},
 primaryClass={cs.LG}
 }

Cite Cross-Modal Fine-Tuning: Align then Refine

@misc{shen2023orca,
 author = {Shen, Junhong and Li, Liam and Dery, Lucio M. and Staten, Corey and Khodak, Mikhail and Neubig, Graham and Talwalkar, Ameet},
 title = {Cross-Modal Fine-Tuning: Align then Refine},
 publisher = {ICML},
 year = {2023},
 url = {https://arxiv.org/abs/2302.05738}
 }

Cite Efficient Architecture Search for Diverse Tasks

@inproceedings{shen2022efficient,
 title={Efficient Architecture Search for Diverse Tasks},
 author={Shen, Junhong and Khodak, Mikhail and Talwalkar, Ameet},
 booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
 year={2022}
 }

Cite Theoretically Principled Deep RL Acceleration via Nearest Neighbor Function Approximation

@inproceedings{Shen2021TheoreticallyPD,
 title={Theoretically Principled Deep RL Acceleration via Nearest Neighbor Function Approximation},
 author={Junhong Shen and Lin F. Yang},
 booktitle={AAAI},
 year={2021}
 }