Resarch

I’m interested in machine learning and deep learning in general. My current research focuses mainly on the practical side of ML, i.e., developing effective ML tools and pipelines for diverse applications in real life. I’m particuarly intereted in enhancing LLM’s interaction with real-world applications by developing efficient and unified multi-modal models and building LLM agents capable of environment (e.g., browsers, command lines, IDEs) and user interactions. Besides, I also study:

 

Talks

 

Publications

, ,
CAT: Content-Adaptive Image Tokenization
In Preprint, .
Most existing image tokenizers encode images into a fixed number of tokens or patches. We introduce Content-Adaptive Tokenizer (CAT), which dynamically adjusts representation capacity based on the image content and encodes simpler images into fewer tokens. We design a caption-based evaluation system that leverages LLMs to predict content complexity and determine the optimal compression ratio for a given image.
, ,
ScribeAgent: Towards Specialized Web Agents Using Production-Scale Workflow Data
In Preprint, .
Most LLM-based web agents rely on prompting general-purpose, proprietary models like GPT-4. We explore an alternative approach that fine-tunes open-source LLMs using production-scale workflow data. This simple yet effective approach achieves SOTA direct generation performance on Mind2Web and improves the task success rate by 7.3% over the previous best text-only web agents on WebArena.
, , , ,
Specialized Foundation Models Struggle to Beat Supervised Baselines
In NeurIPS FM4Science Workshop, 2024.
We look at three modalities--genomics, satellite imaging, and time series--with multiple recent FMs and compare them to a standard supervised learning workflow (model development, hyperparameter tuning, and training, all using only data from the target task). We find that it is consistently possible to train simple supervised models that match or even outperform the latest foundation models.
, ,
UPS: Efficiently Building Foundation Models for PDE Solving via Cross-Modal Adaptation
In TMLR 2024 & ICML AI4Science Workshop, 2024 (Spotlight).
UPS is developed for solving diverse spatiotemporal PDEs defined over various domains, dimensions, and resolutions. It unifies different PDEs into a consistent representation space and processes diverse collections of PDE data using a unified network architecture that combines LLMs with domain-specific neural operators.
, ,
Tag-LLM: Repurposing General-Purpose LLMs for Specialized Domains
In ICML, 2024.
LLMs demonstrate proficiency in understanding natural language. However, their capabilities wane in highly specialized domains underrepresented in the pretraining corpus, such as physical and biomedical sciences. This work explores how to repurpose general LLMs into specialized task solvers through learning custom input tags to condition the LLM.
, ,
Cross-Modal Fine-Tuning: Align then Refine
In ICML, 2023 ([Oral](https://icml.cc/virtual/2023/oral/25514)).
ORCA is a general cross-modal fine-tuning framework that extends the applicability of a single large-scale pretrained model to diverse modalities. It adapts to a target task via an align-then-refine workflow. Given the target input, ORCA first learns an embedding network that aligns the embedded feature distribution with the pretraining modality. The pretrained model is then fine-tuned on the embedded data to exploit the knowledge shared across modalities.
, ,
Efficient Architecture Search for Diverse Tasks
In NeurIPS, 2022.
DASH is developed for efficiently solving diverse ML problems outside of the well-researched domains such as vision and natural language processing. Being fast, simple, and broadly applicable, DASH fixes a standard CNN topology and searches for the right kernel sizes and dilation rates that its operations should take on. It expands the network capacity to extract features at multiple resolutions for different types of data while only requiring searching over the operation space.
, , , , ,
NAS-Bench-360: Benchmarking Neural Architecture Search on Diverse Tasks
In NeurIPS Datasets and Benchmarks Track, 2022.
Neural architecture search (NAS) benchmarks and methods prioritize performance on well-studied tasks, e.g., image classification on CIFAR and ImageNet. To mitigate this bias, NAS-Bench-360 is a benchmark suite for evaluating state-of-the-art NAS methods on a diverse set of tasks. The selection spans different application domains, dataset sizes, problem dimensionalities, and learning objectives.
, , , ,
Iterative Teacher-Aware Learning
In NeurIPS, 2021.
In this paper, we propose a gradient optimization based teacher-aware learner who can incorporate teacher’s cooperative intention into the likelihood function and learn provably faster compared with the naive learning algorithms used in previous machine teaching works.
,
Theoretically Principled Deep RL Acceleration via Nearest Neighbor Function Approximation
In AAAI, 2021.
We propose a theoretically principled nearest neighbor (NN) function approximator that can replace the value networks in deep RL methods. Inspired by human similarity judgments, the NN approximator estimates the action values using rollouts on past observations and can provably obtain a small regret bound that depends only on the intrinsic complexity of the environment.
, ,
Mathematical Reconstruction of Patient-Specific Vascular Networks Based on Clinical Images and Global Optimization
In IEEE Access, 2021.
We developed a computational framework that takes 3D medical images as input and reconstructs complete, patient-specific vascular network models using a mathematical optimization procedure. Our framework extracts major vessels from the images and uses the organ geometry to select vessel termination points. Then, it generates the remainder network based on physiological optimality principles.
, , ,
Emergence of Pragmatics from Referential Game between Theory of Mind Agents
In Emergent Communication Workshop, NeurIPS, 2019.
We integrate the theory of mind (ToM) in a cooperative multi-agent pedagogical situation and propose an adaptive reinforcement learning (RL) algorithm to develop a communication protocol.