|
Hao Zou Reliable generation, masked diffusion language models, and long-horizon LLM agents. I am a Columbia CS researcher working with Prof. Kathleen McKeown and Prof. Zhou Yu, and closely collaborating with Zachary Horvitz and Xiao Yu. My research asks how language models can revise, reason, and act reliably when evidence changes or tasks unfold over many steps. My recent work connects two threads: (1) masked diffusion language models for controllable editing, faithful summarization, and inference-time reasoning; and (2) LLM agents for realistic computer-use workflows, environment scaling, and harness-based training. Starting Summer 2026, I am continuing in the Columbia CS Department as a Research Staff Associate on an IARPA-funded project on faithful summarization. Previously, I worked with Prof. Jordan Boyd-Graber at the UMD CLIP Lab, and with Prof. Dongyeop Kang at the University of Minnesota NLP group. I received my B.S. in Computer Science from the University of Minnesota, Twin Cities, and my M.S. in Computer Science from Columbia University in May 2026. |
|
Research
My long-term goal is to build AI systems that remain faithful and useful under uncertainty: systems that can detect unsupported claims, localize what should be changed, allocate computation to uncertain parts of a reasoning trace, and act safely in long-horizon environments. I approach this through a bridge between diffusion-based generation/editing and agent evaluation/training.
In diffusion language modeling, I study how non-autoregressive models can use their ability to infill and revise arbitrary spans for faithful summarization, reasoning templates, and efficient inference. In agents, I study how realistic environments, verifiers, and harnesses can make training and evaluation match the way agents are actually deployed.
News
- Jun 2026: Detect, Remask, Repair is on arXiv and under review at EMNLP 2026.
- Jun 2026: Continuing at Columbia CS as a Research Staff Associate in Prof. McKeown's group on faithful summarization.
- 2026: OSWorld 2.0, a benchmark for long-horizon real-world computer-use agents, is under review at NeurIPS 2026.
- 2025: No Compute Left Behind studies reasoning, sampling, and inference-time compute allocation for masked diffusion language models.
- 2024: Our question rewriting work was accepted to EMNLP 2024.
Selected Papers & Projects
|
Detect, Remask, Repair: Diffusion Editing for Faithful Summarization of Evolving Contexts
Hao Zou, Zachary Horvitz, Chandhru Karthick, Zhou Yu, Kathleen McKeown Under review, EMNLP 2026 arXiv / bibtex A diffusion-based framework for localized faithfulness repair: detect unsupported spans in an existing summary, remask them, and repair only the unreliable content instead of fully regenerating the summary. |
|
OSWorld 2.0: Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks
Large collaborative project Under review, NeurIPS 2026 project page / arXiv coming soon A benchmark of 102 realistic computer-use tasks and 30 self-hosted mock websites for evaluating long-horizon GUI agents, including dynamic workflows, tutorial following, simulated-user interaction, multimodal editing, and safety-oriented checks. |
|
|
Scalable Environments and Training Infrastructure for Harness-Based Agents
Ongoing collaboration with Xiao Yu In preparation, 2026 Building agent-curated task environments and training infrastructure that connects real inference harnesses to SFT/RL pipelines, reducing the mismatch between how agents are trained and how they are deployed. |
|
No Compute Left Behind: Rethinking Reasoning and Sampling with Masked Diffusion Models
Zachary Horvitz, Raghav Singhal, Hao Zou, Carles Domingo-Enrich, Zhou Yu, Rajesh Ranganath, Kathleen McKeown Under review, COLM 2026 arXiv / bibtex Studies how masked diffusion language models use inference-time compute, including reasoning-as-infilling, answer-conditioned posterior sampling, uncertainty-aware early exit, and reasoning validation. |
|
You Make me Feel like a Natural Question: Training QA Systems on Transformed Trivia Questions
Tasnim Kabir, Yoo Yeon Sung, Saptarashmi Bandyopadhyay, Hao Zou, Abhranil Chandra, Jordan Boyd-Graber EMNLP 2024 paper / bibtex Transforms trivia-style questions into more natural information-seeking questions to improve retrieval alignment and cross-domain question answering. |
|
A Survey of Diffusion Models in Natural Language Processing
Hao Zou, Zae Myung Kim, Dongyeop Kang arXiv, 2023 arXiv / bibtex Surveys diffusion formulations for NLP, including non-autoregressive generation, text editing, token-level control, robustness, and scaling directions for diffusion language models. |
|
Debiasing Language Models for In-Context Learning Using a Causal Inference-Inspired Method
Hao Zou, Karin de Langis, Dongyeop Kang, Yohan Jo Manuscript, 2023 bibtex Uses intervention-inspired estimates of the causal effect of input text to reduce spurious prompt-label bias in in-context learning. |
Experience
| 2025 - Present | NLP Lab, Columbia University Graduate researcher / Staff Associate I. Working on faithful summarization, masked diffusion language models, inference-time algorithms, and LLM agents with Kathleen McKeown, Zhou Yu, Zachary Horvitz, and Xiao Yu. |
| 2024 - 2025 | IBM AI Algorithms Intern. Worked on LLM inference efficiency, KV-cache quantization, and attention benchmarking. |
| 2024 - 2025 | Duke University Graduate research with Enmao Diao on diffusion/flow-matching training objectives and benchmarking. |
| 2021 - 2023 | University of Minnesota NLP Group Undergraduate research with Dongyeop Kang on diffusion models for NLP, causal analysis, and controllable generation. |
| 2021 - 2022 | UMD CLIP Lab Undergraduate research with Jordan Boyd-Graber on question rewriting and retrieval for open-domain QA. |
| 2022 - 2023 | Sony Research Research intern on controllable generation and reinforcement learning from human preferences. |
