Shuo Li — Research Scientist, Google DeepMind

01 About § 1.1—1.3

I am a Research Scientist at Google DeepMind, where I work on safe coding agents and formal-verification-based safety mechanisms for large language model agents. Previously, I was an Applied Scientist at Amazon AWS AI Lab, where I shipped a production-grade agent verification module that is now integrated into a flagship AWS product used by enterprise customers.

I completed my Ph.D. in Computer Science at the University of Pennsylvania in 2025, advised by Insup Lee and Osbert Bastani. My research focuses on agentic AI safety, alignment, uncertainty quantification, and LLM evaluation — with the long-term goal of building reliable, secure, and verifiable LLM agent systems. My work has been published at NeurIPS, ICLR, ICML, NAACL, EMNLP, and KDD, including a Spotlight at NeurIPS 2024 and a Best Paper at the 2023 TEACH workshop at ICML.

I'm always open to collaboration on agentic safety and trustworthy AI — feel free to reach out.

02 Research Interests 3 areas

03 Experience 2 roles

Apr 2026
— Present

Research Scientist Google DeepMind

Shipped three agentic safety datasets, including adversarial prompt injection and real user trajectory data.
Building a formal-method-based policy guardian for LLM agents.
Building red-teaming and evaluation sets to stress-test frontier agents and policy guardians.

Jun 2025
— Mar 2026

Applied Scientist Amazon AWS AI Lab

Shipped a production-grade agent verification module, integrated into a flagship AWS product used by enterprise customers.
Designed a formal-methods verification pipeline achieving 46.4% verification accuracy and reducing agent execution failures.
Built Trajectory Analyzer, a reusable agent debugging library adopted by two internal agent teams.
Pioneered post-training methods to generate formal specifications from natural language.

04 Publications 21 papers

04.1 Featured

One-Shot Safety Alignment for Large Language Models via Optimal Dualization NeurIPS 2024 · Spotlight

TL;DRA one-shot safety alignment algorithm that navigates the helpfulness/safety trade-off via convex dualization, reducing computational complexity by ~90%.

AbstractThe growing safety concerns surrounding LLMs raise an urgent need to align them with diverse human preferences. We present a dualization perspective that reduces constrained alignment to an equivalent unconstrained problem by pre-optimizing a smooth and convex dual function with a closed form. This eliminates cumbersome primal-dual policy iterations, greatly reducing computational burden and improving training stability. Our strategy yields two practical algorithms (MoCAN and PeCAN) in model-based and preference-based settings.

arXiv PDF

Uncertainty in Language Models: Assessment through Rank-Calibration EMNLP 2024

TL;DRA rank-based framework for assessing LLM uncertainty measures, robust to the differing value ranges of existing metrics and free of ad-hoc thresholds.

AbstractMany uncertainty measures (semantic entropy, affinity-graph-based measures, etc.) have been proposed for language models, but they take values over different ranges, making them hard to compare. We develop Rank-Calibration, a principled framework whose key tenet is that higher uncertainty should imply lower generation quality on average. Rank-calibration quantifies deviations from this ideal without requiring ad-hoc thresholding of correctness scores. We demonstrate broad applicability and interpretability empirically.

arXiv PDF

TRAQ: Trustworthy Retrieval Augmented Question Answering via Conformal Prediction NAACL 2024

TL;DRThe first end-to-end statistical correctness guarantee for retrieval-augmented generation, combining conformal prediction with Bayesian optimization.

AbstractWhen applied to open-domain question answering, LLMs frequently generate incorrect responses based on hallucinated facts. RAG mitigates hallucinations but provides no correctness guarantees. We propose TRAQ, which uses conformal prediction to construct prediction sets guaranteed to contain the semantically correct response with high probability, and Bayesian optimization to minimize set size. Empirically, TRAQ provides the desired correctness guarantee while reducing prediction set size by 16.2% on average compared to ablations. Also: Best Paper, 2023 TEACH Workshop @ ICML.

arXiv PDF

04.2 — 2026 / Under Review

[01] Where to Spend Rollouts: Hit-Utility Optimal Rollout Allocation for Group-Based RLVR arXiv preprint, 2026 arXiv PDF
[02] QUASAR: A Fast, Reliable, and Secure Programming Language for LLM Agents with Code Actions ICLR 2026 / under review arXiv PDF
[03] SeekerGym: Benchmarking Agentic Information Seeking under Uncertainty Under review PDF
[04] BrowserArena: Evaluating LLM Agents on Real-World Web Navigation Tasks Under review arXiv PDF
[05] Development of a Deep Learning Model to Screen for Primary Open-Angle Glaucoma in African Ancestry Individuals npj Digital Medicine, 2026 DOI PDF

04.3 — 2025

[06] Alignment of Large Language Models with Constrained Learning NeurIPS 2025 arXiv PDF
[07] MrGuard: A Multilingual Reasoning Guardrail for Universal LLM Safety EMNLP 2025 PDF
[08] Conformal Structured Prediction ICLR 2025
[09] Evaluating the Diversity and Quality of LLM Generated Content arXiv preprint, 2025

04.4 — 2024

[10] REDO: Execution-Free Runtime Error Detection for Coding Agents 2024 arXiv PDF
[11] MOMENT: A Family of Open Time-Series Foundation Models ICML 2024 arXiv PDF
[12] Conformalized Credal Regions for Classification with Ambiguous Ground Truth TMLR 2024
[13] Advancing Glaucoma Care: Integrating Artificial Intelligence in Diagnosis, Management, and Progression Detection Bioengineering, 2024
[14] Utilizing Deep Learning to Diagnose Glaucoma from Fundus Photography in African Ancestry Individuals IOVS, 2024

04.5 — 2023

[15] Angelic Patches for Improving Third-Party Object Detector Performance CVPR 2023

04.6 — Earlier

[16] PAC-Wrap: Semi-Supervised PAC Anomaly Detection KDD 2022 Paper
[17] Towards PAC Multi-Object Detection and Tracking arXiv preprint, 2022
[18] Safe Reinforcement Learning via Statistical Model Predictive Shielding RSS 2021 PDF
[19] Robust Model Predictive Shielding for Safe Reinforcement Learning with Stochastic Dynamics ICRA 2020
[20] PAC Confidence Predictions for Deep Neural Network Classifiers arXiv preprint, 2020
[21] Learning Safe Unlabeled Multi-Robot Planning with Motion Constraints IROS 2019

05 Contact ¶

Emailshuoli0128@gmail.com
LocationMountain View, CA
AffiliationsGoogle DeepMind · UPenn (alumni)