Materials

Technical blogs

LLM Powered Autonomous Agents from Lillian Weng

Papers

LLM Reasoning

LLM agents: brief history

Agentic AI Frameworks & AutoGen

Enterprise trends for generative AI, and key components of building successful agents/applications

Google Cloud expands grounding capabilities on Vertex AI
The Needle In a Haystack Test: Evaluating the performance of RAG systems
The AI detective: The Needle in a Haystack test and how Gemini 1.5 Pro solves it

Compound AI Systems & the DSPy Framework

Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs
Fine-Tuning and Prompt Optimization: Two Great Steps that Work Better Together

Agents for software Developement

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering
OpenHands: An Open Platform for AI Software Developers as Generalist Agents

AI Agents for Enterprise Workflows

WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?
WorkArena++: Towards Compositional Planning and Reasoning-based Common Knowledge Work Tasks
TapeAgents: a Holistic Framework for Agent Development and Optimization

Towards a unified framework of Neural and Symbolic Decision Making

Beyond A*: Better Planning with Transformers via Search Dynamics Bootstrapping
Dualformer: Controllable Fast and Slow Thinking by Learning with Randomized Reasoning Traces
Composing Global Optimizers to Reasoning Tasks via Algebraic Objects in Neural Nets
SurCo: Learning Linear Surrogates For Combinatorial Nonlinear Optimization Problems

Project GROOT: A Blueprint for Generalist Robotics

Voyager: An Open-Ended Embodied Agent with Large Language Models
Eureka: Human-Level Reward Design via Coding Large Language Models
DrEureka: Language Model Guided Sim-To-Real Transfer

Open-Source and Science in the Era of Foundation Models

Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models

Measuring Agent capabilities and Anthropic’s RSP

Announcing our updated Responsible Scaling Policy
Developing a computer use model

Towards Building Safe & Trustworthy AI Agents and A Path for Science‑ and Evidence‑based AI Policy

A Path for Science‑ and Evidence‑based AI Policy
DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models
Representation Engineering: A Top-Down Approach to AI Transparency
Extracting Training Data from Large Language Models
The Secret Sharer: Evaluating and Testing Unintended Memorization in Neural Networks

Test-time Scaling

s1: Simple test-time scaling

General Agent (using LLM)