Technical blogs

Papers

LLM Reasoning

LLM agents: brief history

Agentic AI Frameworks & AutoGen

  • Google Cloud expands grounding capabilities on Vertex AI
  • The Needle In a Haystack Test: Evaluating the performance of RAG systems
  • The AI detective: The Needle in a Haystack test and how Gemini 1.5 Pro solves it

Compound AI Systems & the DSPy Framework

  • Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs
  • Fine-Tuning and Prompt Optimization: Two Great Steps that Work Better Together

Agents for software Developement

  • SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering
  • OpenHands: An Open Platform for AI Software Developers as Generalist Agents

AI Agents for Enterprise Workflows

  • WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?
  • WorkArena++: Towards Compositional Planning and Reasoning-based Common Knowledge Work Tasks
  • TapeAgents: a Holistic Framework for Agent Development and Optimization

Towards a unified framework of Neural and Symbolic Decision Making

  • Beyond A*: Better Planning with Transformers via Search Dynamics Bootstrapping
  • Dualformer: Controllable Fast and Slow Thinking by Learning with Randomized Reasoning Traces
  • Composing Global Optimizers to Reasoning Tasks via Algebraic Objects in Neural Nets
  • SurCo: Learning Linear Surrogates For Combinatorial Nonlinear Optimization Problems

Project GROOT: A Blueprint for Generalist Robotics

  • Voyager: An Open-Ended Embodied Agent with Large Language Models
  • Eureka: Human-Level Reward Design via Coding Large Language Models
  • DrEureka: Language Model Guided Sim-To-Real Transfer

Open-Source and Science in the Era of Foundation Models

  • Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models

Measuring Agent capabilities and Anthropic’s RSP

  • Announcing our updated Responsible Scaling Policy
  • Developing a computer use model

Towards Building Safe & Trustworthy AI Agents and A Path for Science‑ and Evidence‑based AI Policy

  • A Path for Science‑ and Evidence‑based AI Policy
  • DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models
  • Representation Engineering: A Top-Down Approach to AI Transparency
  • Extracting Training Data from Large Language Models
  • The Secret Sharer: Evaluating and Testing Unintended Memorization in Neural Networks

Test-time Scaling

General Agent (using LLM)