Technical blogs
Papers
LLM Reasoning
LLM agents: brief history
Agentic AI Frameworks & AutoGen
Enterprise trends for generative AI, and key components of building successful agents/applications
- Google Cloud expands grounding capabilities on Vertex AI
- The Needle In a Haystack Test: Evaluating the performance of RAG systems
- The AI detective: The Needle in a Haystack test and how Gemini 1.5 Pro solves it
Compound AI Systems & the DSPy Framework
- Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs
- Fine-Tuning and Prompt Optimization: Two Great Steps that Work Better Together
Agents for software Developement
- SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering
- OpenHands: An Open Platform for AI Software Developers as Generalist Agents
AI Agents for Enterprise Workflows
- WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?
- WorkArena++: Towards Compositional Planning and Reasoning-based Common Knowledge Work Tasks
- TapeAgents: a Holistic Framework for Agent Development and Optimization
Towards a unified framework of Neural and Symbolic Decision Making
- Beyond A*: Better Planning with Transformers via Search Dynamics Bootstrapping
- Dualformer: Controllable Fast and Slow Thinking by Learning with Randomized Reasoning Traces
- Composing Global Optimizers to Reasoning Tasks via Algebraic Objects in Neural Nets
- SurCo: Learning Linear Surrogates For Combinatorial Nonlinear Optimization Problems
Project GROOT: A Blueprint for Generalist Robotics
- Voyager: An Open-Ended Embodied Agent with Large Language Models
- Eureka: Human-Level Reward Design via Coding Large Language Models
- DrEureka: Language Model Guided Sim-To-Real Transfer
Open-Source and Science in the Era of Foundation Models
- Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models
Measuring Agent capabilities and Anthropic’s RSP
- Announcing our updated Responsible Scaling Policy
- Developing a computer use model
Towards Building Safe & Trustworthy AI Agents and A Path for Science‑ and Evidence‑based AI Policy
- A Path for Science‑ and Evidence‑based AI Policy
- DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models
- Representation Engineering: A Top-Down Approach to AI Transparency
- Extracting Training Data from Large Language Models
- The Secret Sharer: Evaluating and Testing Unintended Memorization in Neural Networks
Test-time Scaling
General Agent (using LLM)