Lectures
You can download the lectures here. We will try to upload lectures prior to their corresponding classes.
-
Introduction to Agent and Inference Time Scaling
tl;dr: We discuss basics about LLM agent; and a brief introduction to DeepSeek-R1.
[codes] [Basics] [Research]
Suggested Readings:
- DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
- DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
- Scaling LLM Test-Time Compute Optimally Can be More Effective than Scaling Model Parameters
- Watch Every Step! LLM Agent Learning via Iterative Step-Level Process Refinement
- AgentTuning: Enabling Generalized Agent Abilities for LLMs
- Trial and Error: Exploration-Based Trajectory Optimization for LLM Agents
- 7B Model and 8K Examples: Emerging Reasoning with Reinforcement Learning is Both Effective and Efficient.
-
LLM Agents: Why, What, How
tl;dr: We discuss why LLM agents are designed, what their key components are, and how to design LLM agents for specific domains.
[Example] [Basics]
Suggested Readings:
- ReAct: Synergizing Reasoning and Acting in Language Models
- Executable Code Actions Elicit Better LLM Agents
- SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering
- Open-RAG: Enhanced Retrieval-Augmented Reasoning with Open-Source Large Language Models
- SWE-Bench Leaderboard
- Agentless: Demystifying LLM-based Software Engineering Agents
- Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
- OpenAI Tool Calling
- Micro-agent
- Introducing Gemini 2.0: our new AI model for the agentic era
- Llama 3.3
-
Multi-Agent, Multimodal Frameworks
tl;dr: We discuss design choices for multi-agent, multimodal frameworks.
[Basics]
Suggested Readings:
- Multi-Agent AI Enables Emergent Cognition and Real-Time Knowledge Synthesis in Science and Engineering
- Building effective agents
- StateFlow - Build State-Driven Workflows with Customized Speaker Selection in GroupChat
- Self-Reflective RAG with LangGraph
- AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation
- Multi-Modal RAG
- Evaluate RAG with LlamaIndex
- Evaluating Multi-Modal Retrieval-Augmented Generation
- SWE-Bench
- AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents
- ALFWorld: Aligning Text and Embodied Environments for Interactive Learning
- ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks
- ToolBench
- Raising the bar on SWE-bench Verified with Claude 3.5 Sonnet
- WebArena
-
LLM agents and reasoning benchmarks
tl;dr: We propose a framework over current evaluation benchmarks, and discuss potential research topics.
[Basics]
Suggested Readings:
- SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
- BabyAI: A Platform to Study the Sample Efficiency of Grounded Language Learning
- AgentGen: Enhancing Planning Abilities for Large Language Model based Agent via Environment and Task Generation
- FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI
- Measuring Mathematical Problem Solving With the MATH Dataset
- AIME: AI System Optimization via Multiple LLM Evaluators
- AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents
-
Safe and Trustworthy Agents
tl;dr: We talk about trust taxonomy, attacks, and defenses.
[Basics]
Suggested Readings:
- Towards Trustworthy AI: A Review of Ethical and Robust Large Language Models
- Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security
- Commercial LLM Agents Are Already Vulnerable to Simple Yet Dangerous Attacks
- A Survey on Trustworthy LLM Agents: Threats and Countermeasures
- DECODINGTRUST: A Comprehensive Assessment of Trustworthiness in GPT Models
- BadAgent: Inserting and Activating Backdoor Attacks in LLM Agents
- Unveiling Privacy Risks in LLM Agent Memory
- Evil Geniuses: Delving into the Safety of LLM-based Agents
- Extracting Training Data from Large Language Models
- OpenFedLLM: Training Large Language Models on Decentralized Private Data via Federated Learning
- StruQ: Defending Against Prompt Injection with Structured Queries
- SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks
- GuardAgent: Safeguard LLM Agents via Knowledge-Enabled Reasoning
- PrivacyLens: Evaluating Privacy Norm Awareness of Language Models in Action
- A Path for Science and Evidence-based AI Policy