Moving AI agents from prototype to production requires addressing reliability, scalability, and operational concerns. This guide covers the essential practices for production-grade agent development.
## Project Structure
Organize your agent codebase for maintainability:
```
ai_agent/
├── agents/
│ ├── __init__.py
│ ├── base.py # Base agent class
│ ├── support_agent.py # Specific agent implementations
│ └── research_agent.py
├── tools/
│ ├── __init__.py
│ ├── base.py # Tool interface
│ ├── database.py # Database tools
│ └── api.py # External API tools
├── memory/
│ ├── __init__.py
│ ├── working.py
│ └── long_term.py
├── config/
│ ├── __init__.py
│ ├── settings.py # Configuration management
│ └── prompts.py # Prompt templates
├── api/
│ ├── __init__.py
│ └── routes.py # API endpoints
├── tests/
│ ├── unit/
│ └── integration/
└── main.py
```
## Configuration Management
Use environment-based configuration:
```python
# config/settings.py
from pydantic_settings import BaseSettings
class Settings(BaseSettings):
openai_api_key: str
anthropic_api_key: str
database_url: str
redis_url: str
llm_model: str = "gpt-4-turbo"
llm_temperature: float = 0.1
max_iterations: int = 10
timeout_seconds: int = 300
class Config:
env_file = ".env"
settings = Settings()
```
## Error Handling and Retries
Implement robust error handling:
```python
from tenacity import retry, stop_after_attempt, wait_exponential
class Agent:
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=4, max=60),
reraise=True
)
async def call_llm(self, messages):
try:
response = await self.client.chat.completions.create(
model=settings.llm_model,
messages=messages,
tools=self.tools
)
return response
except RateLimitError:
logger.warning("Rate limit hit, retrying...")
raise
except APIError as e:
logger.error(f"API error: {e}")
raise
```
## Observability
Implement comprehensive logging and tracing:
```python
import structlog
from opentelemetry import trace
logger = structlog.get_logger()
tracer = trace.get_tracer(__name__)
class Agent:
async def run(self, task):
with tracer.start_as_current_span("agent_run") as span:
span.set_attribute("task", task)
logger.info("agent_started", task=task, agent_id=self.id)
try:
result = await self._execute(task)
span.set_attribute("success", True)
logger.info("agent_completed", result_summary=str(result)[:100])
return result
except Exception as e:
span.set_attribute("error", str(e))
logger.error("agent_failed", error=str(e))
raise
```
## Testing Strategy
### Unit Tests
```python
def test_tool_execution():
tool = DatabaseQueryTool(mock_db)
result = tool.execute({"query": "SELECT * FROM users LIMIT 1"})
assert result["success"] == True
assert len(result["data"]) == 1
```
### Integration Tests
```python
@pytest.mark.integration
async def test_agent_workflow():
agent = SupportAgent()
result = await agent.run("What is the status of order #12345?")
assert "order" in result.lower()
assert any(status in result.lower() for status in ["pending", "shipped", "delivered"])
```
### LLM Response Mocking
```python
@pytest.fixture
def mock_llm_response():
return {
"choices": [{
"message": {
"content": None,
"tool_calls": [{
"function": {
"name": "get_order_status",
"arguments": '{"order_id": "12345"}'
}
}]
}
}]
}
```
## Deployment Considerations
### Containerization
```dockerfile
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
CMD ["uvicorn", "api.main:app", "--host", "0.0.0.0", "--port", "8000"]
```
### Health Checks
```python
@app.get("/health")
async def health_check():
checks = {
"llm_api": await check_llm_connection(),
"database": await check_db_connection(),
"memory_store": await check_redis_connection()
}
healthy = all(checks.values())
return {"healthy": healthy, "checks": checks}
```
## Monitoring Dashboards
Track key metrics:
- Request latency (p50, p95, p99)
- LLM token usage and costs
- Tool execution success rates
- Error rates by type
- Active sessions and queue depth
Production agents require the same rigor as any mission-critical system. Build for failure, monitor everything, and iterate based on real-world performance.