Skip to main content

Introduction

LiteLLM is an self-hosted interface for calling LLM APIs.

Integration Steps

1
2
HELICONE_API_KEY=sk-helicone-...
3
pip install litellm python-dotenv
4

Use LiteLLM with Helicone

Add the helicone/ prefix to any model name to logg requests for Helicone:
import os
from litellm import completion
from dotenv import load_dotenv

load_dotenv()

# Route through Helicone by adding "helicone/" prefix
response = completion(
    model="helicone/gpt-4o",
    messages=[{"role": "user", "content": "What is the capital of France?"}],
    api_key=os.getenv("HELICONE_API_KEY")
)

print(response.choices[0].message.content)
5
While you’re here, why not give us a star on GitHub? It helps us a lot!

Complete Working Examples

Basic Completion

import os
from litellm import completion
from dotenv import load_dotenv

load_dotenv()

# Simple completion
response = completion(
    model="helicone/gpt-4o-mini",
    messages=[{"role": "user", "content": "Tell me a fun fact about space"}],
    api_key=os.getenv("HELICONE_API_KEY")
)

print(response.choices[0].message.content)

Streaming Responses

import os
from litellm import completion
from dotenv import load_dotenv

load_dotenv()

# Streaming example
response = completion(
    model="helicone/claude-4.5-sonnet",
    messages=[{"role": "user", "content": "Write a short story about a robot learning to paint"}],
    stream=True,
    api_key=os.getenv("HELICONE_API_KEY")
)

print("🤖 Assistant (streaming):")
for chunk in response:
    if hasattr(chunk.choices[0].delta, 'content') and chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
print("\n")

Custom Properties and Session Tracking

Add metadata to track and filter your requests:
import os
from litellm import completion
from dotenv import load_dotenv

load_dotenv()

response = completion(
    model="helicone/gpt-4o-mini",
    messages=[{"role": "user", "content": "What's the weather like?"}],
    api_key=os.getenv("HELICONE_API_KEY"),
    metadata={
        "Helicone-Session-Id": "session-abc-123",
        "Helicone-Session-Name": "Weather Assistant",
        "Helicone-User-Id": "user-789",
        "Helicone-Property-Environment": "production",
        "Helicone-Property-App-Version": "2.1.0",
        "Helicone-Property-Feature": "weather-query"
    }
)

print(response.choices[0].message.content)

Provider Selection and Fallback

Helicone’s AI Gateway supports automatic failover between providers:
import os
from litellm import completion
from dotenv import load_dotenv

load_dotenv()

# Automatic routing (cheapest provider)
response = completion(
    model="helicone/gpt-4o",
    messages=[{"role": "user", "content": "Hello!"}],
    api_key=os.getenv("HELICONE_API_KEY")
)

# Manual provider selection
response = completion(
    model="helicone/claude-4.5-sonnet/anthropic",
    messages=[{"role": "user", "content": "Hello!"}],
    api_key=os.getenv("HELICONE_API_KEY")
)

# Multiple provider fallback chain
# Try OpenAI first, then Anthropic if it fails
response = completion(
    model="helicone/gpt-4o/openai,claude-4.5-sonnet/anthropic",
    messages=[{"role": "user", "content": "Hello!"}],
    api_key=os.getenv("HELICONE_API_KEY")
)

Advanced Features

Caching

Enable caching to reduce costs and latency for repeated requests:
import os
from litellm import completion
from dotenv import load_dotenv

load_dotenv()

# Enable caching for this request
response = completion(
    model="helicone/gpt-4o",
    messages=[{"role": "user", "content": "What is 2+2?"}],
    api_key=os.getenv("HELICONE_API_KEY"),
    metadata={
        "Helicone-Cache-Enabled": "true"
    }
)

print(response.choices[0].message.content)

# Subsequent identical requests will be served from cache
response2 = completion(
    model="helicone/gpt-4o",
    messages=[{"role": "user", "content": "What is 2+2?"}],
    api_key=os.getenv("HELICONE_API_KEY"),
    metadata={
        "Helicone-Cache-Enabled": "true"
    }
)

print(response2.choices[0].message.content)

Rate Limiting

Apply rate limiting policies to control request rates:
import os
from litellm import completion
from dotenv import load_dotenv

load_dotenv()

response = completion(
    model="helicone/gpt-4o",
    messages=[{"role": "user", "content": "Hello"}],
    api_key=os.getenv("HELICONE_API_KEY"),
    metadata={
        "Helicone-Rate-Limit-Policy": "basic-100"
    }
)

print(response.choices[0].message.content)

AI Gateway Overview

Learn about Helicone’s AI Gateway features and capabilities

Provider Routing

Configure intelligent routing and automatic failover

Model Registry

Browse all available models and providers

Custom Properties

Add metadata to track and filter your requests

Sessions

Track multi-turn conversations and user sessions

Rate Limiting

Configure rate limits for your applications

Caching

Reduce costs and latency with intelligent caching

LiteLLM Documentation

Official LiteLLM documentation