Router Replay: Debugging and Observability for vLLM Semantic Router
Expanding vLLM Semantic Router’s Observability
Today, we’re excited to introduce Router Replay — a powerful observability plugin for vLLM Semantic Router that captures routing decisions, matched signals, selected models, and request/response payloads for comprehensive debugging and analysis.
Debugging routing decisions in production LLM systems can be challenging. When a request is routed to the wrong model, or when you need to understand why a specific routing decision was made, having visibility into the decision-making process is crucial. Router Replay solves this by automatically capturing all routing metadata in a single, queryable record.
Router Replay enables you to:
- Debug routing issues by inspecting exactly which signals matched and why a specific model was selected
- Audit routing decisions by maintaining a record of all routing choices
- Analyze routing patterns to understand how your router behaves in production
- Replay scenarios by examining captured request/response pairs to reproduce issues
Why Router Replay?
In a typical vLLM Semantic Router deployment, requests flow through multiple decision points:
- Signal Matching: Keywords, embeddings, domains, and other signals are evaluated
- Decision Selection: The router selects which decision (route) to use based on matched signals
- Model Selection: A specific model is chosen from the decision’s model references
- Request Processing: The request is forwarded to the selected model
- Response Handling: The response is processed and returned
Without Router Replay, understanding what happened at each step requires extensive logging, correlation of multiple log files, and manual reconstruction of the routing flow. Router Replay captures all of this information in a single, queryable record.
The Problem: Invisible Routing Decisions
Consider a scenario where you’re running a production router with multiple decisions:
math_route: Routes math queries to a reasoning-capable modelcode_route: Routes programming questions to a code-specialized modelgeneral_route: Handles general-purpose queries
When a user reports that their math question was routed incorrectly, you need to answer:
- Which signals matched?
- Why was this decision selected?
- What was the original request?
- Which model was actually used?
- Was the response cached?
Without Router Replay, answering these questions requires piecing together information from multiple sources — router logs, Envoy logs, and application logs — which is time-consuming and error-prone.
How Router Replay Works
Router Replay is implemented as a plugin that integrates seamlessly into your routing decisions. When enabled, it automatically captures routing metadata for every request that matches a decision with the plugin configured.
Basic Setup
Getting started with Router Replay is straightforward. First, initialize your configuration:
# Install vLLM Semantic Router CLI
pip install vllm-sr
# Initialize configuration
vllm-sr init
The initialization creates a config.yaml file with example configurations. You can add Router Replay to any decision by including it in the plugins section:
decisions:
- name: "math_route"
description: "Route math queries with reasoning"
priority: 100
rules:
operator: "OR"
conditions:
- type: "keyword"
name: "math_keywords"
- type: "domain"
name: "math"
modelRefs:
- model: "openai/gpt-oss-120b"
use_reasoning: true
plugins:
- type: "router_replay"
configuration:
enabled: true
max_records: 200 # Maximum records in memory
capture_request_body: true # Capture request payloads
capture_response_body: false # Capture response payloads
max_body_bytes: 4096 # Max bytes per body (truncates if exceeded)
Configuration Options
Router Replay offers flexible configuration options:
enabled: Enable or disable the plugin (default:true)max_records: Maximum number of records to keep in memory (default:200, must be > 0)capture_request_body: Whether to capture request payloads (default:falsefor privacy)capture_response_body: Whether to capture response payloads (default:false)max_body_bytes: Maximum bytes to capture per body, truncates if exceeded (default:4096)
Validating Your Configuration
Before deploying, validate your configuration:
vllm-sr validate --config config.yaml
This command checks that:
- Plugin configurations are valid
- Field constraints are satisfied (e.g.,
max_records > 0) - All required fields are present
- Plugin types are correctly specified
Example validation output:
✓ Configuration is valid
Summary:
Decisions: 3
Plugins: 5 total (3 decisions)
Types: router_replay: 1, semantic-cache: 2, system_prompt: 2
Models: 2
Signals: 4
Starting the Router
Once configured, start the router:
vllm-sr serve
The router will start with Router Replay enabled for all configured decisions. Each routing decision that matches a request will automatically create a replay record.
Accessing Replay Records
Router Replay exposes a RESTful API endpoint to query captured records. The endpoint is available at /v1/router_replay through Envoy (default port: 8888), which routes requests through the ext_proc pipeline to the router.
Note: Router Replay is currently accessible via the API endpoint. Dashboard UI integration for visualizing replay records is planned for future releases. For now, you can query records programmatically using the API or with tools like
curlandjq.
Listing All Records
To retrieve all captured records:
curl http://localhost:8888/v1/router_replay
Response:
{
"object": "router_replay.list",
"count": 5,
"data": [
{
"id": "550e8400-e29b-41d4-a716-446655440000",
"timestamp": "2025-01-15T10:30:45Z",
"request_id": "req_abc123",
"decision": "math_route",
"category": "mathematics",
"original_model": "openai/gpt-oss-120b",
"selected_model": "openai/gpt-oss-120b",
"reasoning_mode": "high",
"signals": {
"keyword": ["calculate", "equation"],
"embedding": [],
"domain": ["mathematics"]
},
"request_body": "{\"messages\":[{\"role\":\"user\",\"content\":\"Calculate 2+2\"}]}",
"response_status": 200,
"from_cache": false,
"streaming": false
}
]
}
Retrieving a Specific Record
To get a specific record by ID:
curl http://localhost:8888/v1/router_replay/550e8400-e29b-41d4-a716-446655440000
Understanding the Record Structure
Each replay record contains:
- Routing Metadata:
decision: The decision (route) that was matchedcategory: The semantic category selectedoriginal_model: The model requested in the original requestselected_model: The model actually usedreasoning_mode: Whether reasoning was enabled and at what effort level
- Signal Information:
signals.keyword: Matched keyword signalssignals.embedding: Matched embedding signalssignals.domain: Matched domain signalssignals.fact_check: Matched fact-check signalssignals.user_feedback: Matched user feedback signalssignals.preference: Matched preference signals
- Request/Response Data (if enabled):
request_body: The original request payload (truncated if exceedsmax_body_bytes)response_body: The response payload (ifcapture_response_bodyis enabled)request_body_truncated: Boolean indicating if request was truncatedresponse_body_truncated: Boolean indicating if response was truncated
- Status Information:
response_status: HTTP status codefrom_cache: Whether the response came from cachestreaming: Whether the response was streamed
Advanced Configuration
Storage Backends
By default, Router Replay uses in-memory storage, which is perfect for development and debugging. The in-memory backend is production-ready and provides fast access to recent routing decisions.
For production deployments requiring persistence across restarts or distributed storage, Router Replay supports additional storage backends including Redis, PostgreSQL, and Milvus. These backends are available in the codebase and can be configured via the store_backend option in your configuration. For detailed configuration examples and setup instructions, refer to the vLLM Semantic Router documentation.
Per-Decision Isolation
Router Replay automatically creates isolated storage per decision. This means:
- Each decision maintains its own record collection
- Records are prefixed with the decision name
- You can configure different storage backends per decision
- Records are automatically namespaced to prevent collisions
Privacy and Security Considerations
Router Replay respects privacy by default:
- Request/response bodies are not captured by default — you must explicitly enable
capture_request_bodyorcapture_response_body - Body truncation — even when enabled, bodies are truncated to
max_body_bytesto prevent excessive storage - TTL support — persistent backends support automatic expiration via
ttl_seconds - Isolated storage — each decision’s records are stored separately
For production deployments handling sensitive data:
plugins:
- type: "router_replay"
configuration:
enabled: true
capture_request_body: false # Don't capture request bodies
capture_response_body: false # Don't capture response bodies
# Only capture routing metadata
Real-World Use Cases
Use Case 1: Debugging Incorrect Routing
Scenario: A user reports that their math question was routed to the general model instead of the math-specialized model.
Solution: Query Router Replay records to see what happened:
# Get the most recent records
curl http://localhost:8888/v1/router_replay | jq '.data[0]'
Inspect the record to see:
- Which signals matched (or didn’t match)
- Why the decision was selected
- What the original request looked like
This helps you understand if:
- The keyword signal wasn’t configured correctly
- The embedding similarity threshold was too low
- The domain classifier missed the math domain
Use Case 2: Analyzing Routing Patterns
Scenario: You want to understand which models are being used most frequently and for what types of queries.
Solution: Query Router Replay records and analyze the data:
# Get all records and analyze
curl http://localhost:8888/v1/router_replay | jq '.data[] | {decision, selected_model, category}'
This gives you insights into:
- Model usage distribution
- Category distribution
- Decision selection patterns
Use Case 3: Reproducing Production Issues
Scenario: A user reports a specific error, and you need to reproduce it locally.
Solution: Use Router Replay to capture the exact request:
# Get the specific record
curl http://localhost:8888/v1/router_replay/{record_id} | jq '.request_body'
Then replay the request locally to debug the issue.
Use Case 4: Auditing and Compliance
Scenario: You need to maintain an audit log of routing decisions for compliance.
Solution: Configure Router Replay with a persistent storage backend (when available) and long TTL. For now, use the in-memory backend with appropriate max_records to maintain recent audit history:
plugins:
- type: "router_replay"
configuration:
enabled: true
max_records: 10000 # Increase for longer audit history
capture_request_body: false # Privacy: don't capture bodies
capture_response_body: false
# Only capture routing metadata for compliance
Performance Considerations
Router Replay is designed to be lightweight and non-intrusive:
- Minimal overhead: Recording happens asynchronously and doesn’t block request processing
- Configurable storage: Use in-memory for development, persistent backends for production
- Body truncation: Automatically truncates large bodies to prevent storage bloat
- TTL support: Automatic expiration prevents unbounded growth
- Async writes: Optional async writes for persistent backends improve performance
For high-throughput scenarios:
plugins:
- type: "router_replay"
configuration:
enabled: true
capture_request_body: false # Reduce storage
capture_response_body: false # Reduce storage
async_writes: true # Non-blocking writes
max_records: 100 # Smaller memory footprint
Get Started Today
Router Replay is available now in vLLM Semantic Router. To get started:
- Install the CLI:
pip install vllm-sr - Initialize configuration:
vllm-sr init - Enable Router Replay in your decisions
- Validate:
vllm-sr validate --config config.yaml - Start the router:
vllm-sr serve - Query records:
curl http://localhost:8888/v1/router_replay
For more information, see the vLLM Semantic Router documentation and the Router Replay API reference.
Acknowledgments
Special thanks to the vLLM Semantic Router community for feedback and contributions. Router Replay was designed based on real-world debugging needs reported by users deploying production routing systems.
We’re incredibly excited about this new observability capability, and we can’t wait to see how you use Router Replay to debug and optimize your routing systems. Let’s get to work!
Subscribe
- © 2026. vLLM Team. All rights reserved.
vLLM is a fast and easy-to-use library for LLM inference and serving.