Documentation
The Cosavu API provides programmatic access to our state-of-the-art prompt optimization engine. Build cost-efficient LLM agents by reducing noise and enforcing structural integrity at the source.
Authentication
The Cosavu API uses Bearer Authentication. All requests must include an API key in the request header. You can manage your API keys in the developer dashboard.
Authorization: Bearer cosavu_live_key****************Base URL
All API requests should be made to the following base internal endpoint.
Rate Limits
Rate limits vary by plan tier. Limits are applied per API key on a per-minute basis.
| Plan | Limit (RPM) | Window |
|---|---|---|
| Free | 10 req | 60 seconds |
| Pro | 100 req | 60 seconds |
| Enterprise | Custom | Continuous |
/optimize
Submits raw prompt text to the optimization engine. The engine decomposes the input into structural blocks, refines instructions, and strips redundant tokens.
Request Body
cURL Example
curl -X POST https://api.cosavu.com/v1/optimize \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"prompt": "Write a summary of the quarterly report focus on finance.",
"target_model": "gpt-4"
}'Response: Prompt IR Object
The engine returns a Prompt Intermediate Representation (IR) consisting of identified blocks and token metadata.
{
"original_text": "...",
"blocks": [
{
"block_type": "INSTRUCTION",
"content": "Summarize Q3 financial report...",
"original_tokens": 42,
"optimized_tokens": 12,
"is_compressed": true
}
],
"total_original_tokens": 42,
"total_optimized_tokens": 12,
"latency_ms": 284.5
}Block Types
Error Handling
/health
Returns the current operational status of the optimization clusters.
{
"status": "ok",
"engine": "Cosavu-Cluster-Alpha-7",
"version": "1.2.0"
}Best Practices
Explicit Structure Separation
The optimizer works best when background data is clearly distinct from instructions. Use clear headers like DATA: or INFO: in your prompts.
Target Specific Models
Setting the target_model parameter allows the engine to strip tokens known to be redundant for that specific architecture (e.g., removing excessive formatting for GPT-4).
Latency Management
Optimization takes 200-500ms on average. For real-time chat applications, we recommend optimizing system prompts asynchronously or during agent initialization, rather than on every user turn.