Feature Bundle Request: Explicit Prompt Caching, Multi-Model Fusion, and Granular API

I am a developer actively building on the Perplexity API platform. First of all, thank you for providing such a robust platform combining flagship models with your exceptional live-search synthesis infrastructure.

As our autonomous agents and production workflows become more sophisticated, we have identified three critical feature gaps that prevent us from scaling our applications optimally on Perplexity compared to other developer platforms. I would like to formally request the following infrastructure enhancements:

1. Feature Request: Support for Explicit Prompt Caching (cache_control)

  • Core Problem: When deploying complex, multi-turn AI agents or specialized development workflows, we frequently pass massive context strings—such as extensive codebases, enterprise guidelines, or static legal docs—repeatedly across conversation turns. Because the API currently standardizes and abstracts provider-level caching properties, we are forced to pay full flat rates for identical input prefixes on every single multi-turn transaction. This lack of exposed ephemeral token tracking heavily inflates development costs for dense context processing.

  • Suggested Implementation:

    • Allow the request body to pass parameters like "cache_control": {"type": "ephemeral"} down to underlying partner models (like Claude) that natively support explicit cache breakpoints.

    • Expose cached_tokens and cache_write_tokens inside the returned usage metadata block so developers can audit cache efficiency dynamically.

    • Maintain connection stickiness to the specific background model host instance when a continuous cache snapshot is active.

2. Feature Request: Multi-Model “Model Council / Fusion Plugin” Endpoint

  • Core Problem: When querying high-stakes engineering logic, complex mathematical proofs, or strategic industry analysis, a single model—even when utilizing internal search agents—is prone to localized blind spots, premature rounding errors, or semantic drift. While the sonar family is exceptional at combining live search with synthesis, it limits the user to a singular model’s logical lane for the final generation block. Managing multiple raw instances manually via client-side code introduces massive network latency, dual API key management overhead, and synchronization bottlenecks.

  • Suggested Implementation:

    • Allow developers to define an array of background models to be invoked simultaneously for a single query (e.g., ["anthropic/claude-3-5-sonnet", "openai/gpt-4o"]).

    • Allow specification of a top-tier agentic model tasked purely with compiling, cross-referencing, and generating the ultimate unified response string.

    • Provide clear markers in the streaming chunks showing consensus points, explicit logical contradictions found between models, and unique isolated arguments.

3. Feature Request: Enterprise-Grade API Key Controls (Granular Budgeting and Model Locking)

  • Core Problem: Currently, the developer dashboard provides a master API Key framework. If an application key is inadvertently exposed, or if an autonomous agent runs into an asynchronous infinite loop (runaway code cycle), the global account credit pool can be entirely drained within minutes. Furthermore, when sharing API access with staging environments, automated testing suites, or clients, it is impossible to prevent them from calling highly expensive reasoning/deep-research endpoints, which heavily strains budget predictability.

  • Suggested Implementation:

    • Implement the option to assign hard spending limits (e.g., Max $10 per day/week/month) to specific API keys that automatically cut off execution when breached.

    • Enable a togglable menu on each generated key to specify exactly which models it is authorized to call (e.g., lock a key to lightweight models and completely ban the invocation of premium reasoning/agentic models).

    • Allow developers to partition balances cleanly between separate test project scopes without forcing the creation of multiple distinct Perplexity corporate accounts.

Implementing these features would dramatically improve budget predictability, lower latency for dense contexts, and cement Perplexity API as the premier tool for building advanced, production-grade AI agents.