On-Premise Embedding API Support for Retrieval-Augmented Generation (RAG) Workflows

kpar7777 · July 24, 2025, 6:29am

Feature Request

Please provide an embedding/vectorization API—similar to OpenAI’s embedding endpoints—in the Perplexity developer ecosystem. The goal is for developers to generate and retrieve vector embeddings for their own text/document chunks, enabling storage in local (on-premise) vector databases such as FAISS and supporting RAG use cases.

Problem Statement

Currently, there is no direct way to send text to Perplexity and receive vector embeddings suitable for use with self-hosted RAG pipelines. This limits the ability to index private or regulated data on-premise, hinders enterprise adoption, and restricts full control and data sovereignty for developers building secure applications.

Proposed Solution

Introduce a dedicated API endpoint for text-to-vector embedding using Perplexity models (e.g., Sonar models).
Support batch processing for efficient vectorization of large document sets.
Embedding endpoint should return standardized dense vector representations (compatible with FAISS, Qdrant, Milvus, etc.).
No remote storage: all calculation and storage of vectors to be handled client-side, enabling on-prem RAG development.
Example workflow:
1. Developer sends proprietary text/document “chunks” to the API.
2. API responds with embeddings (dense vector arrays).
3. Developer stores and manages these vectors in their own FAISS index to support secure RAG queries and search.
Clear documentation with examples for Python and other common programming languages.

API Impact

Which API component is affected?
A new embeddings or vectorize endpoint, separate from chat completions or document search.
Is this related to a specific model?
Ideally available for all current Perplexity language models (Sonar, Sonar Pro, etc.).
Would this require new API parameters or changes to existing ones?
- New endpoint: likely /embeddings or similar.
- Input: text (string or list of strings), optional model selection.
- Output: vector(s).

Alternatives Considered

Workarounds attempted:
- Using third-party public embedding models (e.g. OpenAI, Hugging Face): introduces dependency on external providers and may not align with Perplexity retrieval/generation pipeline.
- Manual chunking and storage: Not possible to leverage Perplexity’s specific vector space, limiting quality and compatibility.
Why insufficient:
- No seamless integration between Perplexity’s generation models and custom RAG pipelines.
- Data sovereignty, compliance and latency requirements rule out cloud-based or external API-only solutions for some use cases.

Additional Context

On-prem embedding APIs are highly requested in regulated industries (finance, legal, healthcare) for confidential, low-latency RAG and search solutions.
Examples from OpenAI (text-embedding-ada-002), Cohere, and others show clear interest and value.
Adding this would broaden Perplexity’s adoption by allowing secure enterprise workflows and hybrid retrieval.
Batch/bulk endpoints and compatibility with open-source vector databases (FAISS, Qdrant, Milvus) are highly desirable.

Thank you for considering this request to make Perplexity models even more accessible and useful for advanced, privacy-focused RAG applications!

Topic		Replies	Views
Perplexity Interface API vs Sonar LLM API General	2	233	June 24, 2025
Perplexity API doesn't support tool call, even though Llama 3.1 does Bug Reports	1	151	May 15, 2025
Use Perplexity w/ DeepSeek model in Cursor IDE Bug Reports	0	127	May 15, 2025
Perplexity Search output is getting worse compared to ChatGPT on Non-Technical Creating Documents or Templates related output Feature Requests	0	78	May 15, 2025
More transparity and options for Perplexity pro application "labs" features Feature Requests	0	107	June 16, 2025

On-Premise Embedding API Support for Retrieval-Augmented Generation (RAG) Workflows

Alternatives Considered

Additional Context

Related topics