Stream responses are 5s slow to get the first token

vikvang · May 15, 2025, 12:10am

Hello, for my usecase I need very low latencies with llama-3.1-sonar-huge-128k-online but currently the model is very slow to respond in streaming requests. The 70b would already be enough with speedness but is it not comparable to the state of the art models like gpt4o

vikvang · May 15, 2025, 12:10am

Hey, we know that sonar-huge is rather slow. I do recommend using sonar-large instead especially if latency is a priority for you; hopefully answer quality does not substantially degrade.
You can expect sonar-huge to remain around the current speed for the foreseeable future.

Topic		Replies	Views
Request for small sonar model Bug Reports	2	69	May 15, 2025
API: Unable to use the "llama-3.1-sonar-huge-128k-online" model Bug Reports	1	206	May 15, 2025
Will we be getting Sonar Huge chat or llama-3.1-405B-instruct as options? Feature Requests	0	31	May 15, 2025
After updating: Network timeout at: https://api.perplexity.ai/chat/completions Bug Reports	4	231	May 15, 2025
Llama-3.1-sonar-small-128k-online not using web look-up Bug Reports	1	271	May 15, 2025

Stream responses are 5s slow to get the first token

Related topics