RLHF Training Artifacts: Systematic Confabulation and Coherency Degradation in Complex Multi-Step Reasoning Tasks

:bug: Describe the Bug

When using Perplexity API (specifically sonar-pro and chat-completions) for extended multi-step reasoning tasks involving cross-domain knowledge synthesis, the model exhibits systematic confabulation patterns and coherency degradation. This appears to be an artifact of RLHF (Reinforcement Learning from Human Feedback) training prioritizing response fluency over factual precision.

Specific Manifestations:

  1. Plausible but Incorrect Citations: Model generates citations that “sound right” (proper DOI format, credible journal names, appropriate year ranges) but don’t correspond to actual publications

  2. Coherency Drift in Long Contexts: After ~15-20 reasoning steps, the model begins contradicting its earlier statements while maintaining confident tone

  3. Statistical Confabulation: When asked for specific quantitative data (e.g., “what percentage of…”), model provides precise-sounding numbers (“23.7%”, “4.2x improvement”) without grounding in actual sources

  4. Retrieval Hallucination: In search-augmented contexts, model sometimes claims to have “found” information that doesn’t appear in the provided search results

Technical Context:

This is likely caused by RLHF reward model optimizing for:

  • Response confidence (penalizing “I don’t know”)
  • Specificity (rewarding precise answers over hedged ones)
  • Completeness (penalizing partial answers)

These training objectives create adversarial incentives for confabulation when the model encounters knowledge boundaries.

Reproduction:

Occurs reliably when:

  • Task requires >10 sequential reasoning steps
  • Query spans multiple specialized domains
  • Specific quantitative claims are requested
  • Extended context window (>8K tokens) is used

Impact:

For complex research tasks (like the computational biology frameworks I developed using Perplexity), this requires extensive manual fact-checking and cross-validation. I spent approximately one week of prompt engineering to implement verification loops catching these confabulations.

Example:
Asked to synthesize cancer treatment pathways, model confidently cited “DOI: 10.1038/nature.2023.12345” which doesn’t exist, but format and journal match real Nature papers.

:white_check_mark: Expected Behavior

What you expected to happen.

:cross_mark: Actual Behavior

Model should either:

  1. Explicitly acknowledge uncertainty (“I don’t have access to specific data on…”)
  2. Provide only verifiable, source-grounded claims
  3. Distinguish between reasoning/inference vs. factual retrieval
  4. Maintain logical coherency across extended reasoning chains

When citations are provided, they should be algorithmically verified against actual publication databases or clearly marked as “similar to” rather than exact matches.

:counterclockwise_arrows_button: Steps to Reproduce

  1. Request complex multi-domain synthesis task (e.g., “Develop a computational framework integrating quantum biology and neuroscience with specific citations”)
    2. Continue conversation for 15-20 reasoning steps, building on previous responses
    3. Ask for specific quantitative claims or citations
    4. Cross-reference provided citations against actual databases (PubMed, DOI resolution services)
    5. Check internal coherency by asking model to summarize its earlier claims
  2. Observe the unexpected behavior.

:pushpin: API Request & Response (if applicable)

:globe_showing_europe_africa: Environment

  • API Version: [e.g., sonar-3.1]
  • SDK (if applicable): [e.g., Python SDK v0.5]
  • Operating System: [e.g., MacOS, Linux, Windows]

:paperclip: Logs or Screenshots (if applicable)

Add any logs or screenshots that can help debug the issue.

:memo: Additional Context

Add any other context about the problem here.