The UX of Multi-Agent Systems: Why We Stream the "Internal Monologue"

The standard user experience for AI today is a blinking cursor and a loading spinner.

When you ping a single LLM for a recipe, a 3-second wait is acceptable. But in an business environment, a true Multi-Agent System (like ByteTect's OMAS) doesn't just generate text. It plans, writes SQL, searches internal vector databases, critiques its own drafts, and refines them. This orchestration can take anywhere from 15 to 45 seconds.

If you make an business user stare at a loading spinner for 45 seconds, they will assume your software is broken and refresh the page.

To solve this, we had to rethink the UX paradigm of AI. We realized that transparency is the ultimate UX. Users don't mind waiting if they can watch the machine work. So, we engineered our architecture to stream the agents' "internal monologues" in real-time.

Here is how we built the observability pipeline for the Nexus dashboard.

The Problem with Streaming JSON

In a multi-agent state machine, agents communicate with each other using structured JSON, not plain text. For example, our Solver agent outputs a payload like this:

agent_payload.json JSON

{
  "thought": "The user needs the Q4 margins. I need to ask the CFO agent for the exact ledger data before I draft the report.",
  "action": "FINANCIAL_ANALYSIS",
  "content": "Get Q4 2025 margins for Client X"
}

Standard Server-Sent Events (SSE) break when streaming JSON because the JSON is technically invalid until the final closing bracket } arrives. If you wait for the complete JSON to parse it, you lose the real-time typing effect. The user is stuck waiting.

The Solution: The PartialThoughtExtractor

To give users a real-time window into the AI’s brain, we built a custom Python utility called the PartialThoughtExtractor.

Instead of waiting for the LLM to finish its turn, the extractor buffers the raw character stream as it arrives from the model. It uses regex heuristics to detect when the LLM begins typing the "thought" key, and immediately starts yielding those specific characters over a WebSocket to the frontend.

app/core/utils/streaming.py (Simplified) PYTHON

def update(self, chunk: str) -> str:
    self.buffer += chunk
    
    # Look for the "thought" key in the raw stream
    if not self.found_thought_key:
        match = re.search(r'"thought"\s*:\s*"', self.buffer)
        if match:
            self.found_thought_key = True
            self.in_thought_value = True
            self.buffer = self.buffer[match.end():]
            
    # Extract and yield characters until the closing quote
    if self.in_thought_value:
        # ... logic to handle escaped quotes and yield the token ...
        return new_content

This allows us to split a single LLM output into two distinct WebSocket channels: the Thought Stream and the Content Stream.

The Event Bus: ws_manager.py & React Zustand

On the backend, our ws_manager.py and Event Bus wrap these tokens into strict payloads (NODE_START, TOKEN, STATUS, NODE_COMPLETE). We broadcast these events to the specific thread's WebSocket room.

On the frontend, our React application intercepts these events using a custom useAgentSocket hook. We bypass standard React state (which would re-render the whole DOM and kill performance) and pump the tokens directly into a high-performance Zustand store (useObservabilityStore).

The result is a beautifully choreographed UI:

The Activity Feed: The user sees status updates like "Librarian is searching company archives..." or "Visual Critic gave a score of 7, requesting revision."
The Thought Stream: In a dedicated UI panel, the user watches the Orchestrator literally "think" out loud, character by character, explaining why it is making its routing decisions.
The Draft: Finally, the polished content streams seamlessly into the main chat window.

Trust Requires Transparency

In B2B SaaS, the "black box" approach to AI is dead. If an executive is going to base a strategic decision on a report generated by your software, they need an audit trail. They need to know exactly which agents touched the data, what internal knowledge was queried, and how the final conclusion was reached.

By streaming the internal monologue, we didn't just fix a UX problem; we built a system that inherently proves its own work.

The Problem with Streaming JSON

The Solution: The PartialThoughtExtractor

The Event Bus: ws_manager.py & React Zustand

Trust Requires Transparency

Deploy Nexus in Your Business