So…. We made a talking bot

Navigating LLM-Driven Voice Interaction

LLM

Posted by Matias Grinberg Portnoy

on September 11, 2024 · 5 mins read

👉 Why did we make a talking bot?

👉 Challenges

👉 How it works

👉 How we overcame the difficulties

👉 Did we succeed?

Introduction - Why did we make a talking bot?

LLMs are all the rage right now, with varied applications ranging from coding to playing Minecraft. Probably the most obvious and direct use of this tech is still chatbots.

Yes, chatbots may not be as alluring as other applications, but hey, they are really useful. Sometimes you need to combine NLP with menial tasks; LLM chatbots are very useful tools for this. For example, simple customer service and appointment scheduling. More complicated interactions are still out of reach for current LLMs.

There’s a catch though, Ignoring new and shiny multimodality, LLMs communicate through text. They get text and output text. This is not necessarily well suited for all tasks. Even though the underlying interaction remains the same, people sometimes just want to talk to someone! In fact, some research indicates that user preference between text or voice communication depends on the task and context.

Even more interesting is that after repeated failures in solving a problem, people usually prefer to interact by voice than by chat

Phone calls are typically less efficient than chats. You can only have one at a time and they typically last longer (provided that people respond to chats on the spot, which probably does not happen). Automating phone calls can optimize costs. Ideally, automated responses to frequently-asked questions and common issues allow businesses to allocate human resources more strategically, focusing on complex problem-solving and high-value tasks.

That’s where we come in. We decided to tackle this problem by making a bot answer phone calls. We recently developed a PoC combining Symbl’s Nebula LLM with off-the-shelf modules to create a talking bot. Our goal was to create a passable conversation with a bot. You would clearly notice that it’s a bot but we weren't aiming for it to seem human.

Challenges: What makes the experience feel right

Latency, latency, latency!

The demo had multiple issues, but latency was a priority. When you are talking to someone, you expect them to answer pretty fast. Having low latency is crucial for this to be viable, otherwise it ruins the illusion of having a real conversation. If you need to wait 10 seconds for an answer, then the experience won't feel satisfactory nor efficient.

Maintaining low latency not only requires making the bot more efficient, but it also limits the solutions we can use to address other issues. In other words, we can’t do time-consuming tricks to solve problems. Transcription quality For real-time conversations we need to continuously transcribe the user's audio, requiring a fast and dependable method for handling short audio clips. If the transcription does not match with what the user said, then the bot's response would not make sense. If the transcription method is slow then we mess up our latency affecting the user experience.

Follow guidelines

LLMs can be unpredictable, so they need guardrails when interacting with users/customers. They must provide accurate information and stay consistent with their intended behavior.

If the bot starts talking about things that are unrelated to its objective then clearly it is not accomplishing its task. Furthermore, a user might intentionally force the bot to make a blunder, so we also need to take that into account. It’s not so easy to do. Especially considering that we won’t be using a custom LLM and any elaborate examining of the prompts requires time. Leverage information

Besides simply talking to the user, there’s still something important: What should the bot accomplish? There are multiple directions depending on what’s required of it. The key point is leveraging relevant information during the conversation. For example, if we want the bot to schedule appointments, it must know the available time slots and avoid selecting invalid ones. If the bot is meant to help customers with their issues, it should understand how to assist and adapt to their needs, or recognize when it can't help rather than providing irrelevant information. All these tasks require the bot to interact with data.

A short demo

How it works

Without further ado, here is the workflow of our demo. For this demo, the bot is set to schedule a fictional call to discuss Mutt Data's services. Its workings are very simple (although it did require some iterations). We interact with services provided by both AWS and Symbl. There's an event diagram below:

It processes chunks of audio from the user until we hear a prolonged silence. That’s our cue for “I’m done talking, answer me”. The audio chunks are then transcribed locally in the backend. The transcribed text (and all the previous conversation) is sent to Symbl’s Nebula LLM API and the response is streamed back. The UI is updated as each word is steamed back from Nebula’s API. After a full sentence is received, it is sent to AWS Polly text-to-speech service to get the audio that the user will hear. Finally the audio is forwarded to the user who will queue it and then play it.

The frontend and backend communicate through web sockets, enabling an event-driven system where both sides can send data whenever needed.

How we overcame the difficulties

Measuring times

Clearly, if we want to make sure we have a low latency, the first step involves measuring stuff. How long does this take? What happens if we do this or that and so on. It doesn’t need to be overly precise. Just some experimental measure will do. As we have a frontend and a backend, we have two sources of timetable events. To have a consistent measure, we measure everything in the backend while frontend events are sent there, recording the time of arrival. So, how do we measure up? Here’s a simplified view of times averaging a couple of runs in the final version:

Phase	Average Duration (seconds)	Std
Transcription	0.504	0.211
Waiting for the first word	1.297	0.188
Waiting between first word and sentence	0.316	0.234
First audio generation	0.073	0.095
First audio arrived at user	0.196	0.064

Our greatest pain point seems to be waiting for Nebula to give us a full sentence. We can’t do much about it (unless we switch to some “lighter LLM”’). The combined weight of the waiting for the LLM averages to around 1.6 seconds, by far the longest. Following this, is transcription at a far second place. The final latency is acceptable for the system to be usable, but it is slightly higher than indicated in the table. We must wait for the user to remain silent for a set period, which may be longer than typical in conversation. This could be improved, but for now, we’ve hardcoded the waiting time due to the limited scope of the PoC.

Transcription quality is a trade-off between precision and speed

We transcribe audio locally for speed. Symbl is better for ongoing conversations, while we used a lighter version of Open AI Whisper for quick, one-time transcriptions. There are other services available, and newer, faster Whisper versions can further improve speed and quality.

Did we succeed? And where do we go from here?

Yes! We successfully developed a proof of concept for a talking bot using Symbl's Nebula LLM and AWS services. While the bot met our initial goal of handling simple conversations like scheduling calls, it remains a work in progress. The overall system demonstrates that such technology can be viable for automating basic voice interactions

Moving forward, we plan to focus on improving the bot's efficiency and functionality. Key areas include reducing response latency, enhancing transcription accuracy, and refining the bot's information retrieval to better assist users during conversations. Exploring additional features like automated follow-up actions after calls will also be crucial to expanding the bot's capabilities for real-world applications.