3. Building a Chat System with Dendron: Learning How to Say Goodbye
In Part 2 we saw how to manage the state of a chat inside of a behavior tree, but we lost the ability to determine when the user wants to end the conversation. Even so, if you played with the models from Parts 0 or 1, you may have noticed that a simple search for the string "Goodbye"
doesn't necessarily lead to a reliable indicator that the conversation is over: even if you say goodbye to your agent, it may not reply in kind, or it may not say the exact word you're searching for. We could try to add complexity to our exact search, but what we really want is to look at whatever the human has said most recently and answer the question "Is the human trying to end the conversation?" This is precisely the sort of thing that language models are supposed to be good at, and Dendron provides a specific class (CompletionCondition
) that uses a language model to score a list of possible answers to a question, returning the most likely answer given the model and some programmer-specified context. In this part we'll show how to add a CompletionCondition
to our tree to end the conversation with a bit more intelligence than our previous trees. We'll also add a rule-based AI node for intelligently splitting long strings into shorter ones to improve TTS quality.
If you find this tutorial too verbose and you just want to get the code, you can find the notebook for this part here.
Imports and Initial Setup
We start as always with our imports:
Most of this should be familiar from the previous parts, but a few points of interest do stand out. First, we import both CompletionConditionConfig
and CompletionCondition
. The import of a config class serves as a hint that CompletionCondition
uses a model from Transformers. Secondly, we are now importing spaCy. You may need to pip
install it. If you haven't used it before, spaCy is a great library for handling certain tasks related to natural language processing. We'll be using the library to perform some simple rule-based string splitting into sentences.
Next up we repeat the definitions of the action and condition nodes from Part 2:
All of this is similar to the previous part, except you might notice that in TimeToThink
we initialize self.last_human_input = None
. This is entirely artificial, but below it will give us a chance to show how to pre-register key-value pairs with our blackboard.
TTSAction
and Sentence Splitting
Next we define our TTSAction
and play_speech
function:
The class includes a few added bells and whistles, but should be mostly familiar by now. The changes in TTSAction
and play_speech
all relate to the way in which we are now processing a series of utterances to speak. To understand why we do this, let's talk about sentence splitting.
Sentence Splitting
In Part 1, we mentioned that most current neural TTS models often struggle with longer utterances. This is something that seems to affect all models, but is particularly pronounced for smaller models. It turns out you can somewhat mitigate the problem by splitting large utterances into shorter ones. You could do this solely based on string length, but then you're likely to break coherent statements into fragments, which will be spoken in weird ways. It would be best if we could split long strings at natural pause points, like sentence boundaries.
It turns out that Piper supports a version of this already in its piper-phonemize
library, but in case you want to try different models we show a simple strategy for doing achieving the same result using rule-based AI. You could certainly do better with a learning-based approach, but what we'll do here is quick, easy, and often good enough. The capability we're after is provided by spaCy, which we'll wrap in an ActionNode
:
In the constructor, we initialize a spaCy pipeline to perform tokenization at the sentence level. In the tick
function we use that pipeline to split longer strings into sentences, strip the sentences of whitespace, and add them back to the appropriate slot in the blackboard.
Defining a Single Turn in the Conversation
As in Part 2, we will define a turn in the chat as consisting of speaking, thinking, and listening. Since most of the details are the same as in previous tutorials, we show all the code while our commentary focuses primarily on differences from previous parts of the tutorial.
The Speech Sequence
With the TTSAction
defined as above, we can define a speech_node
and speech_seq
as in Part 2:
The Thought Sequence
To define our thought sequence, we just need to repeat our code from Part 2:
Aside from a bit of blackboard gymnastics, this should mostly be familiar from the previous parts of the tutorial. Notice that we are adding a SentenceSplitter
node to the end of our thought sequence, so that if the chat_node
generates a string that is too long to speak we can split it immediately before our TTSAction
has a chance to run.
Once we have defined thought_seq
, we have enough to define a single turn in a conversation:
This should look familiar as the root node from Part 2 of the tutorial. We still haven't implemented a way to break out of the chat loop, so lets build a "farewell classifier" using a CompletionCondition
node.
CompletionCondition
for Classifying Strings
A CompletionCondition
node is, as the name suggests, a condition node that returns SUCCESS
or FAILURE
based on the output of an autoregressive language model. Although we mostly use language models to generate text these days, it's important to remember that language models are probability models, so in addition to generating text by sampling we can also evaluate the conditional probability of any string given any other string (as long as the combination fits inside the model's context window). This opens the door to the following strategy for evaluating a logical condition using a language model:
- Write down a prompt that includes a statement and a question with a closed set of possible answers. Include a placeholder where possible answers could go.
- Write down the possible answers in a list.
- Generate a batch for the model consisting of all the possible completions of answers to questions.
- Run the model on the batch to get a probability for each question-answer pair.
- Select the answer with the highest probability relative to the other answers.
- Determine if this answer corresponds to
SUCCESS
orFAILURE
depending on the nature of the problem.
The CompletionCondition
implements precisely this strategy (with some nuance due to how the Transformers library implements probability calculations). Let's look at the code and then we can talk about its parts:
This is a big block of code, but if you've followed the tutorial up to this point you have everything you need to understand what's going on. First we create a config object that specifies the model to use, the input key for the blackboard, and some optimization flags. Then we create our farewell_classification_node
using that configuration. Then we define a farewell_success_fn
, a farewell_pretick
function, and a farewell_posttick
function. The pre- and post-tick functions are responsible for handling state management and format conversion. The model we're using requires some formatting similar to openchat_3.5
, so we put that processing in farewell_pretick
. (We could do this in an input processor, but using a pre-tick function works just as well.) The post-tick function is solely responsible for deciding if the blackboard should be updated to set the "all_done"
flag to True
.
Even without saying what exactly farewell_success_fn
does, you can probably read these three functions and make a guess at how they fit together. In the pre-tick we set up a "yes-no" question asking if the user is saying goodbye. The farewell_success_fn
takes a completion (it will turn out to be the highest scoring completion) and if it is "yes" then it returns SUCCESS
. This is status returned as the node's status, and the post-tick function checks that status and updates the blackboard if the status is SUCCESS
. In this way, the CompletionCondition
node implements a kind of classifier over text strings.
You might be wondering how the completions are passed to the node. If you guessed "via the blackboard," then you're right! You can scroll down to see how the blackboard is set up once we create a BehaviorTree
instance, or you can look at the documentation for CompletionCondition
.
Lastly, we implement an action node responsible for saying goodbye. We could just end the conversation as we have before, but since we're already in the process of adding intelligence to our system, we may as well have the node speak goodbye:
All this node does is check if the "all_done"
flag is set, and if so we add a "Goodbye!" to the "speech_in"
slot in the blackboard.
At this point we can build our farewell classifier using a Sequence
:
In addition to the farewell_classification_node
we defined above, the other noteworthy thing about this snippet is that we are reusing our speech_node
. This is perfectly valid as long as your nodes control their state and side effects. Language models (and neural networks in general) are pure functions except possibly for the randomness introduced by sampling, but even with sampling it turns out that this kind of model reuse is viable for our application, and saves a bunch of GPU VRAM.
Building the Tree
Now that we have defined all of the components we need, we can build our tree:
This results in a tree that looks like the following:
Once our tree is set up, we just need to initialize our blackboard and start our chat loop:
Almost all of this should be clear by now, except for how we set up the key "latest_human_input"
. For that key, we register the entry with our blackboard before setting a value. Registering an entry with a blackboard allows us to specify a description for the entry, and more importantly lets us specify a type for the value. Here, if we were to just assign None
to the value without registering the value first, we would eventually run into type errors when we tried to use the value.
Also of note we specify "completions_in"
and "success_fn"
for the CompletionCondition
we defined above. Because we pass these and the input prefix to the node via the blackboard, we can change the question, possible answers, and success criterion for a CompletionCondition
dynamically at runtime.
The Chat Loop
We are now in a position to run our chat loop:
Instead of looping forever, we loop until the blackboard slot for "all_done"
returns True
. We tick
as fast as we can, which works just fine for a chat application. If we were running this code on a robot though, we would probably run our chat loop at some fixed frequency, such as 20 or 50Hz.
Conclusion
You now have an agent that can detect when you are trying to say goodbye and respond appropriately. The agent uses three language models and a rule-based AI system to perform its work, and this runs in about 14GB on a single RTX 3090! For the last part of this tutorial, we'll extend our system just a little bit more to add speech recognition on top of the tree we've built in this part. Then you'll have a local chat agent that can literally speak and listen to you.