Schema Grounded CUI

Since the goal of chatbots is to deliver services to cooperative users, it is ok for us to only handle the conversation limited to these services. This means that we do not have to worry about the conversations irrelevant to the underlying services. Of course, it is possible for users to step outside this scope. For example, customer might ask your barista bot for a haircut? Luckily bot can simply respond with "I am only trained to help you with your coffee needs. What do you want to drink today"? No cooperative users will fault your service bot here. So instead of focusing on conversations, we can ground the CUI to the API schema that implements the services we want to expose.

Building CUI in this schema grounded fashion means we always first convert user utterances into data structures (known as semantic frames) first, we then consult the builder predefined interaction logic and application logic to figure out what semantic frame we need to communicate back to the user and finally convert such frame into natural text in a given language and style. Schema grounded way of building CUI has many advantages over the conversation driven approaches:

  1. API Schema provides natural boundary for both design and implementation. Given the set of APIs, it should be immediately clear whether given conversation is relevant or not.
  2. API schema typically is a result of careful deliberation between the product owner and software architect, so it is usually normalized to be minimal and orthogonal. This means the similar functionalities are generally serviced by the same APIs, so there is no need to create the equivalency between user intention at language level, as all we have to do it mapping language toward the APIs.
  3. We can naturally separate the different concerns, so different people can work on the different aspects: the actual service can be implemented by the backend team, conversational user interface builder can take care of interaction logic, and CUI designer can provide the script for better user experience.

Aside from these soft influence, the schema grounded CUI actually has more constructive requirements on how conversational experience should be built differently.

CUI Type System

In programming languages, a type system is a logical system comprising a set of rules that assigns a property called a type to the various constructs of a computer program, such as variables, expressions, functions or modules. It is considered as a starting point for any programming language and everything else is built on top of this. The standard way of describing APIs nowadays is OpenAPI. OpenAPI is a programming language agnostic way to define functions, there is also a type system at this core.

The core of the type system is to specify what types the programming language support. Types in programming languages are broadly classified and can be divided into built-in types represented by int and float, and abstract types represented by class and function. Since the types serve as the basis of the problem modeling, it is important to support modern concepts like: containers, inheritance and polymorphism, in addition to basic user defined data types if one wants to make is easier to model complex problems, which is exactly what OpenAPI did.

While recently there are some effort of supporting composite data type out of necessity,conversation driven CUI approaches in general does not have explicit support for user defined type, container, and polymorphism. While it is possible to simulate conversational behavior of these type system features, doing so puts a burden on the CUI builder and greatly increase the cost of building good conversational experience. Framely provides support for every type system features defined by OpenAPI 3.x at CUI level for both input and output, with the only exception on dictionary which only have output support. This way, so that Framely builder can enjoy modeling the problem at the abstract level permitted by modern type system.

The types in OpenAPI are supported on Framely as follows: functions are mapped to intent, classes are mapped to frame, and primitive type and enum are mapped to entity. This means that there is a one-to-one mapping between semantics from user utterance to data structure at programming language level. The conversational behavior for these type can be defined through some declarative annotation at both interaction level and language level. If defined in a library, these behaviors can be reused by other builders after importing.

Structured Conversations

Since the conversational behavior is defined on the frame, and frame is potentially nested, so conversational behavior are necessarily nested. To make sure we can fill the nested structure through conversations, we make a bot follow a simple strategy like depth first search. The conversation that follow such systematic algorithm is called structured conversation.

Structured conversations are simply composite conversation sequences, in both sequential and nested sense. Each sequence has a concrete goal, and have a clear start and finish. Finish can come in different flavors: including abort by user or early exit per business logic. Furthermore, each sequence has a clear owner. An owner can start the sequence, and set the goal and communicate to the other party, and other party is cooperating in help to achieve the goal for the owner.

Current conversation owner can yield ownership to other party: for example, the bot can say "what can I do for you", this can potentially start a nested conversation sequence that is owned by a user. And after the sequence concludes, then the ownership automatically get back to the bot. Of course, in this case, a user can choose not to take the ownership by simply replying "nothing, thanks" , which will bring closure to bot's owned sequence. The bot is expected to simply close the sequence in the next turn.

Implicit Contextual Management

One of the hallmark of natural language is the same word can mean different things in the different context. While context dependent behavior is explicit modeled by conversation driven approaches, schema grounded CUI naturally has a context, the stack of frame that represent the the current filling. This forest representation of dialog state in the schema space made it easy to find the coreference for the pronouns and produce the contextual behavior almost as a by product.

Last Updated: