10 Best Practices for Designing NLU Training Data The Rasa Blog

This command will take care of updating your config.yml and domain.yml, while
making backups of your existing files using the .bak suffix. Once the assistant is re-trained with the above configuration, users should also tune fallback confidence thresholds. Training_states_and_actions method of TrackerFeaturizer, FullDialogueTrackerFeaturizer and
MaxHistoryTrackerFeaturizer classes is deprecated and will be removed in Rasa 3.0 .

You can use regular expressions to improve intent classification by including the RegexFeaturizer component in your pipeline. When using the RegexFeaturizer, a regex does not act as a rule for classifying an intent. It only provides a feature that the intent classifier will use
to learn patterns for intent classification.

Natural Language Processing

In order to gather real data, you’re going to need real user messages. A bot developer
can only come up with a limited range of examples, and users will always surprise you
with what they say. This means you should share your bot with test users outside the
development team as early as possible. Each folder should contain a list of multiple intents, consider if the set of training data you’re contributing could fit within an existing folder before creating a new one. You can use regular expressions for rule-based entity extraction using the RegexEntityExtractor component in your NLU pipeline. The first one is for the NLU pipeline and the second one is to define the policies for Rasa Core.

This release breaks backward compatibility of machine learning models. It is not possible to load models trained with previous versions of Rasa. To extract custom slots that are not defined in any form’s required_slots, you should now use a global custom slot mapping
and extend the ValidationAction class. Each slot in the slots section of the domain will need a new key mappings. This key is a list of mappings moved from forms, while the required_slots field collapses to a list of slot names.

Entities Roles and Groups#

This section provides best practices around generating test sets and evaluating NLU accuracy at a dataset and intent level.. A single NLU developer thinking of different ways to phrase various utterances can be thought of as a “data collection of one person”. However, a data collection from many people is preferred, since this will provide a wider variety of utterances and thus give the model a better chance of performing well in production.

It is also required to return the
message objects at the end of the process method. Instead, all NLU
components have to override the create method of the
GraphComponent interface. The
passed in configuration is your NLU component’s default configuration including any updates
from your model configuration file.

How To Build Your Own Custom ChatGPT Bot

I use a python library which is pretty good to power your conversational software, based on the latest Machine Learning research. Rasa Open Source is licensed under the Apache 2.0 license, and the full code for the project is hosted on GitHub. Rasa Open Source is actively maintained by a team of Rasa engineers and machine learning researchers, as well as open source contributors from around the world.

  • Looking at the domain concept, it can be seen that the related knowledge database is queried for each of the defined entity types with the goal to extract all available values and store them into a list.
  • Make sure you read
    through this guide thoroughly, to make sure all parts of your bot are updated.
  • Rasa Open source is a robust platform that includes natural language understanding and open source natural language processing.
  • Before the RL problem can be resolved, however, it must first be ensured that the correct intents and entities are found.

Synonyms don’t have any effect on how well the NLU model extracts the entities in the first place. If that’s your goal, the best option is to provide training examples that include commonly used word variations. Based on the results at hand, we recommend applying the domain approach following the EX 1 construction when training an intent classifier that shall only perform well in a certain domain. When aiming towards training a more robust and open domain intent classifier we recommend to used PH type 1 values to construct the training dataset. Although the performance in some domains might be lower, compared to using domain-specific values for training, the performance overall domains will be higher.

Entity spans

Table 1 shows how both concepts work and further depicts an example for each of them. The example shows how one of the entity values of type lecture is used to fill the empty slot of matching type in the example utterance. This utterance together with the appropriate labels can then be used to train the component of the NLU. The good news is that once you start sharing your assistant with testers and users, you can start collecting these conversations and converting them to training data. Rasa X is the tool we built for this purpose, and it also includes other features that support NLU data best practices, like version control and testing. The term for this method of growing your data set and improving your assistant based on real data is called Conversation Driven Development; you can learn more here and here.

nlu training data

Any alternate casing of these phrases (e.g. CREDIT, credit ACCOUNT) will also be mapped to the synonym. The / symbol is reserved as a delimiter to separate retrieval intents from response text identifiers. This command is most commonly used to import old conversations into Rasa X/Enterprise to annotate
them. The Rasa CLI now includes a new argument –logging-config-file which accepts a YAML file as value. Head to “Talk to Your bot” in the menu on the left, and start conversing with your bot. You can install the latest version, but this post is based on v2.6.2, so any v2.x should work perfectly with what’s covered here.

Building our Bot

Rasa end-to-end training is fully integrated with standard Rasa approach. It means that you can have mixed stories with some steps defined by actions or intents
and other steps defined directly by user messages or bot responses. If you have been dynamically filling slots not present in the form’s required_slots nlu training data defined in the domain.yml
file, note that this behaviour is no longer supported in 3.x. Any dynamic slots with custom mappings, which are set in
the last user turn, will be filled only if they are returned by the required_slots method of the custom action
inheriting from FormValidationAction.

nlu training data

If you are just starting to build your model and you don’t have any training dataset then having 15–20 examples for each intent is a good starting point. Training data should have a rich set of examples to cover the variety in which real users may phrase their queries. If you look at the top right part of the diagram in the left, the total loss is being calculated from the summation of entity loss, mask loss, and intent loss.

Training data files#

Some data management is helpful here to segregate the test data from the training and test data, and from the model development process in general. Ideally, the person handling the splitting of the data into train/validate/test and the testing of the final model should be someone outside the team developing the model. Note that it is fine, and indeed expected, that different instances of the same utterance will sometimes fall into different partitions. Otherwise, if the new NLU model is for a new application for which no usage data exists, then artificial data will need to be generated to train the initial model. It is a good idea to use a consistent convention for the names of intents and entities in your ontology. This is particularly helpful if there are multiple developers working on your project.

Podziel się na:
  • Print
  • Facebook
  • Google Bookmarks
  • Twitter