Prakash Dale · August 9, 2020

Rasa Open Source is a machine learning framework for automated text and voice-based conversations. NLU understands the user’s message based on the previous training data been provided:

  • Intent classification: Interpreting purpose/intent of the user message
  • Entity Extraction: Recognizing structured data.

Core decides what happens next in this conversation. Its machine learning-based dialogue management predicts the next best action based on the input from NLU, the conversation history, and your training data.

Best Practices for designing NLU data

  • Use real data

    Avoid using tools to auto-generate training data. It may lead model to loose ability to generalize and is only able to recognize phrases, it’s seen before. i.e. overfitting of the data.

  • Keep training examples distinct across intents

    When training examples are two similar, intent confusion results.

  • Merge on intents, split on entities

      ## returning user
      * greet
          - utter_ask_if_new
      * inform
          - slot{"status": "returning"}
          - utter_welcome_back
      ## new user onboarding
      * greet
          - utter_ask_if_new
      * inform
          - slot{"status": "new"}
          - utter_create_account
  • Use synonyms wisely

    A common misconception is that synonyms are a method of improving entity extraction. In fact, synonyms are more closely related to data normalization, or entity mapping. Synonyms convert the entity value provided by the user to another value—usually a format needed by backend code.

  • Understand lookup tables and regex

    Lookup tables and regex get featurized. That means they are used to train the NLU model itself. This is why you can include an entity value in a lookup table and it might not get extracted - while it’s not common, it is possible.

    For best results, you should make sure to include a few of the entities used in lookup table and regexes in your training examples.

  • Leverage pre-training entity extractor

    Names, dates, places, email addresses, etc entity types can be extracted without lot of training data using pre-trained entity extractors available in Rasa - SpacyEntityExtractor, DucklingEntityExtractor

  • Always include an out-of-scope intent

    An out-of-scope intent is a catch all intent for anything that user might say that’s outside of that assistant’s domain.

  • Handle misspelled words

    It’s a given that the messages users send to your assistant will contain spelling errors

    Before turning to a custom spellchecker component, try including common misspellings in your training data, along with the NLU pipeline configuration below. This pipeline uses character n-grams in addition to word n-grams, which allows the model to take parts of words into account, rather than just looking at the whole word. By doing so, it can better recover from misspellings.

      language: "en"
        - name: ConveRTTokenizer
        - name: ConveRTFeaturizer
        - name: RegexFeaturizer
        - name: LexicalSyntacticFeaturizer
        - name: CountVectorsFeaturizer
        - name: CountVectorsFeaturizer
          analyzer: "char_wb"
          min_ngram: 1
          max_ngram: 4
        - name: DIETClassifier
          epochs: 100
        - name: EntitySynonymMapper
        - name: ResponseSelector
          epochs: 100
  • Treat your data like code
  • Test your updates

For more details click here

Twitter, Facebook