Combine training text and rules (hybrid models)

The text parameter, available in the category editing view, is the one corresponding to statistical classification.

  • Text: in this field you can include all training texts for the category. The format you use to include them (one text per line, all in one line, etc.) won't affect the classification, but they have to be UTF-8 encoded. It's recommended to add from 1000 to 10000 words of representative texts in each category, although this number may vary depending on the scenario.

Similarly to what happens with the terms lists, if you include any text in the training text field of any of the categories of the model, even if it is a single word, the model stops being a rule-based one.

When we create the categories of a model, we are also setting the model type:

Model type Terms lists Training text
Rules model
Statistical model
Hybrid model

Important

The final relevance value in a hybrid model is not the sum of the statistical classification and the rule-based classification.

First of all, the hybrid model makes a statistical classification, and then it applies the defined rules in order to obtain more accurate relevance values. For this reason:

  • if a category is not included in the initial statistical classification, it will never be returned after the rule-based classification even if the text contains terms defined in the category rules.
  • if a category doesn't have any training text, the statistical classification will never return it, and because of that it will never appear in the final classification.