The training text parameter, available in the category editing view, is the one corresponding to statistical classification.
In this field you can include all training texts for the category. The format you use to include them (one text per line, all in one line, etc.) won't affect the classification, but they have to be UTF-8 encoded. It's recommended to add from 1000 to 10000 words of representative texts in each category, although this number may vary depending on the scenario.
Similarly to what happens with the terms lists, if you include any text in the training text field of any of the categories of the model, even if it is a single word, the model stops being a rule-based one.
When we create the categories of a model, we are also setting the model type:
Model type | Terms lists | Training text |
---|---|---|
Rules model | ||
Statistical model | ||
Hybrid model |
The final relevance value in a hybrid model is the sum of the statistical classification and the rule-based classification weighted by the value set for Rule vs Statistical defined in the settings section.
By default, this value is set to 0.7, that is 70% of the weight of the final relevance comes from the rules, while the other 30% comes from the statistical classification.