Automatic text classification is a system developed to assign one or more categories to a text according to a set of categories.
If you are already familiar with MeaningCloud, you'll probably know there's another resource that is used to do this: deep categorization models. The difference between these two resources is the scenario in which we apply each one of them.
Deep Categorization is designed for classification scenarios where there are not that many categories, but the casuistic for each one of them is complex enough so advanced rules that can use morphosyntactic and semantic elements are needed to classify with an acceptable degree of precision/recall.
Text Classification is designed for large models where it's not necessary a great detail in the rules defined to successfully classify a text and good performance is key. It's also specially useful when you already have a training text collection for the categories, as it allows you to reuse it and set up a working model very quickly.
We set up the text classification system through a classification model. This is the workflow of the process:
In the upper row, you can see how the classification model is generated from training texts and classification rules (see below for further information on when each one is used). The classification system uses this model to classify a text and outputs one or more categories in which it can be classified.
The management dashboard allows to manage the creation process of the classification models.
In our text classification environment there are several types of classification models. They employ training texts, classification rules or both.
It exclusively compares the input text with the training texts included in the model. In other words, when you train a model you associate each category to sample texts, so that when you enter a text to classify, the system will compare it with the examples and will determine which one is the nearest.
It exclusively uses classification rules. This implies that the classification is done considering only the terms defined in the model.
As its name suggests, it is a hybrid version of the two models described before and uses both training texts and classification rules. You can use the training texts to cover the greater part of the classification, with the benefits of a purely statistical model, and apply classification rules to improve the specific cases in which the statistical model is inaccurate.
In the management dashboard, the type of the model will be determined by how categories are defined.
We always recommend to start working with a statistical model, because it requires less effort and its evaluation can give a general idea of the problems that can be derived from the model definition.
When the problems are identified, the best thing to do is to implement a hybrid model and correct those problems by defining classification rules.