How does text classification work?

Automatic text classification is a system developed to classify texts according to a set of categories after a training stage.

If you are already familiar with MeaningCloud, you'll probably know there's another resource that focuses on text classification: deep categorization models. The difference between these two resources is the scenario in which we apply each one of them. Deep Categorization is designed for classification scenarios where there are not that many categories, but the casuistic for each one of them is complex enough so advanced morphosyntactic and semantic rules are needed to classify with an acceptable degree of precision/recall. Text Classification is designed for large classification models where it's not necessary a great detail in the rules defined to successfully classify a text and good performance is key.

We set up the text classification system through a classification model. This is the workflow of the process:

Text classification workflow

In the upper row, you can see how the classification model is generated from training texts and classification rules (see below for further information on when each one is used). The classification system uses this model to classify a text and outputs one or more categories in which it can be classified.

The management dashboard allows to manage the creation process of the classification models.

Types

In our text classification environment there are several types of classification models. They employ training texts, classification rules or both.

It exclusively compares the input text with the training texts included in the model. In other words, when you train a model you associate each category to sample texts, so that when you enter a text to classify, the system will compare it with the examples and will determine which one is the nearest.

Pros
  • If your texts are properly tagged, the initial training of the model is fast and returns good results.
  • It is recommended for long texts.
Cons
  • If the number of sample texts associated with each category is not equal, the classification might favor the categories with more sample texts.
  • The statistical model gives worse results with short texts because they might be similar to the sample texts of more than one category.

It exclusively uses classification rules. This implies that the classification is done considering only the terms defined in the model.

Pros
  • Using only terms, the chances to obtain false positives are limited to ambiguous terms or the ones included in more than one category.
  • It is useful when categories have a well defined and limited casuistry or when you want to reduce the amount of false positives returned by the system.
Cons
  • If an expression is not included in any rule, it will never be detected and the number of false negatives increases.
  • Rules must be defined exhaustively in order to obtain high accuracy.

As its name suggests, it is a hybrid version of the two models described before and uses both training texts and classification rules. You can use the training texts to cover the greater part of the classification, with the benefits of a purely statistical model, and apply classification rules to improve the specific cases in which the statistical model is inaccurate.

In the management dashboard, the type of the model will be determined by how categories are defined.

Pros
  • The definition of rules doesn't need to be too elaborate. It would only cover the cases in which the statistical model gives inaccurate results.
  • It has has the advantages of both model types.
Cons
  • The tuning is harder because it has to be made for the statistical version of the model, so that mistakes can be identified and improved with the rules.

Recommendation

We always recommend to start working with a statistical model, because it requires less effort and its evaluation can give a general idea of the problems that can be derived from the model definition.

When the problems are identified, the best thing to do is to implement a hybrid model and correct those problems by defining classification rules.