Term definition

The term definition that's going to apply to rule-based and hybrid models is not just a list of words. There are a number of operators that enable to define more complex terms.

Consider these four elements: simple terms, multiwords, context operator and OR operator.

The most basic element is the simple term. It is formed by a single word (and thus, without any blank spaces in it) and there are no operators. This word will be matched literally in the processed text.

The fact that this match is carried out over the processed text means that all the steps described in the text tokenization section have been already carried out, and so, stopwords have been removed. This means that a simple term cannot be a stopword or any of the words that are removed in the tokenization (for example, a word that has only numbers).

In its simplest version, a list of terms can be formed just by simple terms.

Important

A dot is considered a valid character in a simple term, so terms like "U.K" or "google.com" would be correct.

The next element are the multiwords, which are used to specify combination of words or n-grams that have to appear in the text in the exact same order and form as specified. The operator used for multiwords is the plus sign, '+', and it joins all the words that form a multiword.

As we have seen in the text tokenization section, multiwords are grouped before removing stopwords, which means that stopwords can be included in multiwords.

    Example: 'too+many', 'machine+learning', 'based+on+a+true+story'
ID Text Term 1 Term 2 Result 1 Result 2
1 The machine is learning to think by itself machine - -
2 It uses a machine learning algorithm machine - -
3 The machine is learning to think by itself machine+learning - -
4 It uses a machine learning algorithm machine+learning - -
5 The machine is learning to think by itself machine machine+learning
6 It uses a machine learning algorithm machine machine+learning

In this table we can see several examples of when a term is detected and when it is not. In the examples 5 and 6 we can see the importance of the text tokenization, how it varies depending on the multiwords defined.

The simple terms and the multiwords are the basic elements that will be combined through the use of operators to define more complex terms.

The context operator allows to consider that a term appears in a text when two or more simple terms/multiwords appear in it. In other words, it permits to evaluate co-appearances, which has the effect of limiting the context in which we consider that a term appears.

The context operator is represented by an underscore, '_', and the order of the terms it is associated to does not affect the result of the evaluation.

The following table shows several examples of sentences and terms defined using the context operator. The last two columns show the result of the evaluation of those terms:

ID Text Term 1 Term 2 Result 1 Result 2
1 The machine is learning to think by itself machine_think - -
2 It uses a machine learning algorithm machine_think - -
3 I think the machine is working now machine_think - -
4 The machine is learning to think by itself machine_think machine+learning
5 I think it uses a machine learning algorithm machine_think machine+learning
6 I think it uses a machine learning algorithm machine+learning_think machine+learning

In the examples 1 and 3 you can see how the order in which the terms appear in the text does not change the result of the evaluation. Examples 4, 5 and 6 show how the definition of multiwords affects the text tokenization and therefore the detection of the terms.

Did you notice...?

As the system is accent insensitive, the context operator will be very useful to limit the context for those cases in which depending on the accent mark, a word means a thing or another.

This happens very often in languages such as Spanish or French. For instance: "ingles" vs "inglés" could be disambiguated using "clases_ingles".

Important

Accent marks are not allowed (except for the 'ñ' letter). If you add a term with an accent mark, it will be automatically removed before saving.

Finally, the OR operator, allows to use the logical operator OR in conjunction with the context operator and that's represented through a tilde, '~' (Alt Gr + 4). It permits to write terms such as "A or B in the same context as C".

It's not mandatory to use it with the context operator, but it is the case where the value it adds is better appreciated:

ml~machine+learning_think = ml_think
machine+learning_think

As it's a logical OR, when the expression does not contain a context operator, the result would be the same as including the terms in different lines:

ml~machine+learning = ml
machine+learning
email~e+mail~electronic+mail = email
e+mail
electronic+mail

The following table shows several examples of sentences and terms defined using all the different operators.

ID Text Term 1 Term 2 Result 1 Result 2
1 The machine is learning to think by itself machine~system_think - -
2 The system is learning to think by itself machine~system_think - -
3 He invented the machine machine~system_think - -
4 The machine is learning to think by itself machine~system_think machine+learning
5 I think it uses a machine learning algorithm machine~system_think machine+learning
6 I think the machine uses ML algorithms machine~system_think machine+learning~ml
7 I think the system uses ML algorithms machine~system_think machine+learning~ml

Important

Every term defined is searched in the text literally, which means that it's the form what you use and not the lemma. This implies that if you want to include a verb in your terms, you will have to manually add all the forms you want to detect.

Operators precedence

The following image shows the processing order of all the operators we have seen.

Tokenization example

The first to be grouped will always be the multiwords, which means that:

list+of+words~terms = list+of+words
list+of+terms
list+of+words~terms = list+of+words
terms