The term definition that's going to apply to rule-based and hybrid models is not just a list of words. There are a number of operators that enable to define more complex terms.
Consider these four elements: simple terms, multiwords, context operator and OR operator.
The most basic element is the simple term. It is formed by a single word (and thus, without any blank spaces in it) and there are no operators. This word will be matched literally in the processed text.
The fact that this match is carried out over the processed text means that all the steps described in the text tokenization section have been already carried out, and so, stopwords have been removed. This means that a simple term cannot be a stopword or any of the words that are removed in the tokenization (for example, a word that has only numbers).
In its simplest version, a list of terms can be formed just by simple terms.
A dot is considered a valid character in a simple term, so terms like "U.K" or "google.com" would be correct.
The next element are the multiwords, which are used to specify combination of words or n-grams that have to appear in the text in the exact same order and form as specified. The operator used for multiwords is the plus sign, '+', and it joins all the words that form a multiword.
As we have seen in the text tokenization section, multiwords are grouped before removing stopwords, which means that stopwords can be included in multiwords.
ID | Text | Term 1 | Term 2 | Result 1 | Result 2 |
---|---|---|---|---|---|
1 | The machine is learning to think by itself | machine | - | - | |
2 | It uses a machine learning algorithm | machine | - | - | |
3 | The machine is learning to think by itself | machine+learning | - | - | |
4 | It uses a machine learning algorithm | machine+learning | - | - | |
5 | The machine is learning to think by itself | machine | machine+learning | ||
6 | It uses a machine learning algorithm | machine | machine+learning |
In this table we can see several examples of when a term is detected and when it is not. In the examples 5 and 6 we can see the importance of the text tokenization, how it varies depending on the multiwords defined.
The simple terms and the multiwords are the basic elements that will be combined through the use of operators to define more complex terms.
The context operator allows to consider that a term appears in a text when two or more simple terms/multiwords appear in it. In other words, it permits to evaluate co-appearances, which has the effect of limiting the context in which we consider that a term appears.
The context operator is represented by an underscore, '_', and the order of the terms it is associated to does not affect the result of the evaluation.
The following table shows several examples of sentences and terms defined using the context operator. The last two columns show the result of the evaluation of those terms:
ID | Text | Term 1 | Term 2 | Result 1 | Result 2 |
---|---|---|---|---|---|
1 | The machine is learning to think by itself | machine_think | - | - | |
2 | It uses a machine learning algorithm | machine_think | - | - | |
3 | I think the machine is working now | machine_think | - | - | |
4 | The machine is learning to think by itself | machine_think | machine+learning | ||
5 | I think it uses a machine learning algorithm | machine_think | machine+learning | ||
6 | I think it uses a machine learning algorithm | machine+learning_think | machine+learning |
In the examples 1 and 3 you can see how the order in which the terms appear in the text does not change the result of the evaluation. Examples 4, 5 and 6 show how the definition of multiwords affects the text tokenization and therefore the detection of the terms.
As the system is accent insensitive, the context operator will be very useful to limit the context for those cases in which depending on the accent mark, a word means a thing or another.
This happens very often in languages such as Spanish or French. For instance: "ingles" vs "inglés" could be disambiguated using "clases_ingles".
Accent marks are not allowed (except for the 'ñ' letter). If you add a term with an accent mark, it will be automatically removed before saving.
Finally, the OR operator, allows to use the logical operator OR in conjunction with the context operator and that's represented through a tilde, '~' (Alt Gr + 4). It permits to write terms such as "A or B in the same context as C".
It's not mandatory to use it with the context operator, but it is the case where the value it adds is better appreciated:
ml~machine+learning_think | = | ml_think |
machine+learning_think |
As it's a logical OR, when the expression does not contain a context operator, the result would be the same as including the terms in different lines:
ml~machine+learning | = | ml |
machine+learning | ||
email~e+mail~electronic+mail | = | |
e+mail | ||
electronic+mail |
The following table shows several examples of sentences and terms defined using all the different operators.
ID | Text | Term 1 | Term 2 | Result 1 | Result 2 |
---|---|---|---|---|---|
1 | The machine is learning to think by itself | machine~system_think | - | - | |
2 | The system is learning to think by itself | machine~system_think | - | - | |
3 | He invented the machine | machine~system_think | - | - | |
4 | The machine is learning to think by itself | machine~system_think | machine+learning | ||
5 | I think it uses a machine learning algorithm | machine~system_think | machine+learning | ||
6 | I think the machine uses ML algorithms | machine~system_think | machine+learning~ml | ||
7 | I think the system uses ML algorithms | machine~system_think | machine+learning~ml |
Every term defined is searched in the text literally, which means that it's the form what you use and not the lemma. This implies that if you want to include a verb in your terms, you will have to manually add all the forms you want to detect.
The following image shows the processing order of all the operators we have seen.
The first to be grouped will always be the multiwords, which means that:
list+of+words~terms | = | list+of+words | |
list+of+terms | |||
list+of+words~terms | = | list+of+words | |
terms |