The term definition that's going to apply to rule-based and hybrid models is not just a list of words. There are a number of operators that enable to define more complex terms.
There are two basic elements: simple terms and multiwords (or literal terms). We will be able to combine them using the different operators available.
The most basic element is the simple term. It is formed by a single word (and thus, without any blank spaces in it) and there are no operators. This word will be matched literally in the processed text or, in if the setting lemmatization
is enabled and the term is a lemma, it will match any of its lexical forms.
The fact that this match is carried out over the processed text means that the pre-processing and tokenization steps described in the text tokenization section have been already carried out, and so, the terms will be matched against the resulting text.
In its simplest version, a list of terms can be formed just by simple terms.
The next element are the multiwords (or literal terms), which are used to specify combination of words or n-grams that have to appear in the text in the exact same order and form as specified. Multiwords are defined by using double quotes around the words that form the multiword. Similarly to simple terms, when the setting lemmatization
is enabled and the words that compose the multiword are lemmas, then their lexical forms will also match.
ID | Text | Term | Lemmatization | Result |
---|---|---|---|---|
1 | The machine is learning to think by itself | machine | n | |
2 | Machines will take over shortly | machine | n | |
3 | Machines will take over shortly | machine | y | |
4 | The machine is learning to think by itself | "machine learning" | n | |
5 | It uses a machine learning algorithm | "machine learning" | n | |
6 | It uses a machine learning algorithm | "machine learn" | n | |
7 | It uses a machine learning algorithm | "machine learn" | y |
In this table, we can see several examples of when a term is detected and when it is not, depending both on the term definition used and the value of the lemmatization
setting.
Different lexical forms can correspond to the same lemma, for instance, "fly" will be the lemma for the verb "fly", as in take flight as well as for "fly", the animal. It is be important to take those ambiguous contexts into account in the term definition.
The simple terms and the multiwords are the basic elements that will be combined through the use of operators to define more complex terms.
These are the operators available:
AND
: the classical logical operator. It allows to give context to the term. The two terms joined by the operator need to appear in the text for the condition to be satisfied. The order of the terms associated to the operator does not affect the result of the evaluation.
ID | Text | Term | Result |
---|---|---|---|
1 | That passenger is afraid to fly | passenger AND fly |
|
2 | The fruit fly is very common | passenger AND fly |
|
3 | That fly is buzzing around the passenger | passenger AND fly |
|
4 | The fruit fly is buzzing around the passenger | "fruit fly" AND passenger |
|
5 | The fly is buzzing around the passenger | "fruit fly" AND passenger |
WITH
: similar to AND
, but only the first term will have any impact on the weight. It's used mainly to disambiguate.
ID | Text | Term | Result |
---|---|---|---|
1 | That passenger is afraid to fly | passenger WITH fly |
|
2 | The fruit fly is very common | passenger WITH fly |
|
3 | The fruit fly is buzzing around the passenger | "fruit fly" WITH passenger |
|
4 | The fly is buzzing around the passenger | "fruit fly" WITH passenger |
OR
or |
: the classical logical operator. The condition will be satisfied if any of the terms joined by the operator appear. The difference between the two versions is the priority given: |
takes precedence over OR
.ID | Text | Term | Result |
---|---|---|---|
1 | I think the machine is working now | machine OR think |
|
2 | I don't think that's correct | think OR machine |
|
3 | The machines are taking over | think OR machine |
NEAR
: proximity operator. It's similar to the AND
operator, but within a specific distance. There are two variants: -
, which implies a strict order in the appearances of the terms and ~
, where the order does not matter.
Distance is counted as jumps starting from the first word in the NEAR
operator. The distance is computed taking into account every word within the NEAR
operator, that is, multiwords or literal terms, do not count as a single "jump" but as the number of words the literal term contains.
ID | Text | Term | Result |
---|---|---|---|
1 | The fruit fly is buzzing around the passenger | [ fruit fly]~3 |
|
2 | The fly is buzzing around the fruit | [ fruit fly]~3 |
|
3 | The fruit fly is buzzing around the passenger | [ fruit fly]-5 |
|
4 | The fly is buzzing around the fruit | [ fruit fly]-5 |
|
5 | The fruit fly is buzzing around the passenger | [ "fruit fly" passenger]-5 |
|
6 | The fruit fly is buzzing around the passenger | [ "fruit fly" passenger]-6 |
Other operators or parentheses are not allowed within the NEAR
operator with the exception of |
!
NOT
: the negator. It indicates that a term/s must not appear for the condition to be considered satisfied.
ID | Text | Term | Result |
---|---|---|---|
1 | She enjoys going to the beach | NOT fly |
|
2 | He does not like to fly | NOT fly |
The NOT
operator can only be applied to terms (that is, simple terms or multiwords) or to terms separated by the operator |
.
As the system is accent insensitive, the AND
and WITH
operators will be very useful to limit the context for those cases in which depending on the accent mark, a word means a thing or another.
This happens very often in languages such as Spanish or French. For instance: "ingles" vs "inglés" could be disambiguated using "ingles AND
clases" or with "ingles WITH
clases".
You can use |
as an alternative to OR
to combine simple terms in a easy way without having to specify priorities using parentheses.
ml OR "machine learning" |
= | ml | "machine learning" |
(ml OR "machine learning") AND (think OR thinks) |
= | ml | "machine learning" AND think | thinks |
The following table shows several examples of sentences and terms defined using all the different operators.
ID | Text | Term | Lemmatization | Result |
---|---|---|---|---|
1 | He invented the machine | machine | system AND think |
y | |
2 | The systems are learning to think by themselves | machine | system AND think |
y | |
3 | The machines are thinking about rebelling | machine | system AND think |
n | |
4 | The systems are learning to think by themselves | (machine OR system) AND think |
y | |
5 | The machines are learning to think by themselves | "machine learning" AND think |
y | |
6 | I think it uses a machine learning algorithm | "machine learning" AND think |
n | |
7 | I think machine learning is very cool | [ system| machine learn]-3 AND think |
y | |
8 | I think machine learning is very cool | [ machine learn]-3 AND think |
n | |
9 | If the machine does not do that, I will have to learn to do it | [ machine learn]-10 AND think |
y | |
10 | If the machine does not do that, I will have to learn to do it | [ machine learn]-10 AND NOT think |
y | |
11 | If the machine does not do that, I will have to learn to do it | [ learn machine]-10 AND NOT think |
y | |
12 | If the machine does not do that, I will have to learn to do it | [ learn machine]~10 AND NOT think |
y |
The precedence of the operators we have seen is the following:
|
NOT
WITH
NEAR
AND
OR
Parentheses can be also used to indicate precedence. For instance, the following would be true:
houseAND
(dogOR
cat) = houseAND
dog|
cat
The weight of a term will only be added to the relevance of the category when the condition described by the term is satisfied.
The starting value for the computations will always be the number of times a simple term or a multiword appear in the text, that is, their frequency. By default, their relevance will be the same, but it's possible to configure the settings of the model so multiwords add more weight than a simple term. The parameter that allows to do this is relevance_boost
, and as we will see in the next section, by default will be disabled.
The following table contains the weight impact for the different elements we can use:
relevance_boost |
Element | Weight added |
---|---|---|
disabled | Simple term | Frequency of the term in the text |
Multiword or literal | Frequency of the multiword in the text | |
OR or | |
Sum of the values of the terms joined by the operator | |
AND |
Minimum value of the terms joined by the operator | |
WITH |
Frequency of the first term | |
NEAR |
Minimum value of the terms that satisfy the distance restriction | |
NOT |
Does not add any weight | |
enabled | Simple term | Frequency of the term in the text |
Multiword or literal | Frequency of the multiword in the text times the number of words in the multiword | |
OR or | |
Sum of the values of the terms joined by the operator | |
AND |
Minimum value of the terms joined by the operator times the number of terms involved | |
WITH |
Frequency of the first term | |
NEAR |
Minimum value of the terms that satisfy the distance restriction times the number of terms involved | |
NOT |
Does not add any weight |
The following table contains some examples of terms and how the relevance they would add/subtract would be obtained if the condition were satisfied in a text. We will represent the number of appearances of a term or frequency with f(term).
ID | Term | Weight (relevance_boost disabled) |
Weight (relevance_boost enabled) |
---|---|---|---|
1 | machine | f(machine) | f(machine) |
2 | "machine learning" | f(machine learning) | f(machine learning)*2 |
3 | machine | system |
f(machine) + f(system) | f(machine) + f(system) |
4 | machine AND think |
min(f(machine), f(think)) | min(f(machine), f(think))*2 |
5 | machine | system AND think |
min(f(machine) + f(system), f(think)) | min(f(machine) + f(system), f(think))*2 |
6 | machine | system AND think OR ponder |
min(f(machine) + f(system), f(think)) + f(ponder) | min(f(machine) + f(system), f(think))*2 + f(ponder) |
7 | "machine learning" AND think |
min(f(machine learning), f(think)) | min(f(machine learning)*2, f(think))*2 |
8 | machine AND learning AND think |
min(f(machine), f(learning),f(think)) | min(f(machine), f(learning),f(think))*3 |
9 | machine WITH think |
f(machine) | f(machine) |
10 | "machine learning" WITH think |
f(machine learning) | f(machine learning)*2 |
11 | machine WITH learn AND think |
min(f(machine), f(think)) | min(f(machine), f(think))*2 |
12 | machine WITH (learn AND think) |
f(machine) | f(machine) |
13 | [ machine learn]-3 AND think |
min(min(f(machine, f(learn)), f(think)) | min(min(f(machine, f(learn))*2, f(think))*2 |
14 | [ machine learn think]~5 |
min(f(machine), f(learn), f(think)) | min(f(machine), f(learn), f(think))*3 |
15 | [ "machine learn" think]~5 |
min(f(machine learn), f(think)) | min(f(machine learn)*2, f(think))*2 |
16 | "machine learn" AND NOT think |
f(machine learn) | f(machine learn)*2 |
Let's see the actual values we would obtain for a given text (and assuming that lemmatization
is enabled):
I'm trying to think of some machine learning algorithms
ID | Term | Weight (relevance_boost disabled) |
Weight (relevance_boost enabled) |
---|---|---|---|
1 | machine | 1 | 1 |
2 | "machine learning" | 1 | 1*2 = 2 |
3 | machine | system |
1 + 0 = 1 | 1 + 0 = 1 |
4 | machine AND think |
min(1, 1) = 1 | min(1, 1)*2 = 1*2 = 2 |
5 | machine | system AND think |
min(1 + 0, 1) = min(1, 1) = 1 | min(1 + 0, 1)*2 = min(1, 1)*2 = 2 |
6 | machine | system AND think OR ponder |
min(1 + 0, 1) + 0 = 1 | min(1 + 0, 1)*2 + 0 = 2 + 0 = 2 |
7 | "machine learning" AND think |
min(1, 1) = 1 | min(1*2, 1)*2 = 2 |
8 | machine AND learning AND think |
min(1, 1, 1) = 1 | min(1, 1, 1)*3 = 3 |
9 | machine WITH think |
1 | 1 |
10 | "machine learning" WITH think |
1 | 1*2 |
11 | machine WITH learn AND think |
min(1, 1) = 1 | min(1, 1)*2 = 2 |
12 | machine WITH (learn AND think) |
1 | 1 |
13 | [ machine learn]-3 AND think |
min(min(1, 1), 1) = min(1, 1) = 1 | min(min(1, 1)*2, 1)*2 = min(2, 1)*2 = 2 |
14 | [ machine learn think]~5 |
min(1, 1, 1) = 1 | min(1, 1, 1)*3 = 3 |
15 | [ "machine learn" think]~5 |
min(1, 1) = 1 | min(1*2, 1)*2 = min(2, 1)*2 = 2 |
16 | "machine learn" AND NOT think |
1 | 1*2 = 2 |