Term definition

The term definition that's going to apply to rule-based and hybrid models is not just a list of words. There are a number of operators that enable to define more complex terms.

There are two basic elements: simple terms and multiwords (or literal terms). We will be able to combine them using the different operators available.

The most basic element is the simple term. It is formed by a single word (and thus, without any blank spaces in it) and there are no operators. This word will be matched literally in the processed text or, in if the setting lemmatization is enabled and the term is a lemma, it will match any of its lexical forms.

The fact that this match is carried out over the processed text means that the pre-processing and tokenization steps described in the text tokenization section have been already carried out, and so, the terms will be matched against the resulting text.

In its simplest version, a list of terms can be formed just by simple terms.

The next element are the multiwords (or literal terms), which are used to specify combination of words or n-grams that have to appear in the text in the exact same order and form as specified. Multiwords are defined by using double quotes around the words that form the multiword. Similarly to simple terms, when the setting lemmatization is enabled and the words that compose the multiword are lemmas, then their lexical forms will also match.

ID	Text	Term	Lemmatization
1	The machine is learning to think by itself	machine	n
2	Machines will take over shortly	machine	n
3	Machines will take over shortly	machine	y
4	The machine is learning to think by itself	"machine learning"	n
5	It uses a machine learning algorithm	"machine learning"	n
6	It uses a machine learning algorithm	"machine learn"	n
7	It uses a machine learning algorithm	"machine learn"	y

In this table, we can see several examples of when a term is detected and when it is not, depending both on the term definition used and the value of the lemmatization setting.

Important

Different lexical forms can correspond to the same lemma, for instance, "fly" will be the lemma for the verb "fly", as in take flight as well as for "fly", the animal. It is be important to take those ambiguous contexts into account in the term definition.

The simple terms and the multiwords are the basic elements that will be combined through the use of operators to define more complex terms.

These are the operators available:

AND: the classical logical operator. It allows to give context to the term. The two terms joined by the operator need to appear in the text for the condition to be satisfied. The order of the terms associated to the operator does not affect the result of the evaluation.

ID	Text	Term
1	That passenger is afraid to fly	passenger `AND` fly
2	The fruit fly is very common	passenger `AND` fly
3	That fly is buzzing around the passenger	passenger `AND` fly
4	The fruit fly is buzzing around the passenger	"fruit fly" `AND` passenger
5	The fly is buzzing around the passenger	"fruit fly" `AND` passenger

WITH: similar to AND, but only the first term will have any impact on the weight. It's used mainly to disambiguate.

ID	Text	Term
1	That passenger is afraid to fly	passenger `WITH` fly
2	The fruit fly is very common	passenger `WITH` fly
3	The fruit fly is buzzing around the passenger	"fruit fly" `WITH` passenger
4	The fly is buzzing around the passenger	"fruit fly" `WITH` passenger

OR or |: the classical logical operator. The condition will be satisfied if any of the terms joined by the operator appear. The difference between the two versions is the priority given: | takes precedence over OR.

ID	Text	Term
1	I think the machine is working now	machine `OR` think
2	I don't think that's correct	think `OR` machine
3	The machines are taking over	think `OR` machine

NEAR: proximity operator. It's similar to the AND operator, but within a specific distance. There are two variants: -, which implies a strict order in the appearances of the terms and ~, where the order does not matter.

Distance is counted as jumps starting from the first word in the NEAR operator. The distance is computed taking into account every word within the NEAR operator, that is, multiwords or literal terms, do not count as a single "jump" but as the number of words the literal term contains.

ID	Text	Term
1	The fruit fly is buzzing around the passenger	`[`fruit fly`]~3`
2	The fly is buzzing around the fruit	`[`fruit fly`]~3`
3	The fruit fly is buzzing around the passenger	`[`fruit fly`]-5`
4	The fly is buzzing around the fruit	`[`fruit fly`]-5`
5	The fruit fly is buzzing around the passenger	`[`"fruit fly" passenger`]-5`
6	The fruit fly is buzzing around the passenger	`[`"fruit fly" passenger`]-6`

Important

Other operators or parentheses are not allowed within the NEAR operator with the exception of |!

NOT: the negator. It indicates that a term/s must not appear for the condition to be considered satisfied.

ID	Text	Term	Result
1	She enjoys going to the beach	`NOT` fly
2	He does not like to fly	`NOT` fly

Important

The NOT operator can only be applied to terms (that is, simple terms or multiwords) or to terms separated by the operator |.

Did you notice...?

As the system is accent insensitive, the AND and WITH operators will be very useful to limit the context for those cases in which depending on the accent mark, a word means a thing or another.

This happens very often in languages such as Spanish or French. For instance: "ingles" vs "inglés" could be disambiguated using "ingles AND clases" or with "ingles WITH clases".

You can use | as an alternative to OR to combine simple terms in a easy way without having to specify priorities using parentheses.

ml `OR` "machine learning"	=	ml `\|` "machine learning"
(ml `OR` "machine learning") `AND` (think `OR` thinks)	=	ml `\|` "machine learning" `AND` think `\|` thinks

The following table shows several examples of sentences and terms defined using all the different operators.

ID	Text	Term	Lemmatization
1	He invented the machine	machine `\|` system `AND` think	y
2	The systems are learning to think by themselves	machine `\|` system `AND` think	y
3	The machines are thinking about rebelling	machine `\|` system `AND` think	n
4	The systems are learning to think by themselves	(machine `OR` system) `AND` think	y
5	The machines are learning to think by themselves	"machine learning" `AND` think	y
6	I think it uses a machine learning algorithm	"machine learning" `AND` think	n
7	I think machine learning is very cool	`[`system`\|`machine learn`]-3` `AND` think	y
8	I think machine learning is very cool	`[`machine learn`]-3` `AND` think	n
9	If the machine does not do that, I will have to learn to do it	`[`machine learn`]-10` `AND` think	y
10	If the machine does not do that, I will have to learn to do it	`[`machine learn`]-10` `AND` `NOT` think	y
11	If the machine does not do that, I will have to learn to do it	`[`learn machine`]-10` `AND` `NOT` think	y
12	If the machine does not do that, I will have to learn to do it	`[`learn machine`]~10` `AND` `NOT` think	y

Operators precedence

The precedence of the operators we have seen is the following:

| NOT WITH NEAR AND OR

Parentheses can be also used to indicate precedence. For instance, the following would be true:

house AND(dog OR cat) = house AND dog|cat

Weights

The weight of a term will only be added to the relevance of the category when the condition described by the term is satisfied.

The starting value for the computations will always be the number of times a simple term or a multiword appear in the text, that is, their frequency. By default, their relevance will be the same, but it's possible to configure the settings of the model so multiwords add more weight than a simple term. The parameter that allows to do this is relevance_boost, and as we will see in the next section, by default will be disabled.

The following table contains the weight impact for the different elements we can use:

`relevance_boost`	Element	Weight added
disabled	Simple term	Frequency of the term in the text
	Multiword or literal	Frequency of the multiword in the text
	`OR` or `\|`	Sum of the values of the terms joined by the operator
	`AND`	Minimum value of the terms joined by the operator
	`WITH`	Frequency of the first term
	`NEAR`	Minimum value of the terms that satisfy the distance restriction
	`NOT`	Does not add any weight
enabled	Simple term	Frequency of the term in the text
	Multiword or literal	Frequency of the multiword in the text times the number of words in the multiword
	`OR` or `\|`	Sum of the values of the terms joined by the operator
	`AND`	Minimum value of the terms joined by the operator times the number of terms involved
	`WITH`	Frequency of the first term
	`NEAR`	Minimum value of the terms that satisfy the distance restriction times the number of terms involved
	`NOT`	Does not add any weight

The following table contains some examples of terms and how the relevance they would add/subtract would be obtained if the condition were satisfied in a text. We will represent the number of appearances of a term or frequency with f(term).

ID	Term	Weight (`relevance_boost` disabled)	Weight (`relevance_boost` enabled)
1	machine	f(machine)	f(machine)
2	"machine learning"	f(machine learning)	f(machine learning)*2
3	machine `\|` system	f(machine) + f(system)	f(machine) + f(system)
4	machine `AND` think	min(f(machine), f(think))	min(f(machine), f(think))*2
5	machine `\|` system `AND` think	min(f(machine) + f(system), f(think))	min(f(machine) + f(system), f(think))*2
6	machine `\|` system `AND` think `OR` ponder	min(f(machine) + f(system), f(think)) + f(ponder)	min(f(machine) + f(system), f(think))2 + f(ponder*)
7	"machine learning" `AND` think	min(f(machine learning), f(think))	min(f(machine learning)2, f(think))2
8	machine `AND` learning `AND` think	min(f(machine), f(learning),f(think))	min(f(machine), f(learning),f(think))*3
9	machine `WITH` think	f(machine)	f(machine)
10	"machine learning" `WITH` think	f(machine learning)	f(machine learning)*2
11	machine `WITH` learn `AND` think	min(f(machine), f(think))	min(f(machine), f(think))*2
12	machine `WITH` (learn `AND` think)	f(machine)	f(machine)
13	`[`machine learn`]-3` `AND` think	min(min(f(machine, f(learn)), f(think))	min(min(f(machine, f(learn))2, f(think))2
14	`[`machine learn think`]~5`	min(f(machine), f(learn), f(think))	min(f(machine), f(learn), f(think))*3
15	`[`"machine learn" think`]~5`	min(f(machine learn), f(think))	min(f(machine learn)2, f(think))2
16	"machine learn" `AND NOT` think	f(machine learn)	f(machine learn)*2

Let's see the actual values we would obtain for a given text (and assuming that lemmatization is enabled):

I'm trying to think of some machine learning algorithms

ID	Term	Weight (`relevance_boost` disabled)	Weight (`relevance_boost` enabled)
1	machine	1	1
2	"machine learning"	1	1*2 = 2
3	machine `\|` system	1 + 0 = 1	1 + 0 = 1
4	machine `AND` think	min(1, 1) = 1	min(1, 1)2 = 12 = 2
5	machine `\|` system `AND` think	min(1 + 0, 1) = min(1, 1) = 1	min(1 + 0, 1)2 = min(1, 1)2 = 2
6	machine `\|` system `AND` think `OR` ponder	min(1 + 0, 1) + 0 = 1	min(1 + 0, 1)*2 + 0 = 2 + 0 = 2
7	"machine learning" `AND` think	min(1, 1) = 1	min(12, 1)2 = 2
8	machine `AND` learning `AND` think	min(1, 1, 1) = 1	min(1, 1, 1)*3 = 3
9	machine `WITH` think	1	1
10	"machine learning" `WITH` think	1	1*2
11	machine `WITH` learn `AND` think	min(1, 1) = 1	min(1, 1)*2 = 2
12	machine `WITH` (learn `AND` think)	1	1
13	`[`machine learn`]-3` `AND` think	min(min(1, 1), 1) = min(1, 1) = 1	min(min(1, 1)2, 1)2 = min(2, 1)*2 = 2
14	`[`machine learn think`]~5`	min(1, 1, 1) = 1	min(1, 1, 1)*3 = 3
15	`[`"machine learn" think`]~5`	min(1, 1) = 1	min(12, 1)2 = min(2, 1)*2 = 2
16	"machine learn" `AND NOT` think	1	1*2 = 2

Previous Next