It is basically the size of vocabulary you want to have it in your model based on the data you have. Below simple example will explain you in detail.
Without num_words:
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(oov_token='<OOV>')
fit_text = ["Example with the first sentence of the tokenizer"]
tokenizer.fit_on_texts(fit_text)
test_text = ["Example with the test sentence of the tokenizer"]
sequences = tokenizer.texts_to_sequences(test_text)
print("sequences : ",sequences,'
')
print("word_index : ",tokenizer.word_index)
print("word counts : ",tokenizer.word_counts)
sequences : [[3, 4, 2, 1, 6, 7, 2, 8]]
word_index : {'<OOV>': 1, 'the': 2, 'example': 3, 'with': 4, 'first': 5, 'sentence': 6, 'of': 7, 'tokenizer': 8}
word counts : OrderedDict([('example', 1), ('with', 1), ('the', 2), ('first', 1), ('sentence', 1), ('of', 1), ('tokenizer', 1)])
Here tokenizer.fit_on_texts(fit_text)
will create the word_index
of the words mentioned present in fit_text
in the order starting from oov_token
which will be 1 and followed by most frequent words from the word_counts
.
If you don't mention num_words
then all the unique words of fit_text
will be considered for word_index
and will be used to represent the sequences
.
If the num_words
is present then it will restrict the sequences to num_words -1
words from word_index
will only be considered to form the sequence while using tokenizer.texts_to_sequences()
if any word is present beyond num_words -1
it will be considered as oov_token
.
Below is the example of it.
With use of num_words:
tokenizer = Tokenizer(num_words=4,oov_token='<OOV>')
fit_text = ["Example with the first sentence of the tokenizer"]
tokenizer.fit_on_texts(fit_text)
test_text = ["Example with the test sentence of the tokenizer"]
sequences = tokenizer.texts_to_sequences(test_text)
print("sequences : ",sequences,'
')
print("word_index : ",tokenizer.word_index)
print("word counts : ",tokenizer.word_counts)
sequences : [[3, 1, 2, 1, 1, 1, 2, 1]]
word_index : {'<OOV>': 1, 'the': 2, 'example': 3, 'with': 4, 'first': 5, 'sentence': 6, 'of': 7, 'tokenizer': 8}
word counts : OrderedDict([('example', 1), ('with', 1), ('the', 2), ('first', 1), ('sentence', 1), ('of', 1), ('tokenizer', 1)])
Regarding the accuracy of the model it's always better to have the correct representation of the words in sequences from your data instead of oov_token
.
In case of large data it's always better to provide the num_words parameter instead of giving load to model.
It's a good practice to do the preprocessing like stopword removal,lemmatization/stemming
to remove all the unnecessary words and then followed by Tokenizer
with the preprocessed data to choose the num_words
parameter better.