python - About Bert embedding (input_ids, input_mask)

Question

Welcome To Ask or Share your Answers For Others

python - About Bert embedding (input_ids, input_mask)

asked Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - About Bert embedding (input_ids, input_mask)

As far as I understand it, in Bert's operating logic, he changes 50% of his sentences that he takes as input. It doesn't touch the rest.

1-) Is the changed part the transaction made with tokenizer.encoder? And is this equal to input_ids?

Then padding is done. Creating a matrix according to the specified Max_len. the empty part is filled with 0.

After these, cls tokens are placed per sentence. Sep token is placed at the end of the sentence.

2-) Is input_mask happening in this process?

3 -) In addition, where do we use input_segment?

question from:https://stackoverflow.com/questions/65901473/about-bert-embedding-input-ids-input-mask

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-06T19:15:12+0000

The input_mask obtained by encoding the sentences does not show the presence of [MASK] tokens. Instead, when the batch of sentences are tokenized, prepended with [CLS], and appended with [SEP] tokens, it obtains an arbitrary length.

To make all the sentences in the batch has fixed number of tokens, zero padding is performed. The input_ids variable shows whether a given token position contians actual token or if it a zero padded position.

Using [MASK] token is using only if you want to train on Masked Language Model(MLM) objective.
BERT is trained on two objectives, MLM and Next Sentence Prediction(NSP). In NSP, you pass two sentences and try to predict if the second sentence is the following sentence of first sentence or not. segment_id holds the information if of which sentence a particular token belongs to.

Categories

python - About Bert embedding (input_ids, input_mask)

python - About Bert embedding (input_ids, input_mask)

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags