Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
85 views
in Technique[技术] by (71.8m points)

python - About Bert embedding (input_ids, input_mask)

As far as I understand it, in Bert's operating logic, he changes 50% of his sentences that he takes as input. It doesn't touch the rest.

1-) Is the changed part the transaction made with tokenizer.encoder? And is this equal to input_ids?

Then padding is done. Creating a matrix according to the specified Max_len. the empty part is filled with 0.

After these, cls tokens are placed per sentence. Sep token is placed at the end of the sentence.

2-) Is input_mask happening in this process?

3 -) In addition, where do we use input_segment?

question from:https://stackoverflow.com/questions/65901473/about-bert-embedding-input-ids-input-mask

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)
  1. The input_mask obtained by encoding the sentences does not show the presence of [MASK] tokens. Instead, when the batch of sentences are tokenized, prepended with [CLS], and appended with [SEP] tokens, it obtains an arbitrary length.

To make all the sentences in the batch has fixed number of tokens, zero padding is performed. The input_ids variable shows whether a given token position contians actual token or if it a zero padded position.

  1. Using [MASK] token is using only if you want to train on Masked Language Model(MLM) objective.

  2. BERT is trained on two objectives, MLM and Next Sentence Prediction(NSP). In NSP, you pass two sentences and try to predict if the second sentence is the following sentence of first sentence or not. segment_id holds the information if of which sentence a particular token belongs to.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

2.1m questions

2.1m answers

60 comments

57.0k users

...