As far as I understand it, in Bert's operating logic, he changes 50% of his sentences that he takes as input. It doesn't touch the rest.
1-) Is the changed part the transaction made with tokenizer.encoder? And is this equal to input_ids?
Then padding is done. Creating a matrix according to the specified Max_len. the empty part is filled with 0.
After these, cls tokens are placed per sentence. Sep token is placed at the end of the sentence.
2-) Is input_mask happening in this process?
3 -) In addition, where do we use input_segment?
question from:
https://stackoverflow.com/questions/65901473/about-bert-embedding-input-ids-input-mask 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…