According to the paper (section 2), the S x S x (B * 5 + C)
shaped output represents the S x S
grid cells that YoloV1 splits the image into. The last layer can be implemented as a fully connected layer with an output length S x S x (B * 5 + C)
, then you can simply reshape the output to a 3D shape.
The paper states that:
"Our system divides the input image into an S × S grid.
If the center of an object falls into a grid cell, that grid cell
is responsible for detecting that object."
Meaning you have to assign each label to its corresponding grid cell in order to do backpropagation. For reference, a keras/tensorflow implementation of the loss calculation can be found here (by the github user FMsunyh).
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…