TLDR; 256x256x32
refers to the layer's output shape rather than the layer itself.
There are many articles and posts out there explaining how convolution layers work. I'll try to answer your question without going into too many details, just focusing on shapes.
Assuming you are working with 2D convolution layers, your input and output will both be three-dimensional. That is, without considering the batch which would correspond to a 4th axis... Therefore, the shape of a convolution layer input will be (c, h, w)
(or (h, w, c)
depending on the framework) where c
is the number of channels, h
is the width of the input and w
the width. You can see it as a c
-channel h
xw
image.
The most intuitive example of such input is the input of the first convolution layer of your convolutional neural network: most likely an image of size h
xw
with c
channels for example c=1
for greyscale or c=3
for RGB...
What's important is that for all pixels of that input, the values on each channel gives additional information on that pixel. Having three channels will give each pixel ('pixel' as in position in the 2D input space) a richer content than having a single. Since each pixel will be encoded with three values (three channels) vs. a single one (one channel). This kind of intuition about what channels represent can be extrapolated to a higher number of channels. As we said an input can have c
channels.
Now going back to convolution layers, here is a good visualization. Imagine having a 5x5 1-channel input. And a convolution layer consisting of a single 3x3 filter (i.e. kernel_size=3
)
|
input |
filter |
convolution |
output |
shape |
(1, 5, 5) |
(3, 3) |
|
(3,3) |
representation |
|
|
|
|
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…