machine learning - Understanding convolutional layers shapes

Question

Welcome To Ask or Share your Answers For Others

machine learning - Understanding convolutional layers shapes

1 Answer

深蓝 · Answer 1 · 2021-10-23T21:29:51+0000

TLDR; `256x256x32` refers to the layer's output shape rather than the layer itself.

There are many articles and posts out there explaining how convolution layers work. I'll try to answer your question without going into too many details, just focusing on shapes.

Assuming you are working with 2D convolution layers, your input and output will both be three-dimensional. That is, without considering the batch which would correspond to a 4th axis... Therefore, the shape of a convolution layer input will be (c, h, w) (or (h, w, c) depending on the framework) where c is the number of channels, h is the width of the input and w the width. You can see it as a c-channel hxw image. The most intuitive example of such input is the input of the first convolution layer of your convolutional neural network: most likely an image of size hxw with c channels for example c=1 for greyscale or c=3 for RGB...

What's important is that for all pixels of that input, the values on each channel gives additional information on that pixel. Having three channels will give each pixel ('pixel' as in position in the 2D input space) a richer content than having a single. Since each pixel will be encoded with three values (three channels) vs. a single one (one channel). This kind of intuition about what channels represent can be extrapolated to a higher number of channels. As we said an input can have c channels.

Now going back to convolution layers, here is a good visualization. Imagine having a 5x5 1-channel input. And a convolution layer consisting of a single 3x3 filter (i.e. kernel_size=3)

	input	filter	convolution	output
shape	`(1, 5, 5)`	`(3, 3)`		`(3,3)`
representation

Categories

machine learning - Understanding convolutional layers shapes