The confusion here is a long standing one (as evidenced by the verbose warning message) that all starts with stat_bin
.
But users don't typically realize that their confusion revolves around stat_bin
, since they typically encounter problems while using either geom_bar
or geom_histogram
. Note the documentation for each: they both use stat = "bin"
(in current ggplot2 versions this stat has been split into stat_bin
for continuous data and stat_count
for discrete data) by default.
But let's back up. geom_*
's control the actual rendering of data into some sort of geometric form. stat_*
's simply transform your data. The distinction is a bit confusing in practice, because adding a layer of stat_bin
will, by default, invoke geom_bar
and so it can seem indistinguishable from geom_bar
when you're learning.
In any case, consider the "bar"-like geom's: histograms and bar charts. Both are clearly going to involve some binning of data somewhere along the line. But our data could either be pre-summarised or not. For instance, we might want a bar plot from:
x
a
a
a
b
b
b
or equivalently from
x y
a 3
b 3
The first hasn't been binned yet. The second is pre-binned. The default behavior for both geom_bar
and geom_histogram
is to assume that you have not pre-binned your data. So they will attempt to call stat_bin
(for histograms, now stat_count
for bar charts) on your x
values.
As the warning says, it will then try to map y
for you to the resulting counts. If you also attempt to map y
yourself to some other variable you end up in Here There Be Dragons territory. Mapping y
to functions of the variables returned by stat_bin
(..count..
, etc.) should be ok and should not throw that warning (it doesn't for me using @mnel's example above).
The take-away here is that for geom_bar
if you've pre-computed the heights of the bars, always remember to use stat = "identity"
, or better yet use the newer geom_col
which uses stat = "identity"
by default. For geom_histogram
it's very unlikely that you will have pre-computed the bins, so in most cases you just need to remember not to map y
to anything beyond what's returned from stat_bin
.
geom_dotplot
uses it's own binning stat, stat_bindot
, and this discussion applies here as well, I believe. This sort of thing generally hasn't been an issue with the 2d binning cases (geom_bin2d
and geom_hex
) since there hasn't been as much flexibility available in the analogous z
variable to the binned y
variable in the 1d case. If future updates start allowing more fancy manipulations of the 2d binning cases this could I suppose become something you have to watch out for there.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…