You had the right idea: use rdd.count()
to count the number of rows. There is no faster way.
I think the question you should have asked is why is rdd.count()
so slow?
The answer is that rdd.count()
is an "action" — it is an eager operation, because it has to return an actual number. The RDD operations you've performed before count()
were "transformations" — they transformed an RDD into another lazily. In effect the transformations were not actually performed, just queued up. When you call count()
, you force all the previous lazy operations to be performed. The input files need to be loaded now, map()
s and filter()
s executed, shuffles performed, etc, until finally we have the data and can say how many rows it has.
Note that if you call count()
twice, all this will happen twice. After the count is returned, all the data is discarded! If you want to avoid this, call cache()
on the RDD. Then the second call to count()
will be fast and also derived RDDs will be faster to calculate. However, in this case the RDD will have to be stored in memory (or disk).
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…