A DStream
or "discretized stream" is an abstraction that breaks a continuous stream of data into small chunks. This is called "microbatching". Each microbatch becomes an RDD that is given to Spark for further processing. There's one and only one RDD produced for each DStream at each batch interval.
An RDD is a distributed collection of data. Think of it as a set of pointers to where the actual data is in a cluster.
DStream.foreachRDD
is an "output operator" in Spark Streaming. It allows you to access the underlying RDDs of the DStream to execute actions that do something practical with the data. For example, using foreachRDD
you could write data to a database.
The little mind twist here is to understand that a DStream is a time-bound collection. Let me contrast this with a classical collection: Take a list of users and apply a foreach to it:
val userList: List[User] = ???
userList.foreach{user => doSomeSideEffect(user)}
This will apply the side-effecting function doSomeSideEffect
to each element of the userList
collection.
Now, let's say that we don't know all the users now, so we cannot build a list of them. Instead, we have a stream of users, like people arriving into a coffee shop during morning rush:
val userDStream: DStream[User] = ???
userDstream.foreachRDD{usersRDD =>
usersRDD.foreach{user => serveCoffee(user)}
}
Note that:
- the
DStream.foreachRDD
gives you an RDD[User]
, not a single user. Going back to our coffee example, that is the collection of users that arrived during some interval of time.
- to access single elements of the collection, we need to further operate on the RDD. In this case, I'm using a
rdd.foreach
to serve coffee to each user.
To think about execution: We might have a cluster of baristas making coffee. Those are our executors. Spark Streaming takes care of making a small batch of users (or orders) and Spark will distribute the work across the baristas, so that we can parallelize the coffee making and speed up the coffee serving.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…