Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
551 views
in Technique[技术] by (71.8m points)

scala - Is there a reason not to use SparkContext.getOrCreate when writing a spark job?

I'm writing Spark Jobs that talk to Cassandra in Datastax.

Sometimes when working through a sequence of steps in a Spark job, it is easier to just get a new RDD rather than join to the old one.

You can do this by calling the SparkContext [getOrCreate][1] method.

Now sometimes there are concerns inside a Spark Job that referring to the SparkContext can take a large object (the Spark Context) which is not serializable and try and distribute it over the network.

In this case - you're registering a singleton for that JVM, and so it gets around the problem of serialization.

One day my tech lead came to me and said

Don't use SparkContext getOrCreate you can and should use joins instead

But he didn't give a reason.

My question is: Is there a reason not to use SparkContext.getOrCreate when writing a spark job?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

TL;DR There are many legitimate applications of the getOrCreate methods but attempt to find a loophole to perform map-side joins is not one of them.

In general there is nothing deeply wrong with SparkContext.getOrCreate. The method has its applications, and although there some caveats, most notably:

  • In its simplest form it doesn't allow you to set job specific properties, and the second variant ((SparkConf) => SparkContext) requires passing SparkConf around, which is hardly an improvement over keeping SparkContext / SparkSession in the scope.
  • It can lead to opaque code with "magic" dependency. It affects testing strategies and overall code readability.

However your question, specifically:

Now sometimes there are concerns inside a Spark Job that referring to the SparkContext can take a large object (the Spark Context) which is not serializable and try and distribute it over the network

and

Don't use SparkContext getOrCreate you can and should use joins instead

suggests you're actually using the method in a way that it was never intended to be used. By using SparkContext on an executor node.

val rdd: RDD[_] = ???

rdd.map(_ => {
  val sc = SparkContext.getOrCreate()
  ...
})

This is definitely something that you shouldn't do.

Each Spark application should have one, and only one SparkContext initialized on the driver, and Apache Spark developers made at a lot prevent users from any attempts of using SparkContex outside the driver. It is not because SparkContext is large, or impossible to serialize, but because it is fundamental feature of the Spark's computing model.

As you probably know, computation in Spark is described by a directed acyclic graph of dependencies, which:

  • Describes processing pipeline in a way that can be translated into actual task.
  • Enables graceful recovery in case of task failures.
  • Allows proper resource allocation and ensures lack of cyclic dependencies.

Let's focus on the last part. Since each executor JVM gets its own instance of SparkContext cyclic dependencies are not an issue - RDDs and Datasets exist only in a scope of its parent context so you won't be able to objects belonging to the application driver.

Proper resource allocation is a different thing. Since each SparkContext creates its own Spark application, your "main" process won't be able to account for resources used by the contexts initialized in the tasks. At the same time cluster manager won't have any indication that application or somehow interconnected. This is likely to cause deadlock-like conditions.

It is technically possible to go around it, with careful resource allocation and usage of the manager-level scheduling pools, or even a separate cluster manager with its own set or resources, but it is not something that Spark is designed for, it not supported, and overall would lead to brittle and convoluted design, where correctness depends on a configuration details, specific cluster manager choice and overall cluster utilization.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...