Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
624 views
in Technique[技术] by (71.8m points)

pyspark - Using spark-submit with python main

Reading at this and this makes me think it is possible to have a python file be executed by spark-submit however I couldn't get it to work.

My setup is a bit complicated. I require several different jars to be submitted together with my python files in order for everything to function. My pyspark command which works is the following:

IPYTHON=1 ./pyspark --jars jar1.jar,/home/local/ANT/bogoyche/dev/rhine_workspace/env/Scala210-1.0/runtime/Scala2.10/scala-library.jar,jar2.jar --driver-class-path jar1.jar:jar2.jar
from sys import path
path.append('my-module')
from my-module import myfn
myfn(myargs)

I have packaged my python files inside an egg, and the egg contains the main file, which makes the egg executable by calling python myegg.egg

I am trying now to form my spark-submit command and I can't seem to get it right. Here's where I am:

./spark-submit --jars jar1.jar,jar2.jar --py-files path/to/my/egg.egg arg1 arg
Error: Cannot load main class from JAR file:/path/to/pyspark/directory/arg1
Run with --help for usage help or --verbose for debug output

Instead of executing my .egg file, it is taking the first argument of the egg and considers it a jar file and tries to load a class from it? What am I doing wrong?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

One way is to have a main driver program for your Spark application as a python file (.py) that gets passed to spark-submit. This primary script has the main method to help the Driver identify the entry point. This file will customize configuration properties as well initialize the SparkContext.

The ones bundled in the egg executables are dependencies that are shipped to the executor nodes and imported inside the driver program.

You can script a small file as main driver and execute -

./spark-submit --jars jar1.jar,jar2.jar --py-files path/to/my/egg.egg driver.py arg1 arg

The driver program would be something like -

from pyspark import SparkContext, SparkConf
from my-module import myfn

if __name__ == '__main__':
    conf = SparkConf().setAppName("app")
    sc = SparkContext(conf=conf)
    myfn(myargs, sc)

Pass the SparkContext object as arguments wherever necessary.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...