Specific to your question, you need to use --py-files
to include python files that should be made available on the PYTHONPATH.
I just ran into a similar problem where I want to run a modules main function from a module inside an egg file.
The wrapper code below can be used to run main
for any module via spark-submit. For this to work you need to drop it into a python file using the package and module name as the filename. The filename is then used inside the wrapper to identify which module to run. This makes for a more natural means of executing packaged modules without needing to add extra arguments (which can get messy).
Here's the script:
"""
Wrapper script to use when running Python packages via egg file through spark-submit.
Rename this script to the fully qualified package and module name you want to run.
The module should provide a ``main`` function.
Pass any additional arguments to the script.
Usage:
spark-submit --py-files <LIST-OF-EGGS> <PACKAGE>.<MODULE>.py <MODULE_ARGS>
"""
import os
import importlib
def main():
filename = os.path.basename(__file__)
module = os.path.splitext(filename)[0]
module = importlib.import_module(module)
module.main()
if __name__ == '__main__':
main()
You won't need to modify any of this code. It's all dynamic and driven from the filename.
As an example, if you drop this into mypackage.mymodule.py
and use spark-submit to run it, then the wrapper will import mypackage.mymodule
and run main()
on that module. All command line arguments are left intact, and will be naturally picked up by the module being executed.
You will need to include any egg files and other supporting files in the command. Here's an example:
spark-submit --py-files mypackage.egg mypackage.mymodule.py --module-arg1 value1
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…