According to this question - --files option in pyspark not working the sc.addFiles option should work for accessing files in both the driver and executors. But I cannot get it to work on the executors
test.py
from pyspark import SparkContext, SparkConf
from pyspark import SparkFiles
conf = SparkConf().setAppName("File access test")
sc = SparkContext(conf=conf)
sc.addFile("file:///home/hadoop/uploads/readme.txt")
with open(SparkFiles.get('readme.txt')) as test_file:
lines = [line.strip() for line in test_file]
print(lines) # this works
print('********************')
lines = sc.textFile(SparkFiles.get('readme.txt')) # run in the executors. this errors
print(lines.collect())
command
spark-submit --master yarn --deploy-mode client test.py
readme.txt
is under /home/hadoop/uploads
in the master node
I see the following in logs
21/01/27 15:03:30 INFO SparkContext: Added file file:///home/hadoop/uploads/readme.txt at spark://ip-10-133-70-121.sysco.net:44401/files/readme.txt with timestamp 1611759810247
21/01/27 15:03:30 INFO Utils: Copying /home/hadoop/uploads/readme.txt to /mnt/tmp/spark-f929a1e2-e7e8-401e-8e2e-dcd1def3ee7b/userFiles-fed4d5bf-3e31-4e1e-b2ae-3d4782ca265c/readme.txt
So its copying it to some spark directory and mount ( I am still relatively new to the spark world). If I use the --files flag and pass the file it also copies it to an hdfs:// path that can be read by the executors.
Is this because the addFile requires the file to also be present on the executors locally. Currently the readme.txt
is on the master node. If so is there a way to propagate it to executors from the master.
I am trying to find one uniform way of accessing the file. I am able to push the file from the local machine to master node. In the spark code however I would like a single way of accessing the contents of a file whether it be the driver or the executor
Currently for the executor part of the code to work I have to also pass the file in the --files flag (spark-submit --master yarn --deploy-mode client --files uploads/readme.txt test.py
) and use something like the following
path = f'hdfs://{sc.getConf().get("spark.driver.host")}:8020/user/hadoop/.sparkStaging/{sc.getConf().get("spark.app.id")}/readme.txt'
lines = sc.textFile(path)
question from:
https://stackoverflow.com/questions/65922476/spark-execution-a-single-way-to-access-file-contents-in-both-the-driver-and-ex 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…