Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
158 views
in Technique[技术] by (71.8m points)

python - How to pass files to the master node?

I've already written code in python to implement binary classification, and I want to parallelize this classification process based on different data files in my local computer using Apache-Spark. I have already done the following steps:

  1. I've written the whole project containing 4 python files: "run_classifer.py" (used for running my classification application), "classifer.py" (used for binary classification), "load_params.py" (used for load the learning parameters for classification) and "preprocessing.py" (used for pre-processing data). The project also uses the dependency files: "tokenizer.perl" (used in preprocessing part) and "nonbreaking_prefixes/nonbreaking_prefix.en" (also used in preprocessing part).

  2. The main part of my script file "run_classifer.py" is defined as follow,

    ### Initialize the Spark
    conf = SparkConf().setAppName("ruofan").setMaster("local")
    sc = SparkContext(conf = conf,
        pyFiles=['''All python files in my project as
                 well as "nonbreaking_prefix.en" and "tokenizer.perl"'''])
    
    ### Read data directory from S3 storage, and create RDD
    datafile = sc.wholeTextFiles("s3n://bucket/data_dir") 
    
    ### Sent the application on each of the slave node
    datafile.foreach(lambda (path, content): classifier(path, content)) 
    

However, When I run my script "run_classifier.py", it seems like cannot find the file "nonbreaking_prefix.en". The following is the error I got:

ERROR: No abbreviations files found in /tmp/spark-f035270e-e267-4d71-9bf1-8c42ca2097ee/userFiles-88093e1a-6096-4592-8a71-be5548a4f8ae/nonbreaking_prefixes

But I actually passed the file "nonbreaking_prefix.en" to the master node, and I have no ideas on the error. I would really appreciate if anyone helps me fix the problem.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

You can upload your files using sc.addFile and get path on a worker using SparkFiles.get:

from pyspark import SparkFiles

sc = (SparkContext(conf = conf,
    pyFiles=["All",  "Python", "Files",  "in",  "your", "project"])

# Assuming both files are in your working directory
sc.addFile("nonbreaking_prefix.en")
sc.addFile("tokenizer.perl")

def classifier(path, content):
   # Get path for uploaded files
   print SparkFiles.get("tokenizer.perl")

   with open(SparkFiles.get("nonbreaking_prefix.en")) as fr:
       lines = [line for line in fr]

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...