I have a spark cluster I created via google dataproc. I want to be able to use the csv library from databricks (see https://github.com/databricks/spark-csv). So I first tested it like this:
I started a ssh session with the master node of my cluster, then I input:
pyspark --packages com.databricks:spark-csv_2.11:1.2.0
Then it launched a pyspark shell in which I input:
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('gs:/xxxx/foo.csv')
df.show()
And it worked.
My next step is to launch this job from my main machine using the command:
gcloud beta dataproc jobs submit pyspark --cluster <my-dataproc-cluster> my_job.py
But here It does not work and I get an error. I think because I did not gave the --packages com.databricks:spark-csv_2.11:1.2.0
as an argument, but I tried 10 different ways to give it and I did not manage.
My question are:
- was the databricks csv library installed after I typed
pyspark --packages com.databricks:spark-csv_2.11:1.2.0
- can I write a line in my
job.py
in order to import it?
- or what params should I give to my gcloud command to import it or install it?
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…