windows - Spark 2.0: Relative path in absolute URI (spark-warehouse)

Question

Welcome To Ask or Share your Answers For Others

windows - Spark 2.0: Relative path in absolute URI (spark-warehouse)

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

windows - Spark 2.0: Relative path in absolute URI (spark-warehouse)

I'm trying to migrate from Spark 1.6.1 to Spark 2.0.0 and I am getting a weird error when trying to read a csv file into SparkSQL. Previously, when I would read a file from local disk in pyspark I would do:

Spark 1.6

df = sqlContext.read 
        .format('com.databricks.spark.csv') 
        .option('header', 'true') 
        .load('file:///C:/path/to/my/file.csv', schema=mySchema)

In the latest release I think it should look like this:

Spark 2.0

spark = SparkSession.builder 
           .master('local[*]') 
           .appName('My App') 
           .getOrCreate()

df = spark.read 
        .format('csv') 
        .option('header', 'true') 
        .load('file:///C:/path/to/my/file.csv', schema=mySchema)

But I am getting this error no matter how many different ways I try to adjust the path:

IllegalArgumentException: 'java.net.URISyntaxException: Relative path in 
absolute URI: file:/C:/path//to/my/file/spark-warehouse'

Not sure if this is just an issue with Windows or there is something I am missing. I was excited that the spark-csv package is now a part of Spark right out of the box, but I can't seem to get it to read any of my local files anymore. Any ideas?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-23T17:58:52+0000

I was able to do some digging around in the latest Spark documentation, and I notice they have a new configuration setting that I hadn't noticed before:

spark.sql.warehouse.dir

So I went ahead and added this setting when I set up my SparkSession:

spark = SparkSession.builder 
           .master('local[*]') 
           .appName('My App') 
           .config('spark.sql.warehouse.dir', 'file:///C:/path/to/my/') 
           .getOrCreate()

That seems to set the working directory, and then I can just feed my filename directly into the csv reader:

df = spark.read 
        .format('csv') 
        .option('header', 'true') 
        .load('file.csv', schema=mySchema)

Once I set the spark warehouse, Spark was able to locate all of my files and my app finishes successfully now. The amazing thing is that it runs about 20 times faster than it did in Spark 1.6. So they really have done some very impressive work optimizing their SQL engine. Spark it up!

Categories

windows - Spark 2.0: Relative path in absolute URI (spark-warehouse)

windows - Spark 2.0: Relative path in absolute URI (spark-warehouse)

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags