authentication - Locally reading S3 files through Spark (or better: pyspark)

Question

Welcome To Ask or Share your Answers For Others

authentication - Locally reading S3 files through Spark (or better: pyspark)

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

authentication - Locally reading S3 files through Spark (or better: pyspark)

I want to read an S3 file from my (local) machine, through Spark (pyspark, really). Now, I keep getting authentication errors like

java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively).

I looked everywhere here and on the web, tried many things, but apparently S3 has been changing over the last year or months, and all methods failed but one:

pyspark.SparkContext().textFile("s3n://user:password@bucket/key")

(note the s3n [s3 did not work]). Now, I don't want to use a URL with the user and password because they can appear in logs, and I am also not sure how to get them from the ~/.aws/credentials file anyway.

So, how can I read locally from S3 through Spark (or, better, pyspark) using the AWS credentials from the now standard ~/.aws/credentials file (ideally, without copying the credentials there to yet another configuration file)?

PS: I tried os.environ["AWS_ACCESS_KEY_ID"] = … and os.environ["AWS_SECRET_ACCESS_KEY"] = …, it did not work.

PPS: I am not sure where to "set the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties" (Google did not come up with anything). However, I did try many ways of setting these: SparkContext.setSystemProperty(), sc.setLocalProperty(), and conf = SparkConf(); conf.set(…); conf.set(…); sc = SparkContext(conf=conf). Nothing worked.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-23T18:58:45+0000

Yes, you have to use s3n instead of s3. s3 is some weird abuse of S3 the benefits of which are unclear to me.

You can pass the credentials to the sc.hadoopFile or sc.newAPIHadoopFile calls:

rdd = sc.hadoopFile('s3n://my_bucket/my_file', conf = {
  'fs.s3n.awsAccessKeyId': '...',
  'fs.s3n.awsSecretAccessKey': '...',
})

Categories

authentication - Locally reading S3 files through Spark (or better: pyspark)

authentication - Locally reading S3 files through Spark (or better: pyspark)

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags