Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
533 views
in Technique[技术] by (71.8m points)

Python Pandas read csv from DataLake

I'm trying to read a csv file that is stored on a Azure Data Lake Gen 2, Python runs in Databricks. Here are 2 lines of code, the first one works, the seconds one fails. Do I really have to mount the Adls to have Pandas being able to access it.

data1 = spark.read.option("header",False).format("csv").load("abfss://[email protected]/belgium/dessel/c3/kiln/temp/Auto202012101237.TXT")
data2 = pd.read_csv("abfss://[email protected]/belgium/dessel/c3/kiln/temp/Auto202012101237.TXT")

Any suggestions ?

question from:https://stackoverflow.com/questions/65845802/python-pandas-read-csv-from-datalake

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Pandas doesn't know about cloud storage, and works with local files only. On Databricks you should be able to copy the file locally, so you can open it with Pandas. This could be done either with %fs cp abfss://.... file:///your-location or with dbutils.fs.cp("abfss://....", "file:///your-location") (see docs).

Another possibility is instead of Pandas, use the Koalas library that provides Pandas-compatible API on top of the Spark. Besides ability to access data in the cloud, you'll also get a possibility to run your code in the distributed fashion.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...