Python Pandas read csv from DataLake

Question

Welcome To Ask or Share your Answers For Others

Python Pandas read csv from DataLake

asked Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

Python Pandas read csv from DataLake

I'm trying to read a csv file that is stored on a Azure Data Lake Gen 2, Python runs in Databricks. Here are 2 lines of code, the first one works, the seconds one fails. Do I really have to mount the Adls to have Pandas being able to access it.

data1 = spark.read.option("header",False).format("csv").load("abfss://[email protected]/belgium/dessel/c3/kiln/temp/Auto202012101237.TXT")
data2 = pd.read_csv("abfss://[email protected]/belgium/dessel/c3/kiln/temp/Auto202012101237.TXT")

Any suggestions ?

question from:https://stackoverflow.com/questions/65845802/python-pandas-read-csv-from-datalake

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-06T19:30:44+0000

Pandas doesn't know about cloud storage, and works with local files only. On Databricks you should be able to copy the file locally, so you can open it with Pandas. This could be done either with %fs cp abfss://.... file:///your-location or with dbutils.fs.cp("abfss://....", "file:///your-location") (see docs).

Another possibility is instead of Pandas, use the Koalas library that provides Pandas-compatible API on top of the Spark. Besides ability to access data in the cloud, you'll also get a possibility to run your code in the distributed fashion.

Categories

Python Pandas read csv from DataLake

Python Pandas read csv from DataLake

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags