apache spark - How to skip lines while reading a CSV file as a dataFrame using PySpark?

Question

Welcome To Ask or Share your Answers For Others

apache spark - How to skip lines while reading a CSV file as a dataFrame using PySpark?

asked Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

apache spark - How to skip lines while reading a CSV file as a dataFrame using PySpark?

I have a CSV file that is structured this way:

Header
Blank Row
"Col1","Col2"
"1,200","1,456"
"2,000","3,450"

I have two problems in reading this file.

I want to Ignore the Header and Ignore the blank row
The commas within the value is not a separator

Here is what I tried:

df = sc.textFile("myFile.csv")
              .map(lambda line: line.split(",")) #Split By comma
              .filter(lambda line: len(line) == 2).collect() #This helped me ignore the first two rows

However, This did not work, because the commas within the value was being read as a separator and the len(line) was returning 4 instead of 2.

I tried an alternate approach:

data = sc.textFile("myFile.csv")
headers = data.take(2) #First two rows to be skipped

The idea was to then use filter and not read the headers. But, when I tried to print the headers, I got encoded values.

[x00Ax00Yx00 x00Jx00ux00lx00yx00 x002x000x001x006x00]

What is the correct way to read a CSV file and skip the first two rows?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-17T01:15:11+0000

Try to use csv.reader with 'quotechar' parameter.It will split the line correctly. After that you can add filters as you like.

import csv
from pyspark.sql.types import StringType

df = sc.textFile("test2.csv")
           .mapPartitions(lambda line: csv.reader(line,delimiter=',', quotechar='"')).filter(lambda line: len(line)>=2 and line[0]!= 'Col1')
           .toDF(['Col1','Col2'])

Categories

apache spark - How to skip lines while reading a CSV file as a dataFrame using PySpark?

apache spark - How to skip lines while reading a CSV file as a dataFrame using PySpark?

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags