python - PySpark: StructField(..., ..., False) always returns `nullable=true` instead of `nullable=false`

Question

Welcome To Ask or Share your Answers For Others

python - PySpark: StructField(..., ..., False) always returns `nullable=true` instead of `nullable=false`

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - PySpark: StructField(..., ..., False) always returns `nullable=true` instead of `nullable=false`

I'm new to PySpark and am facing a strange problem. I'm trying to set some column to non-nullable while loading a CSV dataset. I can reproduce my case with a very small dataset (test.csv):

col1,col2,col3
11,12,13
21,22,23
31,32,33
41,42,43
51,,53

There is a null value at row 5, column 2 and I don't want to get that row inside my DF. I set all fields as non-nullable (nullable=false) but I get a schema with all the three columns having nullable=true. This happens even if I set all the three columns as non-nullable! I'm running the latest available version of Spark, 2.0.1.

Here's the code:

from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *

spark = SparkSession 
    .builder 
    .appName("Python Spark SQL basic example") 
    .config("spark.some.config.option", "some-value") 
    .getOrCreate()

struct = StructType([   StructField("col1", StringType(), False), 
                        StructField("col2", StringType(), False), 
                        StructField("col3", StringType(), False) 
                    ])

df = spark.read.load("test.csv", schema=struct, format="csv", header="true")

df.printSchema() returns:

root
 |-- col1: string (nullable = true)
 |-- col2: string (nullable = true)
 |-- col3: string (nullable = true)

and df.show() returns:

+----+----+----+
|col1|col2|col3|
+----+----+----+
|  11|  12|  13|
|  21|  22|  23|
|  31|  32|  33|
|  41|  42|  43|
|  51|null|  53|
+----+----+----+

while I expect this:

root
 |-- col1: string (nullable = false)
 |-- col2: string (nullable = false)
 |-- col3: string (nullable = false)

+----+----+----+
|col1|col2|col3|
+----+----+----+
|  11|  12|  13|
|  21|  22|  23|
|  31|  32|  33|
|  41|  42|  43|
+----+----+----+

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-23T19:16:33+0000

While Spark behavior (switch from False to True here is confusing there is nothing fundamentally wrong going on here. nullable argument is not a constraint but a reflection of the source and type semantics which enables certain types of optimization

You state that you want to avoid null values in your data. For this you should use na.drop method.

df.na.drop()

For other ways of handling nulls please take a look at the DataFrameNaFunctions (exposed using DataFrame.na property) documentation.

CSV format doesn't provide any tools which allow you to specify data constraints so by definition reader cannot assume that input is not null and your data indeed contains nulls.

Categories

python - PySpark: StructField(..., ..., False) always returns `nullable=true` instead of `nullable=false`

python - PySpark: StructField(..., ..., False) always returns `nullable=true` instead of `nullable=false`

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags