Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.2k views
in Technique[技术] by (71.8m points)

apache spark - Why does createDataFrame reorder the columns?

Suppose I am creating a data frame from a list without a schema:

data = [Row(c=0, b=1, a=2), Row(c=10, b=11, a=12)]
df = spark.createDataFrame(data)
df.show()
+---+---+---+
|  a|  b|  c|
+---+---+---+
|  2|  1|  0|
| 12| 11| 10|
+---+---+---+

Why are the columns reordered in alphabet order ?
Can I preserve the original order of columns without adding a schema ?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Why are the columns reordered in alphabet order ?

Because Row created with **kwargs sorts the arguments by name.

This design choice is required to address the issues described in PEP 468. Please check SPARK-12467 for a discussion.

Can I preserve the original order of columns without adding a schema ?

Not with **kwargs. You can use plain tuples:

df = spark.createDataFrame([(0, 1, 2), (10, 11, 12)], ["c", "b", "a"])

or namedtuple:

from collections import namedtuple

CBA = namedtuple("CBA", ["c", "b", "a"])
spark.createDataFrame([CBA(0, 1, 2), CBA(10, 11, 12)])

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...