Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
759 views
in Technique[技术] by (71.8m points)

scala - How to get column names with all values null?

I don't have any ideas to get column names when it has null value

For example,

case class A(name: String, id: String, email: String, company: String)

val e1 = A("n1", null, "[email protected]", null)
val e2 = A("n2", null, "[email protected]", null)
val e3 = A("n3", null, "[email protected]", null)
val e4 = A("n4", null, "[email protected]", null)
val e5 = A("n5", null, "[email protected]", null)
val e6 = A("n6", null, "[email protected]", null)
val e7 = A("n7", null, "[email protected]", null)
val e8 = A("n8", null, "[email protected]", null)
val As = Seq(e1, e2, e3, e4, e5, e6, e7, e8)
val df = sc.parallelize(As).toDF

This code makes dataframe like this :

+----+----+---------+-------+
|name|  id|    email|company|
+----+----+---------+-------+
|  n1|null|[email protected]|   null|
|  n2|null|[email protected]|   null|
|  n3|null|[email protected]|   null|
|  n4|null|[email protected]|   null|
|  n5|null|[email protected]|   null|
|  n6|null|[email protected]|   null|
|  n7|null|[email protected]|   null|
|  n8|null|[email protected]|   null|
+----+----+---------+-------+

and I want to get column names all of their rows are null : id, company

I don't care the type of output. Array, String, RDD whatever

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

You can do a simple count on all your columns, then using the indices of the columns that return a count of 0, you subset df.columns:

import org.apache.spark.sql.functions.{count,col}
// Get column indices
val col_inds = df.select(df.columns.map(c => count(col(c)).alias(c)): _*)
                 .collect()(0)
                 .toSeq.zipWithIndex
                 .filter(_._1 == 0).map(_._2)
// Subset column names using the indices
col_inds.map(i => df.columns.apply(i))
//Seq[String] = ArrayBuffer(id, company)

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...