Wanted to take something like this
https://github.com/fitzscott/AirQuality/blob/master/HiveDataTypeGuesser.java
and create a Hive UDAF to create an aggregate function that returns a data type guess.
Does Spark have something like this already built-in?
Would be very useful for new wide datasets to explore data. Would be helpful for ML too, e.g. to decide categorical vs numerical variables.
How do you normally determine data types in Spark?
P.S. Frameworks like h2o automatically determine data type scanning a sample of data, or whole dataset. So then one can decide e.g. if a variable should be a categorical variable or numerical.
P.P.S. Another use case is if you get an arbitrary data set (we get them quite often), and want to save as a Parquet table.
Providing correct data types make parquet more space effiecient (and probably more query-time performant, e.g.
better parquet bloom filters than just storing everything as string/varchar).
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…