Having some issues with the way that Spark is interpreting columns for parquet.
I have an Oracle source with confirmed schema (df.schema() method):
root
|-- LM_PERSON_ID: decimal(15,0) (nullable = true)
|-- LM_BIRTHDATE: timestamp (nullable = true)
|-- LM_COMM_METHOD: string (nullable = true)
|-- LM_SOURCE_IND: string (nullable = true)
|-- DATASET_ID: decimal(38,0) (nullable = true)
|-- RECORD_ID: decimal(38,0) (nullable = true)
Which is then saved as Parquet - df.write().parquet() method - with corresponding message type (determined by Spark):
message spark_schema {
optional int64 LM_PERSON_ID (DECIMAL(15,0));
optional int96 LM_BIRTHDATE;
optional binary LM_COMM_METHOD (UTF8);
optional binary LM_SOURCE_IND (UTF8);
optional fixed_len_byte_array(16) DATASET_ID (DECIMAL(38,0));
optional fixed_len_byte_array(16) RECORD_ID (DECIMAL(38,0));
}
My application then generates the table DDL using a HashMap for type conversion, for example:
CREATE EXTERNAL TABLE IF NOT EXISTS
ELM_PS_LM_PERSON (
LM_PERSON_ID DECIMAL(15,0)
,LM_BIRTHDATE TIMESTAMP
,LM_COMM_METHOD STRING
,LM_SOURCE_IND STRING
,DATASET_ID DECIMAL(38,0)
,RECORD_ID DECIMAL(38,0)
) PARTITIONED BY (edi_business_day STRING) STORED AS PARQUET LOCATION '<PATH>'
My issue is that the table will fail to be read by Impala because it will not accept LM_PERSON_ID as a decimal field. The table will only read the parquet file if this column is set to BIGINT.
Query 8d437faf6323f0bb:b7ba295d028c8fbe: 0% Complete (0 out of 1)
File 'hdfs:dev/ELM/ELM_PS_LM_PERSON/part-00000-fcdbd3a5-9c93-490e-a124-c2a327a17a17.snappy.parquet' has an incompatible Parquet schema for column 'rbdshid1.elm_ps_lm_person_2.lm_person_id'.
Column type: DOUBLE, Parquet schema:
optional int64 LM_PERSON_ID [i:0 d:1 r:0]
How do I know when to substitute a Decimal field for BIGINT?
The parquet message type is logged but not accessible?
Two decimal fields are converted to fixed_len_byte_array(16), LM_PERSON_ID is converted to int64
The only resolution I can think of is to create the table, test if it returns, if not drop and substitute decimal fields to BIGINT one by one, testing each time.
What am I missing here? Can I enforce a schema for the parquet file for decimal?
See Question&Answers more detail:
os