We are migrating our data platform from Redshift to Snowflake and while converting the COPY/UNLOAD commands from Redshift to Snowflake, we came across an issue where the Redshift UNLOAD command was creating partitioned data sets.
Snowflake does have an option in the COPY into command to specify the partitioning columns whereas we are seeing some differences in the output dataset as compared to Redshift:
Snowflake is producing the header in the Upper case. Thought this is not a show-shopper but since Python is case sensitive, it is not able to read the Parquet datasets produced out of Snowflake. Is there a way/option in Snowflake to produce the header in unloaded files in lower case?
Snowflake is including the partitioning columns in the output dataset/unloaded files. Redshift works just like Hive and it by default excludes the partitioned columns from the output dataset/unloaded files. Is there a way/option to exclude those partition columns just so that we don't have to modify the post-processing scripts that consume these datasets?
Snowflake doesn't allow OVERWRITE mode with PARTITION BY option so it is creating duplicate datasets/unloaded files when the job runs multiple times. We were planning to add a pre-step to clean-up the partition folders manually before re-running the job, but is there a way this could be handled on Snowflake level?
As these issues are impacting some of our post-processing Python script which reads the partitioned data, just wanted to understand if any of these issues can be handled on Snowflake level rather than changing the scripts. Would really appreciate any inputs/suggestions on this.
Thanks in advance.
Regards,
Gagandeep
question from:
https://stackoverflow.com/questions/65848848/snowflake-partition-by-copy-option-including-partition-columns-in-output-datas 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…