Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
191 views
in Technique[技术] by (71.8m points)

python - Snowflake "PARTITION BY" COPY Option including Partition Columns in output Dataset

We are migrating our data platform from Redshift to Snowflake and while converting the COPY/UNLOAD commands from Redshift to Snowflake, we came across an issue where the Redshift UNLOAD command was creating partitioned data sets.

Snowflake does have an option in the COPY into command to specify the partitioning columns whereas we are seeing some differences in the output dataset as compared to Redshift:

  1. Snowflake is producing the header in the Upper case. Thought this is not a show-shopper but since Python is case sensitive, it is not able to read the Parquet datasets produced out of Snowflake. Is there a way/option in Snowflake to produce the header in unloaded files in lower case?

  2. Snowflake is including the partitioning columns in the output dataset/unloaded files. Redshift works just like Hive and it by default excludes the partitioned columns from the output dataset/unloaded files. Is there a way/option to exclude those partition columns just so that we don't have to modify the post-processing scripts that consume these datasets?

  3. Snowflake doesn't allow OVERWRITE mode with PARTITION BY option so it is creating duplicate datasets/unloaded files when the job runs multiple times. We were planning to add a pre-step to clean-up the partition folders manually before re-running the job, but is there a way this could be handled on Snowflake level?

As these issues are impacting some of our post-processing Python script which reads the partitioned data, just wanted to understand if any of these issues can be handled on Snowflake level rather than changing the scripts. Would really appreciate any inputs/suggestions on this.

Thanks in advance.

Regards, Gagandeep

question from:https://stackoverflow.com/questions/65848848/snowflake-partition-by-copy-option-including-partition-columns-in-output-datas

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

my thoughts/answers on your 3 questions are as follows:

  1. Header case: if you alias the columns names in your query then the output should adhere to those aliases e.g. SELECT NAME as "Name" FROM Table1 should output a column header of "Name" not "NAME"
  2. There is no way to exclude the partitioning columns that I am aware of - and this is explicitly stated in the documentation: Copy Options : "There is no option to omit the columns in the partition expression from the unloaded data files."
  3. No way that I am aware of. You might be able to write an external function and then include this and the COPY INTO statement in a Stored Proc or a sequence of Tasks - but I doubt that is less complicated/easier than adding the pre-step that you mention

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...