Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
386 views
in Technique[技术] by (71.8m points)

google cloud platform - Is it possible to export a BigQuery query into equal 5000-line CSV files in GCS bucket?

I'm aware of the ability to export a query to CSV files in a GCS Bucket; however, when exporting to multiple files there doesn't seem to be a way to limit the row count of each file. I was wondering if anyone has figured out a workaround to allow for this functionality. My current use-case is I need to export a query of a table (152 columns) into multiple CSV files, saving them to a GCS bucket. Each file can't have more than 5000 records. I was hoping to find some sort of statement I could pop into BigQuery to avoid having to write a solution in python (since this is a short-term requirement and won't be needed after a certain point).

I came up with the following pseudo-SQL-code:

offset = 0
total = SELECT?COUNT(*) FROM?WEB_SCRAPING.scrapy_products WHERE?spider?=?"{?0}?"
while offset < total + 5000:
    EXPORT DATA OPTIONS(
        uri='gs://webscraping/{?0}?_{offset}-{offset + 5000}.csv',
    ??? format='CSV',
    ??? overwrite=true,
    ??? header=true
??? ) AS (
    ??? SELECT * EXCEPT (id)
??? ??? FROM WEB_SCRAPING.scrapy_products
??? ??? WHERE spider = "{0}"
??? ??? LIMIT?5000 OFFSET?{offset};
??? )
??? offset += 5000

I was thinking that I could somehow wrap the EXPORT DATA statement in a CASE statement that loops, incrementing the offset until all results are exported. But, from what I've seen the EXPORT DATA statement requires an asterisk, which will likely lead to the 5000-record chunks being split into even smaller files.

Any idea on how I can make this work from a BigQuery SQL standpoint? Is it even possible?


Update: Here's what I have so far as BigQuery statement

DECLARE spider STRING DEFAULT 'albertsons.albertsons';
DECLARE increment INT64 DEFAULT 5000;

DECLARE tt_name STRING;
DECLARE `offset` INT64;
DECLARE total INT64;

SET `offset` = 0;
SET total = (SELECT COUNT(*) as total_products FROM `WEB_SCRAPING.SCRAPY_PRODUCTS` WHERE spider = FORMAT('"%s"', spider));

WHILE offset < (total + increment) DO
  SET tt_name = FORMAT("`WEB_SCRAPING.%s_%dto%d`", @spider, @offset, @`offset` + @increment);
  CREATE TEMP TABLE tt_name AS (
    SELECT * EXCEPT (id)
    FROM `WEB_SCRAPING.SCRAPY_PRODUCTS`
    WHERE spider = FORMAT('"%s"', @spider)
    LIMIT @increment OFFSET @`offset`
  );
  
  -- Now, to just handle the export portion
  
  SET `offset` = @`offset` + @increment;
END WHILE;
question from:https://stackoverflow.com/questions/66048044/is-it-possible-to-export-a-bigquery-query-into-equal-5000-line-csv-files-in-gcs

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

You can't do this only with SQL code. You need to create your temp table as you did and then to iterate over them (in python or other languages) to extract the table to file with the extract API


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...