I have a BigQuery table that is not randomly sorted. The IDs are also not random. I would like to partition the data into chunks based on a random number, so that I can use those chunks for various parts of the project.
The solution I have in mind is to add two columns to my table: a randomly generated number, and a partition number. I am following this code snippet on AI Platform Notebooks.
The only substantive difference is I've changed the query_job line to
traintestsplit="""
DECLARE randn NUMERIC;
DECLARE split INT64 default 0;
LOOP
SET randn = RAND();
IF (randn < (1/3)) THEN
SET split = 1;
END IF;
IF (randn > (2/3)) THEN
SET split = 3;
ELSE
SET split = 2;
END IF;
END LOOP;
"""
query_job = client.query(traintestsplit,
job_config=job_config,
) # Make an API request.
query_job.result() # Wait for the job to complete.
I get the error that someone else got, BadRequest: 400 configuration.query.destinationTable cannot be set for scripts
(job ID: 676675d7-9151-4626-8a7e-96263232f7b2)
and have read through
Cannot set destination table with BigQuery Python API
but I need something that stays constant if I am going to use these partitions.
Should I approach this problem in another way? A very naive way would be to pull the IDs from the BigQuery table, generate a random number, save the random number as a CSV, and then do a join every time I pull the data but that seems terribly inefficient.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…