TL;DR
No.
Explanation:
As it says in Amazon Redshift UNLOAD document, if you do not want it to be split into several parts, you can use PARALLEL FALSE
, but it is strongly recommended to leave it enabled. Even then, the file will always include the 000.[EXT]
suffix (when the [EXT]
exists only when the compression is enabled), because there is a limit to a file size that Redshift can output, as says in the documentation:
By default, UNLOAD writes data in parallel to multiple files,
according to the number of slices in the cluster. The default option
is ON or TRUE. If PARALLEL is OFF or FALSE, UNLOAD writes to one or
more data files serially, sorted absolutely according to the ORDER BY
clause, if one is used. The maximum size for a data file is 6.2 GB.
So, for example, if you unload 13.4 GB of data, UNLOAD creates the
following three files.
s3://mybucket/key000 6.2 GB
s3://mybucket/key001 6.2 GB
s3://mybucket/key002 1.0 GB
Therefore, it will alway add at least the prefix 000
, because Redshift doesn't know what size of the file he is going to output in the first place, so he's adding this suffix in case the output will reach the size of 6.2 GB.
If you ask why the use of PARALLEL FALSE
is not recommended, I'll try to explain it in several points:
- The most important reason is because of the way a Redshift cluster designed. Each cluster includes at least 2 servers, when one of them is a leader node and the rest are data nodes. The purpose of leader node, is to control the data nodes, it hold the necessary information to work with all data in Redshift, either read or write.
When you unload data from Redshift while the flag PARALLEL
is TRUE
, it will create at least X files, when X is the number of nodes you choose to construct the Redshift cluster of, in the first place. It means, that the data is written directly from the data nodes themselves, which is much faster because it's doing it in parallel and skips the leader node.
When you decide to turn this flag to off, all data is gathered from all of the data nodes into a single node, the leader node, because it needs to reorganize the sorting of the rows to output and also compress it if needed as a single stream. This action causes you data to be written much slower.
- Also, this is significantly decreases Redshift cluster performance in a matter of reading and writing data, because everything (read and write queries) goes through the leader node, and as it says above, when the leader node is overloaded, there will be a performance issue.
The queries COPY
and UNLOAD
work directly with the data nodes, therefore, they behave almost the same way as if you would use PARALLEL TRUE
. In the contrary, queries like SELECT
, UPDATE
, DELETE
and INSERT
, are processed by the leader node, that's why they suffer from the leader node loads.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…