Compute table statistics every time you changed data.
ANALYZE TABLE hive_table PARTITION(partitioned_col) COMPUTE STATISTICS FOR COLUMNS;
Enable CBO and statistics auto gathering:
set hive.cbo.enable=true;
set hive.stats.autogather=true;
Use these settings to enable CBO using statistics:
set hive.compute.query.using.stats=true;
set hive.stats.fetch.partition.stats=true;
set hive.stats.fetch.column.stats=true;
If nothing helps I'd recommend to apply this approach for finding last partition fast:
Parse max partition key using shell script from the table location.
The command below will print all table folder paths, sort, take latest sorted, take last subfolder name, parse partition folder name and extract value. All you need is to initialize TABLE_DIR
variable and put the number of partition subfolder in the path
:
last_partition=$(hadoop fs -ls $TABLE_DIR/* | awk '{ print $8 }' | sort -r | head -n1 | cut -d / -f [number of partition subfolder in the path here] | cut -d = -f 2
Then use $last_partition
variable to pass to your script as
hive -hiveconf last_partition="$last_partition" -f your_script.hql
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…