Applying distinct to each dataset before joining them is safer because joining not unique keys will duplicate data.
I would recommend to partition your datasets by to_date(timestamp) field (yyyy-MM-dd) to make partition pruning work according to your where clause (check it works). Partition also by event field if datasets are too big and contain a lot of data where event <> 'Use'.
It's important to know on which stage it fails. Study the exception as well. If it fails on mappers then you should optimize your subqueries (add partitions as I mentioned). if it fails on reducer (join) then you should somehow improve join (try to reduce bytes per reducer:
set hive.exec.reducers.bytes.per.reducer=67108864;
or even less) if it fails on writer (OrcWriter then try to add partition to Output table by substr from imei and 'distribute by substr(imei...)` at the end of query to reduce pressure on reducers).
Or add une more column with low cardinality and even distribution to distribute the data between more reducers evenly:
distribute by substr(imei...), col2
Make sure that partition column is in the distribute by. This will reduce the number of files written by each reducer and help to get rid of OOM
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…