I am working with financial data and cannot assume a Gaussian distribution. So I normalize my data by subtracting the median and dividing by the interquartile range. This puts 95% of the data into a range [-2,2]. The rest are a bunch of crazy outliers that can be as high as -8, 28, 47 etc.
But I still dont want to throw the outliers away. So I apply a tanh(x) to my entire normalized time series and the majority of the data that are in the range [-2,-2] are now mapped to [-0.95, 0.95], and the crazy outliers are now saturated close to -1 and 1, and the really crazy ones are all mapped to precisely -1 and 1. Order is kept throughout the process, because tanh(x) is a monotonic function. And the Machine Learning algorithm doesnt have to waste time and energy on numbers that have much larger absolute values than others. The extreme outliers are all now in two groups, -1 and 1.
By the way, the tanh compression doesnt destroy too many unique values. That is, close values are not collapsed to the same value by tanh. I get almost exatly the same amount of unique values in my time series before the tanh, as after.
The data will be fed into Neural Network, Random Forest, and Gradient Boosted Decision Trees. (Even though decision trees dont care too much about outliers, I still want to force all indicators into the same range [-1,1]).
What are the bad consequences to my approach, compared to just throwing the outliers away? What am I missing?
question from:
https://stackoverflow.com/questions/66052101/can-i-deal-with-outliers-in-my-data-by-applying-tanhx 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…