You should see Impala as "SQL on HDFS", while Hive is more "SQL on Hadoop".
In other words, Impala doesn't even use Hadoop at all. It simply has daemons running on all your nodes which cache some of the data that is in HDFS, so that these daemons can return data quickly without having to go through a whole Map/Reduce job.
The reason for this is that there is a certain overhead involved in running a Map/Reduce job, so by short-circuiting Map/Reduce altogether you can get some pretty big gain in runtime.
That being said, Impala does not replace Hive, it is good for very different use cases. Impala doesn't provide fault-tolerance compared to Hive, so if there is a problem during your query then it's gone. Definitely for ETL type of jobs where failure of one job would be costly I would recommend Hive, but Impala can be awesome for small ad-hoc queries, for example for data scientists or business analysts who just want to take a look and analyze some data without building robust jobs. Also from my personal experience, Impala is still not very mature, and I've seen some crashes sometimes when the amount of data is larger than available memory.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…