Is it possible to tell HDFS where to store particular files?
Use case
I've just loaded batch #1 of files into HDFS and want to run job/application on these data. However, I also have batch #2 that is still to be loaded. It would be nice if I could run job/application on first batch on, say, nodes from 1 to 10, and load new data to nodes, say, 11 to 20, completely in parallel.
Initially I thought that NameNode federation (Hadoop 2.x) does exactly that, but it looks like federation only splits namespace, while DataNodes still provide blocks for all connected NameNodes.
So, is there a way to control the distribution of data in HDFS? And does it make sense at all?
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…