I am trying to run a high-memory job on a Hadoop cluster (0.20.203). I modified the mapred-site.xml to enforce some memory limits.
<property>
<name>mapred.cluster.max.map.memory.mb</name>
<value>4096</value>
</property>
<property>
<name>mapred.cluster.max.reduce.memory.mb</name>
<value>4096</value>
</property>
<property>
<name>mapred.cluster.map.memory.mb</name>
<value>2048</value>
</property>
<property>
<name>mapred.cluster.reduce.memory.mb</name>
<value>2048</value>
</property>
In my job, I am specifying how much memory I will need. Unfortunately, even though I am running my process with -Xmx2g
(the job will run just fine with this much memory as a console application) I need to request much more memory for my mapper (as a subquestion, why is this?) or it is killed.
val conf = new Configuration()
conf.set("mapred.child.java.opts", "-Xms256m -Xmx2g -XX:+UseSerialGC");
conf.set("mapred.job.map.memory.mb", "4096");
conf.set("mapred.job.reduce.memory.mb", "1024");
The reducer needs hardly any memory since I am performing an identity reducer.
class IdentityReducer[K, V] extends Reducer[K, V, K, V] {
override def reduce(key: K,
values: java.lang.Iterable[V],
context:Reducer[K,V,K,V]#Context) {
for (v <- values) {
context write (key, v)
}
}
}
However, the reducer is still using a lot of memory. Is it possible to give the reducer different JVM arguments than the mapper? Hadoop kills the reducer and claims it is using 3960 MB of memory! And the reducers end up failing the job. How is this possible?
TaskTree [pid=10282,tipID=attempt_201111041418_0005_r_000000_0] is running beyond memory-limits.
Current usage : 4152717312bytes.
Limit : 1073741824bytes.
Killing task.
UPDATE: even when I specify a streaming job with cat
as the mapper and uniq
as the reducer and -Xms512M -Xmx1g -XX:+UseSerialGC
my tasks take over 2g of virtual memory! This seems extravagant at 4x the max heap size.
TaskTree [pid=3101,tipID=attempt_201111041418_0112_m_000000_0] is running beyond memory-limits.
Current usage : 2186784768bytes.
Limit : 2147483648bytes.
Killing task.
Update: the original JIRA for changing the configuration format for memory usage specifically mentions that Java users are mostly interested in physical memory to prevent thrashing. I think this is exactly what I want: I don't want a node to spin up a mapper if there is inadequate physical memory available. However, these options all seem to have been implemented as virtual memory constraints, which are difficult to manage.
See Question&Answers more detail:
os