Wednesday, May 7, 2014

MapR commands - 7 MapReduce

  • mapred-site.xml

<property>
  <name>mapred.fairscheduler.smalljob.schedule.enable</name>
  <value>true</value>
  <description>Enable small job fast scheduling inside fair scheduler.
  TaskTrackers should reserve a slot called ephemeral slot which
  is used for smalljob if cluster is busy.
  </description>
</property>


<!-- Small job definition. If a job does not satisfy any of following limits
 it is not considered as a small job and will be moved out of small job pool.
-->
<property>
  <name>mapred.fairscheduler.smalljob.max.maps</name>
  <value>10</value>
  <description>Small job definition. Max number of maps allowed in small job. </description>
</property>


<property>
  <name>mapred.fairscheduler.smalljob.max.reducers</name>
  <value>10</value>
  <description>Small job definition. Max number of reducers allowed in small job. </description>
</property>


<property>
  <name>mapred.fairscheduler.smalljob.max.inputsize</name>
  <value>10737418240</value>
  <description>Small job definition. Max input size in bytes allowed for a small job.
  Default is 10GB.
  </description>
</property>


<property>
  <name>mapred.fairscheduler.smalljob.max.reducer.inputsize</name>
  <value>1073741824</value>
  <description>Small job definition.
  Max estimated input size for a reducer allowed in small job.
  Default is 1GB per reducer.
  </description>
</property>


<property>
  <name>mapred.cluster.ephemeral.tasks.memory.limit.mb</name>
  <value>200</value>
  <description>Small job definition. Max memory in mbytes reserved for an ephermal slot.
  Default is 200mb. This value must be same on JobTracker and TaskTracker nodes.
  </description>
</property>
  • Secured TaskTracker

<property>
  <name>mapred.tasktracker.task-controller.config.overwrite</name>
  <value>true</value>
  <description>LinuxTaskController needs a config file set at HADOOP_HOME/conf/taskcontroller.cfg
  It has following parameters -
  mapred.local.dir = Local dir used by tasktracker, taken from mapred-site.xml.
  hadoop.log.dir = hadoop log dir, taken from system properties of the tasktracker process
  mapreduce.tasktracker.group = groups allowed to run tasktracker see 'mapreduce.tasktracker.group'
  min.user.id = Don't allow any user below this uid to launch a task.
  banned.users = users who are not allowed to launch any tasks.
  If set to true, TaskTracker will always overwrite config file with default values as
    min.user.id = -1(check disabled), banned.users = bin, mapreduce.tasktracker.group = root
  Set to false while using customized config and restart TaskTracker.
  </description>
</property>
To disallow root:
  1.Edit mapred-site.xml and set mapred.tasktracker.task-controller.config.overwrite = false on all TaskTracker nodes.
  2.Edit taskcontroller.cfg and set min.user.id=0 on all TaskTracker nodes.
  3.Restart all TaskTrackers.
To disallow all superusers:
  1.Edit mapred-site.xml and set mapred.tasktracker.task-controller.config.overwrite = false on all TaskTracker nodes.
  2.Edit taskcontroller.cfg and set min.user.id=1000 on all TaskTracker nodes.
  3.Restart all TaskTrackers.
To disallow specific users:
  1.Edit mapred-site.xml and set mapred.tasktracker.task-controller.config.overwrite = false on all TaskTracker nodes.
  2.Edit taskcontroller.cfg and add the parameter banned.users on all TaskTracker nodes, setting it to a comma-separated list of usernames.
    Example: banned.users=foo,bar
  3.Restart all TaskTrackers.
To remove all user restrictions, and run all jobs as root:
  1.Edit mapred-site.xml and set mapred.task.tracker.task.controller = org.apache.hadoop.mapred.DefaultTaskController on all TaskTracker nodes.
  2.Restart all TaskTrackers.
  • Standalone Operation

Input=local, output=mfs
hadoop jar hadoop-0.20.2-dev-examples.jar grep -Dmapred.job.tracker=local file:///opt/mapr/hadoop/hadoop-0.20.2/input /output 'dfs[a-z.]+'

Input=local, output=local
Setting mapred.job.tracker=local in the mapred-site.xml 
hadoop jar hadoop-0.20.2-dev-examples.jar grep -Dmapred.job.tracker=local file:///opt/mapr/hadoop/hadoop-0.20.2/input  file:///opt/mapr/hadoop/hadoop-0.20.2/output 'dfs[a-z.]+'

Input=mfs, output=mfs
hadoop jar hadoop-0.20.2-dev-examples.jar grep /input /output 'dfs[a-z.]+'
  • Memory for Services

/opt/mapr/conf/warden.conf
service.command.tt.heapsize.percent=2   #The percentage of heap space reserved for the TaskTracker.
service.command.tt.heapsize.max=325     #The maximum heap space that can be used by the TaskTracker. 
service.command.tt.heapsize.min=64      #The minimum heap space for use by the TaskTracker.
$ cat /opt/mapr/conf/warden.conf|grep size|grep percent
service.command.jt.heapsize.percent=10
service.command.tt.heapsize.percent=2
service.command.hbmaster.heapsize.percent=4
service.command.hbregion.heapsize.percent=25
service.command.cldb.heapsize.percent=8
service.command.mfs.heapsize.percent=20
service.command.webserver.heapsize.percent=3
service.command.os.heapsize.percent=3 
  • MapReduce Memory

/opt/mapr/hadoop/hadoop-0.20.2/conf/mapred-site.xml:
 
<property>
  <name>mapreduce.tasktracker.reserved.physicalmemory.mb</name>
  <value></value>
  <description> Maximum phyiscal memory tasktracker should reserve for mapreduce tasks.
  If tasks use more than the limit, task using maximum memory will be killed.
  Expert only: Set this value iff tasktracker should use a certain amount of memory
  for mapreduce tasks. In MapR Distro warden figures this number based
  on services configured on a node.
  Setting mapreduce.tasktracker.reserved.physicalmemory.mb to -1 will disable
  physical memory accounting and task management.
  </description>
</property>
  • OOM killer

/opt/mapr/hadoop/hadoop-0.20.2/conf/mapred-site.xml
mapred.child.oom_adj -17 to +15
Increase the OOM adjust for oom killer (linux specific). We only allow increasing the adj value. (valid values: 0-15)
  • Map tasks Memory

Map tasks use memory mainly in two ways: The application consumes memory to run the map function. The MapReduce framework uses an intermediate buffer to hold serialized (key, value) pairs. (io.sort.mb)
/opt/mapr/hadoop/hadoop-0.20.2/conf/mapred-site.xml:
io.sort.mb
Buffer used to hold map outputs in memory before writing final map outputs. 
Setting this value very low may cause spills. By default if left empty value is set to 50% of heapsize for map.
If a average input to map is "MapIn" bytes then typically value of io.sort.mb should be '1.25 times MapIn' bytes.
  • Reduce tasks Memory

mapred.reduce.child.java.opts
Java opts for the reduce tasks. Default heapsize(-Xmx) is determined by memory reserved for mapreduce at tasktracker.
Reduce task is given more memory than map task.
Default memory for a reduce task = (Total Memory reserved for mapreduce) * (2*#reduceslots / (#mapslots + 2*#reduceslots))
  • Tasks number

Map slots should be based on how many map tasks can fit in memory, and reduce slots should be based on the number of CPUs.
mapred.tasktracker.map.tasks.maximum
(CPUS > 2) ? (CPUS * 0.75) : 1
(At least one Map slot, up to 0.75 times the number of CPUs)
mapred.tasktracker.reduce.tasks.maximum
(CPUS > 2) ? (CPUS * 0.50) : 1
(At least one Map slot, up to 0.50 times the number of CPUs)

variables in formula:
CPUS - number of CPUs present on the node
DISKS - number of disks present on the node
MEM - memory reserved for MapReduce tasks
mapreduce.tasktracker.prefetch.maptasks
How many map tasks should be scheduled in-advance on a tasktracker.
To be given in % of map slots. Default is 1.0 which means number of tasks overscheduled = total map slots on TT.

No comments:

Post a Comment

Popular Posts