`

hadoop参数配置详解

 
阅读更多

Jobtracker Configuration

Changing any parameters in this section requires a JobTracker restart.

Parameter Value Description
mapred.job.tracker maprfs:/// JobTracker address ip:port or use uri maprfs:/// for default cluster or maprfs:///mapr/san_jose_cluster1 to connect 'san_jose_cluster1' cluster.
Replace localhost by one or more ip addresses for jobtracker.
mapred.jobtracker.port 9001 Port on which JobTracker listens. Read by JobTracker to start RPC Server.
mapreduce.tasktracker.outofband.heartbeat True The task tracker sends an out-of-band heartbeat on task completion to improve latency. Set this value to false to disable this behavior.
webinterface.private.actions False By default, jobs cannot be killed from the job tracker's web interface. Set this value to True to enable this behavior.
MapR recommends properly securing your interfaces before enabling this behavior.
maprfs.openfid2.prefetch.bytes 0 Expert: number of shuffle bytes to prefetched by reduce task
mapr.localoutput.dir output The path for map output files on shuffle volume.
mapr.localspill.dir spill The path for local spill files on shuffle volume.
mapreduce.jobtracker.node.labels.file The file that specifies the labels to apply to the nodes in the cluster.
mapreduce.jobtracker.node.labels.monitor.interval 120000 Specifies a time interval in milliseconds. The node labels file is polled for changes every time this interval passes.
mapred.queue.<queue-name>.label Specifies a label for the queue named in the<queue-name>placeholder.
mapred.queue.<queue-name>.label.policy Specifies a policy for the label applied to the queue named in the<queue-name>placeholder. The policy controls the interaction between the queue label and the job label:
  • PREFER_QUEUE— always use label set on queue
  • PREFER_JOB— always use label set on job
  • AND(default) — job label AND node label
  • OR— job label OR node label

Jobtracker Directories

When changing any parameters in this section, a JobTracker restart is required.

Volume path = mapred.system.dir/../

Parameter Value Description
mapred.system.dir /var/mapr/cluster/mapred/jobTracker/system The shared directory where MapReduce stores control files.
mapred.job.tracker.persist.jobstatus.dir /var/mapr/cluster/mapred/jobTracker/jobsInfo The directory where the job status information is persisted in a file system to be available after it drops of the memory queue and between jobtracker restarts.
mapreduce.jobtracker.staging.root.dir /var/mapr/cluster/mapred/jobTracker/staging The root of the staging area for users' job files In practice, this should be the directory where users' home directories are located (usually /user)
mapreduce.job.split.metainfo.maxsize 10000000 The maximum permissible size of the split metainfo file. The JobTracker won't attempt to read split metainfo files bigger than the configured value. No limits if set to -1.
mapreduce.maprfs.use.compression True Set this property's value to False to disable the use of MapR-FS compression for shuffle data by MapReduce.
mapred.jobtracker.retiredjobs.cache.size 1000 The number of retired job status to keep in the cache.
mapred.job.tracker.history.completed.location /var/mapr/cluster/mapred/jobTracker/history/done The completed job history files are stored at this single well known location. If nothing is specified, the files are stored at ${hadoop.job.history.location}/done in local filesystem.
hadoop.job.history.location If job tracker is static the history files are stored in this single well known place on local filesystem. If No value is set here, by default, it is in the local file system at ${hadoop.log.dir}/history. History files are moved to mapred.jobtracker.history.completed.location which is on MapRFs JobTracker volume.
mapred.jobtracker.jobhistory.lru.cache.size 5 The number of job history files loaded in memory. The jobs are loaded when they are first accessed. The cache is cleared based on LRU.

JobTracker Recovery

When changing any parameters in this section, a JobTracker restart is required.

Parameter Value Description
mapreduce.jobtracker.recovery.dir /var/mapr/cluster/mapred/jobTracker/recovery Recovery Directory. Stores list of known TaskTrackers.
mapreduce.jobtracker.recovery.maxtime 120 Maximum time in seconds JobTracker should stay in recovery mode.
mapreduce.jobtracker.split.metainfo.maxsize 10000000 This property's value sets the maximum permissible size of the split metainfo file. The JobTracker does not attempt to read split metainfo files larger than this value.
mapred.jobtracker.restart.recover true "true" to enable (job) recovery upon restart, "false" to start afresh
mapreduce.jobtracker.recovery.job.initialization.maxtime 480 this property's value specifies the maximum time in seconds that the JobTracker waits to initialize jobs before starting recovery. This property's default value is equal to the value of themapreduce.jobtracker.recovery.maxtimeproperty.

Enable Fair Scheduler

When changing any parameters in this section, a JobTracker restart is required.

Parameter Value Description
mapred.fairscheduler.allocation.file conf/pools.xml
mapred.jobtracker.taskScheduler org.apache.hadoop.mapred.FairScheduler The class responsible for task scheduling.
mapred.fairscheduler.assignmultiple true
mapred.fairscheduler.eventlog.enabled false Enable scheduler logging in ${HADOOP_LOG_DIR}/fairscheduler/
mapred.fairscheduler.smalljob.schedule.enable True Set this property's value to False to disable fast scheduling for small jobs in FairScheduler. TaskTrackers can reserve an ephemeral slot for small jobs when the cluster is under load.
mapred.fairscheduler.smalljob.max.maps 10 Small job definition. Max number of maps allowed in small job.
mapred.fairscheduler.smalljob.max.reducers 10 Small job definition. Max number of reducers allowed in small job.
mapred.fairscheduler.smalljob.max.inputsize 10737418240 Small job definition. Max input size in bytes allowed for a small job. Default is 10GB.
mapred.fairscheduler.smalljob.max.reducer.inputsize 1073741824 Small job definition. Max estimated input size for a reducer allowed in small job. Default is 1GB per reducer.
mapred.cluster.ephemeral.tasks.memory.limit.mb 200 Small job definition. Max memory in mbytes reserved for an ephermal slot. Default is 200mb. This value must be same on JobTracker and TaskTracker nodes.

TaskTracker Configuration

When changing any parameters in this section, a TaskTracker restart is required.

Whenmapreduce.tasktracker.prefetch.maptasksis greater than 0, you must disableFair Scheduler with preemptionandlabel-based job placement.
Parameter Value Description
mapred.tasktracker.map.tasks.maximum -1 The maximum number of map task slots to run simultaneously. The default value of -1 specifies that the number of map task slots is based on the total amount of memory reserved for MapReduce by the Warden. Of the memory available for MapReduce (not counting the memory reserved for ephemeral slots), 40% is allocated to map tasks. That total amount of memory is divided by the value of themapred.maptask.memory.defaultparameter to determine the total number of map task slots on this node. You can also specify a formula using the following variables:
  • CPUS - The number of CPUs on the node.
  • DISKS - The number of disks on the node.
  • MEM - The amount of memory reserved for MapReduce tasks by the Warden.
    You can assemble these variables with the syntax CONDITIONAL ? TRUE : FALSE. For example, the expression 2*CPUS < DISKS ? 2*CPUS : DISKS results in 2*CPUS slots when there are more disks on the node than twice the number of cores, and DISKS slots otherwise.
mapreduce.tasktracker.prefetch.maptasks 1.0 How many map tasks should be scheduled in-advance on a tasktracker. To be given in % of map slots. Default is 1.0 which means number of tasks overscheduled = total map slots on TT.
mapreduce.tasktracker.reserved.physicalmemory.mb.low 0.8 This property's value sets the target memory usage level when the TaskTracker kills tasks to reduce total memory usage. This property's value represents a percentage of the amount in themapreduce.tasktracker.reserved.physicalmemory.mbvalue.
mapreduce.tasktracker.task.slowlaunch False Set this property's value to True to wait after each task launch for nodes running critical services like CLDB, JobTracker, and ZooKeeper.
mapreduce.tasktracker.volume.healthcheck.interval 60000 This property's value defines the frequency in milliseconds that the TaskTracker checks the Mapreduce volume defined in the${mapr.localvolumes.path}/mapred/property.
mapreduce.use.maprfs True Use MapR-FS for shuffle and sort/merge.
mapred.userlog.retain.hours 24 This property's value specifies the maximum time, in hours, to retain the user-logs after job completion.
mapred.user.jobconf.limit 5242880 The maximum allowed size of the user jobconf. The default is set to 5 MB.
mapred.userlog.limit.kb 0 Deprecated: The maximum size of user-logs of each task in KB. 0 disables the cap.
mapreduce.use.fastreduce False Expert: Merge map outputs without copying.
mapred.tasktracker.reduce.tasks.maximum -1 The maximum number of reduce task slots to run simultaneously. The default value of -1 specifies that the number of reduce task slots is based on the total amount of memory reserved for MapReduce by the Warden. Of the memory available for MapReduce (not counting the memory reserved for ephemeral slots), 60% is allocated to reduce tasks. That total amount of memory is divided by the value of themapred.reducetask.memory.defaultparameter to determine the total number of reduce task slots on this node. You can also specify a formula using the following variables:
  • CPUS - The number of CPUs on the node.
  • DISKS - The number of disks on the node.
  • MEM - The amount of memory reserved for MapReduce tasks by the Warden.
    You can assemble these variables with the syntax CONDITIONAL ? TRUE : FALSE. For example, the expression 2*CPUS < DISKS ? 2*CPUS : DISKS results in 2*CPUS slots when there are more disks on the node than twice the number of cores, and DISKS slots otherwise.
mapred.tasktracker.ephemeral.tasks.maximum 1 Reserved slot for small job scheduling
mapred.tasktracker.ephemeral.tasks.timeout 10000 Maximum time in milliseconds a task is allowed to occupy ephemeral slot
mapred.tasktracker.ephemeral.tasks.ulimit 4294967296 Ulimit (bytes) on all tasks scheduled on an ephemeral slot
mapreduce.tasktracker.reserved.physicalmemory.mb Maximum phyiscal memory tasktracker should reserve for mapreduce tasks.
If tasks use more than the limit, task using maximum memory will be killed.
Expert only: Set this value iff tasktracker should use a certain amount of memory
for mapreduce tasks. In MapR Distro warden figures this number based
on services configured on a node.
Setting mapreduce.tasktracker.reserved.physicalmemory.mb to -1 will disable
physical memory accounting and task management.
mapred.tasktracker.expiry.interval 600000 Expert: This property's value specifies a time interval in milliseconds. After this interval expires without any heartbeats sent, a TaskTracker is markedlost.
mapreduce.tasktracker.heapbased.memory.management false Expert only: If admin wants to prevent swapping by not launching too many tasks
use this option. Task's memory usage is based on max java heap size (-Xmx).
By default -Xmx will be computed by tasktracker based on slots and memory reserved for mapreduce tasks.
See mapred.map.child.java.opts/mapred.reduce.child.java.opts.
mapreduce.tasktracker.jvm.idle.time 10000 If jvm is idle for more than mapreduce.tasktracker.jvm.idle.time (milliseconds)
tasktracker will kill it.
mapred.max.tracker.failures 4 The number of task-failures on a tasktracker of a given job after which new tasks of that job aren't assigned to it.
mapred.max.tracker.blacklists 4 The number of blacklists for a taskTracker by various jobs after which the task tracker could be blacklisted across all jobs. The tracker will be given a tasks later (after a day). The tracker will become a healthy tracker after a restart.
mapred.task.tracker.http.address 0.0.0.0:50060 This property's value specifies the HTTP server address and port for the TaskTracker. Specify 0 as the port to make the server start on a free port.
mapred.task.tracker.report.address 127.0.0.1:0 The IP address and port that TaskTrackeer server listens on. Since it is only connected to by the tasks, it uses the local interface. EXPERT ONLY. Only change this value if your host does not have a loopback interface.
mapreduce.tasktracker.group mapr Expert: Group to which TaskTracker belongs. If LinuxTaskController is configured via themapreduce.tasktracker.taskcontrollervalue, the group owner of the task-controller binary$HADOOP_HOME/bin/platform/bin/task-controllermust be same as this group.
mapred.tasktracker.task-controller.config.overwrite True TheLinuxTaskControllerneeds a configuration file set at$HADOOP_HOME/conf/taskcontroller.cfg. The configuration file takes the following parameters:
  • mapred.local.dir = Local dir used by tasktracker, taken from mapred-site.xml.
  • hadoop.log.dir = hadoop log dir, taken from system properties of the tasktracker process
  • mapreduce.tasktracker.group = groups allowed to run tasktracker see 'mapreduce.tasktracker.group'
  • min.user.id = Don't allow any user below this uid to launch a task.
  • banned.users = users who are not allowed to launch any tasks.
  • If set to true, TaskTracker will always overwrite config file with default values as
  • min.user.id = -1(check disabled), banned.users = bin, mapreduce.tasktracker.group = root
    To disable this configuration and use a custom configuration, set this property's value to False and restart the TaskTracker.
mapred.tasktracker.indexcache.mb 10 This property's value specifies the maximum amount of memory allocated by the TaskTracker for the index cache. The index cache is used when the TaskTracker serves map outputs to reducers.
mapred.tasktracker.instrumentation org.apache.hadoop.mapred.TaskTrackerMetricsInst Expert: The instrumentation class to associate with each TaskTracker.
mapred.task.tracker.task-controller org.apache.hadoop.mapred.LinuxTaskController This property's value specifies the TaskController that launches and manages task execution.
mapred.tasktracker.taskmemorymanager.killtask.maxRSS False Set this property's value to True to kill tasks that are using maximum memory when the total number of MapReduce tasks exceeds the limit specified in the TaskTracker'smapreduce.tasktracker.reserved.physicalmemory.mbproperty. Tasks are killed in most-recently-launched order.
mapred.tasktracker.taskmemorymanager.monitoring-interval 3000 This property's value specifies an interval in milliseconds that TaskTracker waits between monitoring the memory usage of tasks. This property is only used when tasks memory management is enabled by setting the propertymapred.tasktracker.tasks.maxmemoryto True.
mapred.tasktracker.tasks.sleeptime-before-sigkill 5000 This property's value sets the time in milliseconds that the TaskTracker waits before sending a SIGKILL to a process after it has been sent a SIGTERM.
mapred.temp.dir ${hadoop.tmp.dir}/mapred/temp A shared directory for temporary files.
mapreduce.cluster.map.userlog.retain-size -1 This property's value specifies the number of bytes to retain from map task logs. The default value of -1 disables this feature.
mapreduce.cluster.reduce.userlog.retain-size -1 This property's value specifies the number of bytes to retain from reduce task logs. The default value of -1 disables this feature.
mapreduce.heartbeat.10000 100000 This property's value specifies a heartbeat time in milliseconds for a medium cluster of 1001 to 10000 nodes. Scales linearly between 10s - 100s.
mapreduce.heartbeat.1000 10000 This property's value specifies a heartbeat time in milliseconds for a medium cluster of 101 to 1000 nodes. Scales linearly between 1s - 10s.
mapreduce.heartbeat.100 1000 This property's value specifies a heartbeat time in milliseconds for a medium cluster of 11 to 100 nodes. Scales linearly between 300ms - 1s.
mapreduce.heartbeat.10 300 This property's value specifies a heartbeat time in milliseconds for a medium cluster of 1 to 10 nodes.
mapreduce.job.complete.cancel.delegation.tokens True Set this property's value to False to prevent unregister or cancel delegation tokens from renewing.
mapreduce.jobtracker.inline.setup.cleanup False Set this property's value to True to make the JobTracker attempt to set up and clean up the job by itself or do it in setup/cleanup task.

Job Configuration

Users should set these values on the node from which you plan to submit jobs, before submitting the jobs. If you are using Hadoop examples, you can set these parameters from the command line. Example:

hadoop jar hadoop-examples.jar terasort -Dmapred.map.child.java.opts="-Xmx1000m"

When you submit a job, the JobClient createsjob.xmlby reading parameters from the following files in the following order:

  1. mapred-default.xml
  2. The localmapred-site.xml- overrides identical parameters inmapred-default.xml
  3. Any settings in the job code itself - overrides identical parameters inmapred-site.xml
Parameter Value Description
keep.failed.task.files false Should the files for failed tasks be kept. This should only be used on jobs that are failing, because the storage is never reclaimed. It also prevents the map outputs from being erased from the reduce directory as they are consumed.
mapred.job.reuse.jvm.num.tasks -1 How many tasks to run per jvm. If set to -1, there is no limit.
mapred.map.tasks.speculative.execution true If true, then multiple instances of some map tasks may be executed in parallel.
mapred.reduce.tasks.speculative.execution true If true, then multiple instances of some reduce tasks may be executed in parallel.
mapred.reduce.tasks 1 The default number of reduce tasks per job. Typically set to 99% of the cluster's reduce capacity, so that if a node fails the reduces can still be executed in a single wave. Ignored when the value of themapred.job.trackerproperty islocal.
mapred.job.map.memory.physical.mb Maximum physical memory limit for map task of this job. If limit is exceeded task attempt will be FAILED.
mapred.job.reduce.memory.physical.mb Maximum physical memory limit for reduce task of this job. If limit is exceeded task attempt will be FAILED.
mapreduce.task.classpath.user.precedence false Set to true if user wants to set different classpath.
mapred.max.maps.per.node -1 Per-node limit on running map tasks for the job. A value of -1 signifies no limit.
mapred.max.reduces.per.node -1 Per-node limit on running reduce tasks for the job. A value of -1 signifies no limit.
mapred.running.map.limit -1 Cluster-wide limit on running map tasks for the job. A value of -1 signifies no limit.
mapred.running.reduce.limit -1 Cluster-wide limit on running reduce tasks for the job. A value of -1 signifies no limit.
mapreduce.tasktracker.cache.local.numberdirectories 10000 This property's value sets the maximum number of subdirectories to create in a given distributed cache store. Cache items in excess of this limit are expunged whether or not the total size threshold is exceeded.
mapred.reduce.child.java.opts -XX:ErrorFile=/opt/cores/mapreduce_java_error%p.log Java opts for the reduce tasks. MapR Default heapsize(-Xmx) is determined by memory reserved for mapreduce at tasktracker. Reduce task is given more memory than map task. Default memory for a reduce task = (Total Memory reserved for mapreduce) * (2*#reduceslots / (#mapslots + 2*#reduceslots))
mapred.reduce.child.ulimit
io.sort.factor 256 The number of streams to merge simultaneously during file sorting. The value of this property determines the number of open file handles.
io.sort.mb 380 This value sets the size, in megabytes, of the memory buffer that holds map outputs before writing the final map outputs. Lower values for this property increases the chance of spills. Recommended practice is to set this value to 1.5 times the average size of a map output.
io.sort.record.percent 0.17
io.sort.record.percent 0.17 The percentage of the memory buffer specified by theio.sort.mbproperty that is dedicated to tracking record boundaries. The maximum number of records that the collection thread can collect before blocking is one-fourth of the multiplied values ofio.sort.mbandio.sort.percent.
io.sort.spill.percent 0.99 This property's value sets the soft limit for either the buffer or record collection buffers. Threads that reach the soft limit begin to spill the contents to disk in the background. Note that this does not imply any chunking of data to the spill. Do not reduce this value below 0.5.
mapred.reduce.slowstart.completed.maps 0.95 Fraction of the number of maps in the job which should be complete before reduces are scheduled for the job.
mapreduce.reduce.input.limit -1 The limit on the input size of the reduce. If the estimated
input size of the reduce is greater than this value, job is failed. A
value of -1 means that there is no limit set.
mapred.reduce.parallel.copies 12 The default number of parallel transfers run by reduce during the copy(shuffle) phase.
jobclient.completion.poll.interval 5000 This property's value specifies the JobClient's polling frequency in milliseconds to the JobTracker for updates about job status. Reduce this value for faster tests on single node systems. Adjusting this value on production clusters may result in undesired client-server traffic.
jobclient.output.filter FAILED This property's value specifies the filter that controls the output of the task's userlogs that are sent to the JobClient's console. Legal values are:
  • NONE
  • KILLED
  • FAILED
  • SUCCEEDED
  • ALL
jobclient.progress.monitor.poll.interval 1000 This property's value specifies the JobClient's status reporting frequency in milliseconds to the console and checking for job completion.
job.end.notification.url http://localhost:8080/jobstatus.php?jobId=$jobId&jobStatus=$jobStatus This property's value specifies the URL to call at job completion to report the job's end status. Only two variables are legal in the URL,$jobIdand$jobStatus. When present, these variables are replaced by their respective values.
job.end.retry.attempts 0 This property's value specifies the maximum number of times that Hadoop attempts to contact the notification URL.
job.end.retry.interval 30000 This property's value specifies the interval in milliseconds between attempts to contact the notification URL.
keep.failed.task.files False Set this property's value to True to keep files for failed tasks. Because this storage is not automatically reclaimed by the system, keep files only for jobs that are failing. Setting this property's value to True also keeps map outputs in the reduce directory as the map outputs are consumed instead of deleting the map outputs on consumption.
local.cache.size 10737418240 This property's value specifies the number of bytes allocated to each local TaskTracker directory to store Distributed Cache data.
mapr.centrallog.dir logs This property's value specifies the relative path under a local volume path that points to the central log location,${mapr.localvolumes.path}/<hostname>/${mapr.centrallog.dir}.
mapr.localvolumes.path /var/mapr/local The path for local volumes.
map.sort.class org.apache.hadoop.util.QuickSort The default sort class for sorting keys.
tasktracker.http.threads 2 The number of worker threads that for the HTTP server.
topology.node.switch.mapping.impl org.apache.hadoop.net.ScriptBasedMapping The default implementation of the DNSToSwitchMapping. It invokes a script specified in thetopology.script.file.nameproperty to resolve node names. If no value is set for thetopology.script.file.nameproperty, the default value of DEFAULT_RACK is returned for all node names.
topology.script.number.args 100 The max number of arguments that the script configured with thetopology.script.file.nameruns with. Each argument is an IP address.
mapr.task.diagnostics.enabled False Set this property's value to True to run the MapR diagnostics script before killing an unresponsive task attempt.
mapred.acls.enabled False This property's value specifies whether or not to check ACLs for user authorization during various queue and job level operations. Set this property's value to True to enable access control checks made by the JobTracker and TaskTracker when users request queue and job operations using Map/Reduce APIs, RPCs, the console, or the web user interfaces.
mapred.child.oom_adj 10 This property's value specifies the adjustment to the out-of-memory value for the Linux-specific out-of-memory killer. Legal values are 0-15.
mapred.child.renice 10 This property's value specifies an integer from 0 to 19 for use by the Linux nice}} utility.
mapred.child.taskset True Set this property's value to False to prevent running the job in a taskset. See the manual page fortaskset(1)for more information.
mapred.child.tmp ./tmp This property's value sets the location of the temporary directory for map and reduce tasks. Set this value to an absolute path to directly assign the directory. Relative paths are located under the task's working directory. Java tasks execute with the option-Djava.io.tmpdir=absolute path of the tmp dir. Pipes and streaming are set with environment variableTMPDIR=absolute path of the tmp dir.
mapred.cluster.ephemeral.tasks.memory.limit.mb 200 This property's value specifies the maximum size in megabytes for small jobs. This value is reserved in memory for an ephemeral slot. JobTracker and TaskTracker nodes must set this property to the same value.
mapred.cluster.map.memory.mb -1 This property's value sets the virtual memory size of a single map slot in the Map-Reduce framework used by the scheduler. If the scheduler supports this feature, a job can ask for multiple slots for a single map task viamapred.job.map.memory.mb, to the limit specified by the value ofmapred.cluster.max.map.memory.mb. The default value of -1 disables the feature. Set this value to a useful memory size to enable the feature.
mapred.cluster.max.map.memory.mb -1 This property's value sets the virtual memory size of a single map task launched by the Map-Reduce framework used by the scheduler. If the scheduler supports this feature, a job can ask for multiple slots for a single map task viamapred.job.map.memory.mb, to the limit specified by the value ofmapred.cluster.max.map.memory.mb. The default value of -1 disables the feature. Set this value to a useful memory size to enable the feature.
mapred.cluster.max.reduce.memory.mb -1 This property's value sets the virtual memory size of a single reduce task launched by the Map-Reduce framework used by the scheduler. If the scheduler supports this feature, a job can ask for multiple slots for a single map task viamapred.job.reduce.memory.mb, to the limit specified by the value ofmapred.cluster.max.reduce.memory.mb. The default value of -1 disables the feature. Set this value to a useful memory size to enable the feature.
mapred.cluster.reduce.memory.mb -1 This property's value sets the virtual memory size of a single reduce slot in the Map-Reduce framework used by the scheduler. If the scheduler supports this feature, a job can ask for multiple slots for a single map task viamapred.job.reduce.memory.mb, to the limit specified by the value ofmapred.cluster.max.reduce.memory.mb. The default value of -1 disables the feature. Set this value to a useful memory size to enable the feature.
mapred.compress.map.output False Set this property's value to True to compress map outputs with SequenceFile compresison before sending the outputs over the network.
mapred.fairscheduler.assignmultiple True Set this property's value to False to prevent the FairScheduler from assigning multiple tasks.
mapred.fairscheduler.eventlog.enabled False Set this property's value to True to enable scheduler logging in {{${HADOOP_LOG_DIR}/fairscheduler/
mapred.fairscheduler.smalljob.max.inputsize 10737418240 This property's value specifies the maximum size, in bytes, that defines a small job.
mapred.fairscheduler.smalljob.max.maps 10 This property's value specifies the maximum number of maps allowed in a small job.
mapred.fairscheduler.smalljob.max.reducer.inputsize 1073741824 This property's value specifies the maximum estimated input size, in bytes, for a reducer in a small job.
mapred.fairscheduler.smalljob.max.reducers 10 This property's value specifies the maximum number of reducers allowed in a small job.
mapred.healthChecker.interval 60000 This property's value sets the frequency, in milliseconds, that the node health script runs.
mapred.healthChecker.script.timeout 600000 This property's value sets the frequency, in milliseconds, after which the node script is killed for being unresponsive and reported as failed.
mapred.inmem.merge.threshold 1000 When a number of files equal to this property's value accumulate, the in-memory merge triggers and spills to disk. Set this property's value to zero or less to force merges and spills to trigger solely on RAMFS memory consumption.
mapred.job.map.memory.mb -1 This property's value sets the virtual memory size of a single map task for the job. If the scheduler supports this feature, a job can ask for multiple slots for a single map task viamapred.cluster.map.memory.mb, to the limit specified by the value ofmapred.cluster.max.map.memory.mb. The default value of -1 disables the feature if the value of themapred.cluster.map.memory.mgbproperty is also -1. Set this value to a useful memory size to enable the feature.
mapred.job.queue.name default This property's value specifies the queue a job is submitted to. This property's value must match the name of a queue defined inmapred.queue.namesfor the system. The ACL setup for the queue must allow the current user to submit a job to the queue.
mapred.job.reduce.input.buffer.percent 0 This property's value specifies the percentage of memory relative to the maximum heap size. After the shuffle, remaining map outputs in memory must occupy less memory than this threshold value before reduce begins.
mapred.job.reduce.memory.mb -1 This property's value sets the virtual memory size of a single reduce task for the job. If the scheduler supports this feature, a job can ask for multiple slots for a single map task viamapred.cluster.reduce.memory.mb, to the limit specified by the value ofmapred.cluster.max.reduce.memory.mb. The default value of -1 disables the feature if the value of themapred.cluster.map.memory.mgbproperty is also -1. Set this value to a useful memory size to enable the feature.
mapred.job.reuse.jvm.num.tasks -1 This property's value sets the number of tasks to run on each JVM. The default of -1 sets no limit.
mapred.job.shuffle.input.buffer.percent 0.7 This property's value sets the percentage of memory allocated from the maximum heap size to storing map outputs during the shuffle.
mapred.job.shuffle.merge.percent 0.66 This property's value sets a percentage of the total memory allocated to storing map outputs inmapred.job.shuffle.input.buffer.percent. When memory storage for map outputs reaches this percentage, an in-memory merge triggers.
mapred.job.tracker.handler.count 10 This property's value sets the number of server threads for the JobTracker. As a best practice, set this value to approximately 4% of the number of TaskTracker nodes.
mapred.job.tracker.history.completed.location /var/mapr/cluster/mapred/jobTracker/history/done This property's value sets a location to store completed job history files. When this property has no value specified, completed job files are stored at${hadoop.job.history.location}/done in the local filesystem.
mapred.job.tracker.http.address 0.0.0.0:50030 This property's value specifies the HTTP server address and port for the JobTracker. Specify 0 as the port to make the server start on a free port.
mapred.jobtracker.instrumentation org.apache.hadoop.mapred.JobTrackerMetricsInst Expert: The instrumentation class to associate with each JobTracker.
mapred.jobtracker.job.history.block.size 3145728 This property's value sets the block size of the job history file. Dumping job history to disk is important because job recovery uses the job history.
mapred.jobtracker.jobhistory.lru.cache.size 5 This property's value specifies the number of job history files to load in memory. The jobs are loaded when they are first accessed. The cache is cleared based on LRU.
mapred.job.tracker maprfs:/// JobTracker address ip:port or use uri maprfs:/// for default cluster or maprfs:///mapr/san_jose_cluster1 to connect 'san_jose_cluster1' cluster. ""local"" for standalone mode.
mapred.jobtracker.maxtasks.per.job -1 Set this property's value to any positive integer to set the maximum number of tasks for a single job. The default value of -1 indicates that there is no maximum.
mapred.job.tracker.persist.jobstatus.active False Set this property's value to True to enable persistence of job status information.
mapred.job.tracker.persist.jobstatus.dir /var/mapr/cluster/mapred/jobTracker/jobsInfo This property's value specifies the directory where job status information persists after dropping out of the memory queue between JobTracker restarts.
mapred.job.tracker.persist.jobstatus.hours 0 This property's value specifies job status information persistence time in hours. Persistent job status information is available after the information drops out of the memory queue and between JobTracker restarts. The default value of zero disables job status information persistence.
mapred.jobtracker.port 9001 The IPC port on which the JobTracker listens.
mapred.jobtracker.restart.recover True Set this property's value to False to disable job recovery on restart.
mapred.jobtracker.retiredjobs.cache.size 1000 This property's value specifies the number of retired job statuses kept in the cache.
mapred.jobtracker.retirejob.check 30000 This property's value specifies the frequency interval used by the retire job thread to check for completed jobs.
mapred.line.input.format.linespermap 1 Number of lines per split in NLineInputFormat.
mapred.local.dir.minspacekill 0 This property's value specifies a threshold of free space in the directory specified by themapred.local.dirproperty. When free space drops below this threshold, no more tasks are requested until all current tasks finish and clean up. When free space is below this threshold, running tasks are killed in the following order until free space is above the threshold:
  • Reduce tasks
  • All other tasks in reverse percent-completed order.
mapred.local.dir.minspacestart 0 This property's value specifies a free space threshold for the directory specified bymapred.local.dir. No tasks are requested while free space is below this threshold.
mapred.local.dir /tmp/mapr-hadoop/mapred/local This property's value specifies the directory where MapReduce localized job files. Localized job files are the job-related files downloaded by the TaskTracker and include the job configuration, job JAR file, and files added to the DistributedCache. Each task attempt has a dedicated subdirectory under themapred.local.dirdirectory. Shared files are symbolically linked to those subdirectories.
mapred.map.child.java.opts -XX:ErrorFile=/opt/cores/mapreduce_java_error%p.log This property stores Java options for map tasks. When present, the@taskid@symbol is replaced with the current TaskID. As an example, to enable verbose garbage collection logging to a file named for the taskid in/tmpand to set the heap maximum to 1GB, set this property to the value-Xmx1024m -verbose:gc -Xloggc:/tmp/@taskid@.gc.
The configuration variablemapred.{map/reduce}.child.ulimitcontrols the maximum virtual memory of the child processes.
In the MapR distribution for Hadoop, the default-Xmxis determined by memory reserved for mapreduce by the TaskTracker. Reduce tasks use memory than map tasks. The default memory for a map task follows the formula (Total Memory reserved for mapreduce) * (#mapslots/ (#mapslots + 1.3*#reduceslots)).
mapred.map.child.log.level INFO This property's value sets the logging level for the map task. The allowed levels are:
  • OFF
  • FATAL
  • ERROR
  • WARN
  • INFO
  • DEBUG
  • TRACE
  • ALL
mapred.map.max.attempts 4 Expert: This property's value sets the maximum number of attempts per map task.
mapred.map.output.compression.codec org.apache.hadoop.io.compress.DefaultCodec Specifies the compression codec to use to compress map outputs if compression of map outputs is enabled.
mapred.maptask.memory.default 800 When the value of themapred.tasktracker.map.tasks.maximumparameter is -1, this parameter specifies a size in MB that is used to determine the default total number of map task slots on this node.
mapred.map.tasks 2 The default number of map tasks per job. Ignored when the value of themapred.job.trackerproperty islocal.
mapred.maxthreads.generate.mapoutput 1 Expert: Number of intra-map-task threads to sort and write the map output partitions.
mapred.maxthreads.partition.closer 1 Expert: Number of threads that asynchronously close or flush map output partitions.
mapred.merge.recordsBeforeProgress 10000 The number of records to process during a merge before sending a progress notification to the TaskTracker.
mapred.min.split.size 0 The minimum size chunk that map input should be split into. File formats with minimum split sizes take priority over this setting.
mapred.output.compress False Set this property's value to True to compress job outputs.
mapred.output.compression.codec org.apache.hadoop.io.compress.DefaultCodec When job output compression is enabled, this property's value specifies the compression codec.
mapred.output.compression.type RECORD When job outputs are compressed as SequenceFiles, this value's property specifies how to compress the job outputs. Legal values are:
  • NONE
  • RECORD
  • BLOCK
mapred.queue.default.state RUNNING This property's value defines the state of the default queue, which can be either STOPPED or RUNNING. This value can be changed at runtime.
mapred.queue.names default This property's value specifies a comma-separated list of the queues configured for this JobTracker. Jobs are added to queues and schedulers can configure different scheduling properties for the various queues. To configure a property for a queue, the name of the queue must match the name specified in this value. Queue properties that are common to all schedulers are configured here with the naming conventionmapred.queue.$QUEUE-NAME.$PROPERTY-NAME.
The number of queues configured in this parameter can depend on the type of scheduler being used, as specified in mapred.jobtracker.taskScheduler. For example, the JobQueueTaskScheduler supports only a single queue, which is the default configured here. Verify that the schedule supports multiple queues before adding queues.
mapred.reduce.child.log.level INFO The logging level for the reduce task. The allowed levels are:
  • OFF
  • FATAL
  • ERROR
  • WARN
  • INFO
  • DEBUG
  • TRACE
  • ALL
mapred.reduce.copy.backoff 300 This property's value specifies the maximum amount of time in seconds a reducer spends on fetching one map output before declaring the fetch failed.
mapred.reduce.max.attempts 4 Expert: The maximum number of attempts per reduce task.
mapred.reducetask.memory.default 1500 When the value of themapred.tasktracker.reduce.tasks.maximumparameter is -1, this parameter specifies a size in MB that is used to determine the default total number of reduce task slots on this node.
mapred.skip.attempts.to.start.skipping 2 This property's value specifies a number of task attempts. After that many task attempts, skip mode is active. While skip mode is active, the task reports the range of records which it will process next to the TaskTracker. With this record range, the TaskTracker is aware of which records are dubious and skips dubious records on further executions.
mapred.skip.map.auto.incr.proc.count True SkipBadRecords.COUNTER_MAP_PROCESSED_RECORDS increments after MapRunner invokes the map function. Set this property's value to False for applications that process records asynchronously or buffer input records. Such applications must increment this counter directly.
mapred.skip.map.max.skip.records 0 The number of acceptable skip records around the bad record, per bad record in the mapper. The number includes the bad record. The default value of 0 disables detection and skipping of bad records. The framework tries to narrow down the skipped range by retrying until this threshold is met OR all attempts get exhausted for this task. Set the value toLong.MAX_VALUEto prevent the framework from narrowing down the skipped range.
mapred.skip.reduce.auto.incr.proc.count True SkipBadRecords.COUNTER_MAP_PROCESSED_RECORDS increments after MapRunner invokes the reduce function. Set this property's value to False for applications that process records asynchronously or buffer input records. Such applications must increment this counter directly.
mapred.skip.reduce.max.skip.groups 0 The number of acceptable skip records around the bad record, per bad record in the reducer. The number includes the bad record. The default value of 0 disables detection and skipping of bad records. The framework tries to narrow down the skipped range by retrying until this threshold is met OR all attempts get exhausted for this task. Set the value toLong.MAX_VALUEto prevent the framework from narrowing down the skipped range.
mapred.submit.replication 10 This property's value specifies the replication level for submitted job files. As a best practice, set this value to approximately the square root of the number of nodes.
mapred.task.cache.levels 2 This property's value specifies the maximum level of the task cache. For example, if the level is 2, the tasks cached are at the host level and at the rack level.
mapred.task.calculate.resource.usage True Set this property's value to False to prevent the use of the${mapreduce.tasktracker.resourcecalculatorplugin}parameter.
mapred.task.profile False Set this property's value to True to enable task profiling and the collection of profiler information by the system.
mapred.task.profile.maps 0-2 This property's value sets the ranges of map tasks to profile. This property is ignored when the value of themapred.task.profileproperty is set to False.
mapred.task.profile.reduces 0-2 This property's value sets the ranges of reduce tasks to profile. This property is ignored when the value of themapred.task.profileproperty is set to False.
mapred.task.timeout 600000 This property's value specifies a time in milliseconds after which a task terminates if the task does not perform any of the following:
  • reads an input
  • writes an output
  • updates its status string
mapred.tasktracker.dns.interface default This property's value specifies the name of the network interface that the TaskTracker reports its IP address from.
mapred.tasktracker.dns.nameserver default This property's value specifies the host name or IP address of the name server (DNS) that the TaskTracker uses to determine the JobTracker's hostname.

Oozie

Parameter Value Description
hadoop.proxyuser.root.hosts * Specifies the hosts that the superuser must connect from in order to act as another user. Specify the hosts as a comma-separatedlist of IP addresses or hostnames that are running Oozie servers.
hadoop.proxyuser.mapr.groups mapr,staff
hadoop.proxyuser.root.groups root The superuser can act as any member of the listed groups.




http://www.mapr.com/doc/display/MapR/mapred-site.xml

分享到:
评论

相关推荐

    hadoop配置文件参数详解1

    Hadoop 配置文件参数详解 Hadoop 配置文件是 Hadoop 集群的核心组件之一,它们控制着 Hadoop 集群的行为和性能。Hadoop 配置文件主要包括 core-site.xml、hdfs-site.xml 和 mapred-site.xml 三个配置文件。这些配置...

    hadoop NameNode 源码解析

    Hadoop NameNode 源码解析 ...本文对 Hadoop NameNode 的源码进行了深入分析,了解了其启动过程、配置加载、RPC 服务端创建、 Namenode 对象初始化等关键步骤,为读者提供了一个详细的 Hadoop NameNode 源码解析。

    hive参数配置手册、hive参数配置大全

    非常全面的hive参数配置,总共有600多项,中文注释是用软件翻译的,勉强能看,引用请注明出处。

    徐老师大数据 Hadoop架构完全分析课程 Hadoop入门学习视频教程

    028.Hadoop架构分析之启动脚本分析(mapred.cmd和虚拟机参数设置).mp4 029.Hadoop架构分析之启动脚本分析(start-yarn.cmd命令).mp4 030.Hadoop架构分析之启动脚本分析(yarn.cmd与yarn-evn.cmd命令).mp4

    Hadoop源代码分析(完整版).pdf

    * conf:提供系统的配置参数。 * fs:提供文件系统的抽象,可以理解为支持多种文件系统实现的统一文件访问接口。 * hdfs:提供 HDFS 的实现。 * ipc:提供一个简单的 IPC 的实现,依赖于 io 提供的编解码功能。 * io...

    hadoop 源码解析-DataNode

    Hadoop 源码解析 - DataNode Hadoop 作为一个大数据处理框架,其核心组件之一是分布式文件系统(HDFS),而 DataNode 是 HDFS 中的重要组件之一。DataNode 负责存储和管理数据块,提供数据访问服务。本文将对 ...

    hive参数配置说明大全

    hive参数配置说明大全,详细说个各个参数的作用用法

    Hadoop技术内幕:深入解析MapReduce架构设计与实现原理

    MapReduce编程模型3.1 MapReduce编程模型概述3.1.1 MapReduce编程接口体系结构3.1.2 新旧MapReduce API比较3.2 MapReduce API基本概念3.2.1 序列化3.2.2 Reporter参数3.2.3 回调机制3.3 Java API解析3.3.1 ...

    hadoop常见错误以及处理方法详解

    原因:每次namenode format会重新创建一个namenodeId,而dfs.data.dir参数配置的目录中包含的是上次format创建的id,和dfs.name.dir参数配置的目录中的id不一致。namenode format清空了namenode下的数据,但是没有

    Hadoop实战(第2版)

    技术点41 内存交换技术点42 磁盘健康技术点43 网络6.3 可视化技术点44 提取并可视化任务执行时间6.4 优化 .6.4.1 剖析MapReduce 的用户代码 技术点45 剖析map 和reduce 任务 6.4.2 参数配置6.4.3...

    hadoop大数据实战手册

    2.1.7 HDFS 缓存相关配置…·……………………………………… …… ……………………………… 40 2.2 HDFS 中心缓存管理…... ... .…· ·……………………………………………………………………….. 42 2.2.l HDFS...

    Hadoop硬实战 [(美)霍姆斯著][电子工业出版社][2015.01]_PDF电子书下载 带书签目录 高清完整版.rar )

    6.4.2 参数配置 6.4.3 优化 shuffle 和 sort 阶段 技术点46 避免reducer 技术点47 过滤和投影 技术点48 使用 combiner 技术点49 超炫的使用比较器的快速排序 6.4.4 减轻倾斜 技术点50 收集倾斜数据 ...

    hdfs-site.xml配置文件详解

    hdfs-site.xml配置文件详解,有需要的可以下载哈哈哈哈哈

    sek:一个类似 Nutch 的, 基于 Hadoop 的并行式爬虫框架

    参数的可配置.设置种子 URL 时可以进行必要的配置, 如评分, 定义抓取间隔等.基于 正则表达式 的 URL 过滤.URL 规范化.广度优先的抓取策略.插件机制. 程序只提供一个必要的骨架, 可以通过插件的机制来定制软件的运行....

    logparser:通过Java,Hadoop,Hive,Pig,Flink,Beam,Storm,Drill等轻松解析Apache HTTPD和NGINX访问日志。

    因此,我们使用写入文件的LogFormat作为读取相同文件的解析器的输入参数。 除了Apache HTTPD手册中“”下指定的配置选项外,还可以识别以下内容: 常见的 合并的 组合式 推荐人 代理人 对于Nginx,在和指定log_...

    HBase应用最佳实践详解.pdf

    HBase优化是指通过调整HBase的配置和参数来提高HBase的性能和可靠性。下面是一些常用的HBase优化方法: * 垃圾回收优化:使用CMS垃圾回收机制可以提高HBase的性能 * 启用压缩:使用GZIP、Snappy、LZO等压缩算法可以...

    Sqoop企业级大数据迁移方案全方位实战视频教程

    1.Sqoop导入开发参数详解 2.数据导入分布式文件系统HDFS 3.数据导入数据仓库Hive 4.基于复杂条件实现数据导入 5.基于订单案例实现Increment增量同步数据 6.基于订单案例实现lastModified增量同步导入数据 7....

    SparkSql技术

    5.1:虚拟集群的搭建(hadoop1、hadoop2、hadoop3) 41 5.1.1:hadoop2.2.0集群搭建 41 5.1.2:MySQL的安装 41 5.1.3:hive的安装 41 5.1.4:Spark1.1.0 Standalone集群搭建 42 5.2:客户端的搭建 42 5.3:文件数据...

    Zookeeper分布式系统开发实战[借鉴].pdf

    * 配置参数详解 * 命令详解 第11课:运维和监控Web平台搭建和使用 * Taokeeper搭建和使用 * 提供Web界面,可对Zookeeper集群进行多方面监控 目标人群: * 初级以上普通开发人员 * 大数据(Hadoop)开发人员 * ...

Global site tag (gtag.js) - Google Analytics