hadoop 源码分析(五)hadoop 任务调度TaskScheduler

黎明lm

浏览: 300772 次
性别:
来自: 北京

最近访客更多访客>>

baby孔祥超

jiazhigang

slipper-jay

woshiliukun

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

hadoop

hadoop mapreduce taskScheduler

hadoop mapreduce 之所有能够实现job的运行,以及将job分配到不同datanode 上的map和reduce task 是由TaskSchduler 完成的.

TaskScheduler mapreduce的任务调度器类,当jobClient 提交一个job 给JobTracker 的时候.JobTracker 接受taskTracker 的心跳.心跳信息含有空闲的slot信息等.JobTracker 则通过调用TaskScheduler 的assignTasks()方法类给报告心跳信息中含有空闲的slots信息的taskTracker 分布任务、

TaskScheduler 类为hadoop的调度器的抽象类。默认继承它作为hadoop调度器的方式为FIFO,当然也有Capacity 和Fair等其他调度器,也可以自己编写符合特定场景所需要的调度器.通过继承TaskScheduler 类即可完成该功能、
下面就 FIFO 调度器进行简单的说明：

JobQueueTaskScheduler 类为FIFO 调度器的实现类.
1. 首先JobQueueTaskSchduler 注册两个监听器类：
JobQueueJobInProgressListener jobQueueJobInProgressListener;
EagerTaskInitializationListener eagerTaskInitializationListener;

JobQueueJobInProgressListener 维护一个job的queue ,其中JobSchedulingInfo 中包含job调度的信息:priority,startTime,id.以及 jobAdd update 等操作jobqueue的方法
EagerTaskInitializationListener 初始化job的listener ,这里所谓的初始化不是初始化job的属性信息，而是针对已经存在jobqueue中即将被执行job的初始化,

class JobInitManager implements Runnable {
   
    public void run() {
      JobInProgress job = null;
      while (true) {
        try {
          synchronized (jobInitQueue) {
            while (jobInitQueue.isEmpty()) {
              jobInitQueue.wait();
            }
            job = jobInitQueue.remove(0);
          }
          threadPool.execute(new InitJob(job));
        } catch (InterruptedException t) {
          LOG.info("JobInitManagerThread interrupted.");
          break;
        } 
      }
      LOG.info("Shutting down thread pool");
      threadPool.shutdownNow();
    }
  }

resortInitQueue 按照priority 和starttime 来排序
jobRemoved()
jobUpdated()
jobStateChanged()当priority或是starttime被改变的时候则重新调用resortInitQueue()重新排序

public EagerTaskInitializationListener(Configuration conf) {
    numThreads = conf.getInt("mapred.jobinit.threads", DEFAULT_NUM_THREADS);
    threadPool = Executors.newFixedThreadPool(numThreads);
  }

在JobTracker 启动的时候创建 mapred.jobinit.threads 改数量的线程去监控jobqueue.当jobqueue 中含有job的时候则initjob

class InitJob implements Runnable {
  
    private JobInProgress job;
    
    public InitJob(JobInProgress job) {
      this.job = job;
    }
    
//调用run方法 回调TaskTrackerManager
    public void run() {
      ttm.initJob(job);
    }
  }

调度其中核心逻辑在assignTasks()方法中
下面分析分析 FIFO模式下的 assignTasks()

@Override
  public synchronized List<Task> assignTasks(TaskTracker taskTracker)
      throws IOException {
    TaskTrackerStatus taskTrackerStatus = taskTracker.getStatus(); 
ClusterStatus clusterStatus = taskTrackerManager.getClusterStatus();
//获取集群中TaskTracker 总数
final int numTaskTrackers = clusterStatus.getTaskTrackers();
//集群中map slot总数
final int clusterMapCapacity = clusterStatus.getMaxMapTasks();
 //集群中reduce slot 总数
    final int clusterReduceCapacity = clusterStatus.getMaxReduceTasks();

    Collection<JobInProgress> jobQueue =
      jobQueueJobInProgressListener.getJobQueue();

    //
    // Get map + reduce counts for the current tracker.
//
//当前的taskTracker 上map slot 总数
final int trackerMapCapacity = taskTrackerStatus.getMaxMapSlots();
//当前的taskTracker 上reduce slot 总数
final int trackerReduceCapacity = taskTrackerStatus.getMaxReduceSlots();
//当前的taskTracker上正在运行的 map数目
final int trackerRunningMaps = taskTrackerStatus.countMapTasks();
//当前的taskTracker上正在运行的 reduce数目
    final int trackerRunningReduces = taskTrackerStatus.countReduceTasks();

    // Assigned tasks
    List<Task> assignedTasks = new ArrayList<Task>();

    //
    // Compute (running + pending) map and reduce task numbers across pool
//
//该taskTracker上剩余的reduce数
int remainingReduceLoad = 0;
//该taskTracker 剩余的map数
    int remainingMapLoad = 0;
    synchronized (jobQueue) {
      for (JobInProgress job : jobQueue) {
        if (job.getStatus().getRunState() == JobStatus.RUNNING) {
          remainingMapLoad += (job.desiredMaps() - job.finishedMaps());
          if (job.scheduleReduces()) {
            remainingReduceLoad += 
              (job.desiredReduces() - job.finishedReduces());
          }
        }
      }
    }

// Compute the 'load factor' for maps and reduces
//map因子
    double mapLoadFactor = 0.0;
    if (clusterMapCapacity > 0) {
      mapLoadFactor = (double)remainingMapLoad / clusterMapCapacity;
    }
    double reduceLoadFactor = 0.0;
    if (clusterReduceCapacity > 0) {
      reduceLoadFactor = (double)remainingReduceLoad / clusterReduceCapacity;
    }
        
    //
    // In the below steps, we allocate first map tasks (if appropriate),
    // and then reduce tasks if appropriate.  We go through all jobs
    // in order of job arrival; jobs only get serviced if their 
    // predecessors are serviced, too.
    //

    //
    // We assign tasks to the current taskTracker if the given machine 
    // has a workload that's less than the maximum load of that kind of
    // task.
    // However, if the cluster is close to getting loaded i.e. we don't
    // have enough _padding_ for speculative executions etc., we only 
    // schedule the "highest priority" task i.e. the task from the job 
    // with the highest priority.
    //
    
    final int trackerCurrentMapCapacity = 
      Math.min((int)Math.ceil(mapLoadFactor * trackerMapCapacity), 
                              trackerMapCapacity);
    int availableMapSlots = trackerCurrentMapCapacity - trackerRunningMaps;
    boolean exceededMapPadding = false;
    if (availableMapSlots > 0) {
      exceededMapPadding = 
        exceededPadding(true, clusterStatus, trackerMapCapacity);
    }
    
    int numLocalMaps = 0;
    int numNonLocalMaps = 0;
    scheduleMaps:
    for (int i=0; i < availableMapSlots; ++i) {
      synchronized (jobQueue) {
        for (JobInProgress job : jobQueue) {
          if (job.getStatus().getRunState() != JobStatus.RUNNING) {
            continue;
          }

          Task t = null;
          
          // Try to schedule a node-local or rack-local Map task
          t = 
            job.obtainNewNodeOrRackLocalMapTask(taskTrackerStatus, 
                numTaskTrackers, taskTrackerManager.getNumberOfUniqueHosts());
          if (t != null) {
            assignedTasks.add(t);
            ++numLocalMaps;
            
            // Don't assign map tasks to the hilt!
            // Leave some free slots in the cluster for future task-failures,
            // speculative tasks etc. beyond the highest priority job
            if (exceededMapPadding) {
              break scheduleMaps;
            }
           
            // Try all jobs again for the next Map task 
            break;
          }
          
          // Try to schedule a node-local or rack-local Map task
          t = 
            job.obtainNewNonLocalMapTask(taskTrackerStatus, numTaskTrackers,
                                   taskTrackerManager.getNumberOfUniqueHosts());
          
          if (t != null) {
            assignedTasks.add(t);
            ++numNonLocalMaps;
            
            // We assign at most 1 off-switch or speculative task
            // This is to prevent TaskTrackers from stealing local-tasks
            // from other TaskTrackers.
            break scheduleMaps;
          }
        }
      }
    }
    int assignedMaps = assignedTasks.size();

    //
    // Same thing, but for reduce tasks
    // However we _never_ assign more than 1 reduce task per heartbeat
    //
    final int trackerCurrentReduceCapacity = 
      Math.min((int)Math.ceil(reduceLoadFactor * trackerReduceCapacity), 
               trackerReduceCapacity);
    final int availableReduceSlots = 
      Math.min((trackerCurrentReduceCapacity - trackerRunningReduces), 1);
    boolean exceededReducePadding = false;
    if (availableReduceSlots > 0) {
      exceededReducePadding = exceededPadding(false, clusterStatus, 
                                              trackerReduceCapacity);
      synchronized (jobQueue) {
        for (JobInProgress job : jobQueue) {
          if (job.getStatus().getRunState() != JobStatus.RUNNING ||
              job.numReduceTasks == 0) {
            continue;
          }

          Task t = 
            job.obtainNewReduceTask(taskTrackerStatus, numTaskTrackers, 
                                    taskTrackerManager.getNumberOfUniqueHosts()
                                    );
          if (t != null) {
            assignedTasks.add(t);
            break;
          }
          
          // Don't assign reduce tasks to the hilt!
          // Leave some free slots in the cluster for future task-failures,
          // speculative tasks etc. beyond the highest priority job
          if (exceededReducePadding) {
            break;
          }
        }
      }
    }
    
    if (LOG.isDebugEnabled()) {
      LOG.debug("Task assignments for " + taskTrackerStatus.getTrackerName() + " --> " +
                "[" + mapLoadFactor + ", " + trackerMapCapacity + ", " + 
                trackerCurrentMapCapacity + ", " + trackerRunningMaps + "] -> [" + 
                (trackerCurrentMapCapacity - trackerRunningMaps) + ", " +
                assignedMaps + " (" + numLocalMaps + ", " + numNonLocalMaps + 
                ")] [" + reduceLoadFactor + ", " + trackerReduceCapacity + ", " + 
                trackerCurrentReduceCapacity + "," + trackerRunningReduces + 
                "] -> [" + (trackerCurrentReduceCapacity - trackerRunningReduces) + 
                ", " + (assignedTasks.size()-assignedMaps) + "]");
    }

    return assignedTasks;
  }

上面方法中真正执行task的方法为：
obtainNewNodeOrRackLocalMapTask 和obtainNewNonLocalMapTask
下一张详细的分析这两个方法

查看图片附件

7
顶

2
踩

分享到：

hadoop 源码分析(六)hadoop taskTracker ... | hadoop 源码分析(四)JobTracker 添加job ...

2013-04-01 11:07
浏览 3913
评论(0)
分类:开源软件
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论