讲座题目：Hierarchical MapReduce: Towards Simplified Cross‐domain Data Processing
主 讲 人：骆远 博士
MapReduce is a programming model well suited to processing large datasets using high-throughput parallelism running on a large number of compute resources. While it has proven useful on data-intensive high throughput applications, conventional MapReduce model limits itself to scheduling jobs within a single cluster. As job sizes become larger, single-cluster solutions grow increasingly inadequate. Additionally, the input dataset could be very large and widely distributed across multiple clusters. Feeding large datasets repeatedly to remote computing resources becomes the bottleneck. When mapping such data-intensive tasks to compute resources, scheduling algorithms need to determine whether to bring data to computation or bring computation to data. We present a Hierarchical MapReduce framework that gathers computation resources from different clusters and runs MapReduce jobs across them. The applications implemented in this framework adopt the Map-Reduce-GlobalReduce model where computations are expressed as three functions: Map, Reduce, and GlobalReduce. Two scheduling algorithms are introduced: Compute Capacity Aware Scheduling for compute-intensive jobs and Data Location Aware Scheduling for data-intensive jobs. Experimental evaluations using a molecule binding prediction tool, AutoDock, and grep demonstrate promising results for our framework.