(Publisher of Peer Reviewed Open Access Journals)

International Journal of Advanced Technology and Engineering Exploration (IJATEE)

ISSN (Print):2394-5443    ISSN (Online):2394-7454
Volume-8 Issue-82 September-2021
Full-Text PDF
Paper Title : Factors affecting cloud data-center efficiency: a scheduling algorithm-based analysis
Author Name : Arif Ahmad Shehloo, Muheet Ahmed Butt and Majid Zaman
Abstract :

Cloud computing encompasses two massively scalable services: computing capability and data storage space, which are provided by a massive number of machines and clusters. The increased use of big data has resulted in adopting a wide range of analytics engines, such as Hadoop. As a result, Hadoop has gained widespread acceptance as a data analytics platform. Over the past decade, Hadoop's ability to schedule tasks has become a critical aspect of system performance. Numerous researchers have presented various scheduling methods in their work to address the complex issue of performance degradation. However, few studies have been conducted to date to evaluate the effectiveness of these methods. By employing the PRISMA approach for searching and selecting papers, we examine the design choices that went into various Hadoop scheduling techniques proposed between 2008 and 2021. We present a taxonomy for succinctly categorising these scheduling techniques. Additionally, we evaluate methodologies based on a variety of performance metrics. Our search identified 82 studies relevant to this domain, all of which came from high-quality conferences, journals, symposiums, and workshops. This systematic study discusses various dynamic, constrained, and adaptive scheduling methods and their primary motivations, including makespan, data control, deadline, resource utilisation, load balancing, fairness, energy efficiency, and failure recovery. There is also a discussion of some unresolved issues and potential future directions for modifying existing studies. This study conducts a systematic review of the literature to identify and discuss the most critical factors affecting Hadoop scheduler performance and provide a roadmap for researchers working in this field. Finally, we intend to expand on the qualitative analysis conducted thus far and give the experts additional recommendations to conduct future cloud scheduling research.

Keywords : Big data, Cloud computing, Apache Hadoop, MapReduce, Task scheduling.
Cite this article : Shehloo AA, Butt MA, Zaman M. Factors affecting cloud data-center efficiency: a scheduling algorithm-based analysis. International Journal of Advanced Technology and Engineering Exploration. 2021; 8(82):1136-1167. DOI:10.19101/IJATEE.2021.874313.
References :
[1]https://developer.ibm.com/articles/os-hadoop-scheduling/. Accessed 01 June 2021.
[2]Maheshwari A, Bhardwaj A, Chandrasekaran K. Hadoop task scheduling-Improving algorithms using tabular approach. In fifth international conference on communication systems and network technologies 2015 (pp. 1034-8). IEEE.
[Crossref] [Google Scholar]
[3]http://hadoop.apache.org/. Accessed 01 June 2021.
[4]Anagnostopoulos I, Zeadally S, Exposito E. Handling big data: research challenges and future directions. The Journal of Supercomputing. 2016; 72(4):1494-516.
[Crossref] [Google Scholar]
[5]Singh N, Agrawal S. A review of research on MapReduce scheduling algorithms in Hadoop. In international conference on computing, communication & automation 2015 (pp. 637-42). IEEE.
[Crossref] [Google Scholar]
[6]Rao BT, Reddy LS. Survey on improved scheduling in Hadoop MapReduce in cloud environments. arXiv preprint arXiv:1207.0780. 2012.
[Google Scholar]
[7]Patil S, Deshmukh S. Survey on task assignment techniques in Hadoop. International Journal of Computer Applications. 2012; 59(14):15-18.
[Google Scholar]
[8]Page MJ, McKenzie JE, Bossuyt PM, Boutron I, Hoffmann TC, Mulrow CD, et al. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ. 2021:1-9.
[Crossref] [Google Scholar]
[9]Kalia K, Gupta N. A Review on job scheduling for hadoop mapreduce. In international conference on next generation computing and information systems 2017 (pp. 75-9). IEEE.
[Crossref] [Google Scholar]
[10]Rasooli A, Down DG. An adaptive scheduling algorithm for dynamic heterogeneous Hadoop systems. In proceedings of the conference of the center for advanced studies on collaborative research 2011 (pp. 30-44). IBM Corp.
[Google Scholar]
[11]Tian W, Luo G, Tian L, Chen A. On dynamic job ordering and slot configurations for minimizing the makespan of multiple MapReduce jobs. arXiv preprint arXiv:1604.04471. 2016.
[Google Scholar]
[12]Cheng D, Zhou X, Xu Y, Liu L, Jiang C. Deadline-aware MapReduce job scheduling with dynamic resource availability. IEEE Transactions on Parallel and Distributed Systems. 2018; 30(4):814-26.
[Crossref] [Google Scholar]
[13]Kc K, Anyanwu K. Scheduling hadoop jobs to meet deadlines. In second international conference on cloud computing technology and science 2010 (pp. 388-92). IEEE.
[Crossref] [Google Scholar]
[14]Zaharia M, Konwinski A, Joseph AD, Katz RH, Stoica I. Improving MapReduce performance in heterogeneous environments. In USENIX symposium on operating systems design and implementation 2008(pp.29-42).
[Google Scholar]
[15]Tan J, Meng X, Zhang L. Coupling task progress for mapreduce resource-aware scheduling. In proceedings IEEE INFOCOM 2013 (pp. 1618-26). IEEE.
[Crossref] [Google Scholar]
[16]Tian W, Li G, Yang W, Buyya R. HScheduler: an optimal approach to minimize the makespan of multiple MapReduce jobs. The Journal of Supercomputing. 2016; 72(6):2376-93.
[Crossref] [Google Scholar]
[17]Jiang Y, Zhu Y, Wu W, Li D. Makespan minimization for MapReduce systems with different servers. Future Generation Computer Systems. 2017; 67:13-21.
[Crossref] [Google Scholar]
[18]Gandomi A, Movaghar A, Reshadi M, Khademzadeh A. Designing a MapReduce performance model in distributed heterogeneous platforms based on benchmarking approach. The Journal of Supercomputing. 2020:1-27.
[Crossref] [Google Scholar]
[19]Xu J, Wang J, Qi Q, Liao J, Sun H, Han Z, Li T. Network-aware task selection to reduce multi-application makespan in cloud. Journal of Network and Computer Applications. 2021;176(15).
[Crossref] [Google Scholar]
[20]Zaharia M, Borthakur D, Sen Sarma J, Elmeleegy K, Shenker S, Stoica I. Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. In proceedings of the European conference on computer systems 2010(pp. 265-78).
[Crossref] [Google Scholar]
[21]Naik NS, Negi A, BR TB, Anitha R. A data locality based scheduler to enhance MapReduce performance in heterogeneous environments. Future Generation Computer Systems. 2019; 90:423-34.
[Crossref] [Google Scholar]
[22]Chen TY, Wei HW, Wei MF, Chen YJ, Hsu TS, Shih WK. LaSA: A locality-aware scheduling algorithm for Hadoop-MapReduce resource assignment. In international conference on collaboration technologies and systems 2013 (pp. 342-6). IEEE.
[Crossref] [Google Scholar]
[23]Althebyan Q, ALQudah O, Jararweh Y, Yaseen Q. Multi-threading based map reduce tasks scheduling. In international conference on information and communication systems 2014 (pp. 1-6). IEEE.
[Crossref] [Google Scholar]
[24]Xu Y, Cai W. Hadoop job scheduling with dynamic task splitting. In international conference on cloud computing research and innovation 2015 (pp. 120-9). IEEE.
[Crossref] [Google Scholar]
[25]Kao YC, Chen YS. Data-locality-aware mapreduce real-time scheduling framework. Journal of Systems and Software. 2016; 112:65-77.
[Crossref] [Google Scholar]
[26]Dai X, Bensaou B. Scheduling for response time in Hadoop MapReduce. In international conference on communications 2016 (pp. 1-6). IEEE.
[Crossref] [Google Scholar]
[27]Xie Q, Pundir M, Lu Y, Abad CL, Campbell RH. Pandas: robust locality-aware scheduling with stochastic delay optimality. IEEE/ACM Transactions on Networking. 2016; 25(2):662-75.
[Crossref] [Google Scholar]
[28]Seo S, Jang I, Woo K, Kim I, Kim JS, Maeng S. HPMR: Prefetching and pre-shuffling in shared MapReduce computation environment. In international conference on cluster computing and workshops 2009 (pp. 1-8). IEEE.
[Crossref] [Google Scholar]
[29]Wang C, Wu Q, Tan Y, Wang W, Wu Q. Locality based data partitioning in MapReduce. In international conference on computational science and engineering 2013 (pp. 1310-7). IEEE.
[Crossref] [Google Scholar]
[30]Wang W, Zhu K, Ying L, Tan J, Zhang L. Maptask scheduling in mapreduce with data locality: Throughput and heavy-traffic optimality. IEEE/ACM Transactions on Networking. 2014; 24(1):190-203.
[Crossref] [Google Scholar]
[31]Zhang X, Zhong Z, Feng S, Tu B, Fan J. Improving data locality of MapReduce by scheduling in homogeneous computing environments. In international symposium on parallel and distributed processing with applications 2011 (pp. 120-6). IEEE.
[Crossref] [Google Scholar]
[32]Polo J, Becerra Y, Carrera D, Steinder M, Whalley I, Torres J, et al. Deadline-based MapReduce workload management. IEEE Transactions on Network and Service Management. 2013; 10(2):231-44.
[Crossref] [Google Scholar]
[33]He C, Lu Y, Swanson D. Matchmaking: A new MapReduce scheduling technique. In IEEE third international conference on cloud computing technology and science 2011 (pp. 40-7). IEEE.
[Crossref] [Google Scholar]
[34]Ibrahim S, Jin H, Lu L, He B, Antoniu G, Wu S. Maestro: Replica-aware map scheduling for mapreduce. In IEEE/ACM international symposium on cluster, cloud and grid computing 2012 (pp. 435-42). IEEE.
[Crossref] [Google Scholar]
[35]Asahara M, Nakadai S, Araki T. LoadAtomizer: a locality and I/O load aware task scheduler for MapReduce. In international conference on cloud computing technology and science proceedings 2012 (pp. 317-24). IEEE.
[Crossref] [Google Scholar]
[36]Singh G, Sharma A, Jeyaraj R, Paul A. Handling non-local executions to improve MapReduce performance using ant colony optimization. IEEE Access. 2021; 9:96176-88.
[Crossref] [Google Scholar]
[37]Hammoud M, Sakr MF. Locality-aware reduce task scheduling for MapReduce. In IEEE third international conference on cloud computing technology and science 2011 (pp. 570-6). IEEE.
[Crossref] [Google Scholar]
[38]Hammoud M, Rehman MS, Sakr MF. Center-of-gravity reduce task scheduling to lower mapreduce network traffic. In fifth international conference on cloud computing 2012 (pp. 49-58). IEEE.
[Crossref] [Google Scholar]
[39]Tan J, Meng S, Meng X, Zhang L. Improving reducetask data locality for sequential mapreduce jobs. In Proceedings IEEE INFOCOM 2013 (pp. 1627-35). IEEE.
[Crossref] [Google Scholar]
[40]Arslan E, Shekhar M, Kosar T. Locality and network-aware reduce task scheduling for data-intensive applications. In international workshop on data-intensive computing in the clouds 2014 (pp. 17-24). IEEE.
[Crossref] [Google Scholar]
[41]Selvitopi O, Demirci GV, Turk A, Aykanat C. Locality-aware and load-balanced static task scheduling for MapReduce. Future Generation Computer Systems. 2019; 90:49-61.
[Crossref] [Google Scholar]
[42]Xie J, Meng F, Wang H, Pan H, Cheng J, Qin X. Research on scheduling scheme for Hadoop clusters. Procedia computer science. 2013; 18:2468-71.
[Crossref] [Google Scholar]
[43]Anjos JC, Carrera I, Kolberg W, Tibola AL, Arantes LB, Geyer CR. MRA++: Scheduling and data placement on MapReduce for heterogeneous environments. Future Generation Computer Systems. 2015; 42:22-35.
[Crossref] [Google Scholar]
[44]Abad CL, Lu Y, Campbell RH. DARE: Adaptive data replication for efficient cluster scheduling. In international conference on cluster computing 2011 (pp. 159-68). IEEE.
[Crossref] [Google Scholar]
[45]Jin H, Yang X, Sun XH, Raicu I. Adapt: Availability-aware MapReduce data placement for non-dedicated distributed computing. In international conference on distributed computing systems 2012 (pp. 516-25). IEEE.
[Crossref] [Google Scholar]
[46]John SN, Mirnalinee TT. A novel dynamic data replication strategy to improve access efficiency of cloud storage. Information Systems and e-Business Management. 2020; 18(3):405-26.
[Crossref] [Google Scholar]
[47]Polo J, Castillo C, Carrera D, Becerra Y, Whalley I, Steinder M, et al. Resource-aware adaptive scheduling for mapreduce clusters. In ACM/IFIP/USENIX international conference on distributed systems platforms and open distributed processing 2011(pp. 187-207). Springer, Berlin, Heidelberg.
[Crossref] [Google Scholar]
[48]He C, Lu Y, Swanson D. Real-time scheduling in MapReduce clusters. In international conference on high performance computing and communications & international conference on embedded and ubiquitous computing 2013 (pp. 1536-44). IEEE.
[Crossref] [Google Scholar]
[49]Liang Y, Wang Y, Fan M, Zhang C, Zhu Y. Predoop: preempting reduce task for job execution accelerations. In workshop on big data benchmarks, performance optimization, and emerging hardware 2014 (pp. 167-80). Springer, Cham.
[Crossref] [Google Scholar]
[50]Pastorelli M, Carra D, Dell Amico M, Michiardi P. HFSP: bringing size-based scheduling to Hadoop. IEEE Transactions on Cloud Computing. 2015; 5(1):43-56.
[Crossref] [Google Scholar]
[51]Verma A, Cherkasova L, Campbell RH. Aria: automatic resource inference and allocation for MapReduce environments. In proceedings of the ACM international conference on Autonomic computing 2011 (pp. 235-44).
[Crossref] [Google Scholar]
[52]Voicu C, Pop F, Dobre C, Xhafa F. MOMC: multi-objective and multi-constrained scheduling algorithm of many tasks in Hadoop. In international conference on P2P, parallel, grid, cloud and internet computing 2014(pp. 89-96). IEEE.
[Crossref] [Google Scholar]
[53]Han J, Yuan Z, Han Y, Peng C, Liu J, Li G. An adaptive scheduling algorithm for heterogeneous Hadoop systems. In international conference on computer and information science 2017 (pp. 845-50). IEEE.
[Crossref] [Google Scholar]
[54]Dong X, Wang Y, Liao H. Scheduling mixed real-time and non-real-time applications in mapreduce environment. In international conference on parallel and distributed systems 2011 (pp. 9-16). IEEE.
[Crossref] [Google Scholar]
[55]Liu L, Zhou Y, Liu M, Xu G, Chen X, Fan D, Wang Q. Preemptive Hadoop jobs scheduling under a deadline. In eighth international conference on semantics, knowledge and grids 2012 (pp. 72-9). IEEE.
[Crossref] [Google Scholar]
[56]Cho B, Rahman M, Chajed T, Gupta I, Abad C, Roberts N, Lin P. Natjam: design and evaluation of eviction policies for supporting priorities and deadlines in MapReduce clusters. In proceedings of the annual symposium on cloud computing 2013 (pp. 1-17).
[Crossref] [Google Scholar]
[57]Ullah I, Khan MS, Amir M, Kim J, Kim SM. LSTPD: least slack time-based preemptive deadline constraint scheduler for Hadoop clusters. IEEE Access. 2020; 8:111751-62.
[Crossref] [Google Scholar]
[58]Mao H, Hu S, Zhang Z, Xiao L, Ruan L. A load-driven task scheduler with adaptive DSC for MapReduce. In international conference on green computing and communications 2011 (pp. 28-33). IEEE.
[Crossref] [Google Scholar]
[59]Teng F, Yang H, Li T, Yang Y, Li Z. Scheduling real-time workflow on MapReduce-based cloud. In international conference on innovative computing technology 2013 (pp. 117-22). IEEE.
[Crossref] [Google Scholar]
[60]Cheng D, Rao J, Guo Y, Zhou X. Improving MapReduce performance in heterogeneous environments with adaptive task tuning. In proceedings of the international middleware conference 2014(pp. 97-108).
[Crossref] [Google Scholar]
[61]Rasooli A, Down DG. COSHH: a classification and optimization based scheduler for heterogeneous Hadoop systems. Future Generation Computer Systems. 2014; 36:1-5.
[Crossref] [Google Scholar]
[62]Tang Z, Liu M, Ammar A, Li K, Li K. An optimized MapReduce workflow scheduling algorithm for heterogeneous computing. The Journal of Supercomputing. 2016; 72(6):2059-79.
[Crossref] [Google Scholar]
[63]Li J, Liu Y, Pan J, Zhang P, Chen W, Wang L. Map-balance-reduce: an improved parallel programming model for load balancing of MapReduce. Future Generation Computer Systems. 2020; 105:993-1001.
[Crossref] [Google Scholar]
[64]Ibrahim S, Jin H, Lu L, Wu S, He B, Qi L. Leen: locality/fairness-aware key partitioning for mapreduce in the cloud. In second international conference on cloud computing technology and science 2010 (pp. 17-24). IEEE.
[Crossref] [Google Scholar]
[65]Nguyen P, Simon T, Halem M, Chapman D, Le Q. A hybrid scheduling algorithm for data intensive workloads in a MapReduce environment. In international conference on utility and cloud computing 2012(pp. 161-7). IEEE.
[Crossref] [Google Scholar]
[66]Li Y, Lin C, Ren F, Geng Y. H-pfsp: Efficient hybrid parallel pfsp protected scheduling for mapreduce system. In international conference on trust, security and privacy in computing and communications 2013 (pp. 1099-106). IEEE.
[Crossref] [Google Scholar]
[67]Wang J, Yao Y, Mao Y, Sheng B, Mi N. Fresh: fair and efficient slot configuration and scheduling for hadoop clusters. In international conference on cloud computing 2014 (pp. 761-8). IEEE.
[Crossref] [Google Scholar]
[68]Zhao H, Yang S, Chen Z, Fan H, Xu J. K%-fair scheduling: a flexible task scheduling strategy for balancing fairness and efficiency in MapReduce systems. In proceedings of international conference on computer science and network technology 2012 (pp. 629-633). IEEE.
[Crossref] [Google Scholar]
[69]Cheng YW, Lo SC. Improving fair scheduling performance on Hadoop. In international conference on platform technology and service (PlatCon) 2017 (pp. 1-6). IEEE.
[Google Scholar]
[70]Hussain R, Rahman M, Masud KI, Roky SM, Akhtar MN, Tarin TA. A novel approach of fair scheduling to enhance performance of Hadoop distributed file system. In international conference on electrical, computer and communication engineering 2019 (pp. 1-6). IEEE.
[Crossref] [Google Scholar]
[71]Chen Y, Alspaugh S, Borthakur D, Katz R. Energy efficiency for large-scale MapReduce workloads with significant interactive analysis. In proceedings of the ACM European conference on computer systems 2012 (pp. 43-56).
[Crossref] [Google Scholar]
[72]Wang L, Khan SU, Chen D, Kołodziej J, Ranjan R, Xu CZ, Zomaya A. Energy-aware parallel task scheduling in a cluster. Future Generation Computer Systems. 2013; 29(7):1661-70.
[Crossref] [Google Scholar]
[73]Lu Q, Li S, Zhang W. Genetic algorithm based job scheduling for big data analytics. In international conference on identification, information, and knowledge in the internet of things 2015(pp. 33-8). IEEE.
[Crossref] [Google Scholar]
[74]Mashayekhy L, Nejad MM, Grosu D, Zhang Q, Shi W. Energy-aware scheduling of mapreduce jobs for big data applications. IEEE Transactions on Parallel and Distributed Systems. 2014; 26(10):2720-33.
[Crossref] [Google Scholar]
[75]Wen YF. Energy-aware dynamical hosts and tasks assignment for cloud computing. Journal of Systems and Software. 2016; 115:144-56.
[Crossref] [Google Scholar]
[76]Pandey V, Saini P. A heuristic method towards deadline-aware energy-efficient mapreduce scheduling problem in Hadoop YARN. Cluster Computing. 2021; 24(2):683-99.
[Crossref] [Google Scholar]
[77]Wang J, Li X, Ruiz R, Yang J, Chu D. Energy utilization task scheduling for MapReduce in heterogeneous clusters. IEEE Transactions on Services Computing. 2020.
[Crossref] [Google Scholar]
[78]Chen L, Liu ZH. Energy-and locality-efficient multi-job scheduling based on MapReduce for heterogeneous datacenter. Service Oriented Computing and Applications. 2019; 13(4):297-308.
[Crossref] [Google Scholar]
[79]Yuan Z, Wang J. Research of scheduling strategy based on fault tolerance in Hadoop platform. In international conference on geo-informatics in resource management and sustainable ecosystem (pp. 509-17). Springer, Berlin, Heidelberg.
[Crossref] [Google Scholar]
[80]Chen Q, Liu C, Xiao Z. Improving MapReduce performance using smart speculative execution strategy. IEEE Transactions on Computers. 2013; 63(4):954-67.
[Crossref] [Google Scholar]
[81]Yildiz O, Ibrahim S, Phuong TA, Antoniu G. Chronos: failure-aware scheduling in shared Hadoop clusters. In international conference on big data (Big Data) 2015 (pp. 313-8). IEEE.
[Crossref] [Google Scholar]
[82]Yildiz O, Ibrahim S, Antoniu G. Enabling fast failure recovery in shared Hadoop clusters: towards failure-aware scheduling. Future Generation Computer Systems. 2017; 74:208-19.
[Crossref] [Google Scholar]
[83]Guo Y, Bland W, Balaji P, Zhou X. Fault tolerant MapReduce-MPI for HPC clusters. In proceedings of the international conference for high performance computing, networking, storage and analysis 2015 (pp. 1-12).
[Crossref] [Google Scholar]
[84]Brahmwar M, Kumar M, Sikka G. Tolhit–a scheduling algorithm for Hadoop cluster. Procedia Computer Science. 2016; 89:203-8.
[Crossref] [Google Scholar]
[85]Zhu Y, Samsudin J, Kanagavelu R, Zhang W, Wang L, Aye TT, et al. Fast recovery MapReduce (FAR-MR) to accelerate failure recovery in big data applications. The Journal of Supercomputing. 2020; 76(5):3572-88.
[Crossref] [Google Scholar]