(Publisher of Peer Reviewed Open Access Journals)

International Journal of Advanced Computer Research (IJACR)

ISSN (Print):2249-7277    ISSN (Online):2277-7970
Volume-10 Issue-48 May-2020
Full-Text PDF
Paper Title : Implementation and performance analysis of dynamic partitioning of graphs in Apache Spark
Author Name : Geetha J, Jayalakshmi D S and Harshit N G
Abstract :

In recent years data is growing continuously at an exponential rate. Even in its processed form, large data is difficult to understand and analyze. One of the best approaches to handle large data is to represent it in the graphic format. These graphs can be used for data analysis, processing and decision making. Because of the volume of data is huge, the graphs generated by them are also huge making it difficult to process. Thus, the modular approach of studying them by partitioning them is much more effective. There are several inbuilt partitioning techniques available in Apache Spark which can be extended to graph partitioning according to the user needs. Graph partitioning is an active area of research with considerable activity with an aim towards increasing the accuracy and speed of the algorithms. This research work aims to build a high- performance graph partitioning technique for Apache Spark which makes it faster, scalable and efficient. In this paper, a custom algorithm that can divide the graph into an optimal number of partitions dynamically in Apache Spark is proposed. A novel method to calculate the distance between the nodes and similarity indexes to partition the data is introduced. The optimal number of partitions is decided to use similarity indexes, which are calculated using the concept of Laplacian matrix and Eigenvalues. The proposed algorithm is implemented and its performance is compared to the existing algorithms in Apache Spark. The results indicate that the algorithm partitions graphs which have a huge number of vertices with considerable efficiency and computing cost.

Keywords : Partition, Graphs, RDD, Clustering, Spark.
Cite this article : Geetha J, Jayalakshmi DS, Harshit NG. Implementation and performance analysis of dynamic partitioning of graphs in Apache Spark. International Journal of Advanced Computer Research. 2020; 10(48):116-127. DOI:10.19101/IJACR.2020.1048023.
References :
[1]Mokashi VS, Kulkarni DB. A review: scalable parallel graph partitioning for complex networks. In second international conference on intelligent computing and control systems 2018 (pp. 1869-71). IEEE.
[Crossref] [Google Scholar]
[2]Hassan M, Bansal SK. Data partitioning scheme for efficient distributed RDF querying using apache spark. In international conference on semantic computing 2019 (pp. 24-31). IEEE.
[Crossref] [Google Scholar]
[3]Gounaris A, Kougka G, Tous R, Montes CT, Torres J. Dynamic configuration of partitioning in spark applications. IEEE Transactions on Parallel and Distributed Systems. 2017; 28(7):1891-904.
[Crossref] [Google Scholar]
[4]Bertolucci M, Carlini E, Dazzi P, Lulli A, Ricci L. Static and dynamic big data partitioning on apache spark. In PARCO 2015 (pp. 489-98).
[Google Scholar]
[5]Abughofa T, Zulkernine F. Towards online graph processing with spark streaming. In international conference on big data 2017 (pp. 2787-94). IEEE.
[Crossref] [Google Scholar]
[6]Taloba AI, Riad MR, Soliman TH. Developing an efficient spectral clustering algorithm on large scale graphs in spark. In international conference on intelligent computing and information systems 2017 (pp. 292-8). IEEE.
[Crossref] [Google Scholar]
[7]Atashkar AH, Ghadiri N, Joodaki M. Linked data partitioning for RDF processing on Apache Spark. In international conference on web research 2017 (pp. 73-7). IEEE.
[Crossref] [Google Scholar]
[8]Tian X, Guo Y, Zhan J, Wang L. Towards memory and computation efficient graph processing on spark. In international conference on big data 2017 (pp. 375-82). IEEE.
[Crossref] [Google Scholar]
[9]Rajan AK, Bhaiya D. Accelerated kerninghan lin algorithm for graph partitioning. In international conference on advances in computing, communications and informatics 2017 (pp. 174-8). IEEE.
[Crossref] [Google Scholar]
[10]Wang M, Yang W, Li H, Lin Y, Chen J. Metis-CIC: a new mesh partitioning heuristic for parallel preconditioned iterative methods in CFD. In international conference on high performance computing & simulation 2016 (pp. 188-95). IEEE.
[Crossref] [Google Scholar]
[11]Kyong J, Jeon J, Lim SS. Improving scalability of apache spark-based scale-up server through docker container-based partitioning. In proceedings of the international conference on software and computer applications 2017 (pp. 176-80).
[Crossref] [Google Scholar]
[12]Olukanmi PO, Twala B. K-means-sharp: modified centroid update for outlier-robust k-means clustering. In pattern recognition association of south Africa and robotics and mechatronics 2017 (pp. 14-9). IEEE.
[Crossref] [Google Scholar]