Study on Spark performance tuning strategy based on Skewed Partitioning and Locality-aware Partitioning
Guikun Cao
Haiyuan Yu
Liujia Chang
Heng Zhao
DOI: https://doi.org/10.59429/esta.v10i4.1590
Keywords: Skewed Partitioning; Locality-aware Partitioning; Performance tuning; Spark; Data skew
Abstract
Apache Spark is a large-scale data processing engine, widely used in a variety of big data analysis tasks. However, data skew and data locality issues can cause performance degradation in Spark applications. This paper investigates Spark performance tuning strategies based on Skewed Partitioning and Locality-aware Partitioning. Firstly, the infl uence of data skew and data localizability problems in Spark is analyzed, and then a perfo rmance tuning method combining Skewed Partitioning and Locality-aware Partitioning is proposed. Experimental results show that this method can significantly improve the efficiency of Spark jobs when dealing with large data sets, compared with traditional HashPartitioner.
References
[1] Wei Jin,Peng Liu,Ru Li. A review of data skew problems based on Spark [J]. Computer Science, 2021, 48(2): 89-97.
[2] Wanhang Xie,Hongwei Yuan,Fanyi Liu, etal. Overview of Spark SQL Optimization Algorithm [J]. Computer Engineering and Design, 2020, 41(1): 7-13.
[3] Weichao Guo,Yong Yang,Bo Pan, etal. Research review of Spark framework in Big Data analysis [J]. Computer Engineering and Design, 2019, 40(5):
1052-1060.
[4] Fei Hu. Research on Optimization of large-scale Data Processing Architecture Based on Spark [J]. Modern Computer, 2018(22): 72-74.
[5] Yichen Fang,Liping Zhang. Review of data analysis and processing methods based on Spark [J]. Well Logging Technology, 2018, 42(1): 69-75.
[6] Hongjie Chen,Yu Huang. A review of data analysis technology based on Apache Spark. Computer and Digital Engineering, 2017, 45(7): 1321-1329.