Study on Spark performance tuning strategy based on Skewed Partitioning and Locality-aware Partitioning

Guikun Cao; Haiyuan Yu; Liujia Chang; Heng Zhao

doi:10.59429/esta.v10i4.1590

Study on Spark performance tuning strategy based on Skewed Partitioning and Locality-aware Partitioning

Guikun Cao

Haiyuan Yu

Liujia Chang

Heng Zhao

DOI: https://doi.org/10.59429/esta.v10i4.1590

Keywords: Skewed Partitioning; Locality-aware Partitioning; Performance tuning; Spark; Data skew

Abstract

Apache Spark is a large-scale data processing engine, widely used in a variety of big data analysis tasks. However, data skew and data locality issues can cause performance degradation in Spark applications. This paper investigates Spark performance tuning strategies based on Skewed Partitioning and Locality-aware Partitioning. Firstly, the infl uence of data skew and data localizability problems in Spark is analyzed, and then a perfo rmance tuning method combining Skewed Partitioning and Locality-aware Partitioning is proposed. Experimental results show that this method can significantly improve the efficiency of Spark jobs when dealing with large data sets, compared with traditional HashPartitioner.

References

[1] Wei Jin,Peng Liu,Ru Li. A review of data skew problems based on Spark [J]. Computer Science, 2021, 48(2): 89-97.

[2] Wanhang Xie,Hongwei Yuan,Fanyi Liu, etal. Overview of Spark SQL Optimization Algorithm [J]. Computer Engineering and Design, 2020, 41(1): 7-13.

[3] Weichao Guo,Yong Yang,Bo Pan, etal. Research review of Spark framework in Big Data analysis [J]. Computer Engineering and Design, 2019, 40(5):

1052-1060.

[4] Fei Hu. Research on Optimization of large-scale Data Processing Architecture Based on Spark [J]. Modern Computer, 2018(22): 72-74.

[5] Yichen Fang,Liping Zhang. Review of data analysis and processing methods based on Spark [J]. Well Logging Technology, 2018, 42(1): 69-75.

[6] Hongjie Chen,Yu Huang. A review of data analysis technology based on Apache Spark. Computer and Digital Engineering, 2017, 45(7): 1321-1329.

Electronics Science Technology and Application

ISSN

Article Processing Charges (APCs)

Published

Issue

Section

Study on Spark performance tuning strategy based on Skewed Partitioning and Locality-aware Partitioning

Abstract

References