[SPARK-32149] Improve file path name normalisation at block resolution within the external shuffle service - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.1.0
Fix Version/s: 3.1.0
Component/s: Shuffle
Labels:
None

Description

In the external shuffle service during the block resolution the file paths (for disk persisted RDD and for shuffle blocks) are normalized by a custom Spark code which uses an OS dependent regexp. This is a redundant code of the package-private JDK counterpart.
As the code not a perfect match even it could happen one method results in a bit different (but semantically equal) path.

The reason of this redundant transformation is the interning of the normalized path to save some heap here which is only possible if both results in the same string.

Checking the JDK code I believe there is a better solution which is perfect match for the JDK code as it uses that package private method. Moreover based on some benchmarking even this new method seams to be more performant too.

Attachments

Issue Links

is related to

SPARK-32121 ExternalShuffleBlockResolverSuite failed on Windows

Resolved

links to

[Github] Pull Request #28967 (attilapiros)

Activity

People

Assignee:: Attila Zsolt Piros

Reporter:: Attila Zsolt Piros

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 01/Jul/20 13:30

Updated:: 11/Jul/20 13:55

Resolved:: 11/Jul/20 13:55