[ARROW-16117] [JS] Improve UTF8 decoding performance - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 8.0.0
Component/s: JavaScript
Labels:
- pull-request-available
Environment:
MacOS, Chrome, Safari

External issue URL:
https://github.com/apache/arrow/issues/31527
Language:
- js

Description

While profiling the performance of decoding TPC-H Customer and Part in-browser, datasets where there are a lot of UTF8s, it turned out that much of the time was being spent in getVariableWidthBytes rather than in TextDecoder itself. Ideally all the time should be spent in TextDecoder.

On Chrome getVariableWidthBytes took up to ~15% of the e2e decoding latency, and on Safari it was close to ~40% (Safari's TextDecoder is much faster than Chrome's, so this took up relatively more time).

This is likely because the code in this PR is more amenable to V8/JSC's JIT, since x and y now are guaranteed to be SMIs ("small integers") instead of Object, allowing the JIT to emit efficient machine instructions that only deal in 32-bit integers. Once V8 discovers that a x and y can potentially be null (upon iterating past the bounds), it "poisons" the codepath forever, since it has to deal with the null case.

See this V8 post for a more in-depth explanation (in particular see the examples underneath "Performance tips"):
https://v8.dev/blog/elements-kinds

Doing the bounds check explicitly instead of implicitly basically eliminates this function from showing up in the profiling. Empirically, on my machine decoding TPC-H Part dropped from 1.9s to 1.7s on Chrome, and Customer dropped from 1.4s to 1.2s.

https://github.com/apache/arrow/pull/12793

Attachments

Issue Links

links to

GitHub Pull Request #12793

Activity

People

Assignee:: Dominik Moritz

Reporter:: Howard Zuo

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 04/Apr/22 21:53

Updated:: 11/Jan/23 11:41

Resolved:: 06/Apr/22 13:27

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

3.5h