Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-16117

[JS] Improve UTF8 decoding performance

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 8.0.0
    • JavaScript
    • MacOS, Chrome, Safari

    Description

      While profiling the performance of decoding TPC-H Customer and Part in-browser, datasets where there are a lot of UTF8s, it turned out that much of the time was being spent in getVariableWidthBytes rather than in TextDecoder itself. Ideally all the time should be spent in TextDecoder.

      On Chrome getVariableWidthBytes took up to ~15% of the e2e decoding latency, and on Safari it was close to ~40% (Safari's TextDecoder is much faster than Chrome's, so this took up relatively more time).

      This is likely because the code in this PR is more amenable to V8/JSC's JIT, since x and y now are guaranteed to be SMIs ("small integers") instead of Object, allowing the JIT to emit efficient machine instructions that only deal in 32-bit integers. Once V8 discovers that a x and y can potentially be null (upon iterating past the bounds), it "poisons" the codepath forever, since it has to deal with the null case.

      See this V8 post for a more in-depth explanation (in particular see the examples underneath "Performance tips"):
      https://v8.dev/blog/elements-kinds

      Doing the bounds check explicitly instead of implicitly basically eliminates this function from showing up in the profiling. Empirically, on my machine decoding TPC-H Part dropped from 1.9s to 1.7s on Chrome, and Customer dropped from 1.4s to 1.2s.

      https://github.com/apache/arrow/pull/12793

       

      Attachments

        Issue Links

          Activity

            People

              domoritz Dominik Moritz
              hzuo Howard Zuo
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 3.5h
                  3.5h