Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-6131

[C++] Optimize the Arrow UTF-8-string-validation

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: In Progress
    • Major
    • Resolution: Unresolved
    • None
    • None
    • C++

    Description

      The new Algorithm comes from: https://github.com/cyb70289/utf8 (MIT LICENSE)

      Range base algorithm:
      1. Map each byte of input-string to Range table.
      2. Leverage the Neon 'tbl' instruction to lookup table.
      3. Find the pattern and set correct table index for each input byte
      4. Validate input string.

      The Algorithm would improve utf8-validation ~1.6x Speedup for LargeNonAscii and SmallNonAscii. But the algorithm would deteriorate the All-Ascii cases (The input data is all ascii string).
      The benchmark API is

      ValidateUTF8
      

      As far as I know, the data that is all-ascii is unusual on the internet.
      Could you guys please tell me what's the use case scenario for Apache Arrow?
      Is the Arrow's data that need to be validated all-ascii string?

      If not, I'd like to submit the patch to accelerate the NonAscii validation.

      As for All-Ascii validation, I would like to propose another optimization solution with SIMD in another jira.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              yqGu Yuqi Gu
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 3h
                  3h