Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-9991

[C++] split kernels for strings/binary

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 3.0.0
    • C++

    Description

      Similar to Python str.split and bytes.split, we'd like to have a way to convert str into list[str] (and similarly for bytes).

      When the separator is given, the algorithms for both types are the same. Python, however, overloads strip. When given no separator, the algorithm will split considering all whitespace (unicode for str, ascii for bytes) as separator.

      I'd rather see not too much overloaded kernels, e.g.

      binary_split (takes string/binary separator, and maxsplit arg, no special utf8 version needed)

      utf8_split_whitespace (similar to Python's version given no separator)

      ascii_split_whitespace (similar to Python's version given no separator, but considering ascii, although this could work on any binary data)

      there can also be rsplit versions of these, or they could be an argument.

       

      Attachments

        Issue Links

          Activity

            People

              maartenbreddels Maarten Breddels
              maartenbreddels Maarten Breddels
              Votes:
              1 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 7h 50m
                  7h 50m