Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-46837

String function support (parent)

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 4.0.0
    • None
    • Spark Core

    Description

      TODO: List of all functions that need to be updated for collation support:

       

      Feature/function Priority Type
      Shuffle 0 comparison
      Delta Columns 0 storage
      Partition key 0 storage
      Comparison operators 0 comparison
      IN list 0 comparison
      GROUP BY  0 comparison
      MERGE, HASH joins 0 comparison
      ORDER BY  0 sorting
      Aggregation 0 comparison
      like 0 comparison
      regexp_* 0 matching
      concat 0 Pass through
      substr 0 Pass through
      between 0 comparison
      coalesce  0 Pass through
      Is distinct 0 comparison
      trim 0 Pass through
      instr 0 comparison
      lcase 0 Pass through, modify
      lower 0 Pass through modify
      replace 0 comparison
      ucase 0 modify , pass through
      upper 0 Modify, pass through
      count(distinct ) 0 comparison
      min/max 0 Comparison pass through
      array 0 Pass through
      case 0 Pass through
      decode 0 Pass through
      elt 0 Comparison, pass through
      Nullif, nvl, nvl2 0 Pass through, comparison
           
      Session variables 1 storage
      SQL UDF 1 Storage, pass through
      Python UDF 1 Storage
      Array element 1 storage
      Map key 1 Storage, comparison
      Map value 1 Storage
      Struct field 1 storage
      least/greatest 1 Comparison, pass through
      if/iff/ifnull 1 Pass through, comparison
      mapExpr [ keyExpr ] 1 Comparison, pass through
      concat_ws 1 Pass through
      contains 1 comparison
      left 1 Pass through
      *pad 1 Pass through
      repeat 1 Pass through
      reverse 1 Pass through
      translate 1 Comparison, Pass through
      array_agg 1 Pass through
      first/last/any 1 Pass through
      mode 1 Comparison, pass through
      array_* 1 Pass through, dedup (array distinct)
           
      explode 2 Pass through
      filter 2 Pass through
      flatten 2 Pass through
      inline* 2 Pass through
      reduce 2 Pass through
      reverse 2 Pass through
      shuffle 2 Pass through
      Slice 2 Pass through
      sort_array 2 Comparison, pass through
      transform 2 Pass through
      zip* 2 Pass through
      map 2 Pass through
      map_* 2 Pass through
      str_to_map 2 Comparison, pass through
      transform* 2 Pass through
      stack 2 Pass through
      describe 2 display
      ilike 2 matching
      charindex 2 comparison
      endswith 2 comparison
      startswith 2 comparison
      find_in_set 2 comparison
      initcap 2 Pass through, modify
      locate 2 comparison
      mask 2 Pass through
      overlay 2 Pass through
      position 2 comparison
      sentences 2 Comparison, pass through
      split 2 Comparison, pass through
      split_part 2 Comparison, pass through
      collect_list 2 Pass through
      collect_set 2 Pass through
      min_by/max_by 2 Comparison, pass through
      Element_at, [] 2 Pass through
      aggregate 2 Pass through

      Attachments

        Issue Links

          1.
          contains, startswith, endswith (binary & lowercase collation only) Sub-task Resolved Uroš Bojanić
          2.
          contains (all collations) Sub-task Resolved Uroš Bojanić
          3.
          startswith, endswith (all collations) Sub-task Resolved Stevo Mitric
          4.
          new test suite for UTF8String Sub-task Resolved Uroš Bojanić
          5.
          fail all unsupported functions Sub-task Resolved Uroš Bojanić
          6.
          Resolve AbstractDataType simpleStrings for StringTypeCollated Sub-task Resolved Mihailo Milosevic
          7.
          refactor UTF8String and CollationFactory Sub-task Resolved Uroš Bojanić
          8.
          Fix CollationSupport test output Sub-task Resolved Unassigned
          9.
          endsWith and startsWith don't work correctly for some collations Sub-task Resolved Vladimir Golubev
          10.
          Add benchmark for stringpredicate expressions Sub-task Resolved Uroš Bojanić
          11.
          Optimize string predicate expressions for UTF8_BINARY_LCASE collation Sub-task Resolved Uroš Bojanić
          12.
          Regexp expressions (binary & lowercase collation only) Sub-task Resolved Uroš Bojanić
          13.
          Add support for ConcatWs & Elt (all collations) Sub-task Resolved Mihailo Milosevic
          14.
          Add support for Upper, Lower, InitCap (all collations) Sub-task Resolved Mihailo Milosevic
          15.
          Fix Upper, Lower, InitCap collation awareness Sub-task Resolved Uroš Bojanić
          16.
          Fix Upper & Lower expressions for UTF8_BINARY_LCASE Sub-task Resolved Uroš Bojanić
          17.
          Fix InitCap expression Sub-task Resolved Uroš Bojanić
          18.
          StringRepeat (all collations) Sub-task Resolved Milan Dankovic
          19.
          StringReplace (all collations) Sub-task Resolved Uroš Bojanić
          20.
          Overlay, FormatString, Length, BitLength, OctetLength, SoundEx, Luhncheck (all collations) Sub-task Resolved Nikola Mandic
          21.
          StringTranslate (all collations) Sub-task Resolved Milan Dankovic
          22.
          StringTrim & StringTrimLeft/Right/Both (binary & lowercase collation only) Sub-task Resolved David Milicevic
          23.
          StringInstr, FindInSet (all collations) Sub-task Resolved Milan Dankovic
          24.
          StringLPad, StringRPad (all collations) Sub-task Resolved Gideon P
          25.
          Substring, Right, Left (all collations) Sub-task Resolved Gideon P
          26.
          Levenshtein (all collations) Sub-task Resolved Uroš Bojanić
          27.
          When the collationId is invalid, throw `COLLATION_INVALID_ID` Sub-task Resolved Unassigned
          28.
          Ascii, Chr, Base64, UnBase64, Decode, StringDecode, Encode, ToBinary, FormatNumber, Sentences (all collations) Sub-task Resolved Nikola Mandic
          29.
          SplitPart (binary & lowercase collation only) Sub-task Resolved Uroš Bojanić
          30.
          Add Collation Support for trim/ltrim/rtrim Sub-task Resolved Unassigned
          31.
          Mode expression for strings (all collations) Sub-task Resolved Gideon P
          32.
          Mode expression for Arrays and Structs (all collations) Sub-task Resolved Gideon P
          33.
          Mode expression for MapType (all collations) Sub-task Resolved Uroš Bojanić
          34.
          PandasMode (all collations) Sub-task Open Unassigned
          35.
          StringToMap & Mask (all collations) Sub-task Resolved Uroš Bojanić
          36.
          Fix mathExpressions that use StringType Sub-task Resolved Mihailo Milosevic
          37.
          Use wildcard imports in CollationTypeCasts Sub-task Resolved Unassigned
          38.
          Format expressions (all collations) Sub-task Resolved Uroš Bojanić
          39.
          Variant expressions (all collations) Sub-task Resolved Uroš Bojanić
          40.
          Add support for AbstractMapType Sub-task Resolved Uroš Bojanić
          41.
          Type casting for AbstractMapType Sub-task Resolved Uroš Bojanić
          42.
          URL expressions (all collations) Sub-task Resolved Uroš Bojanić
          43.
          Miscellaneous expressions (all collations) Sub-task Resolved Uroš Bojanić
          44.
          CurrentLike - Database/Schema, Catalog, User (all collations) Sub-task Resolved Uroš Bojanić
          45.
          JSON expressions (all collations) Sub-task Resolved Uroš Bojanić
          46.
          CSV expressions (all collations) Sub-task Resolved Uroš Bojanić
          47.
          XML expressions (all collations) Sub-task Resolved Uroš Bojanić
          48.
          XPath expressions (all collations) Sub-task Resolved Uroš Bojanić
          49.
          inputFile expressions (all collations) Sub-task Resolved Uroš Bojanić
          50.
          DateFormatClass (all collations) Sub-task Resolved Nebojsa Savic
          51.
          Datetime expressions (all collations) Sub-task Resolved Nebojsa Savic
          52.
          Alter string search logic for: startsWith, endsWith, contains, locate (UTF8_BINARY_LCASE) Sub-task Resolved Uroš Bojanić
          53.
          Alter string search logic for: instr, substring_index (UTF8_BINARY_LCASE) Sub-task Resolved Uroš Bojanić
          54.
          Alter string search logic for: replace, find_in_set (UTF8_BINARY_LCASE) Sub-task Resolved Uroš Bojanić
          55.
          Implement modified Lowercase operation for UTF8_BINARY_LCASE Sub-task Resolved Uroš Bojanić
          56.
          Alter string search logic for: translate (UTF8_BINARY_LCASE) Sub-task Resolved Uroš Bojanić
          57.
          Alter string search logic for: trim (UTF8_BINARY_LCASE) Sub-task Resolved Uroš Bojanić
          58.
          Improve collation testing surface area using expression walking Sub-task Resolved Mihailo Milosevic
          59.
          Enable reflect expressions with collated strings Sub-task Resolved Mihailo Milosevic
          60.
          Fix DateSub, DateAdd, WindowTime, TimeWindow and SessionWindow expressions Sub-task Resolved Mihailo Milosevic
          61.
          Fix FrameLessOffsetWindowFunction expressions implicit casting Sub-task Resolved Mihailo Milosevic
          62.
          Fix Like simplification in Optimizer (for UTF8_LCASE collation) Sub-task Resolved Uroš Bojanić
          63.
          Fix StructsToXml expression with collations Sub-task Resolved Mihailo Milosevic
          64.
          Use ICU in Lower/Upper expressions (UTF8_BINARY collation) Sub-task Resolved Uroš Bojanić
          65.
          Use ICU in InitCap expression (UTF8_BINARY collation) Sub-task Resolved Uroš Bojanić
          66.
          Refine collation API Sub-task Resolved Uroš Bojanić
          67.
          UTF8String to String conversions should use Unicode replacement logic Sub-task Resolved Uroš Bojanić
          68.
          Use lowerCaseCodePoints in string functions for UTF8_LCASE Sub-task Resolved Uroš Bojanić
          69.
          Improve collation support testing for various collations Sub-task Open Unassigned
          70.
          Improve collation support testing for various expressions Sub-task Resolved Uroš Bojanić
          71.
          Improve collation support testing - add golden files Sub-task Resolved Viktor Lučić
          72.
          Improve collation support testing - add expression-level unit tests Sub-task Resolved Uroš Bojanić
          73.
          Improve collation support testing - unit tests for comparison & equality Sub-task Resolved Apache Spark
          74.
          Improve collation support testing - unit tests for Contains, StartsWith, and EndsWith Sub-task Resolved Uroš Bojanić
          75.
          Improve collation support testing - unit tests for Upper, Lower, and InitCap Sub-task Resolved Uroš Bojanić
          76.
          Improve collation support testing - unit tests for FindInSet Sub-task Resolved Uroš Bojanić
          77.
          Improve collation support testing - unit tests for StringTranslate Sub-task Resolved Uroš Bojanić
          78.
          Fix SchemaOfJson Expression to work with Collations Sub-task Open Unassigned
          79.
          Fix SplitPart one-to-many case mapping (UTF8_LCASE) Sub-task Resolved Uroš Bojanić
          80.
          Fix collation support for the StringToMap expression (binary & lowercase collation only) Sub-task Resolved Uroš Bojanić
          81.
          Optimize collation support for string search (UTF8_LCASE collation) Sub-task Resolved Uroš Bojanić
          82.
          Optimize collation support for ASCII strings (all collations) Sub-task Resolved Uroš Bojanić
          83.
          Rename leftover BinaryLcase to Lcase Sub-task Resolved Mihailo Milosevic
          84.
          Handle surrogate pairs properly Sub-task Resolved Uroš Bojanić
          85.
          `str_to_map` should check whether the `collation` values of all parameter types are the same Sub-task Resolved BingKun Pan
          86.
          `split_part` should check whether the `collation` values of all parameter types are the same Sub-task Resolved BingKun Pan
          87.
          `levenshtein` should check whether the `collation` values of all parameter types are the same Sub-task Resolved BingKun Pan
          88.
          Update collation benchmarks Sub-task Resolved Uroš Bojanić
          89.
          Expand collation benchmark coverage Sub-task Open Unassigned
          90.
          Support collations with get_json_object and json_tuple Sub-task Open Unassigned

          Activity

            People

              Unassigned Unassigned
              dbatomic Aleksandar Tomic
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated: