Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-28262

Single column use MultiDelimitSerDe parse column error

    XMLWordPrintableJSON

Details

    Description

      ENV:

      Hive: 3.1.3/4.1.0

      HDFS: 3.3.1

      --------------------------

      Create a text file for external table load,(e.g:/tmp/data):

       

      1|@|
      2|@|
      3|@| 

       

       

      Create external table:

       

      CREATE EXTERNAL TABLE IF NOT EXISTS test_split_tmp(`ID` string) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe' WITH SERDEPROPERTIES('field.delim'='|@|') STORED AS textfile location '/tmp/test_split_tmp'; 

       

      put text file to external table path:

       

      hdfs dfs -put /tmp/data /tmp/test_split_tmp 

       

       

      query this table and cast column id to long type:

       

      select UDFToLong(`id`) from test_split_tmp; 

      why use UDFToLong function? because  it will get NULL result in this condition,but string type '1' use this function should get  type long 1 result.

      +--------+
      | id     |
      +--------+
      | NULL   |
      | NULL   |
      | NULL   |
      +--------+ 

      Therefore, I speculate that there is an issue with the field splitting in MultiDelimitSerde.

      when I debug this issue, I found some problem below:

      • org.apache.hadoop.hive.serde2.lazy.LazyStruct#findIndexes

                 when fields.length=1 can't find the delimit index

       

      private int[] findIndexes(byte[] array, byte[] target) {
        if (fields.length <= 1) {  // bug
          return new int[0];
        }
        ...
        for (int i = 1; i < indexes.length; i++) {  // bug
          array = Arrays.copyOfRange(array, indexInNewArray + target.length, array.length);
          indexInNewArray = Bytes.indexOf(array, target);
          if (indexInNewArray == -1) {
            break;
          }
          indexes[i] = indexInNewArray + indexes[i - 1] + target.length;
        }
        return indexes;
      }

       

      • org.apache.hadoop.hive.serde2.lazy.LazyStruct#parseMultiDelimit

                 when fields.length=1 can't find the column startPosition

       

      public void parseMultiDelimit(byte[] rawRow, byte[] fieldDelimit) {
        ...
        int[] delimitIndexes = findIndexes(rawRow, fieldDelimit);
        ...
          if (fields.length > 1 && delimitIndexes[i - 1] != -1) { // bug
            int start = delimitIndexes[i - 1] + fieldDelimit.length;
            startPosition[i] = start - i * diff;
          } else {
            startPosition[i] = length + 1;
          }
        }
        Arrays.fill(fieldInited, false);
        parsed = true;
      }

       

       

      Multi delimit Process:

      Actual:  1|@| -> 1^A  id column start 0 ,next column start 1

      Expected:  1|@| -> 1^A  id column start 0 ,next column start 2

       

      Fix:

      1. fields.length=1 should  find multi delimit index
      2. fields.length=1 should  calculate column start position correct

       

      Attachments

        1. CleanShot 2024-05-16 at 15.13.29@2x.png
          137 kB
          Liu Weizheng
        2. CleanShot 2024-05-16 at 15.17.15@2x.png
          227 kB
          Liu Weizheng

        Issue Links

          Activity

            People

              laughing_vzr Liu Weizheng
              laughing_vzr Liu Weizheng
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: