Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-11112

ISO-8859-1 text output has fragments of previous longer rows appended

    XMLWordPrintableJSON

Details

    Description

      If a LazySimpleSerDe table is created using ISO 8859-1 encoding, query results for a string column are incorrect for any row that was preceded by a row containing a longer string.

      Example steps to reproduce:

      1. Create a table using ISO 8859-1 encoding:

      CREATE TABLE person_lat1 (name STRING)
      ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' WITH SERDEPROPERTIES ('serialization.encoding'='ISO8859_1');
      

      2. Copy an ISO-8859-1 encoded text file into the appropriate warehouse folder in HDFS. I'll attach an example file containing the following text:

      Müller,Thomas
      Jørgensen,Jørgen
      Peña,Andrés
      Nåm,Fæk
      

      3. Execute SELECT * FROM person_lat1

      Result - The following output appears:

      +-------------------+--+
      | person_lat1.name |
      +-------------------+--+
      | Müller,Thomas |
      | Jørgensen,Jørgen |
      | Peña,Andrésørgen |
      | Nåm,Fækdrésørgen |
      +-------------------+--+
      

      Attachments

        1. HIVE-11112.1.patch
          3 kB
          Yongzhi Chen

        Issue Links

          Activity

            People

              ychena Yongzhi Chen
              ychena Yongzhi Chen
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: