Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-16351

Hive confused by CR/LFs

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 1.2.1
    • None
    • None
    • Hadoop 2.7.3

    • Hide
      Fixed it out. The problem is well described in https://www.phdata.io/hive-corruption-due-to-newlines-and-carriage-returns. If you set hive.query.result.fileformat to SequenceFile the query works as expected since the query output is written in a binary format as opposed to a text format.

      Dup of https://issues.apache.org/jira/browse/HIVE-1608
      Show
      Fixed it out. The problem is well described in https://www.phdata.io/hive-corruption-due-to-newlines-and-carriage-returns . If you set hive.query.result.fileformat to SequenceFile the query works as expected since the query output is written in a binary format as opposed to a text format. Dup of https://issues.apache.org/jira/browse/HIVE-1608

    Description

      Hive is returning broken data that contains CR/LF.

      CREATE DATABASE positron;
      
      CREATE EXTERNAL TABLE positron.articles (
        `_id` struct<oid:string>,
        `channel_id` struct<oid:string>,
        `exclude_google_news` boolean,
        `fair_ids` array<map<string,string>>,
        `hero_section` map<string,string>,
        `partner_ids` array<map<string,string>>,
        `description` string,
        `partner_channel_id` struct<oid:string>,
        `published` boolean,
        `published_at` map<string,string>,
        `slugs` array<string>,
        `sections` array<map<string,string>>,
        `thumbnail_image` string,
        `title` string
      )
      ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
      WITH serdeproperties ('mapping.oid' = '$oid')
      LOCATION '/user/data/positron/articles';
      
      SELECT
        a.slugs[SIZE(a.slugs) - 1],
        a.title,
        a.thumbnail_image
      FROM
        positron.articles as a
      WHERE a.published = true
      AND a.hero_section["type"] = "video"
      AND (a.channel_id IS NOT NULL OR a.partner_channel_id IS NOT NULL)
      AND a.thumbnail_image IS NOT NULL;
      

      Note the NULLs below. The `title` field has a CR/LF in it.

      astrid-caroline-cole-sneak-peek-realities-by-marc-gumpinger   Sneak peek "Realities" by Marc Gumpinger   https://artsy-media-uploads.s3.amazonaws.com/bUb1l_4g6cMhcDxEaPYDxw%2Facc_signature.png
      artsy-editorial-how-art-fairs-expanded-the-contemporary-art-market   The Art Market, Explained: The Rise of the Art Fair   https://artsy-media-uploads.s3.amazonaws.com/j8GIeamyufubMBgJFNHbFA%2Fartfairsex.jpg
      nolongercreek  Alexandra Kehayoglou x Artsy: NULL
      No Longer Creek   https://d32dm0rphc51dk.cloudfront.net/5oRwy7ysKHohtahIYUTE9Q/larger.jpg NULL
      kukje-gallery-teaser-trailer-kim-yong-ik  Teaser Trailer | Kim Yong-Ik  https://artsy-media-uploads.s3.amazonaws.com/mmoZcz0imuUzCavkObKgVQ%2Fkyi+thumbnail.PNG
      
      $ hive --version
      Hive 1.1.0-cdh5.6.0
      Subversion file:///data/jenkins/workspace/generic-package-ubuntu64-14-04/CDH5.6.0-Packaging-Hive-2016-01-28_21-19-00/hive-1.1.0+cdh5.6.0+377-1.cdh5.6.0.p0.110~trusty -r Unknown
      Compiled by jenkins on Thu Jan 28 21:35:50 PST 2016
      From source with checksum b4a8fadbcf1ca36d11d91805d3ec2743
      

      What's very interesting is that I am not able to reproduce this locally with the same data with any version of hive. It only happens in our Cloudera cluster. Any help appreciated.

      Attachments

        1. Screen Shot 2017-04-03 at 4.33.23 PM.png
          60 kB
          Daniel Doubrovkine

        Activity

          People

            Unassigned Unassigned
            dblock Daniel Doubrovkine
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: