Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Duplicate
-
1.2.1
-
None
-
None
-
Hadoop 2.7.3
Description
Hive is returning broken data that contains CR/LF.
CREATE DATABASE positron; CREATE EXTERNAL TABLE positron.articles ( `_id` struct<oid:string>, `channel_id` struct<oid:string>, `exclude_google_news` boolean, `fair_ids` array<map<string,string>>, `hero_section` map<string,string>, `partner_ids` array<map<string,string>>, `description` string, `partner_channel_id` struct<oid:string>, `published` boolean, `published_at` map<string,string>, `slugs` array<string>, `sections` array<map<string,string>>, `thumbnail_image` string, `title` string ) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' WITH serdeproperties ('mapping.oid' = '$oid') LOCATION '/user/data/positron/articles'; SELECT a.slugs[SIZE(a.slugs) - 1], a.title, a.thumbnail_image FROM positron.articles as a WHERE a.published = true AND a.hero_section["type"] = "video" AND (a.channel_id IS NOT NULL OR a.partner_channel_id IS NOT NULL) AND a.thumbnail_image IS NOT NULL;
Note the NULLs below. The `title` field has a CR/LF in it.
astrid-caroline-cole-sneak-peek-realities-by-marc-gumpinger Sneak peek "Realities" by Marc Gumpinger https://artsy-media-uploads.s3.amazonaws.com/bUb1l_4g6cMhcDxEaPYDxw%2Facc_signature.png artsy-editorial-how-art-fairs-expanded-the-contemporary-art-market The Art Market, Explained: The Rise of the Art Fair https://artsy-media-uploads.s3.amazonaws.com/j8GIeamyufubMBgJFNHbFA%2Fartfairsex.jpg nolongercreek Alexandra Kehayoglou x Artsy: NULL No Longer Creek https://d32dm0rphc51dk.cloudfront.net/5oRwy7ysKHohtahIYUTE9Q/larger.jpg NULL kukje-gallery-teaser-trailer-kim-yong-ik Teaser Trailer | Kim Yong-Ik https://artsy-media-uploads.s3.amazonaws.com/mmoZcz0imuUzCavkObKgVQ%2Fkyi+thumbnail.PNG
$ hive --version
Hive 1.1.0-cdh5.6.0
Subversion file:///data/jenkins/workspace/generic-package-ubuntu64-14-04/CDH5.6.0-Packaging-Hive-2016-01-28_21-19-00/hive-1.1.0+cdh5.6.0+377-1.cdh5.6.0.p0.110~trusty -r Unknown
Compiled by jenkins on Thu Jan 28 21:35:50 PST 2016
From source with checksum b4a8fadbcf1ca36d11d91805d3ec2743
What's very interesting is that I am not able to reproduce this locally with the same data with any version of hive. It only happens in our Cloudera cluster. Any help appreciated.