[HIVE-11095] SerDeUtils another bug ,when Text is reused - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 0.14.0, 1.0.0, 1.2.0
Fix Version/s: 1.3.0, 2.0.0
Component/s: API, CLI
Labels:
None
Environment:

Hadoop 2.3.0-cdh5.0.0
Hive 0.14

External issue URL:
https://issues.apache.org/jira/browse/HIVE-10983
External issue ID:
10983

Description

The method transformTextFromUTF8 have a  error bug, It invoke a bad method of Text,getBytes()!
The method getBytes of Text returns the raw bytes; however, only data up to Text.length is valid.A better way is  use copyBytes()  if you need the returned array to be precisely the length of the data.
But the copyBytes is added behind hadoop1.

How I found this bug？
When i query data from a lzo table ， I found in results ： the length of the current row is always largr than the previous row， and sometimes，the current row contains the contents of the previous row。 For example ，i execute a sql ,

select * from web_searchhub where logdate=2015061003

the result of sql see blow.Notice that ,the second row content contains the first row content.

INFO [03:00:05.589] HttpFrontServer::FrontSH msgRecv:Remote=/10.13.193.68:42098,session=3151,thread=254 2015061003
INFO [03:00:05.594] <18941e66-9962-44ad-81bc-3519f47ba274> session=901,thread=223ession=3151,thread=254 2015061003

The content of origin lzo file content see below ,just 2 rows.

INFO [03:00:05.635] <b88e0473-7530-494c-82d8-e2d2ebd2666c_forweb> session=3148,thread=285
INFO [03:00:05.635] HttpFrontServer::FrontSH msgRecv:Remote=/10.13.193.68:42095,session=3148,thread=285

I think this error is caused by the Text reuse,and I found the solutions .
Addicational, table create sql is :

CREATE EXTERNAL TABLE `web_searchhub`(
`line` string)
PARTITIONED BY (
`logdate` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '
U0000'
WITH SERDEPROPERTIES (
'serialization.encoding'='GBK')
STORED AS INPUTFORMAT "com.hadoop.mapred.DeprecatedLzoTextInputFormat"
OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat";
LOCATION
'viewfs://nsX/user/hive/warehouse/raw.db/web/web_searchhub'

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HIVE-11095.1.patch.txt
24/Jun/15 13:40
0.8 kB
Xiaowei Wang
HIVE-11095.2.patch.txt
26/Jun/15 09:25
0.6 kB
Xiaowei Wang
HIVE-11095.3.patch.txt
30/Jun/15 00:20
6 kB
Xiaowei Wang

Issue Links

is duplicated by

HIVE-10983 SerDeUtils bug ,when Text is reused

Resolved

is related to

HIVE-11112 ISO-8859-1 text output has fragments of previous longer rows appended

Closed

relates to

HIVE-10983 SerDeUtils bug ,when Text is reused

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Xiaowei Wang

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 24/Jun/15 13:30

Updated:: 19/Apr/18 16:47

Resolved:: 30/Jun/15 12:24