[HIVE-2395] Misleading "No LZO codec found, cannot run." exception when using external table and LZO / DeprecatedLzoTextInputFormat - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Won't Fix
Affects Version/s: 0.7.1
Fix Version/s: None
Component/s: Serializers/Deserializers
Labels:
None
Environment:

Hide

Cloudera 3u1 with https://github.com/kevinweil/hadoop-lzo or https://github.com/kevinweil/elephant-bird

Show
Cloudera 3u1 with https://github.com/kevinweil/hadoop-lzo or https://github.com/kevinweil/elephant-bird

Description

We have a /tables/ directory containing .lzo files with our data, compressed using lzop.

We CREATE EXTERNAL TABLE on top of this directory, using STORED AS INPUTFORMAT "com.hadoop.mapred.DeprecatedLzoTextInputFormat".

.lzo files require that an LzoIndexer is run on them. When this is done, .lzo.index file is created for every .lzo file, so we end up with:

/tables/ourdata_2011-08-19.lzo
/tables/ourdata_2011-08-19.lzo.index
/tables/ourdata_2011-08-18.lzo
/tables/ourdata_2011-08-18.lzo.index
..etc

The issue is that org.apache.hadoop.hive.ql.io.CombineHiveRecordReader is attempting to getRecordReader() for .lzo.index files. This throws a pretty confusing exception:

Caused by: java.io.IOException: No LZO codec found, cannot run.
        at com.hadoop.mapred.DeprecatedLzoLineRecordReader.<init>(DeprecatedLzoLineRecordReader.java:53)
        at com.hadoop.mapred.DeprecatedLzoTextInputFormat.getRecordReader(DeprecatedLzoTextInputFormat.java:128)
        at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.<init>(CombineHiveRecordReader.java:68)

More precisely, it dies on second invocation of getRecordReader() - here is some System.out.println() output:

DeprecatedLzoTextInputFormat.getRecordReader(): split=/tables/ourdata_2011-08-19.lzo:0+616479
DeprecatedLzoTextInputFormat.getRecordReader(): split=/tables/ourdata_2011-08-19.lzo.index:0+64

DeprecatedLzoTextInputFormat contains the following code which causes the ultimate exception and death of query, as it obviously doesn't have a codec to read .lzo.index files.

    final CompressionCodec codec = codecFactory.getCodec(file);
    if (codec == null) {
      throw new IOException("No LZO codec found, cannot run.");
    }

So I understand that the way things are right now is that Hive considers all files within a directory to be part of a table. There is an open patch HIVE-951 which would allow a quick workaround for this problem.

Does it make sense to add some hooks so that CombineHiveRecordReader or its parents are more aware of what files should be considered instead of blindly trying to read everything?

Any suggestions for a quick workaround to make it skip .index files?

Attachments

Issue Links

relates to

HIVE-80 Add testcases for concurrent query execution

Open

Activity

People

Assignee:: Unassigned

Reporter:: Vitaliy Fuks

Votes:: 1 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 19/Aug/11 05:07

Updated:: 01/Sep/12 16:26

Resolved:: 01/Sep/12 16:26