[AVRO-534] AvroRecordReader (org.apache.avro.mapred) should support a JobConf-given schema - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Trivial
Resolution: Fixed
Affects Version/s: 1.4.0
Fix Version/s: 1.4.0
Component/s: java
Labels:
- avro
Environment:

ArchLinux, JAVA 1.6, Apache Hadoop (0.20.2), Apache Avro (trunk – 1.4.0 SNAPSHOT), Using Avro Generic API (JAVA)

Hadoop Flags:

Reviewed
Tags:
Avro, MapReduce, AvroRecordReader

Description

Consider an Avro File of a single record type with about 70 fields in the order (str, str, str, long, str, double, [lets take only first 6 into consideration] ...).
To pass this into a simple MapReduce job I do: AvroInputFormat.addInputPath(...) and it works well with an IdentityMapper.

Now I'd like to read only three fields, say fields 0, 1 and 3 so I give the special schema with my 3 fields as (str (0), str (1), long(2)) using AvroJob.setInputGeneric(..., mySchema). This leads to a failure of the mapreduce job since the Avro record reader reads the file for its entire schema (of 70 fields) and tries to convert my given 'long' field to 'str' as is at the index 2 of the actual schema (meaning its using the actual schema embedded into the file, not what I supplied!).

The AvroRecordReader must support reading in the schema specified by the user using AvroJob.setInputGeneric.

I've written a patch for it to do the same but am not sure if its actually the solution (MAP_OUTPUT_SCHEMA use?)

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

avro.mapreduce.r1.diff
30/Apr/10 12:33
1 kB
Harsh J
avro.recordreader.readers.schema.r2.diff
20/Aug/10 20:40
3 kB
Harsh J
avro.recordreader.readers.schema.r3.diff
23/Aug/10 19:01
3 kB
Harsh J

Activity

People

Assignee:: Harsh J

Reporter:: Harsh J

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 30/Apr/10 12:31

Updated:: 08/Sep/10 21:07

Resolved:: 23/Aug/10 19:19