[AVRO-2188] SpecificDatumReader Corrupts Bytes Field When Using Next(R) - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: 1.8.2
Fix Version/s: None
Component/s: java
Labels:
None

Description

I am loading large sets of data into Avro files and cataloging them for quick access. My code for the load looks a bit like this.

DataFileWriter<MyData> writer;
final long offset = writer.sync();
writer.append(data);

When I am ready to read the data, that was the code that I initially used.

SpecificDatumReader<MyDat> reader;
reader.sync(offset);

while (reader.hasNext()) {
  MyData data = reader.next();
  if (matchesId(data)) return data;
}

This worked for the majority of cases. But a few of them had a problem with the call to sync. In these cases a call to "reader.tell()" indicated that the sync was actually PAST the offset. This meant that I would never retrieve the file that I wanted.

In order to work around this issue, I implemented a simple reversal algorithm which works roughly like this.

SpecificDatumReader<MyDat> reader;
reader.sync(offset);

long reversal = 10;
while (reader.tell() >= offset) {
  reader.sync(offset - reversal);
  reversal *= 2;
}

while (reader.hasNext()) {
  MyData data = reader.next();
  if (matchesId(data)) return data;
}

This works correctly in, what I believe, are all cases. Now I am SURE that I am doing something wrong, since this process seems like an extremely convoluted way to retrieve data. However, the issue is what happened next.

To see if performance could be improved, I changed my last loop to this.

MyData data;
while (reader.hasNext()) {
  data = reader.next(data);
  if (matchesId(data)) return data;
}

My schema has several string fields and a contents field which is bytes. It looks something like this.

{
  "title": "the title",
  "contents": "[base 64 file contents]"
}

Once I made the change for performance, I started seeing data returned like this.

{
  "title": "the title",
  "contents": "[file contents][fragment of contents from previous file]"
}

I'm thinking that this is because the byte array is being reused. Any ideas on this?

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Dan Grahn

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 12/Jun/18 17:32

Updated:: 12/Jun/18 17:32