Hi Alan ,
The below is how I have handled these cases :
The XMLLoader will consider one record from begining tag to end tag just like a line record reader searching for new line char .
Split start and end locations are provided by the default FileInputFormat.
Describing the entire steps in a simple way ;
*The loader will collect the start and end tags and create a record out of it. (XMLLoaderBufferedPositionedInputStream.collectTag)
*For begin tag
*Read till the tag is found in this block
*If tag not found and split end has reached then no rec found in this split (return empty array)
*If partial tag is found in the current split then even though split end has reached
continue reading rest of the file , beyond the split end location (handled by cond in while loop)
*For end tag
*Read till the end tag is found even if the split end location is reached.
>>How far will split 1 read? It seems like it has to read to "</a>" or else the map processing split one will not be able to process this as a coherent document.
>>Yet from the setting of maxBytesReadable on line 132 it looks to me like it won't read past the end point.
The other condition will keep the reading going on. (matchBuf.size() > 0 )
Here in this case lets say my tag identifier is <a> . Then the loader will read till the split end to search for begining tag.
Now for the end tag, it reads the rest of file starting from the last read position.Lets say split end has reached in between,
it will check whether it has found a match/or partial match. If not proceed with the reading till it finds a end tag.