[HADOOP-2032] distcp split generation does not work correctly - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Duplicate
Affects Version/s: None
Fix Version/s: None
Component/s: util
Labels:
None

Description

With the current implementation, distcp will always assign multiple files to one mapper to copy, no matter how large
are the files. This is because the CopyFiles class uses a sequencefile to store the list of files to be copied,
one record per file. CopyFile class correctly generates one split per record in the sequence file. However,
due to the way the sequence file record reader works, the minimum unit for splits is the segments between the
"syncmarks" in the sequence file.
This results in the strange behavior that some mappers get zero records (zero files to copy) even though their
split lengths are non-zero, while other mappers get multiple records (multiple filesto copy) from their split (and beyond
to the next sync mark).

When CopyFile class creates the sequencefile, it does try to place a sync mark between splitable segments in the sequence file by calling sync() function of the sequence file record writer.
Unfortunately, the sync() function is a no-op for files that are not block compressed.

Naturally, after I changed the compression type for the sequence file to block compression,
mappers got the correct records from their splits.
So a simple fix is to change the compression tye to CompressionType.BLOCK:

// create src list
    SequenceFile.Writer writer = SequenceFile.createWriter(
        jobDirectory.getFileSystem(jobConf), jobConf, srcfilelist,
        LongWritable.class, FilePair.class,
        SequenceFile.CompressionType.BLOCK);.

Attachments

Issue Links

is related to

HADOOP-2033 In SequenceFile sync doesn't work unless the file is compressed (block or record)

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Runping Qi

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 11/Oct/07 13:20

Updated:: 24/Oct/07 21:42

Resolved:: 17/Oct/07 20:37