[CRUNCH-580] FileTargetImpl#handleOutputs Inefficiency on S3NativeFileSystem - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 0.13.0
Fix Version/s: 0.14.0
Component/s: Core, IO
Labels:
None
Environment:
Amazon Elastic Map Reduce

Flags:

Patch

Description

We have run in to a pretty frustrating inefficiency inside of org.apache.crunch.io.impl.FileTargetImpl#handleOutputs.

This method loops over all of the partial output files and moves them to their ultimate destination directories, calling org.apache.hadoop.fs.FileSystem#rename(org.apache.hadoop.fs.Path, org.apache.hadoop.fs.Path) on each partial output in a loop.

This is no problem when the org.apache.hadoop.fs.FileSystem in question is HDFS where #rename is a cheap operation, but when an implementation such as S3NativeFileSystem is used it is extremely inefficient, as each iteration through the loop makes a single blocking S3 API call, and this loop can be extremely long when there are many thousands of partial output files.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

CRUNCH-580.patch
05/Dec/15 01:26
9 kB
Jeffrey Quinn
CRUNCH-580.patch
08/Dec/15 18:04
8 kB
Jeffrey Quinn

Activity

People

Assignee:: Josh Wills

Reporter:: Jeffrey Quinn

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 05/Dec/15 01:20

Updated:: 08/May/16 04:14

Resolved:: 10/Dec/15 06:30