[MAPREDUCE-6758] TestDFSIO should parallelize its creation of control files on setup - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Patch Available
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: test
Labels:
None

Description

TestDFSIO currently performs a sequential for-loop to create nrFiles control files in the controlDir which is a subdirectory of the overall test.build.data directory, which may be a non-HDFS FileSystem implementation:

private void createControlFile(FileSystem fs,
                                long nrBytes, // in bytes
                                int nrFiles
                              ) throws IOException {
  LOG.info("creating control file: "+nrBytes+" bytes, "+nrFiles+" files");

  Path controlDir = getControlDir(config);
  fs.delete(controlDir, true);

  for(int i=0; i < nrFiles; i++) {
    String name = getFileName(i);
    Path controlFile = new Path(controlDir, "in_file_" + name);
    SequenceFile.Writer writer = null;
    try {
      writer = SequenceFile.createWriter(fs, config, controlFile,
                                         Text.class, LongWritable.class,
                                         CompressionType.NONE);
      writer.append(new Text(name), new LongWritable(nrBytes));
    } catch(Exception e) {
      throw new IOException(e.getLocalizedMessage());
    } finally {
      if (writer != null)
        writer.close();
      writer = null;
    }
  }
  LOG.info("created control files for: "+nrFiles+" files");
}

When testing in an object-store based filesystem with higher round-trip latency than HDFS (like S3 or GCS), this means job setup that might only take seconds in HDFS ends up taking minutes or even tens of minutes against the object stores if the test is using thousands of control files. In the same vein as other JIRAs in https://issues.apache.org/jira/browse/HADOOP-11694, the control-file creation should be parallelized/multithreaded to efficiently launch large TestDFSIO jobs against FileSystem impls with high round-trip latency but which can still support high overall throughput/QPS.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

MAPREDUCE-6758.001.diff
29/Jan/20 16:05
2 kB
Igor Dvorzhak

Issue Links

relates to

MAPREDUCE-6674 configure parallel tests for mapreduce-client-jobclient

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Dennis Huo

Votes:: 1 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 17/Aug/16 01:56

Updated:: 29/Jan/20 19:05