[CARBONDATA-4050] TPC-DS queries performance degraded when compared to older versions due to redundant getFileStatus() invocations - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.0.0
Fix Version/s: 2.1.1
Component/s: core
Labels:
None

Description

Issue:

In createCarbonDataFileBlockMetaInfoMapping method, we get list of carbondata files in the segment, loop through all the carbon files and make a map of fileNameToMetaInfoMapping<path-string, BlockMetaInfo>

In that carbon files loop, if the file is of AbstractDFSCarbonFile type, we get the org.apache.hadoop.fs.FileStatus thrice for each file. And the method to get file status is an RPC call(fileSystem.getFileStatus(path)). It takes ~2ms in the cluster for each call. Thus, incur an overhead of ~6ms per file. So overall driver side query processing time has increased significantly when there are more carbon files. Hence caused TPC-DS queries performance degradation.

Have shown the methods/calls which get the file status for the carbon file in loop:

public static Map<String, BlockMetaInfo> createCarbonDataFileBlockMetaInfoMapping(
    String segmentFilePath, Configuration configuration) throws IOException {
  Map<String, BlockMetaInfo> fileNameToMetaInfoMapping = new TreeMap();
  CarbonFile carbonFile = FileFactory.getCarbonFile(segmentFilePath, configuration);
  if (carbonFile instanceof AbstractDFSCarbonFile && !(carbonFile instanceof S3CarbonFile)) {
    PathFilter pathFilter = new PathFilter() {
      @Override
      public boolean accept(Path path) {
        return CarbonTablePath.isCarbonDataFile(path.getName());
      }
    };
    CarbonFile[] carbonFiles = carbonFile.locationAwareListFiles(pathFilter);
    for (CarbonFile file : carbonFiles) {
      String[] location = file.getLocations(); // RPC call - 1
      long len = file.getSize(); // RPC call - 2
      BlockMetaInfo blockMetaInfo = new BlockMetaInfo(location, len);
      fileNameToMetaInfoMapping.put(file.getPath(), blockMetaInfo); // RPC call - 3 in file.getpath() method
    }
  }
  return fileNameToMetaInfoMapping;
}

Suggestion:

I think, currently we make RPC call to get the file status upon each invocation because file status may change over a period of time. And we shouldn't cache the file status in AbstractDFSCarbonFile.

In the current case, just before the loop of carbon files, we get the file status of all the carbon files in the segment with RPC call shown below. LocatedFileStatus is a child class of FileStatus. It has BlockLocation along with file status.

RemoteIterator<LocatedFileStatus> iter = fileSystem.listLocatedStatus(path);

Intention of getting all the file status here is to create instance of BlockMetaInfo and maintain the map of fileNameToMetaInfoMapping.

So it is safe to avoid these unnecessary rpc calls to get file status again in getLocations(), getSize() and getPath() methods.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Venugopal Reddy K

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 18/Nov/20 17:46

Updated:: 03/Dec/20 14:26

Resolved:: 03/Dec/20 14:26

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

2h 40m