Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-16829 Über-jira: S3A Hadoop 3.3.1 features
  3. HADOOP-14159

Add some Java-8 friendly way to work with RemoteIterable, especially listings

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Minor
    • Resolution: Duplicate
    • 3.0.0-alpha2
    • 3.3.1
    • fs
    • None

    Description

      There's a fair amount of Hadoop code which uses FileSystem.listStatus(path) }} just to get an {{FileStatus[] array which they can then iterate over in a for loop.

      This is inefficient and scales badly, as the entire listing is done before the compute; it cannot handle directories with millions of entries.

      The listLocatedStatus() calls return a RemoteIterator class, which can't be used in for loops as it has the right to throw an IOE in any hasNext/next call. That doesn't matter, as we now have closures and simple stream operations.

       listLocatedStatus(path).filter((st) -> st.length > 0).apply(st -> fs.delete(st.path))}}
      

      See? We could do shiny new closure things. It wouldn't necessarily need changes to FileSystem either, just something which took RemoteIterator and let you chain some closures off it, similar to the java 8 streams operations.

      Once implemented, we can move to using it in the Hadoop code wherever we use listFiles() today

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              stevel@apache.org Steve Loughran
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: