Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-13752

fs.Path stores file path in java.net.URI causes big memory waste

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Patch Available
    • Major
    • Resolution: Unresolved
    • 2.7.6
    • None
    • fs
    • None
    • Hive 2.1.1 and hadoop 2.7.6

    Description

      I was looking at HiveServer2 memory usage, and a big percentage of this was because of org.apache.hadoop.fs.Path, where you store file paths in a java.net.URI object. The URI implementation stores the same string in 3 different objects (see the attached image). In Hive when there are many partitions this cause a big memory usage. In my particular case 42% of memory was used by java.net.URI so it could be reduced to 14%. 

      I wonder if the community is open to replace it with a more memory efficient implementation and what other things should be considered here? It can be a huge memory improvement for Hadoop and for Hive as well.

      Attachments

        1. HDFSbenchmark.pdf
          833 kB
          Barnabas Maidics
        2. HDFS-13752.003.patch
          13 kB
          Barnabas Maidics
        3. HDFS-13752.002.patch
          10 kB
          Barnabas Maidics
        4. HDFS-13752.001.patch
          11 kB
          Barnabas Maidics
        5. measurement.pdf
          318 kB
          Barnabas Maidics
        6. heapdump-100000partitions.html
          2.02 MB
          Misha Dmitriev
        7. Screen Shot 2018-07-20 at 11.12.38.png
          150 kB
          Barnabas Maidics

        Activity

          People

            b.maidics Barnabas Maidics
            b.maidics Barnabas Maidics
            Votes:
            0 Vote for this issue
            Watchers:
            18 Start watching this issue

            Dates

              Created:
              Updated: