Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-13752

fs.Path stores file path in java.net.URI causes big memory waste

    Details

    • Type: Improvement
    • Status: Patch Available
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 2.7.6
    • Fix Version/s: None
    • Component/s: fs
    • Labels:
      None
    • Environment:

      Hive 2.1.1 and hadoop 2.7.6

      Description

      I was looking at HiveServer2 memory usage, and a big percentage of this was because of org.apache.hadoop.fs.Path, where you store file paths in a java.net.URI object. The URI implementation stores the same string in 3 different objects (see the attached image). In Hive when there are many partitions this cause a big memory usage. In my particular case 42% of memory was used by java.net.URI so it could be reduced to 14%. 

      I wonder if the community is open to replace it with a more memory efficient implementation and what other things should be considered here? It can be a huge memory improvement for Hadoop and for Hive as well.

        Attachments

        1. Screen Shot 2018-07-20 at 11.12.38.png
          150 kB
          Barnabas Maidics
        2. heapdump-100000partitions.html
          2.02 MB
          Misha Dmitriev
        3. measurement.pdf
          318 kB
          Barnabas Maidics
        4. HDFS-13752.001.patch
          11 kB
          Barnabas Maidics
        5. HDFS-13752.002.patch
          10 kB
          Barnabas Maidics
        6. HDFS-13752.003.patch
          13 kB
          Barnabas Maidics

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              b.maidics Barnabas Maidics
            • Votes:
              0 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

              • Created:
                Updated: