Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-5952

Create a tool to run data analysis on the PB format fsimage

    Details

    • Type: Improvement Improvement
    • Status: Patch Available
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 2.6.0
    • Fix Version/s: None
    • Component/s: tools
    • Labels:
    • Release Note:
      Bring back the delimited processor and resolve memory problem about offline image viewer on PB-based fsimage.

      Description

      Delimited processor in OfflineImageViewer is not supported after HDFS-5698 was merged.
      The motivation of delimited processor is to run data analysis on the fsimage, therefore, there might be more values to create a tool for Hive or Pig that reads the PB format fsimage directly.

        Issue Links

          Activity

          Transition Time In Source Status Execution Times Last Executer Last Execution Date
          Open Open Patch Available Patch Available
          223d 11h 47m 1 Hao Chen 25/Sep/14 13:14
          Hide
          Vinayakumar B added a comment -

          Any more update on this tool?

          Really like to see the tool with new ideas.

          Show
          Vinayakumar B added a comment - Any more update on this tool? Really like to see the tool with new ideas.
          Allen Wittenauer made changes -
          Labels BB2015-05-TBR
          Hide
          Hadoop QA added a comment -



          -1 overall



          Vote Subsystem Runtime Comment
          -1 patch 0m 0s The patch command could not apply the patch during dryrun.



          Subsystem Report/Notes
          Patch URL http://issues.apache.org/jira/secure/attachment/12663638/HDFS-5952.patch
          Optional Tests javadoc javac unit findbugs checkstyle
          git revision trunk / f1a152c
          Console output https://builds.apache.org/job/PreCommit-HDFS-Build/10700/console

          This message was automatically generated.

          Show
          Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment -1 patch 0m 0s The patch command could not apply the patch during dryrun. Subsystem Report/Notes Patch URL http://issues.apache.org/jira/secure/attachment/12663638/HDFS-5952.patch Optional Tests javadoc javac unit findbugs checkstyle git revision trunk / f1a152c Console output https://builds.apache.org/job/PreCommit-HDFS-Build/10700/console This message was automatically generated.
          Hide
          Hadoop QA added a comment -



          -1 overall



          Vote Subsystem Runtime Comment
          -1 patch 0m 0s The patch command could not apply the patch during dryrun.



          Subsystem Report/Notes
          Patch URL http://issues.apache.org/jira/secure/attachment/12663638/HDFS-5952.patch
          Optional Tests javadoc javac unit findbugs checkstyle
          git revision trunk / f1a152c
          Console output https://builds.apache.org/job/PreCommit-HDFS-Build/10677/console

          This message was automatically generated.

          Show
          Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment -1 patch 0m 0s The patch command could not apply the patch during dryrun. Subsystem Report/Notes Patch URL http://issues.apache.org/jira/secure/attachment/12663638/HDFS-5952.patch Optional Tests javadoc javac unit findbugs checkstyle git revision trunk / f1a152c Console output https://builds.apache.org/job/PreCommit-HDFS-Build/10677/console This message was automatically generated.
          Lei (Eddy) Xu made changes -
          Link This issue is related to HDFS-6673 [ HDFS-6673 ]
          Hide
          Hao Chen added a comment -

          Lei (Eddy) Xu External storage is surely one of the correct options to resolve the problem of RAM consumption. Please feel free to take it and move forward.

          Show
          Hao Chen added a comment - Lei (Eddy) Xu External storage is surely one of the correct options to resolve the problem of RAM consumption. Please feel free to take it and move forward.
          Hide
          Haohui Mai added a comment -
          Show
          Haohui Mai added a comment - I've explored the direction of using leveldb in HDFS-6293 : https://issues.apache.org/jira/browse/HDFS-6293?focusedCommentId=13989358&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13989358 Please feel free to take the patch and drive it forward.
          Hide
          Lei (Eddy) Xu added a comment -

          Hey, Hao Chen and Haohui Mai

          Firstly, thank you for the work you have done.

          I've been looking at writing a tool that uses external db (e.g., leveldb) to process the new-style protobuf-based fsimage. Using leveldb can remove the RAM limitations (i.e., loading all inodes into RAM first). This would be more convenient for people who don't want to lose the information in the new image (such as xattrs), but who want delimited output. It would be great that I can follow Hao Chen's work and of course I would love to help to get this patch in.

          What do you think about this?

          Show
          Lei (Eddy) Xu added a comment - Hey, Hao Chen and Haohui Mai Firstly, thank you for the work you have done. I've been looking at writing a tool that uses external db (e.g., leveldb) to process the new-style protobuf-based fsimage. Using leveldb can remove the RAM limitations (i.e., loading all inodes into RAM first). This would be more convenient for people who don't want to lose the information in the new image (such as xattrs), but who want delimited output. It would be great that I can follow Hao Chen 's work and of course I would love to help to get this patch in. What do you think about this?
          Hide
          Haohui Mai added a comment - - edited

          Users can export legacy oiv to generate the old delimited format. Please see HDFS-6293.

          Show
          Haohui Mai added a comment - - edited Users can export legacy oiv to generate the old delimited format. Please see HDFS-6293 .
          Hide
          Hao Chen added a comment -

          As to OIV performance related, please refer to: https://issues.apache.org/jira/browse/HDFS-6914.

          Show
          Hao Chen added a comment - As to OIV performance related, please refer to: https://issues.apache.org/jira/browse/HDFS-6914 .
          Hide
          Hao Chen added a comment -

          I have tested for large PB-based fsimage about 8GiB which used to consume about 85GiB of memory and is just taking about 30GiB (about 30% or less) now using this processor.

          In fact, we are using this processor in production for all our clusters now which seems to work fine aside name node without affecting its performance and we are highly relying on it for daily hadoop storage management but not just for temporary troubleshooting, so I am surely willing to bring it back to trunk if it can help others too.

          Show
          Hao Chen added a comment - I have tested for large PB-based fsimage about 8GiB which used to consume about 85GiB of memory and is just taking about 30GiB (about 30% or less) now using this processor. In fact, we are using this processor in production for all our clusters now which seems to work fine aside name node without affecting its performance and we are highly relying on it for daily hadoop storage management but not just for temporary troubleshooting, so I am surely willing to bring it back to trunk if it can help others too.
          Hide
          Lei (Eddy) Xu added a comment -

          Hi, Hao Chen.

          Thank you very much for your work. Bringing back supports of OIV delimited processor will be very helpful. I just have a few small questions.

          Are you going to make it into trunk? Moreover, have you tried to process large fsimage (e.g., 16gb). Would you mind to share us the results?

          It would be appreciated to have this functionality back in trunk.

          Show
          Lei (Eddy) Xu added a comment - Hi, Hao Chen . Thank you very much for your work. Bringing back supports of OIV delimited processor will be very helpful. I just have a few small questions. Are you going to make it into trunk? Moreover, have you tried to process large fsimage (e.g., 16gb). Would you mind to share us the results? It would be appreciated to have this functionality back in trunk.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12663638/HDFS-5952.patch
          against trunk revision dff95f7.

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8203//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12663638/HDFS-5952.patch against trunk revision dff95f7. -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8203//console This message is automatically generated.
          Hao Chen made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Release Note Bring back the delimited processor and resolve memory problem about offline image viewer on PB-based fsimage.
          Affects Version/s 2.6.0 [ 12327181 ]
          Affects Version/s 3.0.0 [ 12320356 ]
          Hao Chen made changes -
          Attachment HDFS-5952.patch [ 12663638 ]
          Hide
          Hao Chen added a comment -

          Implement an OIV delimited processor replacement for hadoop version since 2.4.1 that has not been upgraded into 2.5.

          Show
          Hao Chen added a comment - Implement an OIV delimited processor replacement for hadoop version since 2.4.1 that has not been upgraded into 2.5.
          Hao Chen made changes -
          Link This issue is blocked by HDFS-6914 [ HDFS-6914 ]
          Akira AJISAKA made changes -
          Assignee Akira AJISAKA [ ajisakaa ]
          Hide
          Akira AJISAKA added a comment -

          Thank you for your comment.
          I'm okay to use XML-based tool, and I don't want to duplicate the code.

          Show
          Akira AJISAKA added a comment - Thank you for your comment. I'm okay to use XML-based tool, and I don't want to duplicate the code.
          Hide
          Haohui Mai added a comment -

          Is it okay to use the XML-based tool for debugging? Otherwise you'll end up with duplicating the code in PBImageXmlWriter to parse the fsimage.

          Note that the XML / delimited formats are intended to capture all internal details of the fsimage. I understand that the delimited format is more compact than the XML one. The delimited format does not include a schema thus it could be problematic when the format of fsimage changes. Unfortunately we changes the fsimage format quite often.

          If you really want to output in delimited format, I think it might be easier to take the output of PBImageXmlWriter and to use SAX to convert the XML into the delimited format. It should work fairly efficiently.

          Show
          Haohui Mai added a comment - Is it okay to use the XML-based tool for debugging? Otherwise you'll end up with duplicating the code in PBImageXmlWriter to parse the fsimage. Note that the XML / delimited formats are intended to capture all internal details of the fsimage. I understand that the delimited format is more compact than the XML one. The delimited format does not include a schema thus it could be problematic when the format of fsimage changes. Unfortunately we changes the fsimage format quite often. If you really want to output in delimited format, I think it might be easier to take the output of PBImageXmlWriter and to use SAX to convert the XML into the delimited format. It should work fairly efficiently.
          Hide
          Akira AJISAKA added a comment -

          Rethinking this idea, it is good for data analysis, but not for troubleshooting. It needs too much cost to run Hive/Pig jobs when an cluster is in trouble.

          Therefore, a tool to dump fsimage into text format is still needed.
          The tool will output two text files:

          • files/dirs information
          • snapshot diffs

          and users can analyze namespaces or lsr to snapshots by tools such as SQLite.

          Show
          Akira AJISAKA added a comment - Rethinking this idea, it is good for data analysis, but not for troubleshooting. It needs too much cost to run Hive/Pig jobs when an cluster is in trouble. Therefore, a tool to dump fsimage into text format is still needed. The tool will output two text files: files/dirs information snapshot diffs and users can analyze namespaces or lsr to snapshots by tools such as SQLite.
          Akira AJISAKA made changes -
          Parent HDFS-5863 [ 12692795 ]
          Issue Type Sub-task [ 7 ] Improvement [ 4 ]
          Akira AJISAKA made changes -
          Description Delimited processor is not supported after HDFS-5698 was merged.
          The processor is useful for analyzing the output by scripts such as pig.
          Delimited processor in OfflineImageViewer is not supported after HDFS-5698 was merged.
          The motivation of delimited processor is to run data analysis on the fsimage, therefore, there might be more values to create a tool for Hive or Pig that reads the PB format fsimage directly.
          Akira AJISAKA made changes -
          Field Original Value New Value
          Summary Implement delimited processor in OfflineImageViewer Create a tool to run data analysis on the PB format fsimage
          Hide
          Akira AJISAKA added a comment -

          +1 for the idea! I'll try it.

          Show
          Akira AJISAKA added a comment - +1 for the idea! I'll try it.
          Hide
          Haohui Mai added a comment -

          It is difficult for delimited processors and lsr to fully support snapshots, because the tools need to reload the full inode information into the memory. It could be infeasible for fsimages in productions (16G fsimages are quite common).

          The motivation of delimited processor is to run data analysis on the fsimage. The design of the PB-based fsimage strives to flatten the hierarchy so that it is feasible to map the analysis problems into JOIN queries.

          Therefore, there might be more values to create a tool that reads the PB format directly and dumps the data directly into Hive. Such a tool avoids converting data between protobuf, text, and the database format, which can significantly boost the efficiency of the analysis pipeline.

          Putting the data also allows getting the stats with little amount of code. For example, the following query can check the usages of different users in a particular directory:

          select sum(filesize) from inode where inode.parentId = 'foo' group by user
          
          Show
          Haohui Mai added a comment - It is difficult for delimited processors and lsr to fully support snapshots, because the tools need to reload the full inode information into the memory. It could be infeasible for fsimages in productions (16G fsimages are quite common). The motivation of delimited processor is to run data analysis on the fsimage. The design of the PB-based fsimage strives to flatten the hierarchy so that it is feasible to map the analysis problems into JOIN queries. Therefore, there might be more values to create a tool that reads the PB format directly and dumps the data directly into Hive. Such a tool avoids converting data between protobuf, text, and the database format, which can significantly boost the efficiency of the analysis pipeline. Putting the data also allows getting the stats with little amount of code. For example, the following query can check the usages of different users in a particular directory: select sum(filesize) from inode where inode.parentId = 'foo' group by user
          Akira AJISAKA created issue -

            People

            • Assignee:
              Unassigned
              Reporter:
              Akira AJISAKA
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:

                Development