Hadoop Common
  1. Hadoop Common
  2. HADOOP-2113

Add "-text" command to FsShell to decode SequenceFile to stdout

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.16.0
    • Component/s: fs
    • Labels:
      None

      Description

      FsShell should provide a command to examine SequenceFiles.

      1. 2113-0.patch
        7 kB
        Chris Douglas
      2. 2113-1.patch
        9 kB
        Chris Douglas

        Issue Links

          Activity

          Chris Douglas created issue -
          Hide
          Chris Douglas added a comment -

          Doing what SequenceFileAsTextRecordReader does- i.e. calling toString() on keys/values from a SequenceFile to output it as text- seems reasonable as a first pass.

          Show
          Chris Douglas added a comment - Doing what SequenceFileAsTextRecordReader does- i.e. calling toString() on keys/values from a SequenceFile to output it as text- seems reasonable as a first pass.
          Chris Douglas made changes -
          Field Original Value New Value
          Attachment 2113-0.patch [ 12368535 ]
          Chris Douglas made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hide
          Enis Soztutar added a comment -

          I rather think of implementing a more general set of tools to operate sequence files. Because sequence files are at the heart of mapred operations, making human-interprettable operations should be supported by the framework. The set of operations that I can think of include :

          1. find value given key
          2. find values with keys matching given regex (dump to text file)
          3. dump sequence file to text file (using keyClass.toString() and valueClass.toString())
          4. find pairs in the given key1-key2 range.
          5. dump metadata and statistics of the sf, such as number of record, key range, etc.
          6. ... suggestions ?
          Show
          Enis Soztutar added a comment - I rather think of implementing a more general set of tools to operate sequence files. Because sequence files are at the heart of mapred operations, making human-interprettable operations should be supported by the framework. The set of operations that I can think of include : find value given key find values with keys matching given regex (dump to text file) dump sequence file to text file (using keyClass.toString() and valueClass.toString()) find pairs in the given key1-key2 range. dump metadata and statistics of the sf, such as number of record, key range, etc. ... suggestions ?
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12368535/2113-0.patch
          against trunk revision r588778.

          @author +1. The patch does not contain any @author tags.

          javadoc +1. The javadoc tool did not generate any warning messages.

          javac +1. The applied patch does not generate any new compiler warnings.

          findbugs +1. The patch does not introduce any new Findbugs warnings.

          core tests -1. The patch failed core unit tests.

          contrib tests -1. The patch failed contrib unit tests.

          Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1015/testReport/
          Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1015/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1015/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1015/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12368535/2113-0.patch against trunk revision r588778. @author +1. The patch does not contain any @author tags. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new compiler warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests -1. The patch failed core unit tests. contrib tests -1. The patch failed contrib unit tests. Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1015/testReport/ Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1015/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1015/artifact/trunk/build/test/checkstyle-errors.html Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1015/console This message is automatically generated.
          Hide
          Chris Douglas added a comment - - edited

          (core tests failed HADOOP-2112; I assume the contrib tests are unrelated)

          Each of those seem like valuable operations, but piping the output through one's favorite text-processing utility seems very usable. Unless the keys contain tabs, I would expect 1-4 in your list to be pretty straightforward. I agree that the framework could be far more efficient for most operations- particularly for sorted data, which is almost certainly the most common case- and it could also help express "for keys matching this regexp in their string representation, emit them as their native type" (which this cannot), but isn't mapred the correct tool for that job, anyway? The intent was merely to provide an aid to people hoping to check the first few/some subset of values from a given SequenceFile; it aspires to sanity checks, not processing.

          I could see extending stat to support more info, re: (5), though. By "a more general set of tools", what did you have in mind?

          [edit - unintended text effects ]

          Show
          Chris Douglas added a comment - - edited (core tests failed HADOOP-2112 ; I assume the contrib tests are unrelated) Each of those seem like valuable operations, but piping the output through one's favorite text-processing utility seems very usable. Unless the keys contain tabs, I would expect 1-4 in your list to be pretty straightforward. I agree that the framework could be far more efficient for most operations- particularly for sorted data, which is almost certainly the most common case- and it could also help express "for keys matching this regexp in their string representation, emit them as their native type" (which this cannot), but isn't mapred the correct tool for that job, anyway? The intent was merely to provide an aid to people hoping to check the first few/some subset of values from a given SequenceFile; it aspires to sanity checks, not processing. I could see extending stat to support more info, re: (5), though. By "a more general set of tools", what did you have in mind? [edit - unintended text effects ]
          Hide
          Enis Soztutar added a comment -

          By "a more general set of tools", what did you have in mind?

          I think of introducing a new command rather than using FsShell, such as

          bin/hadoop sf <command> <command_args>
          

          and the set of commands would be : findkey, matchkey, dump, stats, etc .

          For some jobs such as finding record given key/value, for example we may check whether sf is a map file, for other commands like matchkey we may run a distributed grep.

          Each of those seem like valuable operations, but piping the output through one's favorite text-processing utility seems very usable.

          Yes, indeed the outputs of some of the commands sould be dumped to stdout. We can add a filename argument and use stdout if "-" is given.

          Show
          Enis Soztutar added a comment - By "a more general set of tools", what did you have in mind? I think of introducing a new command rather than using FsShell, such as bin/hadoop sf <command> <command_args> and the set of commands would be : findkey, matchkey, dump, stats, etc . For some jobs such as finding record given key/value, for example we may check whether sf is a map file, for other commands like matchkey we may run a distributed grep. Each of those seem like valuable operations, but piping the output through one's favorite text-processing utility seems very usable. Yes, indeed the outputs of some of the commands sould be dumped to stdout. We can add a filename argument and use stdout if "-" is given.
          Hide
          Milind Bhandarkar added a comment -

          I find Enis's suggestion very valuable. Having a separate command to operate on sequencefiles (or rather any infputformat) would be great. These commands would allow users to specify inputformat. They can make only three assumptions, that the "datasets" belonging to a directory all have same format, are partitioned, and are locally sorted within a partition (in short, produced by a reduce phase.) Dumping them as text is a starting point, and then more commands can be added. Comments ?

          Show
          Milind Bhandarkar added a comment - I find Enis's suggestion very valuable. Having a separate command to operate on sequencefiles (or rather any infputformat) would be great. These commands would allow users to specify inputformat. They can make only three assumptions, that the "datasets" belonging to a directory all have same format, are partitioned, and are locally sorted within a partition (in short, produced by a reduce phase.) Dumping them as text is a starting point, and then more commands can be added. Comments ?
          Hide
          Andrzej Bialecki added a comment -

          Please take a look at HADOOP-175 and see if that patch could be useful here.

          Show
          Andrzej Bialecki added a comment - Please take a look at HADOOP-175 and see if that patch could be useful here.
          Hide
          Enis Soztutar added a comment -

          Thanks, Hadoop-175 could be a good starting point, i wonder why it did not make into trunk.

          Show
          Enis Soztutar added a comment - Thanks, Hadoop-175 could be a good starting point, i wonder why it did not make into trunk.
          Hide
          Chris Douglas added a comment -

          I think I've explained this command poorly. It attempts to render whatever exists at a given path as human-readable text. Right now, it includes SequenceFile and gzip formats; it's not trying to stuff a framework for computation on SequenceFiles into FsShell. I agree that such a toolchain should be independent, but this aspires to something else.

          While we're on the subject though, I'm not sure I fully understand the motivation for this command-line tool. Aren't each of those commands easily implemented in map/reduce? As I see it, there are two ways to generalize the operations Enis suggests, since all of WritableComparable is fair game. Either a) everything is first converted to a string or b) the framework can understand that a user-specified InputFormat creating a RecordReader creating a keytype comparable to IntWritable should select a comparator for its keys such that the user-supplied "70" is greater than "9", (unless the user actually intends a lexiographic ordering). Not to reveal my opinion.

          In the latter case, code like this belongs in mapred, since merely working out the types is going to be either a hack or a significant effort. In the former case, for more than a single SequenceFile, such code still seems to belong in mapred; that said, piping the output of "text"- as implemented- through a general text-processing utility is a reasonable hack for some purposes. For my purposes, I only needed to check the first few records for some of the output, and this suffices. I don't know why a comparable utility like HADOOP-175 never got committed (it would be a good base, though 1) it relies on UTF8 keys which are currently deprecated and 2) it solves some problems outside the limited domain of this issue), but that no similar utility has been written for the last year makes me wary of over-complicating this. It's for human-readability, not processing.

          Show
          Chris Douglas added a comment - I think I've explained this command poorly. It attempts to render whatever exists at a given path as human-readable text. Right now, it includes SequenceFile and gzip formats; it's not trying to stuff a framework for computation on SequenceFiles into FsShell. I agree that such a toolchain should be independent, but this aspires to something else. While we're on the subject though, I'm not sure I fully understand the motivation for this command-line tool. Aren't each of those commands easily implemented in map/reduce? As I see it, there are two ways to generalize the operations Enis suggests, since all of WritableComparable is fair game. Either a) everything is first converted to a string or b) the framework can understand that a user-specified InputFormat creating a RecordReader creating a keytype comparable to IntWritable should select a comparator for its keys such that the user-supplied "70" is greater than "9", (unless the user actually intends a lexiographic ordering). Not to reveal my opinion. In the latter case, code like this belongs in mapred, since merely working out the types is going to be either a hack or a significant effort. In the former case, for more than a single SequenceFile, such code still seems to belong in mapred; that said, piping the output of "text"- as implemented- through a general text-processing utility is a reasonable hack for some purposes. For my purposes, I only needed to check the first few records for some of the output, and this suffices. I don't know why a comparable utility like HADOOP-175 never got committed (it would be a good base, though 1) it relies on UTF8 keys which are currently deprecated and 2) it solves some problems outside the limited domain of this issue), but that no similar utility has been written for the last year makes me wary of over-complicating this. It's for human-readability, not processing.
          Hide
          Andrzej Bialecki added a comment -

          Some additional functionality was requested for HADOOP-175, and so far it didn't materialize ...

          UTF8 keys in these utilities are used only when user wants to retrieve specific records by key - and indeed, we can change this to Text - otherwise the tools use whatever classes are declared for keys/values, so from this point of view they don't depend on UTF8.

          Regarding mapred: I use these utilities often, specifically for casual checking of existing data files, and they come especially handy in cases when only DFS is working but mapred might not be available, or when the overhead of starting a mapred job is too high (e.g. dumping the first record of a big SequenceFile).

          Show
          Andrzej Bialecki added a comment - Some additional functionality was requested for HADOOP-175 , and so far it didn't materialize ... UTF8 keys in these utilities are used only when user wants to retrieve specific records by key - and indeed, we can change this to Text - otherwise the tools use whatever classes are declared for keys/values, so from this point of view they don't depend on UTF8. Regarding mapred: I use these utilities often, specifically for casual checking of existing data files, and they come especially handy in cases when only DFS is working but mapred might not be available, or when the overhead of starting a mapred job is too high (e.g. dumping the first record of a big SequenceFile).
          Hide
          Enis Soztutar added a comment -

          I think I've explained this command poorly. It attempts to render whatever exists at a given path as human-readable text.

          Himm, i guess i've just assumed that the file would be a SequenceFile, in which case the patch dumps all the contents of the file. But what i propose is more general for sequence files, but it lacks other file types.

          aren't each of those commands easily implemented in map/reduce?

          yes, they can be easily implemented as MR jobs, or local jobs, but the framework should include such jobs.

          Now understanding the original intention, I am OK with the current patch and I suggest we finalize this patch, and continue with Hadoop-175 for SF handling. We can later change SF dumping code(TextRecordInputStream) in Hadoop-175.

          Show
          Enis Soztutar added a comment - I think I've explained this command poorly. It attempts to render whatever exists at a given path as human-readable text. Himm, i guess i've just assumed that the file would be a SequenceFile, in which case the patch dumps all the contents of the file. But what i propose is more general for sequence files, but it lacks other file types. aren't each of those commands easily implemented in map/reduce? yes, they can be easily implemented as MR jobs, or local jobs, but the framework should include such jobs. Now understanding the original intention, I am OK with the current patch and I suggest we finalize this patch, and continue with Hadoop-175 for SF handling. We can later change SF dumping code(TextRecordInputStream) in Hadoop-175.
          Chris Douglas made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Hide
          Chris Douglas added a comment -

          Failed HADOOP-2112; trying Hudson again

          Show
          Chris Douglas added a comment - Failed HADOOP-2112 ; trying Hudson again
          Chris Douglas made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12368535/2113-0.patch
          against trunk revision r592551.

          @author +1. The patch does not contain any @author tags.

          javadoc +1. The javadoc tool did not generate any warning messages.

          javac +1. The applied patch does not generate any new compiler warnings.

          findbugs +1. The patch does not introduce any new Findbugs warnings.

          core tests +1. The patch passed core unit tests.

          contrib tests +1. The patch passed contrib unit tests.

          Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1072/testReport/
          Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1072/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1072/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1072/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12368535/2113-0.patch against trunk revision r592551. @author +1. The patch does not contain any @author tags. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new compiler warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1072/testReport/ Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1072/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1072/artifact/trunk/build/test/checkstyle-errors.html Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1072/console This message is automatically generated.
          Hide
          dhruba borthakur added a comment -

          It would be really good if we can have a test case.

          Show
          dhruba borthakur added a comment - It would be really good if we can have a test case.
          dhruba borthakur made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Hide
          Chris Douglas added a comment -

          Added a test case

          Show
          Chris Douglas added a comment - Added a test case
          Chris Douglas made changes -
          Attachment 2113-1.patch [ 12369578 ]
          Chris Douglas made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hide
          dhruba borthakur added a comment -

          +1. Code looks good.

          Show
          dhruba borthakur added a comment - +1. Code looks good.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12369578/2113-1.patch
          against trunk revision r595406.

          @author +1. The patch does not contain any @author tags.

          javadoc +1. The javadoc tool did not generate any warning messages.

          javac +1. The applied patch does not generate any new compiler warnings.

          findbugs +1. The patch does not introduce any new Findbugs warnings.

          core tests +1. The patch passed core unit tests.

          contrib tests +1. The patch passed contrib unit tests.

          Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1103/testReport/
          Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1103/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1103/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1103/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12369578/2113-1.patch against trunk revision r595406. @author +1. The patch does not contain any @author tags. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new compiler warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1103/testReport/ Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1103/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1103/artifact/trunk/build/test/checkstyle-errors.html Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1103/console This message is automatically generated.
          dhruba borthakur committed 597211 (3 files)
          Reviews: none

          HADOOP-2113. A new shell command "dfs -text" to view the contents of
          a gziped or SequenceFile. (Chris Douglas via dhruba)

          Hide
          dhruba borthakur added a comment -

          I just committed this. Thanks Chris!

          Show
          dhruba borthakur added a comment - I just committed this. Thanks Chris!
          dhruba borthakur made changes -
          Resolution Fixed [ 1 ]
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Hide
          Hudson added a comment -
          Show
          Hudson added a comment - Integrated in Hadoop-Nightly #311 (See http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/311/ )
          dhruba borthakur made changes -
          Link This issue relates to HADOOP-2501 [ HADOOP-2501 ]
          Nigel Daley made changes -
          Status Resolved [ 5 ] Closed [ 6 ]

            People

            • Assignee:
              Chris Douglas
              Reporter:
              Chris Douglas
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development