Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.2
    • Component/s: Utilities
    • Labels:
      None

      Description

      I needed to analyse a log file today and I was looking for a ReversedLinesFileReader: A class that behaves exactly like BufferedReader except that it goes from bottom to top when readLine() is called. I didn't find it in IOUtils and the internet didn't help a lot either, e.g. http://www.java2s.com/Tutorial/Java/0180__File/ReversingaFile.htm is a fairly inefficient - the log files I'm analysing are huge and it is not a good idea to load the whole content in the memory.

      So I ended up writing an implementation myself using little memory and the class RandomAccessFile - see attached file. It's used as follows:

      int blockSize = 4096; // only that much memory is needed, no matter how big the file is
      ReversedLinesFileReader reversedLinesFileReader = new ReversedLinesFileReader (myFile, blockSize, "UTF-8"); // encoding is supported
      String line = null;
      while((line=reversedLinesFileReader.readLine())!=null) {
      ... // use the line
      if(enoughLinesSeen)

      { break; }

      }
      reversedLinesFileReader.close();

      I believe this could be useful for other people as well!

        Activity

        Hide
        Gary Gregory added a comment -

        Version 2.2 has been released and addresses this issue.

        Show
        Gary Gregory added a comment - Version 2.2 has been released and addresses this issue.
        Hide
        Lars Kolb added a comment - - edited

        Basically, it would be required to support:

        Text str= new Text();
        FSDataInputStream is= FileSystem.get(conf);
        is.seek(offset);
        
        ReversedLinesReader reader= new ReversedLinesReader(is);
        int bytesConsumed;
        long bytesConsumedTotal=0L;
        
        while(bytesConsumedTotal<treshold && (bytesConsumed=reader.readLine(str))>0)
        {
           //...
           bytesConsumedTotal+= bytesConsumed;
        }
        
        
        public class ReversedLinesReader
        {
           public ReversedLinesReader(InputStream is)
           {
              //simply start reading from (positioned) is
           }
        
           public ReversedLinesReader(File file)
           {
              //current behaviour seek to end of file
           }
        
           public int readLine(Text text)
           {
              //return bytes read and store line in text
              //alternatively one could return a Pair<String,Integer> to not depend on org.apache.hadoop.io.Text
           }
        
           public String readLine()
           {
              //current behaviour 
           }
        }
        
        Show
        Lars Kolb added a comment - - edited Web SVN Basically, it would be required to support: Text str= new Text(); FSDataInputStream is= FileSystem.get(conf); is.seek(offset); ReversedLinesReader reader= new ReversedLinesReader(is); int bytesConsumed; long bytesConsumedTotal=0L; while (bytesConsumedTotal<treshold && (bytesConsumed=reader.readLine(str))>0) { //... bytesConsumedTotal+= bytesConsumed; } public class ReversedLinesReader { public ReversedLinesReader(InputStream is) { //simply start reading from (positioned) is } public ReversedLinesReader(File file) { //current behaviour seek to end of file } public int readLine(Text text) { // return bytes read and store line in text //alternatively one could return a Pair< String , Integer > to not depend on org.apache.hadoop.io.Text } public String readLine() { //current behaviour } }
        Hide
        Georg Henzler added a comment -

        Great to hear that the class is being used already! Can you provide the exact SVN-Location of org.apache.hadoop.util.LineReader? I will try to create a base class for ReversedLinesFileReader that you could use then.

        Show
        Georg Henzler added a comment - Great to hear that the class is being used already! Can you provide the exact SVN-Location of org.apache.hadoop.util.LineReader? I will try to create a base class for ReversedLinesFileReader that you could use then.
        Hide
        Lars Kolb added a comment - - edited

        Works like a charm, thanks Georg.

        I had a similar requirement. I wanted to "chunk-wise" read a HDFS file backwards to allow file browsing similar to the Hadoop namenode's web interface. By clicking a button a user triggers to fetch the previous N lines starting from a specific offset.

        With a few changes to your ReversedLinesFileReader implementation I was able to implement this functionality. I would suggest to extend your ReversedLinesFileReader to be able to operate on InputStreams and to return the number of consumed bytes (i.e. the number of bytes actually read for "line construction" not the number of buffered bytes). This actually results in a Reverse org.apache.hadoop.util.LineReader.

        Show
        Lars Kolb added a comment - - edited Works like a charm, thanks Georg. I had a similar requirement. I wanted to "chunk-wise" read a HDFS file backwards to allow file browsing similar to the Hadoop namenode's web interface. By clicking a button a user triggers to fetch the previous N lines starting from a specific offset. With a few changes to your ReversedLinesFileReader implementation I was able to implement this functionality. I would suggest to extend your ReversedLinesFileReader to be able to operate on InputStreams and to return the number of consumed bytes (i.e. the number of bytes actually read for "line construction" not the number of buffered bytes). This actually results in a Reverse org.apache.hadoop.util.LineReader.
        Hide
        Niall Pemberton added a comment -

        ReversedLinesFileReader should be in the org.apache.commons.io.input package along with all the other InputStreams & Readers, so I have moved it.

        Also fixed some checkstyle issues (removed tabs, javadocs etc)

        Show
        Niall Pemberton added a comment - ReversedLinesFileReader should be in the org.apache.commons.io.input package along with all the other InputStreams & Readers, so I have moved it. Also fixed some checkstyle issues (removed tabs, javadocs etc)
        Hide
        Sebb added a comment -

        Thanks for your perserverance!

        I had problems with committing the .txt files, as SVN tried to add SVN eol-style:native. This would have broken the tests, so they were renamed as .bin and stored as application/octet-stream.

        Made some minor code changes:

        • more final fields
        • added extra BufferedReader comparison tests
        Show
        Sebb added a comment - Thanks for your perserverance! I had problems with committing the .txt files, as SVN tried to add SVN eol-style:native. This would have broken the tests, so they were renamed as .bin and stored as application/octet-stream. Made some minor code changes: more final fields added extra BufferedReader comparison tests
        Hide
        Georg Henzler added a comment -

        Eventually... there's a new version 0.3 attached.

        • The block-spanning newline issue is fixed
        • Comment for the line-end-bytes-array added
        • The last-line-is-empty-behaviour has been aligned with BufferedReader (and there is a test to check the "BufferedReader Compliancy")
        • The tests have been split up in two parametrized ones and one standard junit test.
        Show
        Georg Henzler added a comment - Eventually... there's a new version 0.3 attached. The block-spanning newline issue is fixed Comment for the line-end-bytes-array added The last-line-is-empty-behaviour has been aligned with BufferedReader (and there is a test to check the "BufferedReader Compliancy") The tests have been split up in two parametrized ones and one standard junit test.
        Hide
        Sebb added a comment -

        Sorry, you're correct about needing to convert CR and LF.
        I was forgetting that BufferedReader.readLine() works on the decoded values, so does not need to encode them for comparison.
        Whereas your code works on bytes, and decodes later.

        AFAICT, the code depends on the line end byte arrays being sorted order of descending length.
        This should be documented. Hopefully it's not possible for an encoding to use different lengths for CR and LF!

        Show
        Sebb added a comment - Sorry, you're correct about needing to convert CR and LF. I was forgetting that BufferedReader.readLine() works on the decoded values, so does not need to encode them for comparison. Whereas your code works on bytes, and decodes later. AFAICT, the code depends on the line end byte arrays being sorted order of descending length. This should be documented. Hopefully it's not possible for an encoding to use different lengths for CR and LF!
        Hide
        Georg Henzler added a comment -

        Sorry for the spurious files, i created the zip with the default utility in OS-X.

        I think the code addresses most of your questions already:

        • There are a few tests already (testXxxxSmallBlockSize()) that test the multi-block behaviour for lines that span a block (you can go down to a block size of 1 and it still works, that shows that the algorithm is solid I think)
        • I think it's clever to encode the newline characters - that way we automatically get the correct byte sequence for multi byte encodings (e.g. UTF-16) and if a one byte-per-char-encoding chose to use different bytes it would also work (performance is no issue for this as it happens only once)
        • I think about getNewLineMatchByteCount() to make it more efficient - although for the standard ISO case it ends up just being four byte comparisons instead of three. Should make almost no difference but on the pro side it makes the implementation nicely generic.
        • It's true, there is an issue with block-spanning newlines to be fixed. If a windows newline (\r\n) happens to span a block a wrong extra empty line will be returned.

        I'll provide a fix for the newline problem and will change totalBlockCount to long.

        Show
        Georg Henzler added a comment - Sorry for the spurious files, i created the zip with the default utility in OS-X. I think the code addresses most of your questions already: There are a few tests already (testXxxxSmallBlockSize()) that test the multi-block behaviour for lines that span a block (you can go down to a block size of 1 and it still works, that shows that the algorithm is solid I think) I think it's clever to encode the newline characters - that way we automatically get the correct byte sequence for multi byte encodings (e.g. UTF-16) and if a one byte-per-char-encoding chose to use different bytes it would also work (performance is no issue for this as it happens only once) I think about getNewLineMatchByteCount() to make it more efficient - although for the standard ISO case it ends up just being four byte comparisons instead of three. Should make almost no difference but on the pro side it makes the implementation nicely generic. It's true, there is an issue with block-spanning newlines to be fixed. If a windows newline (\r\n) happens to span a block a wrong extra empty line will be returned. I'll provide a fix for the newline problem and will change totalBlockCount to long.
        Hide
        Sebb added a comment -

        Good to know that it's easy to unambiguously detect CR and LF.

        There seems to be a lot of spurious files in the zip archive.

        I'm not sure that the getNewLineMatchByteCount() is as efficient as BufferedReader.readLine() - it seems to process characters multiple times. It could probably be improved by just checking current and previous chars. Also, I don't think it's necessary to encode \n or \r - just use the appropriate characters.

        There are no tests for multi-block files where there may be lines spanning blocks.
        Indeed the CRLF pair may span blocks; I'm not convinced that the code handles that correctly.
        In order for getNewLineMatchByteCount() to detect all CRLF pairs, it generally needs at least 2 characters to be present; this does not seem to be guaranteed.

        Note: could use a smaller block size to make the test files smaller; probably sensible to compare the results with a forward line reader. It would then be simple to have a directory of various different test files - read the file forward and store the lines; ensure that the reverse reader matches the reversed lines.

        The field totalBlockCount needs to be a long, not an int.

        Might simplify the code to use empty arrays rather than null.

        Show
        Sebb added a comment - Good to know that it's easy to unambiguously detect CR and LF. There seems to be a lot of spurious files in the zip archive. I'm not sure that the getNewLineMatchByteCount() is as efficient as BufferedReader.readLine() - it seems to process characters multiple times. It could probably be improved by just checking current and previous chars. Also, I don't think it's necessary to encode \n or \r - just use the appropriate characters. There are no tests for multi-block files where there may be lines spanning blocks. Indeed the CRLF pair may span blocks; I'm not convinced that the code handles that correctly. In order for getNewLineMatchByteCount() to detect all CRLF pairs, it generally needs at least 2 characters to be present; this does not seem to be guaranteed. Note: could use a smaller block size to make the test files smaller; probably sensible to compare the results with a forward line reader. It would then be simple to have a directory of various different test files - read the file forward and store the lines; ensure that the reverse reader matches the reversed lines. The field totalBlockCount needs to be a long, not an int. Might simplify the code to use empty arrays rather than null.
        Hide
        Georg Henzler added a comment -

        I didn't really like the "inefficient" idea of using the BufferedReader to get around the encoding issue... so I read up about encodings in general and I think the solution as provided works for all one byte encodings, UTF-8, UTF-16BE/UTF-16LE and Shift-JIS. For other encodings at the moment an exception is thrown to be on the safe side (but this can be easily extended).

        Also, now \r alone is treated as newline as well.

        PS: Added Apache headers and removed author tag.

        Show
        Georg Henzler added a comment - I didn't really like the "inefficient" idea of using the BufferedReader to get around the encoding issue... so I read up about encodings in general and I think the solution as provided works for all one byte encodings, UTF-8, UTF-16BE/UTF-16LE and Shift-JIS. For other encodings at the moment an exception is thrown to be on the safe side (but this can be easily extended). Also, now \r alone is treated as newline as well. PS: Added Apache headers and removed author tag.
        Hide
        Sebb added a comment -

        BufferedReader.readLine() allows for LF, CRLF and CR line termination; perhaps this should too?

        What about multi-byte encodings?
        Can these ever have CR or LF as part of a multi-byte character?
        Won't the processing fail in such cases?
        If so, is that a restriction, or can it be fixed?

        I'm wondering whether it would be possible to use BufferedReader.readLine() to scan forward through the buffer, saving up lines as it goes, and then return the lines in reverse order. This would use a bit more memory, but would re-use the readLine processing which is well-tested. It can also allow for the encoding, though there is still a potential issue where a buffer happens to start in the middle of a multi-byte character. It might be possible to check the first few bytes of the buffer and adjust the start offset if necessary.

        We don't use @author tags in code; contributors are credited on the website instead.
        Also, files should have AL headers please.

        Show
        Sebb added a comment - BufferedReader.readLine() allows for LF, CRLF and CR line termination; perhaps this should too? What about multi-byte encodings? Can these ever have CR or LF as part of a multi-byte character? Won't the processing fail in such cases? If so, is that a restriction, or can it be fixed? I'm wondering whether it would be possible to use BufferedReader.readLine() to scan forward through the buffer, saving up lines as it goes, and then return the lines in reverse order. This would use a bit more memory, but would re-use the readLine processing which is well-tested. It can also allow for the encoding, though there is still a potential issue where a buffer happens to start in the middle of a multi-byte character. It might be possible to check the first few bytes of the buffer and adjust the start offset if necessary. We don't use @author tags in code; contributors are credited on the website instead. Also, files should have AL headers please.
        Hide
        Georg Henzler added a comment -

        Improved javadoc and unit tests. Fixed empty file bug.

        Show
        Georg Henzler added a comment - Improved javadoc and unit tests. Fixed empty file bug.
        Hide
        Georg Henzler added a comment -

        @Gary: Let me know if you need anything more

        Show
        Georg Henzler added a comment - @Gary: Let me know if you need anything more
        Hide
        Georg Henzler added a comment -

        Attached simple maven project with java class and unit tests.

        Show
        Georg Henzler added a comment - Attached simple maven project with java class and unit tests.
        Hide
        Georg Henzler added a comment - - edited

        I will provide you with Unit tests if you're interested - I tested the class with a main class and different input files but it's easy to create Unit tests from the main class.

        Regarding your questions:

        • The Tailer class listens to file changes (as the unix tail does) and notifies a provided Listener passing the added line. The ReverseFileReader starts at the last line of a file and moves towards the start of the file (ignoring added lines after it has been instantiated).
        • We could subclass FileReader but I'm not sure how to implement e.g. read(char[] cbuf, int off, int len)... implementing this going backward would be hard. Mixing going forward and backward is probably not really intuitive. I would suggest that if we implement FileReader, we throw a UnsupportedOperationException for most of the Reader inferface's methods.
        • I'm not sure of the Filename... is BufferedReverseFileReader a better name to emphasize on the fact that it's all about the method readLine()? Any other name suggestions?
        Show
        Georg Henzler added a comment - - edited I will provide you with Unit tests if you're interested - I tested the class with a main class and different input files but it's easy to create Unit tests from the main class. Regarding your questions: The Tailer class listens to file changes (as the unix tail does) and notifies a provided Listener passing the added line. The ReverseFileReader starts at the last line of a file and moves towards the start of the file (ignoring added lines after it has been instantiated). We could subclass FileReader but I'm not sure how to implement e.g. read(char[] cbuf, int off, int len)... implementing this going backward would be hard. Mixing going forward and backward is probably not really intuitive. I would suggest that if we implement FileReader, we throw a UnsupportedOperationException for most of the Reader inferface's methods. I'm not sure of the Filename... is BufferedReverseFileReader a better name to emphasize on the fact that it's all about the method readLine()? Any other name suggestions?
        Hide
        Georg Henzler added a comment -

        Sorry, I didn't notice the radio buttons when I first uploaded...

        Show
        Georg Henzler added a comment - Sorry, I didn't notice the radio buttons when I first uploaded...
        Hide
        Gary Gregory added a comment -

        Hi Georg,

        We cannot consider this for consideration without:

        • Grant of license to Apache which is done when an attachment is uploaded.
        • Unit tests.

        Questions:

        • How does this fit in with the Tailer class. Should one reuse the other?
        • Should this be a real Reader implementation? Subclassing FileReader?
        Show
        Gary Gregory added a comment - Hi Georg, We cannot consider this for consideration without: Grant of license to Apache which is done when an attachment is uploaded. Unit tests. Questions: How does this fit in with the Tailer class. Should one reuse the other? Should this be a real Reader implementation? Subclassing FileReader?

          People

          • Assignee:
            Unassigned
            Reporter:
            Georg Henzler
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development