Issue Details (XML | Word | Printable)

Key: HADOOP-4226
Type: Improvement Improvement
Status: Closed Closed
Resolution: Fixed
Priority: Minor Minor
Assignee: Yuri Pradkin
Reporter: Yuri Pradkin
Votes: 0
Watchers: 1
Operations

If you were logged in you would be able to see more operations.
Hadoop Common

LineReader::readLine cleanup

Created: 19/Sep/08 08:28 PM   Updated: 08/Jul/09 04:53 PM
Return to search
Component/s: None
Affects Version/s: 0.19.0
Fix Version/s: 0.20.0

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works HADOOP-4226.patch 2008-09-29 08:38 PM Yuri Pradkin 12 kB
Text File Licensed for inclusion in ASF works HADOOP-4226.patch 2008-09-26 06:03 PM Yuri Pradkin 11 kB

Hadoop Flags: Reviewed
Resolution Date: 29/Sep/08 10:38 PM


 Description  « Hide
I've been looking at HADOOP-4010 and realized that readLine is pretty convoluted. I changed the implementation which made it hopefully a little easier to read/validate/understand.

I've had some problems testing it locally, so I'll submit it for Hudson to test.



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Tsz Wo (Nicholas), SZE added a comment - 19/Sep/08 08:50 PM
Forgot to attach a file?

Yuri Pradkin added a comment - 19/Sep/08 08:50 PM
sorry, I did not realized that readLine moved to LineReader in util, deleted old patch, submitted new.

Hadoop QA added a comment - 20/Sep/08 01:18 AM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12390539/LineReader.java.patch
against trunk revision 697306.

+1 @author. The patch does not contain any @author tags.

-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no tests are needed for this patch.

-1 patch. The patch command could not apply the patch.

Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3338/console

This message is automatically generated.


Yuri Pradkin added a comment - 20/Sep/08 05:27 AM
another snafu, sorry, reuploaded the patch

Yuri Pradkin added a comment - 22/Sep/08 01:21 AM
What do I need to do to make this patch available (for Hudson to test)? I've marked it "in Progress", re-attached the patch. Now I don't even have "Submit patch", the only available workflow action is "resolve issue". This web interface is a little srewy.

Hadoop QA added a comment - 22/Sep/08 05:29 PM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12390601/HADOOP-4226.patch
against trunk revision 697306.

+1 @author. The patch does not contain any @author tags.

-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no tests are needed for this patch.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

-1 core tests. The patch failed core unit tests.

-1 contrib tests. The patch failed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3344/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3344/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3344/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3344/console

This message is automatically generated.


Yuri Pradkin added a comment - 22/Sep/08 05:55 PM
I think that failed test wasn't related to my changes at all, probably diffed against failing snapshot. Here it is again with minor adjustments

Hadoop QA added a comment - 22/Sep/08 08:28 PM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12390667/HADOOP-4226.patch
against trunk revision 697306.

+1 @author. The patch does not contain any @author tags.

-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no tests are needed for this patch.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

-1 core tests. The patch failed core unit tests.

-1 contrib tests. The patch failed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3345/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3345/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3345/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3345/console

This message is automatically generated.


Yuri Pradkin added a comment - 24/Sep/08 08:31 PM
OK, let me respond to Hudson:

1. Tests: since this is not new code, I don't feel obligated to provide more tests. There already exist tests for TextInputFormat that are already checking this code; these tests pass.

2. Two failed tests: I've re-run these tests on my box, they pass (except TestLeaseRecovery2 throws exception closing an already closed file, but it still passes) on my system. What's up with that? If you could point me to a "clean" revision, I'll be happy to generate a patch against that. Otherwise I think we're OK here.

I might make one more little change to the code to make it a little cleaner, but otherwise I think the patch is ready. I urge commiters to look at it and let me know what they think. I'd hate my time to be wasted, so please comment.


Chris Douglas added a comment - 24/Sep/08 10:34 PM

I might make one more little change to the code to make it a little cleaner, but otherwise I think the patch is ready. I urge commiters to look at it and let me know what they think. I'd hate my time to be wasted, so please comment.

If you're going to change the code again, a review of the final version is the best use of everyone's time. Also: it's only been two days. This is a heavily used piece of code with a lot of corner cases, so this refactoring is risky, low-priority, and non-trivial to review. Please try to be patient.

Tests: since this is not new code, I don't feel obligated to provide more tests. There already exist tests for TextInputFormat that are already checking this code; these tests pass.

TestTextInputFormat is testing LineRecordReader.LineReader, not o.a.h.util.LineReader, to which this patch applies.

  • Per the guidelines, don't use tabs
  • Why track str.getLength() with a local var?
  • This shouldn't append stray \r characters to the end of the string. This should treat \r, \n, and \r\n as delimiters, as in the original. If there isn't a unit test validating this, writing it would be helpful.
  • This appears to regard \r as a valid character (newLineLength is reset for each char read, and only relevant when followed by \n), but the original does not.

Yuri Pradkin added a comment - 26/Sep/08 01:04 AM
Thanks for your input.

TestTextInputFormat is testing LineRecordReader.LineReader, not o.a.h.util.LineReader, to which this patch applies.

OK. BTW, are you planning to switch LRR to use util.LineReader?

Why track str.getLength() with a local var?

Only as an optimization; for this little piece of code I think it's OK. It is (or going to be) heavily used, as you say.

This appears to regard \r as a valid character (newLineLength is reset for each char read, and only relevant when followed by \n), but the original does not.

Sorry, fixed. It was not obvious from looking at the code.

If there isn't a unit test validating this, writing it would be helpful.

Done. Beating the hell out of it.


Chris Douglas added a comment - 26/Sep/08 04:34 AM

OK. BTW, are you planning to switch LRR to use util.LineReader?

HADOOP-4269 fixed this; trunk runs the unit tests from TextInputFormat over o.a.h.util.LineReader, as intended. Sorry, the current patch no longer applies to trunk.

Only as an optimization; for this little piece of code I think it's OK. It is (or going to be) heavily used, as you say.

nod Fair enough. I think it makes the code slightly harder to read for a nominal performance gain that the JIT compiler should cover, but it's not significant.

Sorry, fixed. It was not obvious from looking at the code.

It's not obvious that it should work that way at all, but backwards compatibility is a big deal for this class.

  • Unless there are new tests in TestLineReader, that class or the redundant tests in TestTextInputFormat should be removed. A test case verifying that streams ending in \r behave like those ending in \n would be a good addition to this patch. Sorry, "this" was ambiguous in my last comment.
  • The default buffer size doesn't need to be public (resolved in trunk)
  • CR and LF don't need to be constants; they should be private at least
  • Instead of
    +        if (bufferLength <= bufferPosn)
    +          break; // EOF
    

    It might make more sense to check bufferLength <= 0, to which bufferPosn is set

  • Instead of incrementing:
    +        if (prevCharCR) { //CR + notLF, we are at notLF
    +          ++newlineLength;
    

    Shouldn't this always set newLineLength = 1?

  • If buffer ends in \r and the following segment starts with \r, it looks like this may not separate those lines.

Yuri Pradkin added a comment - 26/Sep/08 03:57 PM
Patched against current trunk.

If buffer ends in \r and the following segment starts with \r, it looks like this may not separate those lines.

I've modified tests in TestInputFormat to do additional beating on readLine. One of the tests validates the \r\r sequence. Buffer size varies and at least a couple of times we should have a read in between of the two \r's.

I've fixed all other nits that you commented on.

It's not obvious that it should work that way at all, but backwards compatibility is a big deal for this class.

It probably should, as it turns out \r is a newline on older macs and commodore, I think.


Yuri Pradkin added a comment - 26/Sep/08 05:59 PM
forgot to change DEFAULT_BUFFER_SIZE to private

Hadoop QA added a comment - 27/Sep/08 02:47 AM
+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12391039/HADOOP-4226.patch
against trunk revision 699517.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 3 new or modified tests.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

+1 Eclipse classpath. The patch retains Eclipse classpath integrity.

+1 core tests. The patch passed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3384/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3384/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3384/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3384/console

This message is automatically generated.


Chris Douglas added a comment - 27/Sep/08 03:12 AM
  • A quick note on policy: old patches are not deleted, as they orphan the discussion about them.
  • The improvements to TestTextInputFormat are very good; thanks for these
  • This is clearer and better documented than the existing code. However, in reviewing corner cases, the reasoning was sometimes unclear (e.g. deferring the increment of bytesConsumed for CR (clever, by the way)). If this could include a short summary of the strategy employed and a few more comments, it would be more readily understandable the next time a change is proposed.
  • CR and LF should be final, as in the previous patch

With a little more documentation, I think this is good to commit.


Yuri Pradkin added a comment - 29/Sep/08 03:27 PM

old patches are not deleted, as they orphan the discussion about them.

Sorry about that: started deleting inconsequential patches and got carried away.

If this could include a short summary of the strategy employed and a few more comments, it would be more readily understandable the next time a change is proposed.

I'm not a java programmer, so I could use some advice here on where would such summary go: javadoc comments (/*...*/) or the policy is to keep them short and the longer comments should go somewhere else?

CR and LF should be final, as in the previous patch

If people were to re-use this class with different newline separators, they might find it useful to be able to change CR and LF. Maybe they should not be static and be public, so they can simply be assigned to?


Chris Douglas added a comment - 29/Sep/08 07:04 PM

I'm not a java programmer, so I could use some advice here on where would such summary go

Though we have some standards, these in particular aren't strictly followed. The vast, vast majority of users shouldn't see this, so putting it in the javadoc is probably unnecessary. A sketch of the algorithm could follow the declaration/signature of the method in a few lines of block comments. After that, the only pieces that require comments are those that aren't obvious, e.g. why bytesConsumed can't terminate the loop if we only have \r.

If people were to re-use this class with different newline separators, they might find it useful to be able to change CR and LF

It's pretty specialized code... I'd much rather lock it down until we have a use case. Having non-final, public static fields is generally considered bad practice. It makes the code unverifiable in isolation and admits some very difficult bugs (and races if two classes assign different values to CR and LF).


Yuri Pradkin added a comment - 29/Sep/08 08:42 PM
added comments, CR,LF are private final static.

Chris Douglas added a comment - 29/Sep/08 10:38 PM
+1 Looks good

I just committed this. Thanks, Yuri


Hudson added a comment - 01/Oct/08 01:29 PM
Integrated in Hadoop-trunk #620 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/620/)
. Refactor and document LineReader to make it more readily
understandable. Contributed by Yuri Pradkin.