Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-1381

The distance between sync blocks in SequenceFiles should be configurable

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.0.0-alpha
    • Fix Version/s: 3.0.0-alpha2
    • Component/s: io
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      The default sync interval within new SequenceFile writes is now 100KB, up from the older default of 2000B. The sync interval is now also manually configurable via the SequenceFile.Writer API.

      Description

      Currently SequenceFiles put in sync blocks every 2000 bytes. It would be much better if it was configurable with a much higher default (1mb or so?).

      1. HADOOP-1381.r5.diff
        9 kB
        Harsh J
      2. HADOOP-1381.r5.diff
        9 kB
        Harsh J
      3. HADOOP-1381.r4.diff
        9 kB
        Harsh J
      4. HADOOP-1381.r3.diff
        9 kB
        Harsh J
      5. HADOOP-1381.r2.diff
        8 kB
        Harsh J
      6. HADOOP-1381.r1.diff
        6 kB
        Harsh J

        Issue Links

          Activity

          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #10894 (See https://builds.apache.org/job/Hadoop-trunk-Commit/10894/)
          HADOOP-1381. The distance between sync blocks in SequenceFiles should be (harsh: rev 07825f2b49384dbec92bfae87ea661cef9ffab49)

          • (edit) hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/io/TestSequenceFileSync.java
          • (edit) hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/SequenceFile.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #10894 (See https://builds.apache.org/job/Hadoop-trunk-Commit/10894/ ) HADOOP-1381 . The distance between sync blocks in SequenceFiles should be (harsh: rev 07825f2b49384dbec92bfae87ea661cef9ffab49) (edit) hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/io/TestSequenceFileSync.java (edit) hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/SequenceFile.java
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user QwertyManiac closed the pull request at:

          https://github.com/apache/hadoop/pull/147

          Show
          githubbot ASF GitHub Bot added a comment - Github user QwertyManiac closed the pull request at: https://github.com/apache/hadoop/pull/147
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user QwertyManiac commented on the issue:

          https://github.com/apache/hadoop/pull/147

          Fixed via 07825f2b49384dbec92bfae87ea661cef9ffab49

          Show
          githubbot ASF GitHub Bot added a comment - Github user QwertyManiac commented on the issue: https://github.com/apache/hadoop/pull/147 Fixed via 07825f2b49384dbec92bfae87ea661cef9ffab49
          Hide
          qwertymaniac Harsh J added a comment -

          Thank you Akira Ajisaka! Pushed to trunk.

          Show
          qwertymaniac Harsh J added a comment - Thank you Akira Ajisaka ! Pushed to trunk.
          Hide
          ajisakaa Akira Ajisaka added a comment -

          LGTM, +1 for trunk.

          Show
          ajisakaa Akira Ajisaka added a comment - LGTM, +1 for trunk.
          Hide
          qwertymaniac Harsh J added a comment -

          Akira Ajisaka / Tsuyoshi Ozawa] - Could you help take a look at this one?

          Show
          qwertymaniac Harsh J added a comment - Akira Ajisaka / Tsuyoshi Ozawa ] - Could you help take a look at this one?
          Hide
          qwertymaniac Harsh J added a comment -

          Fixed all javac and checkstyle issues except this one:

          ./hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/SequenceFile.java:1264:    void init(Configuration config, FSDataOutputStream outStream,:10: More than 7 parameters (found 8).
          

          This can only be cleaned up when the deprecated API constructors are removed.

          Show
          qwertymaniac Harsh J added a comment - Fixed all javac and checkstyle issues except this one: ./hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/SequenceFile.java:1264: void init(Configuration config, FSDataOutputStream outStream,:10: More than 7 parameters (found 8). This can only be cleaned up when the deprecated API constructors are removed.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 16s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
          +1 mvninstall 8m 14s trunk passed
          +1 compile 8m 22s trunk passed
          +1 checkstyle 0m 26s trunk passed
          +1 mvnsite 0m 56s trunk passed
          +1 mvneclipse 0m 12s trunk passed
          +1 findbugs 1m 20s trunk passed
          +1 javadoc 0m 41s trunk passed
          +1 mvninstall 0m 36s the patch passed
          +1 compile 6m 51s the patch passed
          -1 javac 6m 51s root generated 2 new + 700 unchanged - 3 fixed = 702 total (was 703)
          -0 checkstyle 0m 27s hadoop-common-project/hadoop-common: The patch generated 5 new + 266 unchanged - 22 fixed = 271 total (was 288)
          +1 mvnsite 0m 59s the patch passed
          +1 mvneclipse 0m 13s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 1m 31s the patch passed
          +1 javadoc 0m 41s the patch passed
          +1 unit 8m 50s hadoop-common in the patch passed.
          +1 asflicense 0m 22s The patch does not generate ASF License warnings.
          42m 24s



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:9560f25
          JIRA Issue HADOOP-1381
          GITHUB PR https://github.com/apache/hadoop/pull/147
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux 68fda58a8023 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 9f32364
          Default Java 1.8.0_101
          findbugs v3.0.0
          javac https://builds.apache.org/job/PreCommit-HADOOP-Build/10905/artifact/patchprocess/diff-compile-javac-root.txt
          checkstyle https://builds.apache.org/job/PreCommit-HADOOP-Build/10905/artifact/patchprocess/diff-checkstyle-hadoop-common-project_hadoop-common.txt
          Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/10905/testReport/
          modules C: hadoop-common-project/hadoop-common U: hadoop-common-project/hadoop-common
          Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/10905/console
          Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 16s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 8m 14s trunk passed +1 compile 8m 22s trunk passed +1 checkstyle 0m 26s trunk passed +1 mvnsite 0m 56s trunk passed +1 mvneclipse 0m 12s trunk passed +1 findbugs 1m 20s trunk passed +1 javadoc 0m 41s trunk passed +1 mvninstall 0m 36s the patch passed +1 compile 6m 51s the patch passed -1 javac 6m 51s root generated 2 new + 700 unchanged - 3 fixed = 702 total (was 703) -0 checkstyle 0m 27s hadoop-common-project/hadoop-common: The patch generated 5 new + 266 unchanged - 22 fixed = 271 total (was 288) +1 mvnsite 0m 59s the patch passed +1 mvneclipse 0m 13s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 31s the patch passed +1 javadoc 0m 41s the patch passed +1 unit 8m 50s hadoop-common in the patch passed. +1 asflicense 0m 22s The patch does not generate ASF License warnings. 42m 24s Subsystem Report/Notes Docker Image:yetus/hadoop:9560f25 JIRA Issue HADOOP-1381 GITHUB PR https://github.com/apache/hadoop/pull/147 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 68fda58a8023 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 9f32364 Default Java 1.8.0_101 findbugs v3.0.0 javac https://builds.apache.org/job/PreCommit-HADOOP-Build/10905/artifact/patchprocess/diff-compile-javac-root.txt checkstyle https://builds.apache.org/job/PreCommit-HADOOP-Build/10905/artifact/patchprocess/diff-checkstyle-hadoop-common-project_hadoop-common.txt Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/10905/testReport/ modules C: hadoop-common-project/hadoop-common U: hadoop-common-project/hadoop-common Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/10905/console Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user QwertyManiac opened a pull request:

          https://github.com/apache/hadoop/pull/147

          HADOOP-1381. The distance between sync blocks in SequenceFiles should…

          … be configurable rather than hard coded to 2000 bytes.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/QwertyManiac/hadoop HADOOP-1381

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/hadoop/pull/147.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #147


          commit dbfd4090c2d97f6dfd984c3d77ed9b78b7ea1a93
          Author: Harsh J <harsh@cloudera.com>
          Date: 2016-10-26T14:34:33Z

          HADOOP-1381. The distance between sync blocks in SequenceFiles should be configurable rather than hard coded to 2000 bytes.


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user QwertyManiac opened a pull request: https://github.com/apache/hadoop/pull/147 HADOOP-1381 . The distance between sync blocks in SequenceFiles should… … be configurable rather than hard coded to 2000 bytes. You can merge this pull request into a Git repository by running: $ git pull https://github.com/QwertyManiac/hadoop HADOOP-1381 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/hadoop/pull/147.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #147 commit dbfd4090c2d97f6dfd984c3d77ed9b78b7ea1a93 Author: Harsh J <harsh@cloudera.com> Date: 2016-10-26T14:34:33Z HADOOP-1381 . The distance between sync blocks in SequenceFiles should be configurable rather than hard coded to 2000 bytes.
          Hide
          owen.omalley Owen O'Malley added a comment -

          The comment is wrong. You've changed the default to 20 * 100k or 2MB. Actually, I'd prefer a smaller default of 0.5 to 1 MB. In any case, the comments should match the code.

          Thanks for getting back to this one.

          Show
          owen.omalley Owen O'Malley added a comment - The comment is wrong. You've changed the default to 20 * 100k or 2MB. Actually, I'd prefer a smaller default of 0.5 to 1 MB. In any case, the comments should match the code. Thanks for getting back to this one.
          Hide
          qwertymaniac Harsh J added a comment -

          Owen/Todd/Others,

          Since I've addressed the comments offered here, I'm going to commit this in by Monday EOD unless there are any further comments.

          Otherwise looks good.

          Thanks!

          Show
          qwertymaniac Harsh J added a comment - Owen/Todd/Others, Since I've addressed the comments offered here, I'm going to commit this in by Monday EOD unless there are any further comments. Otherwise looks good. Thanks!
          Hide
          hsn Radim Kolar added a comment -

          default 1 meg interval should be fine

          Show
          hsn Radim Kolar added a comment - default 1 meg interval should be fine
          Hide
          qwertymaniac Harsh J added a comment -

          I do think this would be useful to be able to control, especially given that use of sequence files is already prevalent.

          Show
          qwertymaniac Harsh J added a comment - I do think this would be useful to be able to control, especially given that use of sequence files is already prevalent.
          Hide
          qwertymaniac Harsh J added a comment -

          Any further comments on the addition?

          Show
          qwertymaniac Harsh J added a comment - Any further comments on the addition?
          Hide
          qwertymaniac Harsh J added a comment -

          -1 javadoc. The javadoc tool appears to have generated 2 warning messages.

          This is HADOOP-8359, not this patch.

          Show
          qwertymaniac Harsh J added a comment - -1 javadoc. The javadoc tool appears to have generated 2 warning messages. This is HADOOP-8359 , not this patch.
          Hide
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12525761/HADOOP-1381.r5.diff
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 2 new or modified test files.

          -1 javadoc. The javadoc tool appears to have generated 2 warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-common-project/hadoop-common.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/944//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/944//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12525761/HADOOP-1381.r5.diff against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 2 new or modified test files. -1 javadoc. The javadoc tool appears to have generated 2 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in hadoop-common-project/hadoop-common. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/944//testReport/ Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/944//console This message is automatically generated.
          Hide
          qwertymaniac Harsh J added a comment -

          Patch fortunately still applies with some auto-offset. Rebased patch to trunk.

          Owen/Todd - final comments?

          Show
          qwertymaniac Harsh J added a comment - Patch fortunately still applies with some auto-offset. Rebased patch to trunk. Owen/Todd - final comments?
          Hide
          hadoopqa Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12507830/HADOOP-1381.r5.diff
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 7 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in .

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/769//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/769//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12507830/HADOOP-1381.r5.diff against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 7 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in . +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/769//testReport/ Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/769//console This message is automatically generated.
          Hide
          qwertymaniac Harsh J added a comment -

          Todd/others, are there any other comments you'd like me to address?

          Show
          qwertymaniac Harsh J added a comment - Todd/others, are there any other comments you'd like me to address?
          Hide
          qwertymaniac Harsh J added a comment -

          Added a fallback in case provided value is < 0.

          Show
          qwertymaniac Harsh J added a comment - Added a fallback in case provided value is < 0.
          Hide
          qwertymaniac Harsh J added a comment -

          Yes, it would end up writing a marker after each record, as sync-writing condition is checked after every record append.

          Show
          qwertymaniac Harsh J added a comment - Yes, it would end up writing a marker after each record, as sync-writing condition is checked after every record append.
          Hide
          tlipcon Todd Lipcon added a comment -

          hm, if you set it to 0, it will write a sync marker between every record?

          Don't worry about renaming the variable, seems like too much effort for little gain.

          Show
          tlipcon Todd Lipcon added a comment - hm, if you set it to 0, it will write a sync marker between every record? Don't worry about renaming the variable, seems like too much effort for little gain.
          Hide
          qwertymaniac Harsh J added a comment -

          Todd,

          • The sync interval can be arbitrary I think, can even be 0. Should not be negative, so I'll add a check for that instead. Or do you think its better if we limit the interval to a minimum? Writer tests pass with 0 no problem.
          • SYNC_INTERVAL is being used by MAPREDUCE right now, and I'll have to carry this out as a cross-project JIRA+patch.
          Show
          qwertymaniac Harsh J added a comment - Todd, The sync interval can be arbitrary I think, can even be 0. Should not be negative, so I'll add a check for that instead. Or do you think its better if we limit the interval to a minimum? Writer tests pass with 0 no problem. SYNC_INTERVAL is being used by MAPREDUCE right now, and I'll have to carry this out as a cross-project JIRA+patch.
          Hide
          tlipcon Todd Lipcon added a comment -

          Two minor nits:

          • Can you add a check in the Writer constructor that the syncInterval option is valid? I think the minimum value would be SYNC_SIZE?
          • Can you rename SYNC_INTERVAL to DEFAULT_SYNC_INTERVAL or SYNC_INTERVAL_DEFAULT? Even though it's currently public, I don't think this would be considered a public API, so changing it seems alright.

          Otherwise looks good.

          Show
          tlipcon Todd Lipcon added a comment - Two minor nits: Can you add a check in the Writer constructor that the syncInterval option is valid? I think the minimum value would be SYNC_SIZE? Can you rename SYNC_INTERVAL to DEFAULT_SYNC_INTERVAL or SYNC_INTERVAL_DEFAULT? Even though it's currently public, I don't think this would be considered a public API, so changing it seems alright. Otherwise looks good.
          Hide
          hadoopqa Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12492885/HADOOP-1381.r4.diff
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 7 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-common-project.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/131//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HADOOP-Build/131//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-auth.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HADOOP-Build/131//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-common.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HADOOP-Build/131//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-annotations.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HADOOP-Build/131//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-auth-examples.html
          Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/131//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12492885/HADOOP-1381.r4.diff against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 7 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in hadoop-common-project. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/131//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HADOOP-Build/131//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-auth.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HADOOP-Build/131//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HADOOP-Build/131//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-annotations.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HADOOP-Build/131//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-auth-examples.html Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/131//console This message is automatically generated.
          Hide
          qwertymaniac Harsh J added a comment -

          Patch available for review and commit (to 0.23+).

          Show
          qwertymaniac Harsh J added a comment - Patch available for review and commit (to 0.23+).
          Hide
          qwertymaniac Harsh J added a comment -

          + Doc fixes

          Show
          qwertymaniac Harsh J added a comment - + Doc fixes
          Hide
          qwertymaniac Harsh J added a comment -

          + A few doc updates.

          Show
          qwertymaniac Harsh J added a comment - + A few doc updates.
          Hide
          qwertymaniac Harsh J added a comment -
          • Fixed the 100 MB mistake. It was supposed to be 100 KB, over multiplied
          • Fixed the boxed integer issue. Derived from IntegerOption.
          • Removed constructor's internal ports to new API. It was done for unification, yes, unrelated. This new patch directly adds defaults to other constructors, which was unnecessary with the unification approach.
          Show
          qwertymaniac Harsh J added a comment - Fixed the 100 MB mistake. It was supposed to be 100 KB, over multiplied Fixed the boxed integer issue. Derived from IntegerOption. Removed constructor's internal ports to new API. It was done for unification, yes, unrelated. This new patch directly adds defaults to other constructors, which was unnecessary with the unification approach.
          Hide
          owen.omalley Owen O'Malley added a comment -

          I agree with Todd's points.

          100mb is too big. Please use something between 100kb and 1mb as the default.

          You should derive from IntegerOption for SyncIntervalOption.

          Show
          owen.omalley Owen O'Malley added a comment - I agree with Todd's points. 100mb is too big. Please use something between 100kb and 1mb as the default. You should derive from IntegerOption for SyncIntervalOption.
          Hide
          tlipcon Todd Lipcon added a comment -
          • why use a boxed Integer for the sync interval instead of a normal int?
          • the javadoc says 100KB, but the constant value seems to be set to 100MB
          • do we have to change the other constructors? seems like unrelated cleanup, probably best to leave them alone since they're deprecated anyway
          Show
          tlipcon Todd Lipcon added a comment - why use a boxed Integer for the sync interval instead of a normal int? the javadoc says 100KB, but the constant value seems to be set to 100MB do we have to change the other constructors? seems like unrelated cleanup, probably best to leave them alone since they're deprecated anyway
          Hide
          hadoopqa Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12486748/HADOOP-1381.r2.diff
          against trunk revision 1147317.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 7 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/738//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HADOOP-Build/738//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/738//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12486748/HADOOP-1381.r2.diff against trunk revision 1147317. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 7 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/738//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HADOOP-Build/738//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/738//console This message is automatically generated.
          Hide
          qwertymaniac Harsh J added a comment -

          Ensured all sequence file cases pass.

              [junit] Running org.apache.hadoop.io.TestSequenceFile
              [junit] Tests run: 6, Failures: 0, Errors: 0, Time elapsed: 14.896 sec
              [junit] Running org.apache.hadoop.io.TestSequenceFileSerialization
              [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.568 sec
              [junit] Running org.apache.hadoop.io.TestSequenceFileSync
              [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.681 sec
          

          + Made changes to deprecated constructors to use the valid one instead of calling init() code.

          Show
          qwertymaniac Harsh J added a comment - Ensured all sequence file cases pass. [junit] Running org.apache.hadoop.io.TestSequenceFile [junit] Tests run: 6, Failures: 0, Errors: 0, Time elapsed: 14.896 sec [junit] Running org.apache.hadoop.io.TestSequenceFileSerialization [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.568 sec [junit] Running org.apache.hadoop.io.TestSequenceFileSync [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.681 sec + Made changes to deprecated constructors to use the valid one instead of calling init() code.
          Hide
          qwertymaniac Harsh J added a comment -

          Sorry, messed up a bit and uploaded a defunct patch. Attaching clean patch post-comment.

          Show
          qwertymaniac Harsh J added a comment - Sorry, messed up a bit and uploaded a defunct patch. Attaching clean patch post-comment.
          Hide
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12486743/HADOOP-1381.r1.diff
          against trunk revision 1147317.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 7 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these core unit tests:
          org.apache.hadoop.io.TestSequenceFileSync

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/735//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HADOOP-Build/735//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/735//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12486743/HADOOP-1381.r1.diff against trunk revision 1147317. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 7 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.io.TestSequenceFileSync +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/735//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HADOOP-Build/735//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/735//console This message is automatically generated.
          Hide
          qwertymaniac Harsh J added a comment -

          (Avro datafiles have the ability of interval configuration as well, if you look at AVRO-719 and related issues)

          Show
          qwertymaniac Harsh J added a comment - (Avro datafiles have the ability of interval configuration as well, if you look at AVRO-719 and related issues)
          Hide
          qwertymaniac Harsh J added a comment -

          Patch that

          • Adds a configurable sync interval option
          • Updates default sync value to 100kb
          • Fixes sync test case to work with it
          • Adds a new constructor test case that wasn't added post HADOOP-6856
          • Removes an unnecessary println
          Show
          qwertymaniac Harsh J added a comment - Patch that Adds a configurable sync interval option Updates default sync value to 100kb Fixes sync test case to work with it Adds a new constructor test case that wasn't added post HADOOP-6856 Removes an unnecessary println
          Hide
          owen.omalley Owen O'Malley added a comment -

          As long as it is configurable, 100k as the default would be fine.

          Show
          owen.omalley Owen O'Malley added a comment - As long as it is configurable, 100k as the default would be fine.
          Hide
          cutting Doug Cutting added a comment -

          This sounds like a time/space tradeoff. We currently have a 1% space overhead. You've proposed a 1% time overhead (with 1MB syncs and 100MB block size, splits would overlap the end of the block by 1%, forcing 1% non-local reads). Perhaps there's a sweet spot somewhere between? 100k?

          Show
          cutting Doug Cutting added a comment - This sounds like a time/space tradeoff. We currently have a 1% space overhead. You've proposed a 1% time overhead (with 1MB syncs and 100MB block size, splits would overlap the end of the block by 1%, forcing 1% non-local reads). Perhaps there's a sweet spot somewhere between? 100k?
          Hide
          owen.omalley Owen O'Malley added a comment -

          I should also note that this would handle the current special case of the map outputs that turn sync blocks off...

          Show
          owen.omalley Owen O'Malley added a comment - I should also note that this would handle the current special case of the map outputs that turn sync blocks off...
          Hide
          owen.omalley Owen O'Malley added a comment -

          The current setup forces all non-block compressed sequence files to accept a 1% overhead for sync bytes. That is a fair amount of overhead on 100tb... Block compressed sequence files do just fine with 1mb between sync blocks.

          Show
          owen.omalley Owen O'Malley added a comment - The current setup forces all non-block compressed sequence files to accept a 1% overhead for sync bytes. That is a fair amount of overhead on 100tb... Block compressed sequence files do just fine with 1mb between sync blocks.
          Hide
          cutting Doug Cutting added a comment -

          > reduce the overhead by a factor of 500

          But if the overhead is already considerably less than 1%, is this really important? We want to keep it substantially smaller than 1% of what we ever expect the FileSystem block size to be. 1MB seems to be pushing that boundary.

          Show
          cutting Doug Cutting added a comment - > reduce the overhead by a factor of 500 But if the overhead is already considerably less than 1%, is this really important? We want to keep it substantially smaller than 1% of what we ever expect the FileSystem block size to be. 1MB seems to be pushing that boundary.
          Hide
          owen.omalley Owen O'Malley added a comment -

          If your input splits are roughly 128MB or so, putting in a sync block every 2k is too often. By moving it up to 1MB, you can get good granularity and reduce the overhead by a factor of 500. That seems pretty significant .

          Show
          owen.omalley Owen O'Malley added a comment - If your input splits are roughly 128MB or so, putting in a sync block every 2k is too often. By moving it up to 1MB, you can get good granularity and reduce the overhead by a factor of 500. That seems pretty significant .
          Hide
          cutting Doug Cutting added a comment -

          Why would this be better? The current design is to add them as frequently as possible without significantly impacting file size. This minimizes the amount of data that must be scanned when sync'ing. What would making it larger help?

          Show
          cutting Doug Cutting added a comment - Why would this be better? The current design is to add them as frequently as possible without significantly impacting file size. This minimizes the amount of data that must be scanned when sync'ing. What would making it larger help?

            People

            • Assignee:
              qwertymaniac Harsh J
              Reporter:
              owen.omalley Owen O'Malley
            • Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development