Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.20.1
    • Component/s: io
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Add a new, binary file format TFile.

      Description

      SequenceFile's block compression format is too complex and requires 4 codecs to compress or decompress. It would be good to have a file format that only needs

      1. TFile Specification 20081217.pdf
        519 kB
        Hong Tang
      2. hadoop-trunk-tfile.patch
        352 kB
        Hong Tang
      3. hadoop-trunk-tfile.patch
        352 kB
        Hong Tang
      4. hadoop-3315-0710-1-hadoop-20.patch
        323 kB
        Hong Tang
      5. hadoop-3315-0701-yhadoop-20.patch
        322 kB
        Hong Tang
      6. hadoop-3315-0623-2.patch
        324 kB
        Hong Tang
      7. hadoop-3315-0612.patch
        327 kB
        Hong Tang
      8. hadoop-3315-0605.patch
        327 kB
        Hong Tang
      9. hadoop-3315-0602.patch
        327 kB
        Hong Tang
      10. hadoop-3315-0601.patch
        326 kB
        Hong Tang
      11. hadoop-3315-0514.patch
        320 kB
        Hong Tang
      12. hadoop-3315-0513.patch
        320 kB
        Hong Tang
      13. hadoop-3315-0509-2.patch
        319 kB
        Hong Tang
      14. hadoop-3315-0509.patch
        319 kB
        Hong Tang
      15. hadoop-3315-0507.patch
        319 kB
        Hong Tang
      16. HADOOP-3315_20080915_TFILE.patch
        204 kB
        Amir Youssefi
      17. HADOOP-3315_20080908_TFILE_PREVIEW_WITH_LZO_TESTS.patch
        188 kB
        Amir Youssefi

        Activity

        Hide
        Chris Douglas added a comment -

        Committed to 0.20 branch

        Show
        Chris Douglas added a comment - Committed to 0.20 branch
        Hide
        Hong Tang added a comment -

        ant test and test-patch passed (after applying patch to test-patch.sh from HADOOP-6141).

        [exec] +1 overall.
        [exec]
        [exec] +1 @author. The patch does not contain any @author tags.
        [exec]
        [exec] +1 tests included. The patch appears to include 63 new or modified tests.
        [exec]
        [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
        [exec]
        [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
        [exec]
        [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
        [exec]
        [exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

        Show
        Hong Tang added a comment - ant test and test-patch passed (after applying patch to test-patch.sh from HADOOP-6141 ). [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 63 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
        Hide
        Hong Tang added a comment -

        Patch to hadoop 0.20. Also included the minor fix in HADOOP-6131.

        Show
        Hong Tang added a comment - Patch to hadoop 0.20. Also included the minor fix in HADOOP-6131 .
        Hide
        Hong Tang added a comment -

        Patch attached for yahoo 0.20 branch.

        Show
        Hong Tang added a comment - Patch attached for yahoo 0.20 branch.
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Common-trunk #7 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Common-trunk/7/)
        . Add a new, binary file foramt, TFile. Contributed by Hong Tang.

        Show
        Hudson added a comment - Integrated in Hadoop-Common-trunk #7 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Common-trunk/7/ ) . Add a new, binary file foramt, TFile. Contributed by Hong Tang.
        Hide
        Chris Douglas added a comment -

        I committed this. Thanks, Hong!

        Show
        Chris Douglas added a comment - I committed this. Thanks, Hong!
        Hide
        Chris Douglas added a comment -
             [exec] +1 overall.  
             [exec] 
             [exec]     +1 @author.  The patch does not contain any @author tags.
             [exec] 
             [exec]     +1 tests included.  The patch appears to include 63 new or modified tests.
             [exec] 
             [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
             [exec] 
             [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
             [exec] 
             [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
             [exec] 
             [exec]     +1 release audit.  The applied patch does not increase the total number of release audit warnings.
        

        Unit tests passed with and without native libs installed

        Show
        Chris Douglas added a comment - [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 63 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. Unit tests passed with and without native libs installed
        Hide
        Hong Tang added a comment -

        Reviewed the patch hadoop-3315-0619.patch with Chris Douglas. The newly uploaded patch addressed the following points raised during the review:

        • Replaced some inappropriate usage of IllegalStateException with RuntimeException or IOException.
        • In ChunkEcoder, write out a new chunk if cached data + user buffer >= chunk buffer.
        • Removed constructor ByteArray(String).
        • Fixed logical error "256 & 1024" -> "256 * 1024".
        • Fixed comments in RawComparable javadoc.
        • Renamed various classes and field variables to match with names used in TFile spec.
        • Replaced ResettableByteArrayInputStream => DataInputBuffer
        • Replaced SimpleByteArrayOutputStream => DataOutputBuffer
        • Used IOUtils.cleanup in TFile.Writer.close.
        • Removed AssertionException to RuntimeException.
        • minBlockSize documentation lacks units (in bytes)
        • Document Compression::get(C|D) 0.18 compatibility code
        • Unconventional try/catch in finally usage pattern in BCFile
        • setting hadoop.native.lib in Compression not Kosher
        • Unconventional try/finally pattern in BCFile, converted to try/catch instead of try/finally
        • TFile.Location::hashCode incl redundant computation
        • double-semicolon @ TFile:801
        • Use DefaultCodec instead of GzipCodec to make pooling work for non-native zlib
        Show
        Hong Tang added a comment - Reviewed the patch hadoop-3315-0619.patch with Chris Douglas. The newly uploaded patch addressed the following points raised during the review: Replaced some inappropriate usage of IllegalStateException with RuntimeException or IOException. In ChunkEcoder, write out a new chunk if cached data + user buffer >= chunk buffer. Removed constructor ByteArray(String). Fixed logical error "256 & 1024" -> "256 * 1024". Fixed comments in RawComparable javadoc. Renamed various classes and field variables to match with names used in TFile spec. Replaced ResettableByteArrayInputStream => DataInputBuffer Replaced SimpleByteArrayOutputStream => DataOutputBuffer Used IOUtils.cleanup in TFile.Writer.close. Removed AssertionException to RuntimeException. minBlockSize documentation lacks units (in bytes) Document Compression::get(C|D) 0.18 compatibility code Unconventional try/catch in finally usage pattern in BCFile setting hadoop.native.lib in Compression not Kosher Unconventional try/finally pattern in BCFile, converted to try/catch instead of try/finally TFile.Location::hashCode incl redundant computation double-semicolon @ TFile:801 Use DefaultCodec instead of GzipCodec to make pooling work for non-native zlib
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12410525/hadoop-3315-0612.patch
        against trunk revision 785065.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 43 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        -1 contrib tests. The patch failed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/507/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/507/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/507/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/507/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12410525/hadoop-3315-0612.patch against trunk revision 785065. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 43 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/507/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/507/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/507/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/507/console This message is automatically generated.
        Hide
        Hong Tang added a comment -

        The jira has been more than a year old and the latest patch has been available for more than a month (with some intermittent minor updates). Could somebody review and commit the patch? Thanks.

        Show
        Hong Tang added a comment - The jira has been more than a year old and the latest patch has been available for more than a month (with some intermittent minor updates). Could somebody review and commit the patch? Thanks.
        Hide
        Hong Tang added a comment -

        The failed test case is due to the missing of native library. I believe it is caused by the missing -Dcompile.native=true in the "ant jar" command.

        The audit warning is caused by the changed findBugs exclusion xml file.

        Following are the output of "ant test-patch" and "ant run-test-core" commands (excerpt):

        [exec] +1 overall.
        [exec]
        [exec] +1 @author. The patch does not contain any @author tags.
        [exec]
        [exec] +1 tests included. The patch appears to include 43 new or modified tests.
        [exec]
        [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
        [exec]
        [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
        [exec]
        [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
        [exec]
        [exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
        [exec]
        [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.

        % ant run-test-core "-Dtest.include=TestTFile*"
        run-test-core:
        [delete] Deleting directory /export/crawlspace/tmp/htang/hadoop/build/test/data
        [mkdir] Created dir: /export/crawlspace/tmp/htang/hadoop/build/test/data
        [delete] Deleting directory /export/crawlspace/tmp/htang/hadoop/build/test/logs
        [mkdir] Created dir: /export/crawlspace/tmp/htang/hadoop/build/test/logs
        [junit] Running org.apache.hadoop.io.file.tfile.TestTFile
        [junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 1.321 sec
        [junit] Running org.apache.hadoop.io.file.tfile.TestTFileByteArrays
        [junit] Tests run: 25, Failures: 0, Errors: 0, Time elapsed: 3.692 sec
        [junit] Running org.apache.hadoop.io.file.tfile.TestTFileComparators
        [junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 0.407 sec
        [junit] Running org.apache.hadoop.io.file.tfile.TestTFileJClassComparatorByteArrays
        [junit] Tests run: 25, Failures: 0, Errors: 0, Time elapsed: 3.621 sec
        [junit] Running org.apache.hadoop.io.file.tfile.TestTFileLzoCodecsByteArrays
        [junit] Tests run: 25, Failures: 0, Errors: 0, Time elapsed: 0.098 sec
        [junit] Running org.apache.hadoop.io.file.tfile.TestTFileLzoCodecsStreams
        [junit] Tests run: 19, Failures: 0, Errors: 0, Time elapsed: 0.09 sec
        [junit] Running org.apache.hadoop.io.file.tfile.TestTFileNoneCodecsByteArrays
        [junit] Tests run: 25, Failures: 0, Errors: 0, Time elapsed: 1.704 sec
        [junit] Running org.apache.hadoop.io.file.tfile.TestTFileNoneCodecsJClassComparatorByteArrays
        [junit] Tests run: 25, Failures: 0, Errors: 0, Time elapsed: 1.764 sec
        [junit] Running org.apache.hadoop.io.file.tfile.TestTFileNoneCodecsStreams
        [junit] Tests run: 19, Failures: 0, Errors: 0, Time elapsed: 2.03 sec
        [junit] Running org.apache.hadoop.io.file.tfile.TestTFileSeek
        [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 4.186 sec
        [junit] Running org.apache.hadoop.io.file.tfile.TestTFileSeqFileComparison
        [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 6.562 sec
        [junit] Running org.apache.hadoop.io.file.tfile.TestTFileSplit
        [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 7.163 sec
        [junit] Running org.apache.hadoop.io.file.tfile.TestTFileStreams
        [junit] Tests run: 19, Failures: 0, Errors: 0, Time elapsed: 2.038 sec
        [junit] Running org.apache.hadoop.io.file.tfile.TestTFileUnsortedByteArrays
        [junit] Tests run: 4, Failures: 0, Errors: 0, Time elapsed: 0.591 sec

        checkfailure:

        BUILD SUCCESSFUL
        Total time: 46 seconds

        Show
        Hong Tang added a comment - The failed test case is due to the missing of native library. I believe it is caused by the missing -Dcompile.native=true in the "ant jar" command. The audit warning is caused by the changed findBugs exclusion xml file. Following are the output of "ant test-patch" and "ant run-test-core" commands (excerpt): [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 43 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. % ant run-test-core "-Dtest.include=TestTFile*" run-test-core: [delete] Deleting directory /export/crawlspace/tmp/htang/hadoop/build/test/data [mkdir] Created dir: /export/crawlspace/tmp/htang/hadoop/build/test/data [delete] Deleting directory /export/crawlspace/tmp/htang/hadoop/build/test/logs [mkdir] Created dir: /export/crawlspace/tmp/htang/hadoop/build/test/logs [junit] Running org.apache.hadoop.io.file.tfile.TestTFile [junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 1.321 sec [junit] Running org.apache.hadoop.io.file.tfile.TestTFileByteArrays [junit] Tests run: 25, Failures: 0, Errors: 0, Time elapsed: 3.692 sec [junit] Running org.apache.hadoop.io.file.tfile.TestTFileComparators [junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 0.407 sec [junit] Running org.apache.hadoop.io.file.tfile.TestTFileJClassComparatorByteArrays [junit] Tests run: 25, Failures: 0, Errors: 0, Time elapsed: 3.621 sec [junit] Running org.apache.hadoop.io.file.tfile.TestTFileLzoCodecsByteArrays [junit] Tests run: 25, Failures: 0, Errors: 0, Time elapsed: 0.098 sec [junit] Running org.apache.hadoop.io.file.tfile.TestTFileLzoCodecsStreams [junit] Tests run: 19, Failures: 0, Errors: 0, Time elapsed: 0.09 sec [junit] Running org.apache.hadoop.io.file.tfile.TestTFileNoneCodecsByteArrays [junit] Tests run: 25, Failures: 0, Errors: 0, Time elapsed: 1.704 sec [junit] Running org.apache.hadoop.io.file.tfile.TestTFileNoneCodecsJClassComparatorByteArrays [junit] Tests run: 25, Failures: 0, Errors: 0, Time elapsed: 1.764 sec [junit] Running org.apache.hadoop.io.file.tfile.TestTFileNoneCodecsStreams [junit] Tests run: 19, Failures: 0, Errors: 0, Time elapsed: 2.03 sec [junit] Running org.apache.hadoop.io.file.tfile.TestTFileSeek [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 4.186 sec [junit] Running org.apache.hadoop.io.file.tfile.TestTFileSeqFileComparison [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 6.562 sec [junit] Running org.apache.hadoop.io.file.tfile.TestTFileSplit [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 7.163 sec [junit] Running org.apache.hadoop.io.file.tfile.TestTFileStreams [junit] Tests run: 19, Failures: 0, Errors: 0, Time elapsed: 2.038 sec [junit] Running org.apache.hadoop.io.file.tfile.TestTFileUnsortedByteArrays [junit] Tests run: 4, Failures: 0, Errors: 0, Time elapsed: 0.591 sec checkfailure: BUILD SUCCESSFUL Total time: 46 seconds
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12409952/hadoop-3315-0605.patch
        against trunk revision 784042.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 43 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

        -1 release audit. The applied patch generated 493 release audit warnings (more than the trunk's current 492 warnings).

        -1 core tests. The patch failed core unit tests.

        -1 contrib tests. The patch failed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/494/testReport/
        Release audit warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/494/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/494/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/494/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/494/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12409952/hadoop-3315-0605.patch against trunk revision 784042. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 43 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. -1 release audit. The applied patch generated 493 release audit warnings (more than the trunk's current 492 warnings). -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/494/testReport/ Release audit warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/494/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/494/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/494/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/494/console This message is automatically generated.
        Hide
        Hong Tang added a comment -

        Cancel and resubmit to trigger hudson.

        Show
        Hong Tang added a comment - Cancel and resubmit to trigger hudson.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12409749/hadoop-3315-0602.patch
        against trunk revision 781916.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 43 new or modified tests.

        -1 javadoc. The javadoc tool appears to have generated 1 warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        -1 contrib tests. The patch failed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/466/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/466/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/466/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/466/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12409749/hadoop-3315-0602.patch against trunk revision 781916. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 43 new or modified tests. -1 javadoc. The javadoc tool appears to have generated 1 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/466/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/466/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/466/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/466/console This message is automatically generated.
        Hide
        Hong Tang added a comment -

        Minor change to fix a javadoc warning. Resubmit to get hudson pick it up.

        Show
        Hong Tang added a comment - Minor change to fix a javadoc warning. Resubmit to get hudson pick it up.
        Hide
        Hong Tang added a comment -

        Fixed findBugs warnings.

        Show
        Hong Tang added a comment - Fixed findBugs warnings.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12409587/hadoop-3315-0601.patch
        against trunk revision 781255.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 43 new or modified tests.

        -1 javadoc. The javadoc tool appears to have generated 1 warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        -1 findbugs. The patch appears to introduce 4 new Findbugs warnings.

        +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        -1 contrib tests. The patch failed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/453/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/453/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/453/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/453/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12409587/hadoop-3315-0601.patch against trunk revision 781255. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 43 new or modified tests. -1 javadoc. The javadoc tool appears to have generated 1 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 4 new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/453/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/453/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/453/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/453/console This message is automatically generated.
        Hide
        Hong Tang added a comment -

        Resubmit the patch to let hudson pick it up.

        Show
        Hong Tang added a comment - Resubmit the patch to let hudson pick it up.
        Hide
        Alan Gates added a comment -

        +1

        Show
        Alan Gates added a comment - +1
        Hide
        Hong Tang added a comment -

        The latest patch addresses the concern raised by Alan.

        Show
        Hong Tang added a comment - The latest patch addresses the concern raised by Alan.
        Hide
        Hong Tang added a comment -

        Looking at the latest patch, I have one question: there are a lot of contained classes and interfaces in tfile. Why are these all contained in one tfile class, instead of making tfile a package and having the classes and interfaces contained in there?

        Fair question, the code could be factored to make it easier to maintain. However, I am a bit hesitant to split them into packages (tfile itself is already a package, adding more sub-packages would probably be a bit overkill).

        After examining the code, here are a few opportunities where we could split it out:

        • Move out Interface RawComparable.
        • Move out public class ByteArray
        • Move out the exception classes: MetaBlockAlreadyExists, MetaBlockDoesNotExist
        • Move out the code that dumps the meta info of TFile (possibly with a wrapper class called TFileDumper).

        I will start working on the above and feel free to comment on what more could be done.

        Show
        Hong Tang added a comment - Looking at the latest patch, I have one question: there are a lot of contained classes and interfaces in tfile. Why are these all contained in one tfile class, instead of making tfile a package and having the classes and interfaces contained in there? Fair question, the code could be factored to make it easier to maintain. However, I am a bit hesitant to split them into packages (tfile itself is already a package, adding more sub-packages would probably be a bit overkill). After examining the code, here are a few opportunities where we could split it out: Move out Interface RawComparable. Move out public class ByteArray Move out the exception classes: MetaBlockAlreadyExists, MetaBlockDoesNotExist Move out the code that dumps the meta info of TFile (possibly with a wrapper class called TFileDumper). I will start working on the above and feel free to comment on what more could be done.
        Hide
        Alan Gates added a comment -

        Looking at the latest patch, I have one question: there are a lot of contained classes and interfaces in tfile. Why are these all contained in one tfile class, instead of making tfile a package and having the classes and interfaces contained in there?

        Show
        Alan Gates added a comment - Looking at the latest patch, I have one question: there are a lot of contained classes and interfaces in tfile. Why are these all contained in one tfile class, instead of making tfile a package and having the classes and interfaces contained in there?
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12408226/hadoop-3315-0514.patch
        against trunk revision 776148.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 43 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        -1 contrib tests. The patch failed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/355/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/355/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/355/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/355/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12408226/hadoop-3315-0514.patch against trunk revision 776148. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 43 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/355/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/355/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/355/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/355/console This message is automatically generated.
        Hide
        Hong Tang added a comment -

        Old patch no longer applies due to change in src/test/findbugsExcludeFile.xml. Updated to fix the problem.

        Show
        Hong Tang added a comment - Old patch no longer applies due to change in src/test/findbugsExcludeFile.xml. Updated to fix the problem.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12408058/hadoop-3315-0513.patch
        against trunk revision 774625.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 43 new or modified tests.

        -1 patch. The patch command could not apply the patch.

        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/338/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12408058/hadoop-3315-0513.patch against trunk revision 774625. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 43 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/338/console This message is automatically generated.
        Hide
        Hong Tang added a comment -

        Minor change: added the Findbugs false-positive conditions to src/test/findbugsExcludeFile.xml.

        Show
        Hong Tang added a comment - Minor change: added the Findbugs false-positive conditions to src/test/findbugsExcludeFile.xml.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12407714/hadoop-3315-0509-2.patch
        against trunk revision 772960.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 40 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        -1 findbugs. The patch appears to introduce 2 new Findbugs warnings.

        +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        -1 contrib tests. The patch failed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/320/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/320/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/320/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/320/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12407714/hadoop-3315-0509-2.patch against trunk revision 772960. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 40 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 2 new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/320/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/320/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/320/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/320/console This message is automatically generated.
        Hide
        Hong Tang added a comment -

        New patch that fixed all but false-positive findbugs warnings.

        Show
        Hong Tang added a comment - New patch that fixed all but false-positive findbugs warnings.
        Hide
        Hong Tang added a comment -

        New patch that fixes all but false positive findbugs warnings.

        Show
        Hong Tang added a comment - New patch that fixes all but false positive findbugs warnings.
        Hide
        Hong Tang added a comment -

        Last patch introduces more findbugs warnings

        Show
        Hong Tang added a comment - Last patch introduces more findbugs warnings
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12407699/hadoop-3315-0509.patch
        against trunk revision 772960.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 40 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        -1 findbugs. The patch appears to introduce 6 new Findbugs warnings.

        +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        -1 contrib tests. The patch failed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/318/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/318/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/318/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/318/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12407699/hadoop-3315-0509.patch against trunk revision 772960. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 40 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 6 new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/318/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/318/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/318/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/318/console This message is automatically generated.
        Hide
        Hong Tang added a comment -

        Fixed a few findbugs warnings. Warnings RR on ChunkDecoder and SF on Utils.writeVLong are false positives.

        Show
        Hong Tang added a comment - Fixed a few findbugs warnings. Warnings RR on ChunkDecoder and SF on Utils.writeVLong are false positives.
        Hide
        Hong Tang added a comment -

        Cancel and resubmit to trigger hudson testing

        Show
        Hong Tang added a comment - Cancel and resubmit to trigger hudson testing
        Hide
        Hong Tang added a comment -

        New patch that fixes some findbugs warnings.

        Show
        Hong Tang added a comment - New patch that fixes some findbugs warnings.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12407451/hadoop-3315-0507.patch
        against trunk revision 772960.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 40 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        -1 findbugs. The patch appears to introduce 6 new Findbugs warnings.

        +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        -1 contrib tests. The patch failed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/311/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/311/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/311/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/311/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12407451/hadoop-3315-0507.patch against trunk revision 772960. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 40 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 6 new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/311/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/311/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/311/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/311/console This message is automatically generated.
        Hide
        Hong Tang added a comment -

        The patch is tested to work with trunk head, 0.18.3 and 0.20.

        For 0.18.3 and 0.20, it works as my previous upload and would use the built-in LzoCodec for "lzo" compression.

        For trunk, by default, "lzo" compression is not support. User must specific the class name for a 3rd-party lzo compression library through property "io.compression.codec.lzo.class", and set the classpath and java.library.path properly. The code is tested to work with hadoop-gpl-compression: http://code.google.com/p/hadoop-gpl-compression/.

        Show
        Hong Tang added a comment - The patch is tested to work with trunk head, 0.18.3 and 0.20. For 0.18.3 and 0.20, it works as my previous upload and would use the built-in LzoCodec for "lzo" compression. For trunk, by default, "lzo" compression is not support. User must specific the class name for a 3rd-party lzo compression library through property "io.compression.codec.lzo.class", and set the classpath and java.library.path properly. The code is tested to work with hadoop-gpl-compression: http://code.google.com/p/hadoop-gpl-compression/ .
        Hide
        Hong Tang added a comment -

        Main changes of this patch since last upload:

        • Solving the missing LzoCodec issue.
        • Enhanced the unit tests.
        Show
        Hong Tang added a comment - Main changes of this patch since last upload: Solving the missing LzoCodec issue. Enhanced the unit tests.
        Hide
        Hong Tang added a comment -

        I am working on a patch that should be ready in a week.

        Show
        Hong Tang added a comment - I am working on a patch that should be ready in a week.
        Hide
        Zheng Shao added a comment -

        This issue has just passed its anniversary.. What is the timeline? When will the first version get committed?

        Show
        Zheng Shao added a comment - This issue has just passed its anniversary.. What is the timeline? When will the first version get committed?
        Hide
        ryan rawson added a comment -

        HBase no longer has interest or dependency on this JIRA. We withdraw any concerns or blockers.

        Good luck on your new file format guys!

        Show
        ryan rawson added a comment - HBase no longer has interest or dependency on this JIRA. We withdraw any concerns or blockers. Good luck on your new file format guys!
        Hide
        Hong Tang added a comment -

        Aah, I wonder how it slipped from being detected for such a while.

        I have not thought through the approach yet. If you have suggestions, please let me know. It definitely needs some refactoring in the TFile end.

        Show
        Hong Tang added a comment - Aah, I wonder how it slipped from being detected for such a while. I have not thought through the approach yet. If you have suggestions, please let me know. It definitely needs some refactoring in the TFile end.
        Hide
        stack added a comment -

        Thanks Hong. I was talking about fact that '256 & 1024' evaluates to 0 (Should be 256 * 1024). Changing this improved my seek times.

        How you thinking of pulling in lzo? Can I help out?

        Show
        stack added a comment - Thanks Hong. I was talking about fact that '256 & 1024' evaluates to 0 (Should be 256 * 1024). Changing this improved my seek times. How you thinking of pulling in lzo? Can I help out?
        Hide
        Hong Tang added a comment -

        Here's one issue in TFile that was causing grief (tests explicitly set this size).

        static int getFSInputBufferSize(Configuration conf) {
            return conf.getInt(FS_INPUT_BUF_SIZE_ATTR, 256 & 1024);
        

        The configuration parameters tfile.fs.output.buffer.size and tfile.fs.input.buffer.size are the buffering between the compression/decompression codec and the underlying file IO stream. larger values would decrease the number of invocations of IO calls to the underlying stream; but pays the overhead of buffer copying and memory. The choice of 256K in the code is a safe one, but not the best one.

        The performance number I posted does not twist these two settings (aka as set by the test at 1 and 0 respectively), and is aimed at relying on buffering in the fs stream and bypassing internal buffering. It seems to work well with a version of hadoop (a 0.19-dev version) I used.

        Show
        Hong Tang added a comment - Here's one issue in TFile that was causing grief (tests explicitly set this size). static int getFSInputBufferSize(Configuration conf) { return conf.getInt(FS_INPUT_BUF_SIZE_ATTR, 256 & 1024); The configuration parameters tfile.fs.output.buffer.size and tfile.fs.input.buffer.size are the buffering between the compression/decompression codec and the underlying file IO stream. larger values would decrease the number of invocations of IO calls to the underlying stream; but pays the overhead of buffer copying and memory. The choice of 256K in the code is a safe one, but not the best one. The performance number I posted does not twist these two settings (aka as set by the test at 1 and 0 respectively), and is aimed at relying on buffering in the fs stream and bypassing internal buffering. It seems to work well with a version of hadoop (a 0.19-dev version) I used.
        Hide
        stack added a comment - - edited

        Moving it to 'resume progress' because of Hong's last comment

        Show
        stack added a comment - - edited Moving it to 'resume progress' because of Hong's last comment
        Hide
        Hong Tang added a comment -

        Was trying to make the minor changes stack suggested and found that tfile no longer applies to trunk due to the removal of Lzo codec by HADOOP-4874.

        Given the performance advantage of lzo, it is probably not a good idea to abandon lzo support completely. Need to allow TFile to accept external Lzo codec implementation.

        Show
        Hong Tang added a comment - Was trying to make the minor changes stack suggested and found that tfile no longer applies to trunk due to the removal of Lzo codec by HADOOP-4874 . Given the performance advantage of lzo, it is probably not a good idea to abandon lzo support completely. Need to allow TFile to accept external Lzo codec implementation.
        Hide
        stack added a comment -

        Digging, looks like ChunkDecoder.close is actually close of the value stream. We need to skip over the value to get to next key in inBlockAdvance.

        Show
        stack added a comment - Digging, looks like ChunkDecoder.close is actually close of the value stream. We need to skip over the value to get to next key in inBlockAdvance.
        Hide
        ryan rawson added a comment -

        Hi,

        I'm evaluating the use of tfile for performance in hbase... I ran a simple seek program in a profiler, and I saw something really weird I'd like to call your attention to:

        18.9% - 56,948 ms - 12,640,778 inv. org.apache.hadoop.io.file.tfile.Chunk$ChunkDecoder.close

        • 9.1% - 27,357 ms - 12,642,117 inv. org.apache.hadoop.io.file.tfile.Chunk$ChunkDecoder.skip
          • 3.0% - 9,107 ms - 12,642,117 inv. org.apache.hadoop.io.file.tfile.Chunk$ChunkDecoder.checkEOF
          • 1.1% - 3,252 ms - 12,642,117 inv. java.io.DataInputStream.skip
          • 1.0% - 3,143 ms - 12,642,117 inv. java.lang.Math.min
            -5.9% - 17,826 ms - 25,282,895 inv. org.apache.hadoop.io.file.tfile.Chunk$ChunkDecoder.checkEOF
          • 2.0% - 5,980 ms - 25,282,895 inv. org.apache.hadoop.io.file.tfile.Chunk$ChunkDecoder.isClosed

        It turns out that nearly 19% of my time is closing a block?! The code for the function is:
        public void close() throws IOException {
        if (closed == false) {
        try {
        while (!checkEOF())

        { skip(Integer.MAX_VALUE); }

        }
        finally

        { closed = true; }

        }

        This seems kind of weird, why do we read the rest of the block on close? Why not just close it?

        Thanks!

        Show
        ryan rawson added a comment - Hi, I'm evaluating the use of tfile for performance in hbase... I ran a simple seek program in a profiler, and I saw something really weird I'd like to call your attention to: 18.9% - 56,948 ms - 12,640,778 inv. org.apache.hadoop.io.file.tfile.Chunk$ChunkDecoder.close 9.1% - 27,357 ms - 12,642,117 inv. org.apache.hadoop.io.file.tfile.Chunk$ChunkDecoder.skip 3.0% - 9,107 ms - 12,642,117 inv. org.apache.hadoop.io.file.tfile.Chunk$ChunkDecoder.checkEOF 1.1% - 3,252 ms - 12,642,117 inv. java.io.DataInputStream.skip 1.0% - 3,143 ms - 12,642,117 inv. java.lang.Math.min -5.9% - 17,826 ms - 25,282,895 inv. org.apache.hadoop.io.file.tfile.Chunk$ChunkDecoder.checkEOF 2.0% - 5,980 ms - 25,282,895 inv. org.apache.hadoop.io.file.tfile.Chunk$ChunkDecoder.isClosed It turns out that nearly 19% of my time is closing a block?! The code for the function is: public void close() throws IOException { if (closed == false) { try { while (!checkEOF()) { skip(Integer.MAX_VALUE); } } finally { closed = true; } } This seems kind of weird, why do we read the rest of the block on close? Why not just close it? Thanks!
        Hide
        stack added a comment -

        Here's one issue in TFile that was causing grief (tests explicitly set this size).

        +  static int getFSInputBufferSize(Configuration conf) {
        +    return conf.getInt(FS_INPUT_BUF_SIZE_ATTR, 256 & 1024);
        
        Show
        stack added a comment - Here's one issue in TFile that was causing grief (tests explicitly set this size). + static int getFSInputBufferSize(Configuration conf) { + return conf.getInt(FS_INPUT_BUF_SIZE_ATTR, 256 & 1024);
        Hide
        Hong Tang added a comment - - edited

        The results were obtained with an early version of Hadoop 19. Can you describe your environment and I can retry the tests.

        The hardcoded parameters were meant to serve as unit test.

        The following is the tcsh script that I use to run my seek benchmark:

        
        set NN=<NN-host>:<NN-port>
        set ROOT="-r hdfs://${NN}/<home-dir>"
        
        foreach compress (none lzo)
            foreach block (128 256 512 1024)
                ./runTFileSeekBench -n 1000 -s 10000 -c $compress -b $block ${ROOT} -f TestTFileSeek.${block}K.$compress -x w
            end
        end
        
        foreach i (1 2 3 4 5)
            foreach compress (none lzo)
                foreach block (128 256 512 1024)
                    ./runTFileSeekBench -n 1000 -s 10000 -c $compress -b $block ${ROOT} -f TestTFileSeek.${block}K.$compress -x r
                end
            end
        end
        
        Show
        Hong Tang added a comment - - edited The results were obtained with an early version of Hadoop 19. Can you describe your environment and I can retry the tests. The hardcoded parameters were meant to serve as unit test. The following is the tcsh script that I use to run my seek benchmark: set NN=<NN-host>:<NN-port> set ROOT= "-r hdfs: //${NN}/<home-dir>" foreach compress (none lzo) foreach block (128 256 512 1024) ./runTFileSeekBench -n 1000 -s 10000 -c $compress -b $block ${ROOT} -f TestTFileSeek.${block}K.$compress -x w end end foreach i (1 2 3 4 5) foreach compress (none lzo) foreach block (128 256 512 1024) ./runTFileSeekBench -n 1000 -s 10000 -c $compress -b $block ${ROOT} -f TestTFileSeek.${block}K.$compress -x r end end end
        Hide
        stack added a comment -

        I should add I tested using local filesystem and hdfs. Both had similar slow profile when 1G file.

        Show
        stack added a comment - I should add I tested using local filesystem and hdfs. Both had similar slow profile when 1G file.
        Hide
        stack added a comment -

        Thanks for the above especially the bit on concurrent access blocked on FSDIS. Thanks too for the explaination of inBlockAdvance (Yes, no objects are created, just resets and closes). Pardon me.

        I'm wondering how you did your random access tests above? I see nowhere near your numbers. I'm seeing more like 2 to 6 random reads a second. Profling, "100%" of the time is being spent in inBlockAdvance. ~70% in TFile$Reader.checkKey and ~20% in TFile$Reader.compareKeys (About half of this time is down in hadoop.fsdoing seeking and reading). My test writes 1M rows whose key is the sequence numbers zero-padded. The rows are 1k of random data (File is about a 1G. TFile blocks are 64k). The random access test fetches a random key from the 1-1M range.

        Looking at the TestTFileSeek, I see that this test writes a 30M file by default and then does 1000 seeks. It reports seek times of about 2.5ms on average. If I up the file written to be 1G, leaving all else the same, now the seek time averages 120ms on average.

        What you think is going on here? My test against MapFile does about 417 random-reads a second. Thanks Hong.

        Show
        stack added a comment - Thanks for the above especially the bit on concurrent access blocked on FSDIS. Thanks too for the explaination of inBlockAdvance (Yes, no objects are created, just resets and closes). Pardon me. I'm wondering how you did your random access tests above? I see nowhere near your numbers. I'm seeing more like 2 to 6 random reads a second. Profling, "100%" of the time is being spent in inBlockAdvance. ~70% in TFile$Reader.checkKey and ~20% in TFile$Reader.compareKeys (About half of this time is down in hadoop.fsdoing seeking and reading). My test writes 1M rows whose key is the sequence numbers zero-padded. The rows are 1k of random data (File is about a 1G. TFile blocks are 64k). The random access test fetches a random key from the 1-1M range. Looking at the TestTFileSeek, I see that this test writes a 30M file by default and then does 1000 seeks. It reports seek times of about 2.5ms on average. If I up the file written to be 1G, leaving all else the same, now the seek time averages 120ms on average. What you think is going on here? My test against MapFile does about 417 random-reads a second. Thanks Hong.
        Hide
        Hong Tang added a comment -

        Any advantage to our making a scanner around a start and end key random accessing or, if I read things properly, there is none since we only fetch actual blocks when seekTo is called.

        There is no performance advantage. But there is a semantic difference. If you create a range scanner, and call seekTo with a key outside the scan range, it will return false even if the key exists in the TFile.

        And on concurrent access, if we have say, random-accesses concurrent with a couple of whole-file scans my reading has it that scanners fetch a block just as it needs it and then works against this fetched copy. The fetch is 'synchronized' which means lots of seeking around in the file but otherwise, it looks like there is no need for the application to synchronize access to tfile.

        Yes, there is no need to synchronize threads accessing to the same TFile, as long as each have its own scanner. However, concurrent access is not as performant as it could be due to the current design of HDFS. If multiple threads scan different regions of the same TFiles, the actual IO calls to FSDataInputStream are synchronized. I tried positioned read (which would avoid synchronizing on reads) but the overhead of frequent connection establishment makes single-threaded case much worse. Connection caching for positioned reads may help (HADOOP-3672).

        Hmm. Looking at doing random accesses and it seems like a bunch of time is spent in inBlockAdvance advancing sequentially through blocks rather than do something like a binary search to find desired block location. Also, as we advance, we create and destroy a bunch of objects such as the stream to hold the value. Can you comment on why this is (compression should be on tfile block boundaries, right so nothing to stop hopping into the midst of a tfile)? Thanks.

        inBlockAdvance() sequentially goes through key-value pairs INSIDE one compressed block. Binary searching for desired block is done through Reader.getBlockContainsKey(). Additionally, the code takes care of the case when you want to seek to a key in the later part of the block. When advancing, no objects are created. The closing of the value stream is to force the code to skip the remaining bytes of the value in case the application consumes part of the value bytes.

        Are you going to upload another patch? If so, I'll keep my +1 for that.

        Will do shortly.

        Show
        Hong Tang added a comment - Any advantage to our making a scanner around a start and end key random accessing or, if I read things properly, there is none since we only fetch actual blocks when seekTo is called. There is no performance advantage. But there is a semantic difference. If you create a range scanner, and call seekTo with a key outside the scan range, it will return false even if the key exists in the TFile. And on concurrent access, if we have say, random-accesses concurrent with a couple of whole-file scans my reading has it that scanners fetch a block just as it needs it and then works against this fetched copy. The fetch is 'synchronized' which means lots of seeking around in the file but otherwise, it looks like there is no need for the application to synchronize access to tfile. Yes, there is no need to synchronize threads accessing to the same TFile, as long as each have its own scanner. However, concurrent access is not as performant as it could be due to the current design of HDFS. If multiple threads scan different regions of the same TFiles, the actual IO calls to FSDataInputStream are synchronized. I tried positioned read (which would avoid synchronizing on reads) but the overhead of frequent connection establishment makes single-threaded case much worse. Connection caching for positioned reads may help ( HADOOP-3672 ). Hmm. Looking at doing random accesses and it seems like a bunch of time is spent in inBlockAdvance advancing sequentially through blocks rather than do something like a binary search to find desired block location. Also, as we advance, we create and destroy a bunch of objects such as the stream to hold the value. Can you comment on why this is (compression should be on tfile block boundaries, right so nothing to stop hopping into the midst of a tfile)? Thanks. inBlockAdvance() sequentially goes through key-value pairs INSIDE one compressed block. Binary searching for desired block is done through Reader.getBlockContainsKey(). Additionally, the code takes care of the case when you want to seek to a key in the later part of the block. When advancing, no objects are created. The closing of the value stream is to force the code to skip the remaining bytes of the value in case the application consumes part of the value bytes. Are you going to upload another patch? If so, I'll keep my +1 for that. Will do shortly.
        Hide
        stack added a comment -

        Hmm. Looking at doing random accesses and it seems like a bunch of time is spent in inBlockAdvance advancing sequentially through blocks rather than do something like a binary search to find desired block location. Also, as we advance, we create and destroy a bunch of objects such as the stream to hold the value. Can you comment on why this is (compression should be on tfile block boundaries, right so nothing to stop hopping into the midst of a tfile)? Thanks.

        Show
        stack added a comment - Hmm. Looking at doing random accesses and it seems like a bunch of time is spent in inBlockAdvance advancing sequentially through blocks rather than do something like a binary search to find desired block location. Also, as we advance, we create and destroy a bunch of objects such as the stream to hold the value. Can you comment on why this is (compression should be on tfile block boundaries, right so nothing to stop hopping into the midst of a tfile)? Thanks.
        Hide
        stack added a comment -

        On an alternate BCFile implementation, we're thinking that since BCFile currently fetches a block as needed, an alternative might keep up a cache of blocks. Before going to the wire, the cache would be checked to save on hdfs'ing (and repeated decompresses if blocks are compressed). If many concurrent accessors contending over the same part of a TFile, we'd save a bunch. We've had some experience doing this in the past with generally good results.

        But rather than make hypotheticals, lets wait as you suggest and revisit when a patch with a viable alternative BCFile exists (On key/value caching, we're working on that too only it can be a tad complex in the hbase context).

        OK on the too-big-key.

        Any advantage to our making a scanner around a start and end key random accessing or, if I read things properly, there is none since we only fetch actual blocks when seekTo is called.

        And on concurrent access, if we have say, random-accesses concurrent with a couple of whole-file scans my reading has it that scanners fetch a block just as it needs it and then works against this fetched copy. The fetch is 'synchronized' which means lots of seeking around in the file but otherwise, it looks like there is no need for the application to synchronize access to tfile.

        Thanks for the sweet patch. Its looking like we'll pull TFile into hbase and start using it tout de suite since our intent is to replace our current storefile in the 0.20.0 timeframe. Hopefully we'll have some patches for TFile to give back once this patch goes into hadopo. Are you going to upload another patch? If so, I'll keep my +1 for that.

        Show
        stack added a comment - On an alternate BCFile implementation, we're thinking that since BCFile currently fetches a block as needed, an alternative might keep up a cache of blocks. Before going to the wire, the cache would be checked to save on hdfs'ing (and repeated decompresses if blocks are compressed). If many concurrent accessors contending over the same part of a TFile, we'd save a bunch. We've had some experience doing this in the past with generally good results. But rather than make hypotheticals, lets wait as you suggest and revisit when a patch with a viable alternative BCFile exists (On key/value caching, we're working on that too only it can be a tad complex in the hbase context). OK on the too-big-key. Any advantage to our making a scanner around a start and end key random accessing or, if I read things properly, there is none since we only fetch actual blocks when seekTo is called. And on concurrent access, if we have say, random-accesses concurrent with a couple of whole-file scans my reading has it that scanners fetch a block just as it needs it and then works against this fetched copy. The fetch is 'synchronized' which means lots of seeking around in the file but otherwise, it looks like there is no need for the application to synchronize access to tfile. Thanks for the sweet patch. Its looking like we'll pull TFile into hbase and start using it tout de suite since our intent is to replace our current storefile in the 0.20.0 timeframe. Hopefully we'll have some patches for TFile to give back once this patch goes into hadopo. Are you going to upload another patch? If so, I'll keep my +1 for that.
        Hide
        Hong Tang added a comment - - edited

        Should probably be explicit about encoding in the below:

         public ByteArray(String str) {
         this(str.getBytes());
        

        Good catch. This seems to be some code to facilitate testing, but was not properly cleaned up. I should remove that constructor.

        Would be nice if we could easily pass a alternate implementation of BCFile, say one that cached blocks.

        Possible, but I think it is too early to make BCFile API public. Can you also elaborate on the example you mentioned? Why do you need block caching instead of key-value caching (given that TFile is based on <key, value> pairs)?

        Do you want to fix the below:

          // TODO: remember the longest key in a TFile, and use it to replace
          // MAX_KEY_SIZE.
          keyBuffer = new byte[MAX_KEY_SIZE];
        

        Default buffers of 64k for keys is a bit on the extravagant side.

        Yes, it is an easy fix. I intend to do that at a later time when we gather more information about what statistics we should collect during file creation time and put them in one meta block. I don't imagine it being an urgent issue though except for applications that open up hundreds of files (or scanners) simultaneously.

        Below should be public so users don't have to define their own:

          
         protected final static String JCLASS = "jclass:";
        

        Certainly. Possibly also true for symbolic names for various compression algorithms.

        API seems to have changed since last patch. There nolonger a #find method. Whats the suggested way of accessing a random single key/value? (Open scanner using what would you suggest for start and end? Then seekTo? But I find I'm making double ByteArray instances of same byte array. Should there be a seekTo that takes a RawComparable that is public?).

        Yes, the API is changed so that we do not have to scan through a compressed block twice (first get a location object, then use it with the scanner). I'd suggest to do random access as follows:

            Scanner scanner = reader.createScanner();
            ...
            if (scanner.seekTo(bytes, offset, length)) {
                Entry entry = scanner.entry();
                // access value through either entry.getValue or entry.writeValue 
            }
        
        Show
        Hong Tang added a comment - - edited Should probably be explicit about encoding in the below: public ByteArray( String str) { this (str.getBytes()); Good catch. This seems to be some code to facilitate testing, but was not properly cleaned up. I should remove that constructor. Would be nice if we could easily pass a alternate implementation of BCFile, say one that cached blocks. Possible, but I think it is too early to make BCFile API public. Can you also elaborate on the example you mentioned? Why do you need block caching instead of key-value caching (given that TFile is based on <key, value> pairs)? Do you want to fix the below: // TODO: remember the longest key in a TFile, and use it to replace // MAX_KEY_SIZE. keyBuffer = new byte [MAX_KEY_SIZE]; Default buffers of 64k for keys is a bit on the extravagant side. Yes, it is an easy fix. I intend to do that at a later time when we gather more information about what statistics we should collect during file creation time and put them in one meta block. I don't imagine it being an urgent issue though except for applications that open up hundreds of files (or scanners) simultaneously. Below should be public so users don't have to define their own: protected final static String JCLASS = "jclass:" ; Certainly. Possibly also true for symbolic names for various compression algorithms. API seems to have changed since last patch. There nolonger a #find method. Whats the suggested way of accessing a random single key/value? (Open scanner using what would you suggest for start and end? Then seekTo? But I find I'm making double ByteArray instances of same byte array. Should there be a seekTo that takes a RawComparable that is public?). Yes, the API is changed so that we do not have to scan through a compressed block twice (first get a location object, then use it with the scanner). I'd suggest to do random access as follows: Scanner scanner = reader.createScanner(); ... if (scanner.seekTo(bytes, offset, length)) { Entry entry = scanner.entry(); // access value through either entry.getValue or entry.writeValue }
        Hide
        stack added a comment -

        I've been playing around w/ the last patch trying to actually use it.

        Here are a couple of (minor) comments:

        Should probably be explicit about encoding in the below:
        public ByteArray(String str) {
        this(str.getBytes());

        Would be nice if we could easily pass a alternate implementation of BCFile, say one that cached blocks.

        Do you want to fix the below:

                // TODO: remember the longest key in a TFile, and use it to replace
                // MAX_KEY_SIZE.
                keyBuffer = new byte[MAX_KEY_SIZE];
        

        Default buffers of 64k for keys is a bit on the extravagant side.

        Below should be public so users don't have to define their own:

        protected final static String JCLASS = "jclass:";

        API seems to have changed since last patch. There nolonger a #find method. Whats the suggested way of accessing a random single key/value? (Open scanner using what would you suggest for start and end? Then seekTo? But I find I'm making double ByteArray instances of same byte array. Should there be a seekTo that takes a RawComparable that is public?).

        Thanks Hong.

        Show
        stack added a comment - I've been playing around w/ the last patch trying to actually use it. Here are a couple of (minor) comments: Should probably be explicit about encoding in the below: public ByteArray(String str) { this(str.getBytes()); Would be nice if we could easily pass a alternate implementation of BCFile, say one that cached blocks. Do you want to fix the below: // TODO: remember the longest key in a TFile, and use it to replace // MAX_KEY_SIZE. keyBuffer = new byte [MAX_KEY_SIZE]; Default buffers of 64k for keys is a bit on the extravagant side. Below should be public so users don't have to define their own: protected final static String JCLASS = "jclass:"; API seems to have changed since last patch. There nolonger a #find method. Whats the suggested way of accessing a random single key/value? (Open scanner using what would you suggest for start and end? Then seekTo? But I find I'm making double ByteArray instances of same byte array. Should there be a seekTo that takes a RawComparable that is public?). Thanks Hong.
        Hide
        Hong Tang added a comment -

        upload the patch again so that it could be picked up by Hudson

        Show
        Hong Tang added a comment - upload the patch again so that it could be picked up by Hudson
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12396286/TFile+Specification+20081217.pdf
        against trunk revision 738602.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no tests are needed for this patch.

        -1 patch. The patch command could not apply the patch.

        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3769/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12396286/TFile+Specification+20081217.pdf against trunk revision 738602. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. -1 patch. The patch command could not apply the patch. Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3769/console This message is automatically generated.
        Hide
        Hong Tang added a comment -

        Some preliminary random seek performance on TFile.

        Settings:

        • Key size = 10-50B (zipf distribution, sigma = 1.2)
        • Value size = 1-2KB (even distribution)
        • TFile size = 10GB
        • Compression: none and lzo (under lzo, the compression ratio is 1:2.1).
        • Block size = 128KB, 256KB, 512KB, 1MB.
        • File System: (1) a single disk local file system, >80% full; (2) a single-node DFS with 4 disks (JBO), >70% full; (3) a single-node DFS with 4 disks (RAID0), almost empty.

        Results:
        1. Local File System, single disk, >80% full.

        BlkSize none (ms) lzo (ms)
        128K 20.10 27.50
        256K 32.18 29.16
        512K 32.50 44.60
        1M 33.75 52.20

        2. A single-node DFS with 4 disks (JBO), >70% full.

        BlkSize none (ms) lzo (ms)
        128K 32.96 31.72
        256K 33.94 33.75
        512K 43.66 40.42
        1M 54.30 51.05

        3. A single-node DFS with 4 disks (RAID0), empty

        BlkSize none (ms) lzo (ms)
        128K 19.70 20.74
        256K 20.55 21.26
        512K 22.21 24.74
        1M 24.51 27.08

        Some observations:

        • larger the block size, longer the seek time, because search inside a block is linear scan.
        • lzo yields longer seek time, because it pays the overhead of decompression and needs to examine more uncompressed bytes.
        • performance of lzo and none are comparable on DFS; but on local file system, lzo is worse than none. This is because for local file system, I/O and decompression are done sequentially, while in DFS, actual I/O is overlapped with computation, and I/O is the critical path.
        • RAID0 seems to offer significantly better random I/O throughput. However, it is not conclusive. Other factors may also play some role: e.g. an empty file system, or faster CPU, or network connectivity.
        Show
        Hong Tang added a comment - Some preliminary random seek performance on TFile. Settings: Key size = 10-50B (zipf distribution, sigma = 1.2) Value size = 1-2KB (even distribution) TFile size = 10GB Compression: none and lzo (under lzo, the compression ratio is 1:2.1). Block size = 128KB, 256KB, 512KB, 1MB. File System: (1) a single disk local file system, >80% full; (2) a single-node DFS with 4 disks (JBO), >70% full; (3) a single-node DFS with 4 disks (RAID0), almost empty. Results: 1. Local File System, single disk, >80% full. BlkSize none (ms) lzo (ms) 128K 20.10 27.50 256K 32.18 29.16 512K 32.50 44.60 1M 33.75 52.20 2. A single-node DFS with 4 disks (JBO), >70% full. BlkSize none (ms) lzo (ms) 128K 32.96 31.72 256K 33.94 33.75 512K 43.66 40.42 1M 54.30 51.05 3. A single-node DFS with 4 disks (RAID0), empty BlkSize none (ms) lzo (ms) 128K 19.70 20.74 256K 20.55 21.26 512K 22.21 24.74 1M 24.51 27.08 Some observations: larger the block size, longer the seek time, because search inside a block is linear scan. lzo yields longer seek time, because it pays the overhead of decompression and needs to examine more uncompressed bytes. performance of lzo and none are comparable on DFS; but on local file system, lzo is worse than none. This is because for local file system, I/O and decompression are done sequentially, while in DFS, actual I/O is overlapped with computation, and I/O is the critical path. RAID0 seems to offer significantly better random I/O throughput. However, it is not conclusive. Other factors may also play some role: e.g. an empty file system, or faster CPU, or network connectivity.
        Hide
        Hong Tang added a comment -

        Updated TFile specification that matches the revised TFile code.

        Show
        Hong Tang added a comment - Updated TFile specification that matches the revised TFile code.
        Hide
        Hong Tang added a comment -

        Just submitted the patch that includes the latest changes.

        The revised TFile spec will be uploaded shortly.

        Show
        Hong Tang added a comment - Just submitted the patch that includes the latest changes. The revised TFile spec will be uploaded shortly.
        Hide
        Hong Tang added a comment -

        Patch for the revised TFile implementation with corresponding junit tests.

        Show
        Hong Tang added a comment - Patch for the revised TFile implementation with corresponding junit tests.
        Hide
        Amir Youssefi added a comment -

        OK, Hung needs to make some changes and will attach a new patch. I will wait for it to attach ObjectFile and I/O Format patches.

        Show
        Amir Youssefi added a comment - OK, Hung needs to make some changes and will attach a new patch. I will wait for it to attach ObjectFile and I/O Format patches.
        Hide
        Amir Youssefi added a comment -

        I will attach an updated TFile patch. This will remove hold-up for ObjectFile and it's I/O Format patches which needed new TFile code.

        Show
        Amir Youssefi added a comment - I will attach an updated TFile patch. This will remove hold-up for ObjectFile and it's I/O Format patches which needed new TFile code.
        Hide
        stack added a comment -

        Thanks for update Hong. If you get a minute, yeah, for sure put up new patch. (BTW, I notice this is most watched issue in JIRA database according to this: http://community.cloudera.com/reports/2/issues/).

        Show
        stack added a comment - Thanks for update Hong. If you get a minute, yeah, for sure put up new patch. (BTW, I notice this is most watched issue in JIRA database according to this: http://community.cloudera.com/reports/2/issues/ ).
        Hide
        Hong Tang added a comment -

        @Stack,

        Some work has been done based on comments we receive after our initial patch release, major feature additions include:

        • More efficient random access implementation. [Reader interface change]
        • Customized comparator (support java class comparators).
        • Added an ObjectFile layer on top of TFile, with corresponding InputFormat and outputFormat.

        I think it may be a good idea for us to upload another patch and collect feedback. However, I have very limited time working on TFile now so I may not be able to answer questions and address issues promptly.

        Show
        Hong Tang added a comment - @Stack, Some work has been done based on comments we receive after our initial patch release, major feature additions include: More efficient random access implementation. [Reader interface change] Customized comparator (support java class comparators). Added an ObjectFile layer on top of TFile, with corresponding InputFormat and outputFormat. I think it may be a good idea for us to upload another patch and collect feedback. However, I have very limited time working on TFile now so I may not be able to answer questions and address issues promptly.
        Hide
        stack added a comment -

        Hows it going lads? Any update? Do you think TFile will make 0.20.0? Thanks.

        Show
        stack added a comment - Hows it going lads? Any update? Do you think TFile will make 0.20.0? Thanks.
        Hide
        Hong Tang added a comment -

        This new Scanner API is still Comparator-free. It's easy to layer lexicographic-by-byte on top of a Comparator-based API, but not vice versa.

        We are working on that as a separate task. The complete task list is as follows:

        • random access (which is what my early long comment is about)
        • customized comparator on TFile.
        • supporting objects and input/output format for MapReduce
        • extensibility.

        Expect more comments addressing these issues soon.

        Show
        Hong Tang added a comment - This new Scanner API is still Comparator-free. It's easy to layer lexicographic-by-byte on top of a Comparator-based API, but not vice versa. We are working on that as a separate task. The complete task list is as follows: random access (which is what my early long comment is about) customized comparator on TFile. supporting objects and input/output format for MapReduce extensibility. Expect more comments addressing these issues soon.
        Hide
        Hong Tang added a comment -

        Yes, Scanner is stateful: lowerBound() or upperBound() will get to the right point of the compression block, and keep the block steram open.

        Sorry forgot to mention about the change from Location to RowID. One school of thoughts is that one cannot really do much with the Location object. So at some time of the point, we may need to expose the fact that you can locate by RowID anyways. At that point, the use of Location would be obsolete. So let's just expose it now to keep the API set tight. With RowID, you can easily implement some auxiliary index to remember the # of rows starting with the same "row key" in HBase, and you can do seek(current()+n) to efficiently skip those rows.

        Show
        Hong Tang added a comment - Yes, Scanner is stateful: lowerBound() or upperBound() will get to the right point of the compression block, and keep the block steram open. Sorry forgot to mention about the change from Location to RowID. One school of thoughts is that one cannot really do much with the Location object. So at some time of the point, we may need to expose the fact that you can locate by RowID anyways. At that point, the use of Location would be obsolete. So let's just expose it now to keep the API set tight. With RowID, you can easily implement some auxiliary index to remember the # of rows starting with the same "row key" in HBase, and you can do seek(current()+n) to efficiently skip those rows.
        Hide
        stack added a comment -

        @Hong

        So, to read a random key, I'd now do:

          TFile.Reader r = new TFile.Reader....
          TFile.Reader.Scanner s = r.createScanner();
          ...
          // Below would body of some 'get' method that took a byte array 'key' for an argument
          Writable w = new SomeWritable();
          return s.find(key, w)? w: null;
        

        That'll work.

        This avoids the double-decompress possibility mentioned above because all seek is done in the Scanner now?

        Are rowids a new notion? Am I supposed to be able to go between rowid and a Location?

        Show
        stack added a comment - @Hong So, to read a random key, I'd now do: TFile.Reader r = new TFile.Reader.... TFile.Reader.Scanner s = r.createScanner(); ... // Below would body of some 'get' method that took a byte array 'key' for an argument Writable w = new SomeWritable(); return s.find(key, w)? w: null ; That'll work. This avoids the double-decompress possibility mentioned above because all seek is done in the Scanner now? Are rowids a new notion? Am I supposed to be able to go between rowid and a Location?
        Hide
        Doug Cutting added a comment -

        This new Scanner API is still Comparator-free. It's easy to layer lexicographic-by-byte on top of a Comparator-based API, but not vice versa.

        Show
        Doug Cutting added a comment - This new Scanner API is still Comparator-free. It's easy to layer lexicographic-by-byte on top of a Comparator-based API, but not vice versa.
        Hide
        Hong Tang added a comment - - edited

        @Stack

        We come up the following proposal to address the random access API issue: we would move all seek() related functionalities into Scanner, and let Reader simply become the starting point to create Scanners plus some misc functionalities. To allow applications to create scanners from reader, we also let reader to implement the Scanner interface (internally, it would lazily instantiate a Scanner whenever those Scanner functions are called, and delegate the call to the internal scanner). This has the advantage of simplifying the life of applications that want to access a TFile in single-thread fashion.

        The following is the detailed API:

        Scanner.java
        /**
         * The TFile Scanner. The Scanner has an implicit cursor, which either points to
         * a valid row in the scan range, or to a special "end" location. There are
         * three kinds of operations supported by a Scanner:
         * <ol>
         * <li>Query about locations. Locations are represented by row IDs.
         * <ul>
         * <li><code>long begin()</code> returns the row ID for the first row.
         * <li><code>long end()</code> returns the row ID of the last row plus one.
         * <li><code>long current()</code> returns the current Row ID.
         * <li><code>boolean atEnd()</code> test whether the cursor is at the end.
         * </ul>
         * <li>Move the cursor.
         * <ul>
         * <li><code>void seek(long rowID)</code>. Go to the specific row as indicated
         * by the input row ID.
         * <li><code>void lowerBound(byte[] key)</code> (and its variant). Go to the
         * first row whose key is greater than or equal to the input key.
         * <li><code>void upperBound(byte[] key)</code> (and its variant). Go to the
         * first row whose key is greater than the input key.
         * <li><code>boolean seek(byte[] key)</code> (and its variant). Similar to
         * <code>lowerBound</code>, except that it returns a boolean to indicate whether
         * we find an exact match of the key.
         * <li><code>boolean advance()</code> advance to next row. An efficient
         * implementation of <code>seek(current()+1)</code> that returns a boolean to
         * indicate whether the cursor actually moves.
         * <li><code>void rewind()</code>. Equivalent to <code>seek(begin())</code>
         * </ul>
         * <li>Retrieve data.
         * <ul>
         * <li><code>DataInputStream getKeyStream()</code>. Return the key stream.
         * <li><code>int getKeylength()</code>. Return the key stream.
         * <li><code>DataInputStream getValueStream()</code>. Return the value stream.
         * <li><code>int getValueLength()</code>. Return the key stream.
         * <li><code>int getKey(byte[] buffer, int offset)</code>. Convenience function,
         * get key.
         * <li><code>int getValue(byte[] buffer, int offset)</code>. Convenience
         * function, get value.
         * <li><code>void get(BytesWritable key, BytesWritable value)</code>.
         * Convenience function, get key and value in one shot.
         * <li><code>DataInputStream find(byte[] key)</code> (and its variants).
         * Convenience function, same as <code>
         * if (seek(key)) {
         *   return getValueStream();
         * }
         * return null;
         * </code>
         * <li><code>boolean find(byte[] key, BytesWritable value)</code> (and its
         * variants). Convenience function, same as <code>
         * if (seek(key)) {
         *   getValue(value);
         *   return true;
         * }
         * return false;
         * </code>
         * <li><code>boolean isValueLengthKnown()</code>. Test whether we know the value
         * length.
         * </ul>
         * </ol>
         */
        public interface Scanner extends Closeable {
          /**
           * Get the beginning row's RowID.
           * 
           * @return RowID for the first row.
           */
          long begin();
        
          /**
           * Get the last row's RowID plus 1.
           * 
           * @return RowID for the last row + 1.
           */
          long end();
        
          /**
           * Get the current row's RowID.
           * 
           * @return RowID for the current row.
           */
          long current();
        
          /**
           * Test whether we reach the end of scan range.
           * 
           * @return (current()==end());
           */
          boolean atEnd();
        
          /**
           * Move the cursor to the first row whose key is greater than or equal to the
           * input key. Equivalent to <code>lowerBound(key, 0, key.length)</code>.
           * 
           * @param key
           *          input key.
           * @throws IOException
           */
          void lowerBound(byte[] key) throws IOException;
        
        
          /**
           * Move the cursor to the first row whose key is greater than or equal to the
           * input key.
           * 
           * @param key
           *          input key.
           * @param offset
           *          offset into the key buffer.
           * @param len
           *          length of the key.
           * @throws IOException
           */
          void lowerBound(byte[] key, int offset, int len) throws IOException;
        
          /**
           * Move the cursor to the first row whose key is greater than the input key.
           * Equivalent to <code>upperBound(key, 0, key.length)</code>.
           * 
           * @param key
           *          input key.
           * @throws IOException
           */
          void upperBound(byte[] key) throws IOException;
        
          /**
           * Move the cursor to the first row whose key is greater than the input key.
           * 
           * @param key
           *          input key.
           * @param offset
           *          offset into the key buffer.
           * @param len
           *          length of the key.
           * @throws IOException
           */
          void upperBound(byte[] key, int offset, int len) throws IOException;
        
          /**
           * Seek to a particular key. Equivalent to <code>lowerBound(key)</code>,
           * except that it returns true if we find an exact match.
           * 
           * @param key
           *          input key.
           * @return whether we find an exact match.
           * @throws IOException
           */
          boolean seek(byte[] key) throws IOException;
        
          /**
           * Seek to a particular key. Equivalent to
           * <code>lowerBound(key, offset, len)</code>, except that it returns true if
           * we find an exact match.
           * 
           * @param key
           *          input key.
           * @param offset
           *          offset into the key buffer.
           * @param len
           *          length of the key.
           * @return whether we find an exact match.
           * @throws IOException
           */
          boolean seek(byte[] key, int offset, int len) throws IOException;
        
          /**
           * Seek to a particular RowID.
           * 
           * @param rowID
           * @throws IOException
           */
          void seek(long rowID) throws IOException;
        
          /**
           * Advance the cursor to the next row. An efficient implementation of
           * <code>seek(current()+1)</code>. And it returns a boolean indicating whether
           * the cursor actually moves.
           * 
           * @return whether the cursor actually advances.
           * @throws IOException
           */
          boolean advance() throws IOException;
        
          /**
           * Move the cursor to the first row. Equivalent to <code>seek(begin())</code>.
           */
          void rewind() throws IOException;
        
          /**
           * Streaming access to the key. <code>atEnd()</code> must be tested false.
           * 
           * @return The input stream.
           * @throws IOException
           */
          DataInputStream getKeyStream() throws IOException;
        
          /**
           * Get the length of the key. <code>atEnd()</code> must be tested false.
           * 
           * @return the length of the key.
           */
          int getKeyLength();
        
          /**
           * Copy the key into user supplied buffer. <code>atEnd()</code> must be tested
           * false. Synonymous to <code>getKey(key, 0)</code>.
           * 
           * @param key
           *          The buffer supplied by user. The length of the buffer must not be
           *          shorter than the key length.
           * @return The length of the key.
           * 
           * @throws IOException
           */
          int getKey(byte[] key) throws IOException;
        
          /**
           * Copy the key into user supplied buffer. <code>atEnd()</code> must be tested
           * false.
           * 
           * @param key
           *          The buffer supplied by user.
           * @param offset
           *          The starting offset of the user buffer where we should copy the
           *          key into. Requiring the following condition to hold:
           *          <code>getKeyLength() + offset <= key.length</code>.
           * @return The length of the key.
           * 
           * @throws IOException
           */
          int getKey(byte[] key, int offset) throws IOException;
        
          /**
           * Copy the key into user supplied buffer. <code>atEnd()</code> must be tested
           * false.
           * 
           * @param key
           *          The BytesWritable buffer supplied by user.
           * @return The length of the key.
           * 
           * @throws IOException
           */
          int getKey(BytesWritable key) throws IOException;
        
          /**
           * Streaming access to the value. <code>atEnd()</code> must be tested false.
           * 
           * @return The input stream.
           * @throws IOException
           */
          DataInputStream getValueStream() throws IOException;
        
          /**
           * Test whether we know the value length. <code>atEnd()</code> must be tested
           * false.
           * 
           * @return Boolean indicate whether we know the length of the value.
           */
          boolean isValueLengthKnown();
        
          /**
           * Get the length of the value. <code>atEnd()</code> must be tested false.
           * 
           * @return the length of the value.
           */
          int getValueLength();
        
          /**
           * Copy the value into user supplied buffer. <code>atEnd()</code> must be
           * tested false. Synonymous to <code>getValue(value, 0)</code>.
           * 
           * @param value
           *          The buffer supplied by user. The length of the buffer must not be
           *          shorter than the value length.
           * @return The length of the value.
           * 
           * @throws IOException
           */
          int getValue(byte[] value) throws IOException;
        
          /**
           * Copy the value into user supplied buffer. <code>atEnd()</code> must be
           * tested false.
           * 
           * @param value
           *          The buffer supplied by user.
           * @param offset
           *          The starting offset of the user buffer where we should copy the
           *          value into. Requiring the following condition to hold:
           *          <code>valueLength + offset <= value.length</code>.
           * @return The length of the value.
           * 
           * @throws IOException
           */
          int getValue(byte[] value, int offset) throws IOException;
        
          /**
           * Copy the value into user supplied BytesWritable buffer.
           * <code>atEnd()</code> must be tested false.
           * 
           * @param value
           *          The BytesWritable buffer supplied by user.
           * @return The length of the value.
           * 
           * @throws IOException
           */
          int getValue(BytesWritable value) throws IOException;
        
          /**
           * Get both key and value in one shot. Synonymous to
           * <code>getKey(key); getValue(value)</code>
           * 
           * @param key
           *          The BytesWritable buffer to hold the key.
           * @param value
           *          The BytesWritable buffer to hold the value.
           * @throws IOException
           */
          void get(BytesWritable key, BytesWritable value) throws IOException;
        
          /**
           * Search based on a particular key. Equivalent to
           * <code>find(key, 0, key.length, value)</code>
           * 
           * @param key
           *          The input key buffer
           * @param value
           *          The BytesWritable buffer to hold the value.
           * @return true we find value for the specific key.
           * @throws IOException
           */
          boolean find(byte[] key, BytesWritable value) throws IOException;
        
          /**
           * Search based on a particular key.
           * 
           * @param key
           *          The input key
           * @param offset
           *          The offset into the key buffer.
           * @param len
           *          The length of the key.
           * @param value
           *          The BytesWritable buffer to hold the value.
           * @return true we find value for the specific key.
           * @throws IOException
           */
          boolean find(byte[] key, int offset, int len, BytesWritable value)
              throws IOException;
        }
        
        Reader.java
        /**
         * TFile Reader. Typical usage patterns:
         * <ul>
         * <li>Single-threaded application: Users may directly use the Reader object.
         * <li>Multi-threaded application: Create one scanner for each thread through
         * <code>Reader.createScanner()</code>. To restrict a thread to access only a
         * region of the rows, use the reader to find the begin and end row RowID and
         * call <code>createScanner(begin, end)</code>. Note the end RowID is exclusive.
         * <li>MapReduce: The InputFormat would use Reader to find out the RowIDs
         * corresponding to split boundaries. Each mapper then uses reader to create a
         * scanner with the begin and end RowID.
         * </ul>
         */
        public class Reader implements Scanner {
          /**
           * Constructor.
           * 
           * @param path
           *          Path to the TFile.
           * @param conf
           *          Configuration object.
           * @throws IOException
           */
          public Reader(Path path, Configuration conf) throws IOException {
            // TODO
          }
        
          /**
           * Constructor
           * 
           * @param fsin
           *          The seekable input stream containing data of a TFile.
           * @param length
           *          The input stream length.
           * @param conf
           *          Configuration object.
           * @throws IOException
           */
          public Reader(FSDataInputStream fsin, long length, Configuration conf)
              throws IOException {
            // TODO:
          }
        
          /**
           * Is TFile sorted.
           * 
           * @return whether the TFile is sorted. Key-based search in Scanner requires
           *         the underlying TFile to be sorted.
           */
          public boolean isSorted() {
            // TODO:
            return false;
          }
        
          /**
           * Get an input stream to read the Meta block.
           * 
           * @param name
           *          Name of the Meta Block.
           * @return The input stream for the Meta block.
           */
          public DataInputStream getMetaBlock(String name) throws IOException {
            // TODO:
            return null;
          }
        
          /**
           * Get the name of the comparator.
           * 
           * @return If TFile is not sorted, empty string will be returned. Otherwise,
           *         the name of the comparator is returned.
           */
          public String getComparatorName() {
            // TODO:
            return null;
          }
        
          /**
           * Get the RowID of the beginning row in the first compressed block whose byte
           * offset in the TFile is greater than or equal to the specified offset.
           * 
           * @param offset
           *          the user supplied offset.
           * @return the RowID of the row that fits the specified criteria; or
           *         <code>end()</code> if no such entry exists.
           */
          public long getRowIDNear(long offset) {
            // TODO:
            return 0;
          }
        
          /**
           * Get the total # of rows in TFile.
           * 
           * @return total # of rows.
           */
          public long getRowCount() {
            // TODO:
            return 0;
          }
        
          /**
           * Create a scanner that scans the whole TFile. Equivalent to
           * <code>createScanner(begin(), end())</code>.
           * 
           * @return The scanner object.
           */
          public Scanner createScanner() {
            // TODO:
            return null;
          }
        
          /**
           * Create a scanner that scans a specified portion of TFile.
           * 
           * @param begin
           *          The beginning RowID (inclusive).
           * @param end
           *          The end RowID (exclusive).
           * @return The scanner object.
           */  
          public Scanner createScanner(long begin, long end) {
            // TODO:
            return null;
          }
        }
        
        Show
        Hong Tang added a comment - - edited @Stack We come up the following proposal to address the random access API issue: we would move all seek() related functionalities into Scanner, and let Reader simply become the starting point to create Scanners plus some misc functionalities. To allow applications to create scanners from reader, we also let reader to implement the Scanner interface (internally, it would lazily instantiate a Scanner whenever those Scanner functions are called, and delegate the call to the internal scanner). This has the advantage of simplifying the life of applications that want to access a TFile in single-thread fashion. The following is the detailed API: Scanner.java /** * The TFile Scanner. The Scanner has an implicit cursor, which either points to * a valid row in the scan range, or to a special "end" location. There are * three kinds of operations supported by a Scanner: * <ol> * <li>Query about locations. Locations are represented by row IDs. * <ul> * <li><code> long begin()</code> returns the row ID for the first row. * <li><code> long end()</code> returns the row ID of the last row plus one. * <li><code> long current()</code> returns the current Row ID. * <li><code> boolean atEnd()</code> test whether the cursor is at the end. * </ul> * <li>Move the cursor. * <ul> * <li><code>void seek( long rowID)</code>. Go to the specific row as indicated * by the input row ID. * <li><code>void lowerBound( byte [] key)</code> (and its variant). Go to the * first row whose key is greater than or equal to the input key. * <li><code>void upperBound( byte [] key)</code> (and its variant). Go to the * first row whose key is greater than the input key. * <li><code> boolean seek( byte [] key)</code> (and its variant). Similar to * <code>lowerBound</code>, except that it returns a boolean to indicate whether * we find an exact match of the key. * <li><code> boolean advance()</code> advance to next row. An efficient * implementation of <code>seek(current()+1)</code> that returns a boolean to * indicate whether the cursor actually moves. * <li><code>void rewind()</code>. Equivalent to <code>seek(begin())</code> * </ul> * <li>Retrieve data. * <ul> * <li><code>DataInputStream getKeyStream()</code>. Return the key stream. * <li><code> int getKeylength()</code>. Return the key stream. * <li><code>DataInputStream getValueStream()</code>. Return the value stream. * <li><code> int getValueLength()</code>. Return the key stream. * <li><code> int getKey( byte [] buffer, int offset)</code>. Convenience function, * get key. * <li><code> int getValue( byte [] buffer, int offset)</code>. Convenience * function, get value. * <li><code>void get(BytesWritable key, BytesWritable value)</code>. * Convenience function, get key and value in one shot. * <li><code>DataInputStream find( byte [] key)</code> (and its variants). * Convenience function, same as <code> * if (seek(key)) { * return getValueStream(); * } * return null ; * </code> * <li><code> boolean find( byte [] key, BytesWritable value)</code> (and its * variants). Convenience function, same as <code> * if (seek(key)) { * getValue(value); * return true ; * } * return false ; * </code> * <li><code> boolean isValueLengthKnown()</code>. Test whether we know the value * length. * </ul> * </ol> */ public interface Scanner extends Closeable { /** * Get the beginning row's RowID. * * @ return RowID for the first row. */ long begin(); /** * Get the last row's RowID plus 1. * * @ return RowID for the last row + 1. */ long end(); /** * Get the current row's RowID. * * @ return RowID for the current row. */ long current(); /** * Test whether we reach the end of scan range. * * @ return (current()==end()); */ boolean atEnd(); /** * Move the cursor to the first row whose key is greater than or equal to the * input key. Equivalent to <code>lowerBound(key, 0, key.length)</code>. * * @param key * input key. * @ throws IOException */ void lowerBound( byte [] key) throws IOException; /** * Move the cursor to the first row whose key is greater than or equal to the * input key. * * @param key * input key. * @param offset * offset into the key buffer. * @param len * length of the key. * @ throws IOException */ void lowerBound( byte [] key, int offset, int len) throws IOException; /** * Move the cursor to the first row whose key is greater than the input key. * Equivalent to <code>upperBound(key, 0, key.length)</code>. * * @param key * input key. * @ throws IOException */ void upperBound( byte [] key) throws IOException; /** * Move the cursor to the first row whose key is greater than the input key. * * @param key * input key. * @param offset * offset into the key buffer. * @param len * length of the key. * @ throws IOException */ void upperBound( byte [] key, int offset, int len) throws IOException; /** * Seek to a particular key. Equivalent to <code>lowerBound(key)</code>, * except that it returns true if we find an exact match. * * @param key * input key. * @ return whether we find an exact match. * @ throws IOException */ boolean seek( byte [] key) throws IOException; /** * Seek to a particular key. Equivalent to * <code>lowerBound(key, offset, len)</code>, except that it returns true if * we find an exact match. * * @param key * input key. * @param offset * offset into the key buffer. * @param len * length of the key. * @ return whether we find an exact match. * @ throws IOException */ boolean seek( byte [] key, int offset, int len) throws IOException; /** * Seek to a particular RowID. * * @param rowID * @ throws IOException */ void seek( long rowID) throws IOException; /** * Advance the cursor to the next row. An efficient implementation of * <code>seek(current()+1)</code>. And it returns a boolean indicating whether * the cursor actually moves. * * @ return whether the cursor actually advances. * @ throws IOException */ boolean advance() throws IOException; /** * Move the cursor to the first row. Equivalent to <code>seek(begin())</code>. */ void rewind() throws IOException; /** * Streaming access to the key. <code>atEnd()</code> must be tested false . * * @ return The input stream. * @ throws IOException */ DataInputStream getKeyStream() throws IOException; /** * Get the length of the key. <code>atEnd()</code> must be tested false . * * @ return the length of the key. */ int getKeyLength(); /** * Copy the key into user supplied buffer. <code>atEnd()</code> must be tested * false . Synonymous to <code>getKey(key, 0)</code>. * * @param key * The buffer supplied by user. The length of the buffer must not be * shorter than the key length. * @ return The length of the key. * * @ throws IOException */ int getKey( byte [] key) throws IOException; /** * Copy the key into user supplied buffer. <code>atEnd()</code> must be tested * false . * * @param key * The buffer supplied by user. * @param offset * The starting offset of the user buffer where we should copy the * key into. Requiring the following condition to hold: * <code>getKeyLength() + offset <= key.length</code>. * @ return The length of the key. * * @ throws IOException */ int getKey( byte [] key, int offset) throws IOException; /** * Copy the key into user supplied buffer. <code>atEnd()</code> must be tested * false . * * @param key * The BytesWritable buffer supplied by user. * @ return The length of the key. * * @ throws IOException */ int getKey(BytesWritable key) throws IOException; /** * Streaming access to the value. <code>atEnd()</code> must be tested false . * * @ return The input stream. * @ throws IOException */ DataInputStream getValueStream() throws IOException; /** * Test whether we know the value length. <code>atEnd()</code> must be tested * false . * * @ return Boolean indicate whether we know the length of the value. */ boolean isValueLengthKnown(); /** * Get the length of the value. <code>atEnd()</code> must be tested false . * * @ return the length of the value. */ int getValueLength(); /** * Copy the value into user supplied buffer. <code>atEnd()</code> must be * tested false . Synonymous to <code>getValue(value, 0)</code>. * * @param value * The buffer supplied by user. The length of the buffer must not be * shorter than the value length. * @ return The length of the value. * * @ throws IOException */ int getValue( byte [] value) throws IOException; /** * Copy the value into user supplied buffer. <code>atEnd()</code> must be * tested false . * * @param value * The buffer supplied by user. * @param offset * The starting offset of the user buffer where we should copy the * value into. Requiring the following condition to hold: * <code>valueLength + offset <= value.length</code>. * @ return The length of the value. * * @ throws IOException */ int getValue( byte [] value, int offset) throws IOException; /** * Copy the value into user supplied BytesWritable buffer. * <code>atEnd()</code> must be tested false . * * @param value * The BytesWritable buffer supplied by user. * @ return The length of the value. * * @ throws IOException */ int getValue(BytesWritable value) throws IOException; /** * Get both key and value in one shot. Synonymous to * <code>getKey(key); getValue(value)</code> * * @param key * The BytesWritable buffer to hold the key. * @param value * The BytesWritable buffer to hold the value. * @ throws IOException */ void get(BytesWritable key, BytesWritable value) throws IOException; /** * Search based on a particular key. Equivalent to * <code>find(key, 0, key.length, value)</code> * * @param key * The input key buffer * @param value * The BytesWritable buffer to hold the value. * @ return true we find value for the specific key. * @ throws IOException */ boolean find( byte [] key, BytesWritable value) throws IOException; /** * Search based on a particular key. * * @param key * The input key * @param offset * The offset into the key buffer. * @param len * The length of the key. * @param value * The BytesWritable buffer to hold the value. * @ return true we find value for the specific key. * @ throws IOException */ boolean find( byte [] key, int offset, int len, BytesWritable value) throws IOException; } Reader.java /** * TFile Reader. Typical usage patterns: * <ul> * <li>Single-threaded application: Users may directly use the Reader object. * <li>Multi-threaded application: Create one scanner for each thread through * <code>Reader.createScanner()</code>. To restrict a thread to access only a * region of the rows, use the reader to find the begin and end row RowID and * call <code>createScanner(begin, end)</code>. Note the end RowID is exclusive. * <li>MapReduce: The InputFormat would use Reader to find out the RowIDs * corresponding to split boundaries. Each mapper then uses reader to create a * scanner with the begin and end RowID. * </ul> */ public class Reader implements Scanner { /** * Constructor. * * @param path * Path to the TFile. * @param conf * Configuration object. * @ throws IOException */ public Reader(Path path, Configuration conf) throws IOException { // TODO } /** * Constructor * * @param fsin * The seekable input stream containing data of a TFile. * @param length * The input stream length. * @param conf * Configuration object. * @ throws IOException */ public Reader(FSDataInputStream fsin, long length, Configuration conf) throws IOException { // TODO: } /** * Is TFile sorted. * * @ return whether the TFile is sorted. Key-based search in Scanner requires * the underlying TFile to be sorted. */ public boolean isSorted() { // TODO: return false ; } /** * Get an input stream to read the Meta block. * * @param name * Name of the Meta Block. * @ return The input stream for the Meta block. */ public DataInputStream getMetaBlock( String name) throws IOException { // TODO: return null ; } /** * Get the name of the comparator. * * @ return If TFile is not sorted, empty string will be returned. Otherwise, * the name of the comparator is returned. */ public String getComparatorName() { // TODO: return null ; } /** * Get the RowID of the beginning row in the first compressed block whose byte * offset in the TFile is greater than or equal to the specified offset. * * @param offset * the user supplied offset. * @ return the RowID of the row that fits the specified criteria; or * <code>end()</code> if no such entry exists. */ public long getRowIDNear( long offset) { // TODO: return 0; } /** * Get the total # of rows in TFile. * * @ return total # of rows. */ public long getRowCount() { // TODO: return 0; } /** * Create a scanner that scans the whole TFile. Equivalent to * <code>createScanner(begin(), end())</code>. * * @ return The scanner object. */ public Scanner createScanner() { // TODO: return null ; } /** * Create a scanner that scans a specified portion of TFile. * * @param begin * The beginning RowID (inclusive). * @param end * The end RowID (exclusive). * @ return The scanner object. */ public Scanner createScanner( long begin, long end) { // TODO: return null ; } }
        Hide
        Amir Youssefi added a comment -

        1) Currently there are two versions one for TFile API and other for BCFile API. Version consists of a Major and a Minor.

        TFile/BCFile API versions are stored when a new file is written. Later when we read a file, API versions used for reading are compared to versions from file. Currently we only check Major and throw an exception if there is a mismatch in Major.

        We can add more compatibility logic to this code when necessary.

        2) There are some compatibility ideas incorporates in the design (see Page 11 of attached spec for more details).

        Show
        Amir Youssefi added a comment - 1) Currently there are two versions one for TFile API and other for BCFile API. Version consists of a Major and a Minor. TFile/BCFile API versions are stored when a new file is written. Later when we read a file, API versions used for reading are compared to versions from file. Currently we only check Major and throw an exception if there is a mismatch in Major. We can add more compatibility logic to this code when necessary. 2) There are some compatibility ideas incorporates in the design (see Page 11 of attached spec for more details).
        Hide
        eric baldeschwieler added a comment -

        yup. There is that danger. Of course if it is in contrib, not making
        strong guarantees seems fair. We plan to use this on lots of data, so
        if we make format changes, we'll need to provide a solution to
        upgrades. The format should support version checking on day one.

        Show
        eric baldeschwieler added a comment - yup. There is that danger. Of course if it is in contrib, not making strong guarantees seems fair. We plan to use this on lots of data, so if we make format changes, we'll need to provide a solution to upgrades. The format should support version checking on day one.
        Hide
        Pete Wyckoff added a comment -

        Let's hold the bar relatively low and wait until it proves itself to graduate it elsewhere.

        Since this includes a file format, won't that mean we won't then be able to make any changes to the format or if we have to make changes, we will have to version it/deal with "legacy" data right away?

        Show
        Pete Wyckoff added a comment - Let's hold the bar relatively low and wait until it proves itself to graduate it elsewhere. Since this includes a file format, won't that mean we won't then be able to make any changes to the format or if we have to make changes, we will have to version it/deal with "legacy" data right away?
        Hide
        eric baldeschwieler added a comment -

        I'm all for putting this in contrib. Let's hold the bar relatively low and wait until it proves itself to graduate it elsewhere.

        We could record the start of the first duplicate of a block's leading key earlier in the file. This might allow a relatively efficient skip to the start of a block of duplicate keys that cross a block boundary. If you just store the distance in uncompressed bytes, you can probably sort the rest out.

        Show
        eric baldeschwieler added a comment - I'm all for putting this in contrib. Let's hold the bar relatively low and wait until it proves itself to graduate it elsewhere. We could record the start of the first duplicate of a block's leading key earlier in the file. This might allow a relatively efficient skip to the start of a block of duplicate keys that cross a block boundary. If you just store the distance in uncompressed bytes, you can probably sort the rest out.
        Hide
        stack added a comment -

        ...would your auxiilary index remember how many records start with the same row-key

        Since its not designed yet, it could (smile). Thanks for the hint.

        Show
        stack added a comment - ...would your auxiilary index remember how many records start with the same row-key Since its not designed yet, it could (smile). Thanks for the hint.
        Hide
        Hong Tang added a comment -

        Sorry if this need seems exotic but I think we can get away with casting this need under the 'Extensibility' TFile Design Principal. In our application, keys are row/column/timestamp. If millions of columns in a row and we want to skip to the next row, we can't next-next-next through the keys. It'll be too slow. We need to skip ahead to the new row. Block index won't help in this regard.

        Yes, it sounds reasonable to change the various indices in TFile as protected instead of private.

        Just curiously, would your auxiilary index remember how many records start with the same row-key? So that you may want to take advantage of this to quickly advance? If true, a better way than opening on advanceCursoInBlock() is to provide an advanceCursor API on the Scanner.

        Show
        Hong Tang added a comment - Sorry if this need seems exotic but I think we can get away with casting this need under the 'Extensibility' TFile Design Principal. In our application, keys are row/column/timestamp. If millions of columns in a row and we want to skip to the next row, we can't next-next-next through the keys. It'll be too slow. We need to skip ahead to the new row. Block index won't help in this regard. Yes, it sounds reasonable to change the various indices in TFile as protected instead of private. Just curiously, would your auxiilary index remember how many records start with the same row-key? So that you may want to take advantage of this to quickly advance? If true, a better way than opening on advanceCursoInBlock() is to provide an advanceCursor API on the Scanner.
        Hide
        stack added a comment -

        To Owen's list I'd suggest adding amenable random-access and access to the block index by subclasses.

        @Jim. MapFile is ill-suited to hbase use. It needs to be replaced. Discussions above are about looking at TFile as possible foundation for replacement. For hbase log files, where we need append, we can just keep on with the SequenceFile we use now.

        @Hong

        .bq ...is to write your own key appender and value appender classes...

        This sounds fine for creating auxillary index information. I then suggested advanceCursorInBlock needed to be made accessible so I could then exploit the auxillary index at read time but on study, I see this is the wrong place. Where would you suggest I plugin to exploit my extra index information at Read time?

        Sorry if this need seems exotic but I think we can get away with casting this need under the 'Extensibility' TFile Design Principal. In our application, keys are row/column/timestamp. If millions of columns in a row and we want to skip to the next row, we can't next-next-next through the keys. It'll be too slow. We need to skip ahead to the new row. Block index won't help in this regard.

        Show
        stack added a comment - To Owen's list I'd suggest adding amenable random-access and access to the block index by subclasses. @Jim. MapFile is ill-suited to hbase use. It needs to be replaced. Discussions above are about looking at TFile as possible foundation for replacement. For hbase log files, where we need append, we can just keep on with the SequenceFile we use now. @Hong .bq ...is to write your own key appender and value appender classes... This sounds fine for creating auxillary index information. I then suggested advanceCursorInBlock needed to be made accessible so I could then exploit the auxillary index at read time but on study, I see this is the wrong place. Where would you suggest I plugin to exploit my extra index information at Read time? Sorry if this need seems exotic but I think we can get away with casting this need under the 'Extensibility' TFile Design Principal. In our application, keys are row/column/timestamp. If millions of columns in a row and we want to skip to the next row, we can't next-next-next through the keys. It'll be too slow. We need to skip ahead to the new row. Block index won't help in this regard.
        Hide
        Jim Kellerman added a comment -

        I don't think it needs to include a table framework, but it does need to support random I/O and comparators at a minimum.

        Show
        Jim Kellerman added a comment - I don't think it needs to include a table framework, but it does need to support random I/O and comparators at a minimum.
        Hide
        Owen O'Malley added a comment -

        The other option is holding off until it includes a table framework.

        Show
        Owen O'Malley added a comment - The other option is holding off until it includes a table framework.
        Hide
        Owen O'Malley added a comment -

        I think that a reasonable path to get the committed are to make it useful to map/reduce. That would require, precisely:
        1. Support for user comparators.
        2. An ObjectFile class that extends (or wraps) TFile and uses the serialization libraries to provide objects.
        3.

        {In,Out}

        putFormats that use ObjectFile.
        4. A testcase that has a map/reduce job that reads and writes ObjectFiles.

        Show
        Owen O'Malley added a comment - I think that a reasonable path to get the committed are to make it useful to map/reduce. That would require, precisely: 1. Support for user comparators. 2. An ObjectFile class that extends (or wraps) TFile and uses the serialization libraries to provide objects. 3. {In,Out} putFormats that use ObjectFile. 4. A testcase that has a map/reduce job that reads and writes ObjectFiles.
        Hide
        Doug Cutting added a comment -

        Thinking a bit more about this, I'm not sure contrib is right either. Since portability and support for mapreduce are stated goals, we should perhaps not commit this at all until it implements those goals. I'd like to see an inputformat, an outputformat and c++ APIs before we commit this. I'd also very much like to see support for comparators and random access. We do not want code committed that's not yet useful--that leads to bloat and sets a bad precedent for the project. If language-independence is a primary goal then it should be addressed up-front. The implementation might also be much simpler if two implementations must be developed and maintained.

        Show
        Doug Cutting added a comment - Thinking a bit more about this, I'm not sure contrib is right either. Since portability and support for mapreduce are stated goals, we should perhaps not commit this at all until it implements those goals. I'd like to see an inputformat, an outputformat and c++ APIs before we commit this. I'd also very much like to see support for comparators and random access. We do not want code committed that's not yet useful--that leads to bloat and sets a bad precedent for the project. If language-independence is a primary goal then it should be addressed up-front. The implementation might also be much simpler if two implementations must be developed and maintained.
        Hide
        Jim Kellerman added a comment -

        Doug Cutting - 24/Sep/08 10:04 AM
        >> this jira is not supposed to be a replacement for MapFile
        >
        > That's unfortunate. I thought that was the point of including indexes in the file, rather than in a side file.

        >Owen O'Malley - 26/Apr/08 09:31 AM
        >> Is this a format just for compressed sequence files, or for all sequence files?
        >
        > The issue is most critical for compressed sequence files, but it would make sense to make the
        > compression optional. I would not support value compression.
        >
        >> Is this intended as a replacement for MapFile too?
        >
        > yes

        Ok, I guess my misunderstanding was due to Owen's comment.

        Doug Cutting - 24/Sep/08 10:04 AM
        > I wonder whether tfile should start out in contrib until it is more full-featured? Without support for java
        > comparators or random access it is not yet a replacement for SequenceFile. It also doesn't yet have
        > any inputformats, so it cannot be used from mapreduce. Nor does it yet have bindings for other
        > programming languages. So my preference is that, until tfile is proven to be of general utility to
        > Hadoop applications, it should live in contrib. We don't want code in core that's not both widely
        > usable and actually used.

        +1

        Since TFile is not a replacement for MapFile and does not support reading up to the latest sync while writing, we probably cannot use it for HBase.

        Show
        Jim Kellerman added a comment - Doug Cutting - 24/Sep/08 10:04 AM >> this jira is not supposed to be a replacement for MapFile > > That's unfortunate. I thought that was the point of including indexes in the file, rather than in a side file. >Owen O'Malley - 26/Apr/08 09:31 AM >> Is this a format just for compressed sequence files, or for all sequence files? > > The issue is most critical for compressed sequence files, but it would make sense to make the > compression optional. I would not support value compression. > >> Is this intended as a replacement for MapFile too? > > yes Ok, I guess my misunderstanding was due to Owen's comment. Doug Cutting - 24/Sep/08 10:04 AM > I wonder whether tfile should start out in contrib until it is more full-featured? Without support for java > comparators or random access it is not yet a replacement for SequenceFile. It also doesn't yet have > any inputformats, so it cannot be used from mapreduce. Nor does it yet have bindings for other > programming languages. So my preference is that, until tfile is proven to be of general utility to > Hadoop applications, it should live in contrib. We don't want code in core that's not both widely > usable and actually used. +1 Since TFile is not a replacement for MapFile and does not support reading up to the latest sync while writing, we probably cannot use it for HBase.
        Hide
        Doug Cutting added a comment -

        > this jira is not supposed to be a replacement for MapFile

        That's unfortunate. I thought that was the point of including indexes in the file, rather than in a side file.

        I wonder whether tfile should start out in contrib until it is more full-featured? Without support for java comparators or random access it is not yet a replacement for SequenceFile. It also doesn't yet have any inputformats, so it cannot be used from mapreduce. Nor does it yet have bindings for other programming languages. So my preference is that, until tfile is proven to be of general utility to Hadoop applications, it should live in contrib. We don't want code in core that's not both widely usable and actually used.

        Show
        Doug Cutting added a comment - > this jira is not supposed to be a replacement for MapFile That's unfortunate. I thought that was the point of including indexes in the file, rather than in a side file. I wonder whether tfile should start out in contrib until it is more full-featured? Without support for java comparators or random access it is not yet a replacement for SequenceFile. It also doesn't yet have any inputformats, so it cannot be used from mapreduce. Nor does it yet have bindings for other programming languages. So my preference is that, until tfile is proven to be of general utility to Hadoop applications, it should live in contrib. We don't want code in core that's not both widely usable and actually used.
        Hide
        stack added a comment -

        @Hong. Yeah. You got all my questions. Here's some follow-on comments.

        Would be great if you could figure some way of avoiding double-decompress doing a single random access. Yeah on name of alternate comparator, etc.

        On "indexing the region by the endKey", pardon me. I thought ceiling a Scanner-only method. I see now what you are suggesting and that could almost work (leaving aside need to rewrite a bunch of ornery hbase code to make the change).

        But what if the following in TFile.Reader:

            private final MetaBlockTFileDataIndex metaTFileDataIndex;
            private final MetaBlockTFileMeta metaTFileMeta;
        

        ...had protected accessors so were accessible by a subclass? Then a subclass could do its own implementation of ceiling to return the key closest before rather than closest after as currently implemented? To do this, I'd replicate the body of ceiling – would need the accessors so I could do the following call in my subclass:

              int blkIndex = metaTFileDataIndex.lowerBound(inKey);
        

        and if searchInBlock was also protected instead of private, a subclass could ask it to find the equal to or before rather than equal to or after?

        Show
        stack added a comment - @Hong. Yeah. You got all my questions. Here's some follow-on comments. Would be great if you could figure some way of avoiding double-decompress doing a single random access. Yeah on name of alternate comparator, etc. On "indexing the region by the endKey", pardon me. I thought ceiling a Scanner-only method. I see now what you are suggesting and that could almost work (leaving aside need to rewrite a bunch of ornery hbase code to make the change). But what if the following in TFile.Reader: private final MetaBlockTFileDataIndex metaTFileDataIndex; private final MetaBlockTFileMeta metaTFileMeta; ...had protected accessors so were accessible by a subclass? Then a subclass could do its own implementation of ceiling to return the key closest before rather than closest after as currently implemented? To do this, I'd replicate the body of ceiling – would need the accessors so I could do the following call in my subclass: int blkIndex = metaTFileDataIndex.lowerBound(inKey); and if searchInBlock was also protected instead of private, a subclass could ask it to find the equal to or before rather than equal to or after?
        Hide
        Hong Tang added a comment -

        To stack:

        Any remarks on my questions in 'stack - 23/Sep/08 02:26 PM'? Thanks.

        Can you check my comments in "Hong Tang - 23/Sep/08 04:52 PM"? Did I miss some of your questions?

        Show
        Hong Tang added a comment - To stack: Any remarks on my questions in 'stack - 23/Sep/08 02:26 PM'? Thanks. Can you check my comments in "Hong Tang - 23/Sep/08 04:52 PM"? Did I miss some of your questions?
        Hide
        Hong Tang added a comment -

        On "indexing the region by the endKey", pardon me, I'm not sure I follow. Currently index is block-based, not key-based IIUC so can I even make an index that has all keys? Or, can you make an index that is key-based? (Even if I could index all keys, if key/values are small, might make for a big index so might need something like the MapFile interval).

        Sorry for the confusion. My comment is in response to your question wrt whether TFile can support something similar to MapFile's getClosest() call. The answer to that question is that we cannot implement such semantics efficiently because the API would require a bidirectional iterator and the underlying decompression stream is not so.

        My understanding of your usage case is that you currently have a MapFile with key being <region startKey> value may contain <region endKey, ...>. Given a client key, you perform getCloest(before==true) to get to the right region entry in the MapFile. To support the usage case in TFile, you may use <region endKey> as TFile key, and <region startKey, ...> as the value of TFile. Then TFile.Reader.ceiling(clientKey) will get you to the right entry.

        Show
        Hong Tang added a comment - On "indexing the region by the endKey", pardon me, I'm not sure I follow. Currently index is block-based, not key-based IIUC so can I even make an index that has all keys? Or, can you make an index that is key-based? (Even if I could index all keys, if key/values are small, might make for a big index so might need something like the MapFile interval). Sorry for the confusion. My comment is in response to your question wrt whether TFile can support something similar to MapFile's getClosest() call. The answer to that question is that we cannot implement such semantics efficiently because the API would require a bidirectional iterator and the underlying decompression stream is not so. My understanding of your usage case is that you currently have a MapFile with key being <region startKey> value may contain <region endKey, ...>. Given a client key, you perform getCloest(before==true) to get to the right region entry in the MapFile. To support the usage case in TFile, you may use <region endKey> as TFile key, and <region startKey, ...> as the value of TFile. Then TFile.Reader.ceiling(clientKey) will get you to the right entry.
        Hide
        Mahadev konar added a comment -

        jim,
        Tfile as per this jira is not supposed to be a replacement for MapFile. We can create another jira wherein we can implement an interface that allows random fast lookups which would be easy to do. For now our intention is to get Tfile in as per the specification posted .

        Show
        Mahadev konar added a comment - jim, Tfile as per this jira is not supposed to be a replacement for MapFile. We can create another jira wherein we can implement an interface that allows random fast lookups which would be easy to do. For now our intention is to get Tfile in as per the specification posted .
        Hide
        Jim Kellerman added a comment -

        > Hong Tang - 23/Sep/08 04:34 PM
        > What I mean is, currently TFile does not directly support random lookup. The only way for you to perform
        > this is by doing the following:

        -1

        If Tfile does not directly support random lookup then it is hardly a replacement for MapFile.

        Show
        Jim Kellerman added a comment - > Hong Tang - 23/Sep/08 04:34 PM > What I mean is, currently TFile does not directly support random lookup. The only way for you to perform > this is by doing the following: -1 If Tfile does not directly support random lookup then it is hardly a replacement for MapFile.
        Hide
        stack added a comment -

        Oh, I see what you are saying about "not directly supporting random lookup". I'd say thats a bit of a hole in TFile, especially if you want to replace MapFile (MapFile.Reader.get(key)).

        On "indexing the region by the endKey", pardon me, I'm not sure I follow. Currently index is block-based, not key-based IIUC so can I even make an index that has all keys? Or, can you make an index that is key-based? (Even if I could index all keys, if key/values are small, might make for a big index so might need something like the MapFile interval).

        When you say attribute, do you mean attribute of the key? Or something else.

        Thanks for entertaining my questions Hong.

        Any remarks on my questions in 'stack - 23/Sep/08 02:26 PM'? Thanks.

        Show
        stack added a comment - Oh, I see what you are saying about "not directly supporting random lookup". I'd say thats a bit of a hole in TFile, especially if you want to replace MapFile (MapFile.Reader.get(key)). On "indexing the region by the endKey", pardon me, I'm not sure I follow. Currently index is block-based, not key-based IIUC so can I even make an index that has all keys? Or, can you make an index that is key-based? (Even if I could index all keys, if key/values are small, might make for a big index so might need something like the MapFile interval). When you say attribute, do you mean attribute of the key? Or something else. Thanks for entertaining my questions Hong. Any remarks on my questions in 'stack - 23/Sep/08 02:26 PM'? Thanks.
        Hide
        Hong Tang added a comment -

        + How about defines for at least the common compression types and comparator name(s) at least for the common case where TFile is used by java?

        Agree.

        + If no compression, does that mean there is only one block in a file or do we still make blocks of size minBlockSize (raw size == compressed size)?

        Multiple blocks, raw size == compressed size.

        + If I wanted to ornament the index - say, I wanted to add a metadata block per BCFile block that had in it the offset of every added key (or the offset of every 'row' in hbase) in the name of improving random access speeds - it looks like I would override prepareAppendKey and then do my own KeyRegister class that keeps up the per-block index? KeyRegister is currently private. Can it be made subclassable?

        That is a usage case I have not thought about. A (slightly) less performant way I can recommend is to write your own key appender and value appender classes as filter stream on top of the key/value appending streams returned by TFile, and add customized actions in close() (before/after calling close() on the down stream).

        advanceCursorInBlock is also private which doesn't help if I want to exploit my ancillary-index info. Or what would you suggest if I want to make a more-involved index (I can't use the BCFile block index since key/values might be of variable size - or, maybe I can set the blocksize to zero and index every element?).

        Hmm, not clear how intercepting advanceCursorInBlock may help. Would it be suffice for you to know a seek() call moves forward or backward by how many <key, value> pairs?

        To add support for alternate comparators and for exposing the index at least to subclasses, should we add a patch atop your patch or just wait till whats here gets committed?

        I'd say let's wait it gets committed. Also, to keep in line with the original design objective, language-specific customized comparators should have string names like "jclass:path/to/java/package/ClassName", or
        "clib:path/to/C/library/functionName".

        It looks like I could do an in-memory TFile if I wanted since I provide the stream? Is that so? If so, thats sweet!

        Guess so. (We haven't thought about this usage though.)

        Show
        Hong Tang added a comment - + How about defines for at least the common compression types and comparator name(s) at least for the common case where TFile is used by java? Agree. + If no compression, does that mean there is only one block in a file or do we still make blocks of size minBlockSize (raw size == compressed size)? Multiple blocks, raw size == compressed size. + If I wanted to ornament the index - say, I wanted to add a metadata block per BCFile block that had in it the offset of every added key (or the offset of every 'row' in hbase) in the name of improving random access speeds - it looks like I would override prepareAppendKey and then do my own KeyRegister class that keeps up the per-block index? KeyRegister is currently private. Can it be made subclassable? That is a usage case I have not thought about. A (slightly) less performant way I can recommend is to write your own key appender and value appender classes as filter stream on top of the key/value appending streams returned by TFile, and add customized actions in close() (before/after calling close() on the down stream). advanceCursorInBlock is also private which doesn't help if I want to exploit my ancillary-index info. Or what would you suggest if I want to make a more-involved index (I can't use the BCFile block index since key/values might be of variable size - or, maybe I can set the blocksize to zero and index every element?). Hmm, not clear how intercepting advanceCursorInBlock may help. Would it be suffice for you to know a seek() call moves forward or backward by how many <key, value> pairs? To add support for alternate comparators and for exposing the index at least to subclasses, should we add a patch atop your patch or just wait till whats here gets committed? I'd say let's wait it gets committed. Also, to keep in line with the original design objective, language-specific customized comparators should have string names like "jclass:path/to/java/package/ClassName", or "clib:path/to/C/library/functionName". It looks like I could do an in-memory TFile if I wanted since I provide the stream? Is that so? If so, thats sweet! Guess so. (We haven't thought about this usage though.)
        Hide
        Hong Tang added a comment -

        Regards discussing our use case, our need is plain. Our catalog table keeps a list of table regions (Catalog table is basically made from MapFiles). Regions are denoted by their start and end keys. To update a particular row, client asks the catalog table for the region that contains the key to update. Currently we depend on MapFile#getClosest (with the before flag set) to do this.

        You may just index the region by the endKey (and keep startKey as an attribute if you want to). Then ceiling() will give you the right region.

        Show
        Hong Tang added a comment - Regards discussing our use case, our need is plain. Our catalog table keeps a list of table regions (Catalog table is basically made from MapFiles). Regions are denoted by their start and end keys. To update a particular row, client asks the catalog table for the region that contains the key to update. Currently we depend on MapFile#getClosest (with the before flag set) to do this. You may just index the region by the endKey (and keep startKey as an attribute if you want to). Then ceiling() will give you the right region.
        Hide
        Hong Tang added a comment -

        -1 on decompressing a block twice if it can be avoided.

        What I mean is, currently TFile does not directly support random lookup. The only way for you to perform this is by doing the following:
        <code>
        Location l = reader.locate(key);
        Scanner scanner = reader.createScanner(l, reader.end());
        scanner.getKey(key);
        scanner.getValue(value);
        scanner.close();
        </code>
        And the above code will lead to reading the compressed block twice. There are multiple ways to support random look up without incurring this overhead, and I need to take a closer look and figure out what is the best way to support it.

        Show
        Hong Tang added a comment - -1 on decompressing a block twice if it can be avoided. What I mean is, currently TFile does not directly support random lookup. The only way for you to perform this is by doing the following: <code> Location l = reader.locate(key); Scanner scanner = reader.createScanner(l, reader.end()); scanner.getKey(key); scanner.getValue(value); scanner.close(); </code> And the above code will lead to reading the compressed block twice. There are multiple ways to support random look up without incurring this overhead, and I need to take a closer look and figure out what is the best way to support it.
        Hide
        stack added a comment -

        Not getting key+value in the one go is a nice improvement.

        -1 on decompressing a block twice if it can be avoided.

        Regards discussing our use case, our need is plain. Our catalog table keeps a list of table regions (Catalog table is basically made from MapFiles). Regions are denoted by their start and end keys. To update a particular row, client asks the catalog table for the region that contains the key to update. Currently we depend on MapFile#getClosest (with the before flag set) to do this.

        Show
        stack added a comment - Not getting key+value in the one go is a nice improvement. -1 on decompressing a block twice if it can be avoided. Regards discussing our use case, our need is plain. Our catalog table keeps a list of table regions (Catalog table is basically made from MapFiles). Regions are denoted by their start and end keys. To update a particular row, client asks the catalog table for the region that contains the key to update. Currently we depend on MapFile#getClosest (with the before flag set) to do this.
        Hide
        Hong Tang added a comment -

        In Data Model/Operations (page 2) it says: "A TFile is not accessible for read while it is being created." While I can understand why that is reasonable for a sorted TFile since the index is written last, for a TFile that is unsorted (so it behaves as a SequenceFile) it should be possible to open a Reader on the TFile to read up to the last sync, provided you do a getFileStatus to get the "current length". Is this not the case?

        TFile is not meant to be used in such situation. Attempting to read a TFile that is being written out will result in failure because it misses the various signatures and data structures stored at the end of the file.

        Show
        Hong Tang added a comment - In Data Model/Operations (page 2) it says: "A TFile is not accessible for read while it is being created." While I can understand why that is reasonable for a sorted TFile since the index is written last, for a TFile that is unsorted (so it behaves as a SequenceFile) it should be possible to open a Reader on the TFile to read up to the last sync, provided you do a getFileStatus to get the "current length". Is this not the case? TFile is not meant to be used in such situation. Attempting to read a TFile that is being written out will result in failure because it misses the various signatures and data structures stored at the end of the file.
        Hide
        Hong Tang added a comment -

        hbase depends on the MapFile#getClosest (and MapFile#getClosestBefore): i.e. "Finds the record that is the closest match to the specified key." returning either an exact match as TFile#locate does or the next key (before or after dependent on provided parameters). TFile does not expose its index so this facility would be hard to build on TFile even in a subclass.

        The reason we did not support this is because TFile.Reader.Scanner is intrinsically a forward iterator (borrowing STL C++ concept, that means, it only supports ++, not --). So to support the operation of looking for the key that is the largest key smaller or equal to the search key, you may need to decomress a block twice. We can discuss in detail of your usage case and see if there may be some work-around of it.

        Note that this is also an issue with the currently implementation of scanner, which requires you to seek twice in the block (first in TFile.Reader.locate(), and then in the constructor of Scanner). This is not an issue in using TFile with MapReduce, because locate() and the construction of Scanner are in different processes (one in JobClient, one in Mapper). But for direct random access to TFile, we need to avoid such a problem (by providing a separate lookup call that returns the key and value stream instead of Location object).

        Show
        Hong Tang added a comment - hbase depends on the MapFile#getClosest (and MapFile#getClosestBefore): i.e. "Finds the record that is the closest match to the specified key." returning either an exact match as TFile#locate does or the next key (before or after dependent on provided parameters). TFile does not expose its index so this facility would be hard to build on TFile even in a subclass. The reason we did not support this is because TFile.Reader.Scanner is intrinsically a forward iterator (borrowing STL C++ concept, that means, it only supports ++, not --). So to support the operation of looking for the key that is the largest key smaller or equal to the search key, you may need to decomress a block twice. We can discuss in detail of your usage case and see if there may be some work-around of it. Note that this is also an issue with the currently implementation of scanner, which requires you to seek twice in the block (first in TFile.Reader.locate(), and then in the constructor of Scanner). This is not an issue in using TFile with MapReduce, because locate() and the construction of Scanner are in different processes (one in JobClient, one in Mapper). But for direct random access to TFile, we need to avoid such a problem (by providing a separate lookup call that returns the key and value stream instead of Location object).
        Hide
        Hong Tang added a comment -

        We would like to leverage TFile in hbase. memcmp-sort of keys won't work for us. Will it be hard to add support for another comparator?

        Yes, we plan to add this feature later - at least for Java. But we may restrict TFiles with such kind of comparators being accessed by Java only.

        On page 8., the awkward-looking for-loop at the head of the page with its isCursorAtEnd and advanceCursor has some justification in practicality, I presume. otherwise why not use the hasNext/next Iterator common (java) idiom?

        Yes, the original consideration behind it is because the Java Iterator interface always park the cursor on the entry that is already read, and you use next() to move the cursor to the next and fetch the result atomically. On the other hand, TFile scanner separates the cursor movement and data access (because we have two ways of moving cursor: advanceCusrosr() and. seek()), so next() does not make sense here. [ Note that the idiom is close to the iterator concept in C++ STL design, you first get a begin and end iterator from a container, then you can do for_each(begin, end, Op). And advanceCursor() corresponds to ++iter, and isCursorAtEnd corresponds to (iter == end). ]

        In terms why we do not get key and value in one call, this is because we want to allow people to first get the key, then decide whether to read the value or not (considering the application of doing an inner join). But conceivably, we can provide various convenience utility methods to get both key and value in one shot (just as we did for append).

        bq On the performance numbers above, how about adding in test of random accesses into TFiles/SequenceFiles?
        Yes, we will follow up on that.

        Show
        Hong Tang added a comment - We would like to leverage TFile in hbase. memcmp-sort of keys won't work for us. Will it be hard to add support for another comparator? Yes, we plan to add this feature later - at least for Java. But we may restrict TFiles with such kind of comparators being accessed by Java only. On page 8., the awkward-looking for-loop at the head of the page with its isCursorAtEnd and advanceCursor has some justification in practicality, I presume. otherwise why not use the hasNext/next Iterator common (java) idiom? Yes, the original consideration behind it is because the Java Iterator interface always park the cursor on the entry that is already read, and you use next() to move the cursor to the next and fetch the result atomically. On the other hand, TFile scanner separates the cursor movement and data access (because we have two ways of moving cursor: advanceCusrosr() and. seek()), so next() does not make sense here. [ Note that the idiom is close to the iterator concept in C++ STL design, you first get a begin and end iterator from a container, then you can do for_each(begin, end, Op). And advanceCursor() corresponds to ++iter, and isCursorAtEnd corresponds to (iter == end). ] In terms why we do not get key and value in one call, this is because we want to allow people to first get the key, then decide whether to read the value or not (considering the application of doing an inner join). But conceivably, we can provide various convenience utility methods to get both key and value in one shot (just as we did for append). bq On the performance numbers above, how about adding in test of random accesses into TFiles/SequenceFiles? Yes, we will follow up on that.
        Hide
        stack added a comment -

        Other comments on TFile.java:

        + How about defines for at least the common compression types and comparator name(s) at least for the common case where TFile is used by java?
        + If no compression, does that mean there is only one block in a file or do we still make blocks of size minBlockSize (raw size == compressed size)?
        + If I wanted to ornament the index – say, I wanted to add a metadata block per BCFile block that had in it the offset of every added key (or the offset of every 'row' in hbase) in the name of improving random access speeds – it looks like I would override prepareAppendKey and then do my own KeyRegister class that keeps up the per-block index? KeyRegister is currently private. Can it be made subclassable? advanceCursorInBlock is also private which doesn't help if I want to exploit my ancillary-index info. Or what would you suggest if I want to make a more-involved index (I can't use the BCFile block index since key/values might be of variable size – or, maybe I can set the blocksize to zero and index every element?).
        + The code looks really good.

        To add support for alternate comparators and for exposing the index at least to subclasses, should we add a patch atop your patch or just wait till whats here gets committed?

        It looks like I could do an in-memory TFile if I wanted since I provide the stream? Is that so? If so, thats sweet!

        Show
        stack added a comment - Other comments on TFile.java: + How about defines for at least the common compression types and comparator name(s) at least for the common case where TFile is used by java? + If no compression, does that mean there is only one block in a file or do we still make blocks of size minBlockSize (raw size == compressed size)? + If I wanted to ornament the index – say, I wanted to add a metadata block per BCFile block that had in it the offset of every added key (or the offset of every 'row' in hbase) in the name of improving random access speeds – it looks like I would override prepareAppendKey and then do my own KeyRegister class that keeps up the per-block index? KeyRegister is currently private. Can it be made subclassable? advanceCursorInBlock is also private which doesn't help if I want to exploit my ancillary-index info. Or what would you suggest if I want to make a more-involved index (I can't use the BCFile block index since key/values might be of variable size – or, maybe I can set the blocksize to zero and index every element?). + The code looks really good. To add support for alternate comparators and for exposing the index at least to subclasses, should we add a patch atop your patch or just wait till whats here gets committed? It looks like I could do an in-memory TFile if I wanted since I provide the stream? Is that so? If so, thats sweet!
        Hide
        Jim Kellerman added a comment -

        In Data Model/Operations (page 2) it says: "A TFile is not accessible for read while it is being created." While I can understand why that is reasonable for a sorted TFile since the index is written last, for a TFile that is unsorted (so it behaves as a SequenceFile) it should be possible to open a Reader on the TFile to read up to the last sync, provided you do a getFileStatus to get the "current length". Is this not the case?

        If TFile does not support sync, then deprecating SequenceFile (as has been discussed above) would be a problem for HBase.

        I'd also like to see a comparison between TFile's random access performance vs MapFile's random access performance.

        Show
        Jim Kellerman added a comment - In Data Model/Operations (page 2) it says: "A TFile is not accessible for read while it is being created." While I can understand why that is reasonable for a sorted TFile since the index is written last, for a TFile that is unsorted (so it behaves as a SequenceFile) it should be possible to open a Reader on the TFile to read up to the last sync, provided you do a getFileStatus to get the "current length". Is this not the case? If TFile does not support sync, then deprecating SequenceFile (as has been discussed above) would be a problem for HBase. I'd also like to see a comparison between TFile's random access performance vs MapFile's random access performance.
        Hide
        stack added a comment -

        hbase depends on the MapFile#getClosest (and MapFile#getClosestBefore): i.e. "Finds the record that is the closest match to the specified key." returning either an exact match as TFile#locate does or the next key (before or after dependent on provided parameters). TFile does not expose its index so this facility would be hard to build on TFile even in a subclass.

        Show
        stack added a comment - hbase depends on the MapFile#getClosest (and MapFile#getClosestBefore): i.e. "Finds the record that is the closest match to the specified key." returning either an exact match as TFile#locate does or the next key (before or after dependent on provided parameters). TFile does not expose its index so this facility would be hard to build on TFile even in a subclass.
        Hide
        stack added a comment -

        Taking a quick look at the patch, changing the comparator isn't currently possible but doesn't look hard to change (The comparator 'name' is already part of the intrinsic metadata).

        Show
        stack added a comment - Taking a quick look at the patch, changing the comparator isn't currently possible but doesn't look hard to change (The comparator 'name' is already part of the intrinsic metadata).
        Hide
        stack added a comment -

        I took a look at the new PDF doc (Haven't looked at the patch yet – excuse me if my questions are answered therein)

        We would like to leverage TFile in hbase. memcmp-sort of keys won't work for us. Will it be hard to add support for another comparator?

        On page 8., the awkward-looking for-loop at the head of the page with its isCursorAtEnd and advanceCursor has some justification in practicality, I presume. otherwise why not use the hasNext/next Iterator common (java) idiom?

        On the performance numbers above, how about adding in test of random accesses into TFiles/SequenceFiles? (If you need a tool to to test random accessing, the MapFilePerformanceEvaluation tool in hbase-test.jar might help: http://svn.apache.org/viewvc/hadoop/hbase/trunk/src/test/org/apache/hadoop/hbase/MapFilePerformanceEvaluation.java?view=markup)

        Show
        stack added a comment - I took a look at the new PDF doc (Haven't looked at the patch yet – excuse me if my questions are answered therein) We would like to leverage TFile in hbase. memcmp-sort of keys won't work for us. Will it be hard to add support for another comparator? On page 8., the awkward-looking for-loop at the head of the page with its isCursorAtEnd and advanceCursor has some justification in practicality, I presume. otherwise why not use the hasNext/next Iterator common (java) idiom? On the performance numbers above, how about adding in test of random accesses into TFiles/SequenceFiles? (If you need a tool to to test random accessing, the MapFilePerformanceEvaluation tool in hbase-test.jar might help: http://svn.apache.org/viewvc/hadoop/hbase/trunk/src/test/org/apache/hadoop/hbase/MapFilePerformanceEvaluation.java?view=markup )
        Hide
        Doug Cutting added a comment -

        > Used SequenceFile and TFile to implement two common interfaces [ ... ]

        This calls out the need, as we add TFile, to have a common API for container files. This would, e.g., reduce code duplication in input and output formats for mapreduce. This is related to HADOOP-4065.

        Show
        Doug Cutting added a comment - > Used SequenceFile and TFile to implement two common interfaces [ ... ] This calls out the need, as we add TFile, to have a common API for container files. This would, e.g., reduce code duplication in input and output formats for mapreduce. This is related to HADOOP-4065 .
        Hide
        Hong Tang added a comment -

        A preliminary study on performance comparison between TFile and SequenceFile.

        Settings:

        • OS: RHEL AS4 Nahant Update 2. Linux 2.6.9-55.ELsmp.
        • Hardware: dual 2GHz CPU, 4GB main memory, WD Caviar 400GB (with 60GB free space).
        • Key length: uniform random 50-100B.
        • Value length: uniform random 5K-10K.
        • Both keys and values are composed by using a "dictionary" of 1000 "words", each word's length is uniformly distributed between 5-20B.
        • Compression schemes: none, lzo, and gz. The # of <key, value> pairs are 600K, 3M, 3M for the three cases. The output file sizes in the three cases are 4.4G, 10G, and 6G respectively. This eliminates the file caching effect.
        • Used SequenceFile and TFile to implement two common interfaces: one for writing, and one for reading. As follows:
        private interface KVAppendable {
          public void append(BytesWritable key, BytesWritable value)
              throws IOException;
          public void close() throws IOException;
        }
        
        private interface KVReadable {
          public byte[] getKey();
          public byte[] getValue();
          public int getKeyLength();
          public int getValueLength();
          public boolean next() throws IOException;
          public void close() throws IOException;
        }
        

        Both interfaces allow for efficient implementations by either TFile or SequenceFile to avoid any object creations or buffer copying to conform to the interface.

        Some finer details:

        • For writing, the timing included the append() call (so TFile meta data writing is not included), it also includes the time being used to compose keys and values. The same seed is used to construct a Random object that does the key/value composition.
        • For reading, the timing includes the next() call followed by getKeyLength() and getValueLength(). (getKey() and getValue() simply returns internally cached key/value buffers, and is an O(1) operation).
        • For each compression scheme, I run and time the following tasks: create sequence file, read sequence file, create Tfile, read Tfile, create Tfile, read Tfile, create sequence file, read Tfile. Then I only pick the better performance of the two. This is to remove the possible effects with JVM hotspot compilation, garbage collection, and/or occassional host-related activities (I have seen tar, yum on top screen from time to time).
        • Memory footprint: For SeqFile, it needs to cache the full block of uncompressed data and compressed data, each in the order of 1MB. So the total buffering is about 2MB. For TFile, the block size is set to 10MB, but the amount of buffering is 4K (buffering for small writes before the compression/decompression stream) + 256K (FS read/write buffering) + 1MB (for writes if value length is not known, this is also tunable). They are comparable but TFile is more malleable.

        Finally, the results:

          SeqFile-none TFile-none SeqFile-lzo TFile-lzo SeqFile-gz TFile-gz
        Data sizes (MB) 4435.98 4435.98 22179.13 22179.13 22179.13 22179.13
        File sizes (MB) 4456.58 4438.31 10080.23 10063.48 6236.91 5943.07
        Write Eff BW (MB/s) 36.86 35.53 38.46 39.97 13.59 13.54
        Write I/O BW (MB/s) 37.03 35.54 17.48 18.14 3.82 3.63
        Read Eff BW (MB/s) 41.13 40.16 86.77 91.04 52.73 75.15
        Read I/O BW (MB/s) 41.32 40.18 39.44 41.31 14.83 20.14

        Things to notice:

        • In most cases, SeqFile and TFile performance are similar.
        • TFile sizes are usually smaller than SeqFile size - SeqFile encodes length using 4B, and TFile uses VInt. The setup of the benchmark favors TFile in this regard, fixed key/value sizes may make SeqFile smaller.
        • For none compression, SeqFile outperforms TFile, because both only needs one layer of buffering and TFile interface does require the creation of more small objects to set up.
        • For lzo and gz compression. TFile outperforms SeqFile due to the reduction of extra buffer copying.
        • Particularly for gz compression, the read performance of TFile is 42% faster. The reason for that is because DecompressorStream always reads as much data as possible from the downstream using the internal buffer size. And I took advantage of that by skipping my own FS buffering. So bulk reads (reading a value) incurs the least amount of buffer copying. (Not true for LZO, which uses block compression, and the BlockDecompressorStream always read small blocks in the order of 20KB). Reduced buffer copy saves CPU cycles and in the case of GZ, CPU is the bottleneck.

        The above results are still preliminary. No YourKit profiling is done on TFile side. And results could vary for different settings - different key value lengths, compression ratios, underlying I/O speed, etc.

        Show
        Hong Tang added a comment - A preliminary study on performance comparison between TFile and SequenceFile. Settings: OS: RHEL AS4 Nahant Update 2. Linux 2.6.9-55.ELsmp. Hardware: dual 2GHz CPU, 4GB main memory, WD Caviar 400GB (with 60GB free space). Key length: uniform random 50-100B. Value length: uniform random 5K-10K. Both keys and values are composed by using a "dictionary" of 1000 "words", each word's length is uniformly distributed between 5-20B. Compression schemes: none, lzo, and gz. The # of <key, value> pairs are 600K, 3M, 3M for the three cases. The output file sizes in the three cases are 4.4G, 10G, and 6G respectively. This eliminates the file caching effect. Used SequenceFile and TFile to implement two common interfaces: one for writing, and one for reading. As follows: private interface KVAppendable { public void append(BytesWritable key, BytesWritable value) throws IOException; public void close() throws IOException; } private interface KVReadable { public byte [] getKey(); public byte [] getValue(); public int getKeyLength(); public int getValueLength(); public boolean next() throws IOException; public void close() throws IOException; } Both interfaces allow for efficient implementations by either TFile or SequenceFile to avoid any object creations or buffer copying to conform to the interface. Some finer details: For writing, the timing included the append() call (so TFile meta data writing is not included), it also includes the time being used to compose keys and values. The same seed is used to construct a Random object that does the key/value composition. For reading, the timing includes the next() call followed by getKeyLength() and getValueLength(). (getKey() and getValue() simply returns internally cached key/value buffers, and is an O(1) operation). For each compression scheme, I run and time the following tasks: create sequence file, read sequence file, create Tfile, read Tfile, create Tfile, read Tfile, create sequence file, read Tfile. Then I only pick the better performance of the two. This is to remove the possible effects with JVM hotspot compilation, garbage collection, and/or occassional host-related activities (I have seen tar, yum on top screen from time to time). Memory footprint: For SeqFile, it needs to cache the full block of uncompressed data and compressed data, each in the order of 1MB. So the total buffering is about 2MB. For TFile, the block size is set to 10MB, but the amount of buffering is 4K (buffering for small writes before the compression/decompression stream) + 256K (FS read/write buffering) + 1MB (for writes if value length is not known, this is also tunable). They are comparable but TFile is more malleable. Finally, the results:   SeqFile-none TFile-none SeqFile-lzo TFile-lzo SeqFile-gz TFile-gz Data sizes (MB) 4435.98 4435.98 22179.13 22179.13 22179.13 22179.13 File sizes (MB) 4456.58 4438.31 10080.23 10063.48 6236.91 5943.07 Write Eff BW (MB/s) 36.86 35.53 38.46 39.97 13.59 13.54 Write I/O BW (MB/s) 37.03 35.54 17.48 18.14 3.82 3.63 Read Eff BW (MB/s) 41.13 40.16 86.77 91.04 52.73 75.15 Read I/O BW (MB/s) 41.32 40.18 39.44 41.31 14.83 20.14 Things to notice: In most cases, SeqFile and TFile performance are similar. TFile sizes are usually smaller than SeqFile size - SeqFile encodes length using 4B, and TFile uses VInt. The setup of the benchmark favors TFile in this regard, fixed key/value sizes may make SeqFile smaller. For none compression, SeqFile outperforms TFile, because both only needs one layer of buffering and TFile interface does require the creation of more small objects to set up. For lzo and gz compression. TFile outperforms SeqFile due to the reduction of extra buffer copying. Particularly for gz compression, the read performance of TFile is 42% faster. The reason for that is because DecompressorStream always reads as much data as possible from the downstream using the internal buffer size. And I took advantage of that by skipping my own FS buffering. So bulk reads (reading a value) incurs the least amount of buffer copying. (Not true for LZO, which uses block compression, and the BlockDecompressorStream always read small blocks in the order of 20KB). Reduced buffer copy saves CPU cycles and in the case of GZ, CPU is the bottleneck. The above results are still preliminary. No YourKit profiling is done on TFile side. And results could vary for different settings - different key value lengths, compression ratios, underlying I/O speed, etc.
        Hide
        Amir Youssefi added a comment -

        Utils.java is very light now.

        memcmp is updated to use Hadoop's WritableComparator.compareBytes(..).

        We plan to replace VInt, VLong with alternative options soon.

        Show
        Amir Youssefi added a comment - Utils.java is very light now. memcmp is updated to use Hadoop's WritableComparator.compareBytes(..). We plan to replace VInt, VLong with alternative options soon.
        Hide
        Amir Youssefi added a comment -

        Update with refactoring and performance improvements.

        Show
        Amir Youssefi added a comment - Update with refactoring and performance improvements.
        Hide
        Hong Tang added a comment -

        If you really want this as a top-level goal, I really think you could define the TFileHeaders as a struct in something like protocol buffers or Thrift or even RecordIO (which would at least give you C) and then no C++, Python, Perl, ... implementor ever has to worry about data format. This is probably a day or two of work and in the long run should really pay off.

        this is also aided greatly by using thrift/protobuffers/record io since they can parse newer versions of the same struct and just ignore unknown fields

        Yes, I agree that using Protocol Buffer or Thrift for headers (rather, tails) is probably a good way to help achieve language neutrality and compatibility, and even shrink the # of lines. I will take a closer look at them.

        Stupid question, but if I am not really using the key field (ie just storing an opaque row), does the TFile incur anymore overhead for all the empty keys then does a SequenceFile?

        Each empty key would take one byte (length=0) to represet. So it does have some overhead.

        Show
        Hong Tang added a comment - If you really want this as a top-level goal, I really think you could define the TFileHeaders as a struct in something like protocol buffers or Thrift or even RecordIO (which would at least give you C) and then no C++, Python, Perl, ... implementor ever has to worry about data format. This is probably a day or two of work and in the long run should really pay off. this is also aided greatly by using thrift/protobuffers/record io since they can parse newer versions of the same struct and just ignore unknown fields Yes, I agree that using Protocol Buffer or Thrift for headers (rather, tails) is probably a good way to help achieve language neutrality and compatibility, and even shrink the # of lines. I will take a closer look at them. Stupid question, but if I am not really using the key field (ie just storing an opaque row), does the TFile incur anymore overhead for all the empty keys then does a SequenceFile? Each empty key would take one byte (length=0) to represet. So it does have some overhead.
        Hide
        Pete Wyckoff added a comment -

        but also forward compatibility (old software read new data).

        this is also aided greatly by using thrift/protobuffers/record io since they can parse newer versions of the same struct and just ignore unknown fields

        Stupid question, but if I am not really using the key field (ie just storing an opaque row), does the TFile incur anymore overhead for all the empty keys then does a SequenceFile?

        Show
        Pete Wyckoff added a comment - but also forward compatibility (old software read new data). this is also aided greatly by using thrift/protobuffers/record io since they can parse newer versions of the same struct and just ignore unknown fields Stupid question, but if I am not really using the key field (ie just storing an opaque row), does the TFile incur anymore overhead for all the empty keys then does a SequenceFile?
        Hide
        Pete Wyckoff added a comment - - edited

        Admittedly, we cannot claim language neutrality

        If you really want this as a top-level goal, I really think you could define the TFileHeaders as a struct in something like protocol buffers or Thrift or even RecordIO (which would at least give you C) and then no C++, Python, Perl, ... implementor ever has to worry about data format. This is probably a day or two of work and in the long run should really pay off.

        If one didn't need to worry about format and ordering of headers, how much easier is it to implement the read side of TFiles which is often the first thing you want.

        Show
        Pete Wyckoff added a comment - - edited Admittedly, we cannot claim language neutrality If you really want this as a top-level goal, I really think you could define the TFileHeaders as a struct in something like protocol buffers or Thrift or even RecordIO (which would at least give you C) and then no C++, Python, Perl, ... implementor ever has to worry about data format. This is probably a day or two of work and in the long run should really pay off. If one didn't need to worry about format and ordering of headers, how much easier is it to implement the read side of TFiles which is often the first thing you want.
        Hide
        Hong Tang added a comment -

        Adding to Owen's comments.

        • One of the hidden assumption of TFile design is that once data are stored in TFile format, they will stay there for a long time, and may need to sustain software changes/improvements. That is where the requirements of extensibility and compatibility come from. Note that we are not only talking about back-ward compatibility (new software read old data), but also forward compatibility (old software read new data). Various mechanisms have been built into our design to facilitate this goal (named meta blocks, length-guarded array entries, <offset, length> region specification, etc). This is also the reason we are required to provide a storage spec that goes down to the details of every bit.
        • Another unstated design requirement is that the design must be performant. This not only means that scanning/writing Tfiles be done at close to bulk I/O throughput, it also means that we need to support reading/writing many Tfiles concurrently, and thus we need to reduce our memory footprint. The internal requirement is to support 100+ Tfiles with single digit MB memory footprint per TFile, regardless the settings of minimum block size and/or value size.
        • We also want to make sure data stored in TFile may be exchanged among different groups who may have different preference of programming languages. Admittedly, we cannot claim language neutrality until we actually implement it in more than one programming language. However, having this objective from the very beginning and reviewing it from time to time certainly helps. And the fact that we are able to spec out the storage format down to every bit without referring to Java or any other library gives us confidence for achieving this goal.

        With regard to the questions why we need so many lines for the implementation, I did a quick run-down of the code:

        • Over 1100 lines are coments.
        • Over 500 lines are blank lines and import lines.
        • Many classes and methods in Utils are fairly general and could be reused, such as chunk encoder/decoder, lowerBound(), upperBound(), BoundedByteArrayOutputStream, BoundedRangeFileInputStream, etc. They take about 400 lines of code (excluding blank lines and comments).

        So that brings the actual TFile-specific code to 1200 lines. Among them,

        • Supporting streaming-style key-value appending and reading. This adds some adaptor classes to DataInputStream and DataOutputStream classes to perform setup and finalization. There are about 120 lines of code for that. Streaming not only helps performance by avoiding unnecessary buffering. It is also relatively easier to use than asking users to write adaptor classes in Writable interface.
        • Defense against misuse of API (enforcing state transitions, etc) and resource clean up takes about 100 lines of code.
        • Serialization and deserialization of various meta data of Tfiles take about 250 lines of code.
          So the code that deals with core TFile read/write logic is about 600-800 lines.
        Show
        Hong Tang added a comment - Adding to Owen's comments. One of the hidden assumption of TFile design is that once data are stored in TFile format, they will stay there for a long time, and may need to sustain software changes/improvements. That is where the requirements of extensibility and compatibility come from. Note that we are not only talking about back-ward compatibility (new software read old data), but also forward compatibility (old software read new data). Various mechanisms have been built into our design to facilitate this goal (named meta blocks, length-guarded array entries, <offset, length> region specification, etc). This is also the reason we are required to provide a storage spec that goes down to the details of every bit. Another unstated design requirement is that the design must be performant. This not only means that scanning/writing Tfiles be done at close to bulk I/O throughput, it also means that we need to support reading/writing many Tfiles concurrently, and thus we need to reduce our memory footprint. The internal requirement is to support 100+ Tfiles with single digit MB memory footprint per TFile, regardless the settings of minimum block size and/or value size. We also want to make sure data stored in TFile may be exchanged among different groups who may have different preference of programming languages. Admittedly, we cannot claim language neutrality until we actually implement it in more than one programming language. However, having this objective from the very beginning and reviewing it from time to time certainly helps. And the fact that we are able to spec out the storage format down to every bit without referring to Java or any other library gives us confidence for achieving this goal. With regard to the questions why we need so many lines for the implementation, I did a quick run-down of the code: Over 1100 lines are coments. Over 500 lines are blank lines and import lines. Many classes and methods in Utils are fairly general and could be reused, such as chunk encoder/decoder, lowerBound(), upperBound(), BoundedByteArrayOutputStream, BoundedRangeFileInputStream, etc. They take about 400 lines of code (excluding blank lines and comments). So that brings the actual TFile-specific code to 1200 lines. Among them, Supporting streaming-style key-value appending and reading. This adds some adaptor classes to DataInputStream and DataOutputStream classes to perform setup and finalization. There are about 120 lines of code for that. Streaming not only helps performance by avoiding unnecessary buffering. It is also relatively easier to use than asking users to write adaptor classes in Writable interface. Defense against misuse of API (enforcing state transitions, etc) and resource clean up takes about 100 lines of code. Serialization and deserialization of various meta data of Tfiles take about 250 lines of code. So the code that deals with core TFile read/write logic is about 600-800 lines.
        Hide
        Owen O'Malley added a comment -

        sorry # of columns should have been # of rows, so it is already included. Since the types are binary, there are clearly no columns.

        Show
        Owen O'Malley added a comment - sorry # of columns should have been # of rows, so it is already included. Since the types are binary, there are clearly no columns.
        Hide
        Amir Youssefi added a comment -

        +1 on Util.java break-up.

        Initially, we had parts scattered in separate files, then refactored those and put relevant parts into TFile, BCFile and ones which didn't fit well went to Util.java file so we can deal with that mix later. It's time to break up Util.java into a few files and hopefully get rid of some baggage and find a sweet spot. I suspect it may never be as sweet as we want

        Show
        Amir Youssefi added a comment - +1 on Util.java break-up. Initially, we had parts scattered in separate files, then refactored those and put relevant parts into TFile, BCFile and ones which didn't fit well went to Util.java file so we can deal with that mix later. It's time to break up Util.java into a few files and hopefully get rid of some baggage and find a sweet spot. I suspect it may never be as sweet as we want
        Hide
        Amir Youssefi added a comment - - edited

        Did initial benchmarks with 100B keys and 5KB values. We found some performance improvement opportunities and an LZO issue (HADOOP-4162). I will make a few changes to make a scenario usable for comparison with Sequence Files.

        Show
        Amir Youssefi added a comment - - edited Did initial benchmarks with 100B keys and 5KB values. We found some performance improvement opportunities and an LZO issue ( HADOOP-4162 ). I will make a few changes to make a scenario usable for comparison with Sequence Files.
        Hide
        Owen O'Malley added a comment -

        I believe the primary advantages over SequenceFile are:

        • Support for large values (> heap size)
        • 1 Codec for compression/decompression instead of 4 for lower memory foot print
        • TFile doesn't require the value to be entirely buffered in RAM before being written
        • no required scanning for sync/block boundaries
        • number of columns included in header
        • metadata at end so can include data summary
        • replaces map files and sequence files with a single format
        • can support seek to row #
        • can sample keys very efficiently

        I agree with Doug's comments on the Util class, which should be broken up into separate classes.

        I would either stick with Hadoop vints or use protocol buffer vints. If we use protocol buffer vints, they should be named differently. My preference would be to stick with the current Hadoop vints.

        Since a lot of the focus on TFile has been on making it performant, it would be nice to see a benchmark that uses a 100 byte Text as key and a 5k byte Text as value and benchmark SequenceFile relative to TFile. I'd suggest both with compression none and possibly lzo.

        Show
        Owen O'Malley added a comment - I believe the primary advantages over SequenceFile are: Support for large values (> heap size) 1 Codec for compression/decompression instead of 4 for lower memory foot print TFile doesn't require the value to be entirely buffered in RAM before being written no required scanning for sync/block boundaries number of columns included in header metadata at end so can include data summary replaces map files and sequence files with a single format can support seek to row # can sample keys very efficiently I agree with Doug's comments on the Util class, which should be broken up into separate classes. I would either stick with Hadoop vints or use protocol buffer vints. If we use protocol buffer vints, they should be named differently. My preference would be to stick with the current Hadoop vints. Since a lot of the focus on TFile has been on making it performant, it would be nice to see a benchmark that uses a 100 byte Text as key and a 5k byte Text as value and benchmark SequenceFile relative to TFile. I'd suggest both with compression none and possibly lzo.
        Hide
        Pete Wyckoff added a comment -

        Additionally, protocol buffer's decoding requires you to read byte after byte, while both WritableUtils and my VLong can detect the length of the whole encoding after the first byte.

        InputStreams are buffered and also isn't the case you are optimizing for mainly the 2 byte case in which case this doesn' t help? Even if you don't want to pull in serialization from thrift or protocol buffers, wouldn't using RecordIO's libraries make more sense to doug's point?

        Show
        Pete Wyckoff added a comment - Additionally, protocol buffer's decoding requires you to read byte after byte, while both WritableUtils and my VLong can detect the length of the whole encoding after the first byte. InputStreams are buffered and also isn't the case you are optimizing for mainly the 2 byte case in which case this doesn' t help? Even if you don't want to pull in serialization from thrift or protocol buffers, wouldn't using RecordIO's libraries make more sense to doug's point?
        Hide
        Pete Wyckoff added a comment -

        If hadoop will now support at least TFiles and SequenceFiles, is there any thought around an Interface that TFile and SequenceFile and other Files or other self describing files or files with user defined metadata would implement?

        Show
        Pete Wyckoff added a comment - If hadoop will now support at least TFiles and SequenceFiles, is there any thought around an Interface that TFile and SequenceFile and other Files or other self describing files or files with user defined metadata would implement?
        Hide
        Doug Cutting added a comment -

        > The whole implementation only consists of the two writes()

        reset(), size(), close() and much of the ctor could be shared with ByteArrayOutputStream.

        > We will consider refactoring based on your suggestions later.

        We generally don't want to commit duplicated logic with a promise to clean it up later.

        Show
        Doug Cutting added a comment - > The whole implementation only consists of the two writes() reset(), size(), close() and much of the ctor could be shared with ByteArrayOutputStream. > We will consider refactoring based on your suggestions later. We generally don't want to commit duplicated logic with a promise to clean it up later.
        Hide
        Doug Cutting added a comment -

        Looking at the specification document, I see the major stated goals are (1) language neutrality; (2) extensibility, and (3) compatibility. I assume these are relative to SequenceFile.

        Langauge neutrality without an implementation in another language seems a risky claim. SequenceFile's only language dependence is in the naming of key and value classes, but implementations of these classes are not required to process a SequenceFile. SequenceFile, like TFile, lacks implementations in other languages, so I don't yet see a clear advantage there.

        (2) and (3) are very related. SequenceFile has proven extensible and back-compatible. Many features have been added without breaking back-compatibility. I don't see a qualitative advantage here to the TFile format.

        Perhaps you should include a section specifically addressing the advantages of TFile over SequenceFile, how they are achieved and how they can be measured.

        I suspect there may be other unstated goals in TFile. The case for TFile should be clearly made, as it adds a lot of code to Hadoop that must now be supported. If it has demonstrable advantages to SequenceFile and the case can be made that we will be able to retire SequenceFile after it is added, then TFile should go forward. Or if it is significantly simpler than SequenceFile while providing the same features, that might make the case that it will be easier to reimplement in other languages. But if it is equivalently complex and supports more-or-less the same features then it only adds baggage to the project.

        Show
        Doug Cutting added a comment - Looking at the specification document, I see the major stated goals are (1) language neutrality; (2) extensibility, and (3) compatibility. I assume these are relative to SequenceFile. Langauge neutrality without an implementation in another language seems a risky claim. SequenceFile's only language dependence is in the naming of key and value classes, but implementations of these classes are not required to process a SequenceFile. SequenceFile, like TFile, lacks implementations in other languages, so I don't yet see a clear advantage there. (2) and (3) are very related. SequenceFile has proven extensible and back-compatible. Many features have been added without breaking back-compatibility. I don't see a qualitative advantage here to the TFile format. Perhaps you should include a section specifically addressing the advantages of TFile over SequenceFile, how they are achieved and how they can be measured. I suspect there may be other unstated goals in TFile. The case for TFile should be clearly made, as it adds a lot of code to Hadoop that must now be supported. If it has demonstrable advantages to SequenceFile and the case can be made that we will be able to retire SequenceFile after it is added, then TFile should go forward. Or if it is significantly simpler than SequenceFile while providing the same features, that might make the case that it will be easier to reimplement in other languages. But if it is equivalently complex and supports more-or-less the same features then it only adds baggage to the project.
        Hide
        Hong Tang added a comment -

        > If you instead extended ByteArrayOutputStream, BoundedByteArrayOutputStream would only need to override the two write() methods, rather than implement all of the methods.
        [Hong] This class is optimized for TFile key appending, The whole implementation only consists of the two writes() and needs nothing else from BufferedInputStream.

        > Then it should be named something different. Why support negative values at all if you don't use them? Lucene also defines a VInt format that might be considered. Personally, I'd prefer Hadoop used a single VInt format and I don't think it is worth defining yet another VInt and String format for TFile is wise.
        [Hong] We can sure rename it to avoid confusion. We do need to represent negative integers. But they are either very small (such as -1, or -2), or very large (-1M).

        > Then CompressionCodecFactory should be extended, rather than duplicated.
        [Hong] This is a first cut implementation. As it stand, it does not duplicate much code. We will consider refactoring based on your suggestions later.

        Show
        Hong Tang added a comment - > If you instead extended ByteArrayOutputStream, BoundedByteArrayOutputStream would only need to override the two write() methods, rather than implement all of the methods. [Hong] This class is optimized for TFile key appending, The whole implementation only consists of the two writes() and needs nothing else from BufferedInputStream. > Then it should be named something different. Why support negative values at all if you don't use them? Lucene also defines a VInt format that might be considered. Personally, I'd prefer Hadoop used a single VInt format and I don't think it is worth defining yet another VInt and String format for TFile is wise. [Hong] We can sure rename it to avoid confusion. We do need to represent negative integers. But they are either very small (such as -1, or -2), or very large (-1M). > Then CompressionCodecFactory should be extended, rather than duplicated. [Hong] This is a first cut implementation. As it stand, it does not duplicate much code. We will consider refactoring based on your suggestions later.
        Hide
        Doug Cutting added a comment -

        > There is little to be shared between the two except for the buffer and count definition.

        If you instead extended ByteArrayOutputStream, BoundedByteArrayOutputStream would only need to override the two write() methods, rather than implement all of the methods.

        > The new VLong format enlarges the range of integers that can be encoded with 2-4 bytes [ ... ]

        Then it should be named something different. Why support negative values at all if you don't use them? Lucene also defines a VInt format that might be considered. Personally, I'd prefer Hadoop used a single VInt format and I don't think it is worth defining yet another VInt and String format for TFile is wise.

        > we did not directly use CompressionCodecFactory is because CompressionCodecFactory.getCodec() expects a path

        Then CompressionCodecFactory should be extended, rather than duplicated.

        Show
        Doug Cutting added a comment - > There is little to be shared between the two except for the buffer and count definition. If you instead extended ByteArrayOutputStream, BoundedByteArrayOutputStream would only need to override the two write() methods, rather than implement all of the methods. > The new VLong format enlarges the range of integers that can be encoded with 2-4 bytes [ ... ] Then it should be named something different. Why support negative values at all if you don't use them? Lucene also defines a VInt format that might be considered. Personally, I'd prefer Hadoop used a single VInt format and I don't think it is worth defining yet another VInt and String format for TFile is wise. > we did not directly use CompressionCodecFactory is because CompressionCodecFactory.getCodec() expects a path Then CompressionCodecFactory should be extended, rather than duplicated.
        Hide
        Hong Tang added a comment -

        The static inner classes for TFile are very tightly related to TFile.

        But various static classes in Utils are indeed for quite disparate purposes. They are currently lumped together as some sort of implementation details. We can certainly regroup them into independent modules. Are there any opposing concerns for having too many small modules?

        Show
        Hong Tang added a comment - The static inner classes for TFile are very tightly related to TFile. But various static classes in Utils are indeed for quite disparate purposes. They are currently lumped together as some sort of implementation details. We can certainly regroup them into independent modules. Are there any opposing concerns for having too many small modules?
        Hide
        Alejandro Abdelnur added a comment -

        More of a coding style question, why Utils and TFile are full of static inner classes? Shouldn't be better to have them a separate java files in a subpackage?

        Show
        Alejandro Abdelnur added a comment - More of a coding style question, why Utils and TFile are full of static inner classes? Shouldn't be better to have them a separate java files in a subpackage?
        Hide
        Hong Tang added a comment -

        BTW, in my previous post, n is the # of effective bits needed to represent the integer without the sign bit.

        Show
        Hong Tang added a comment - BTW, in my previous post, n is the # of effective bits needed to represent the integer without the sign bit.
        Hide
        Hong Tang added a comment -

        Just checked the code (CodedInputStream.java). The protocol buffer VInt (or VLong) is pretty interesting. They first transform the integer through ZigZag encoding, which essentially transform the long into leading 000+[n]+[sign]. They then encode the n+1 bits using ceiling((n+1)/7) bytes (in little-endian style). So effectively, 1B can represent -64 to 63, 2B: -8K to 8K-1, 3B: -1M to 1M-1, 4B: -128M to 128M. Comparing to my encoding scheme, I basically traded off some 2B encoding space for expanded 1B coverage. Additionally, protocol buffer's decoding requires you to read byte after byte, while both WritableUtils and my VLong can detect the length of the whole encoding after the first byte.

        Show
        Hong Tang added a comment - Just checked the code (CodedInputStream.java). The protocol buffer VInt (or VLong) is pretty interesting. They first transform the integer through ZigZag encoding, which essentially transform the long into leading 000+ [n] + [sign] . They then encode the n+1 bits using ceiling((n+1)/7) bytes (in little-endian style). So effectively, 1B can represent -64 to 63, 2B: -8K to 8K-1, 3B: -1M to 1M-1, 4B: -128M to 128M. Comparing to my encoding scheme, I basically traded off some 2B encoding space for expanded 1B coverage. Additionally, protocol buffer's decoding requires you to read byte after byte, while both WritableUtils and my VLong can detect the length of the whole encoding after the first byte.
        Hide
        Pete Wyckoff added a comment -

        They do have varints and are under the apache license. I don't know how they compare compression wise:
        http://code.google.com/apis/protocolbuffers/docs/encoding.html#varints

        I would of course prefer to recommend thrift, but they don't yet have variable sized ints.

        Show
        Pete Wyckoff added a comment - They do have varints and are under the apache license. I don't know how they compare compression wise: http://code.google.com/apis/protocolbuffers/docs/encoding.html#varints I would of course prefer to recommend thrift, but they don't yet have variable sized ints.
        Hide
        Hong Tang added a comment -

        > Why did you write your own variable length integer encoding? is this superior to protocolbuffers encoding?it seems like writing your own encoding/decoding format makes it harder for people to implement this in other languages, no?

        No, I have not looked at protocol buffer. Does it have a VInt/VLong implementation that can be used independently?

        As I commented previously, the only variable length integer format I found is the one from WriteableUtils, which has the odd property of only doubling the range of representation when we increase from 1B to 2B.

        I am open for adopting VInt/VLong encoding schemes that are relatively standardized and provides a good range of coverage for small to medium integers (which is the primary purpose of having variable length integers).

        Show
        Hong Tang added a comment - > Why did you write your own variable length integer encoding? is this superior to protocolbuffers encoding?it seems like writing your own encoding/decoding format makes it harder for people to implement this in other languages, no? No, I have not looked at protocol buffer. Does it have a VInt/VLong implementation that can be used independently? As I commented previously, the only variable length integer format I found is the one from WriteableUtils, which has the odd property of only doubling the range of representation when we increase from 1B to 2B. I am open for adopting VInt/VLong encoding schemes that are relatively standardized and provides a good range of coverage for small to medium integers (which is the primary purpose of having variable length integers).
        Hide
        Pete Wyckoff added a comment -

        Why did you write your own variable length integer encoding? is this superior to protocolbuffers encoding?it seems like writing your own encoding/decoding format makes it harder for people to implement this in other languages, no?

        Show
        Pete Wyckoff added a comment - Why did you write your own variable length integer encoding? is this superior to protocolbuffers encoding?it seems like writing your own encoding/decoding format makes it harder for people to implement this in other languages, no?
        Hide
        Hong Tang added a comment -
        • Is Util#memcmp different from WritableComparator#compareBytes()?
          [Hong] Looks like they are the same. My oversight.
        • Shouldn't BoundedByteArrayOutputStream extend ByteArrayOutputStream?
          [Hong] No, it should not. ByteArrayOutputStream does not bound the # of bytes written to the output stream. It automatically increases the size of the internal buffer, which is not what we want. There is little to be shared between the two except for the buffer and count definition.
        • the VLong code duplicates code in WritableUtils, no?
          [Hong] No. The new VLong format enlarges the range of integers that can be encoded with 2-4 bytes (with the expense of reduced range of negative integers in 1 byte case). The new format can represent integers from -32 to 127 with 1B, -5120 to 5119 for 2B, -1M to 1M-1 for 3B, and -128M to 128M-1 for 4B. Comparing WritableUtils.VLong: 1B -112 to 127, 2B: -256 to 255, 3B: -64K to 64K-1, 4B: -16M to 16M-1. This encoding scheme is more efficient for TFile where we may have lots of small integers but never small negative integers.
        • readString/writeString duplicates Text methods.
          [Hong] Sort of, but because Text.readString and writeString uses WriteableUtils.VInt, if we were to use these methods directly, we would have to document WritableUtils's VInt/VLong encoding as well, which is kind of confusing to define two VInt/VLong standards in one spec.
        • should the Compression enum be simply a new method on CompressionCodecFactory? If not, shouldn't it go in the io.compress package?
          [Hong] This part is a quick implementation of what should be a more extensible compression algorithm management in the future. The reason we did not directly use CompressionCodecFactory is because CompressionCodecFactory.getCodec() expects a path and the uses the suffix portion of the path to find the codec is based on some configuration. Directly using it would break TFile spec's requirement for language/implementation neutrality. On the other hand, it may be nice to include standard string name to compression codec definition in Hadoop.
        Show
        Hong Tang added a comment - Is Util#memcmp different from WritableComparator#compareBytes()? [Hong] Looks like they are the same. My oversight. Shouldn't BoundedByteArrayOutputStream extend ByteArrayOutputStream? [Hong] No, it should not. ByteArrayOutputStream does not bound the # of bytes written to the output stream. It automatically increases the size of the internal buffer, which is not what we want. There is little to be shared between the two except for the buffer and count definition. the VLong code duplicates code in WritableUtils, no? [Hong] No. The new VLong format enlarges the range of integers that can be encoded with 2-4 bytes (with the expense of reduced range of negative integers in 1 byte case). The new format can represent integers from -32 to 127 with 1B, -5120 to 5119 for 2B, -1M to 1M-1 for 3B, and -128M to 128M-1 for 4B. Comparing WritableUtils.VLong: 1B -112 to 127, 2B: -256 to 255, 3B: -64K to 64K-1, 4B: -16M to 16M-1. This encoding scheme is more efficient for TFile where we may have lots of small integers but never small negative integers. readString/writeString duplicates Text methods. [Hong] Sort of, but because Text.readString and writeString uses WriteableUtils.VInt, if we were to use these methods directly, we would have to document WritableUtils's VInt/VLong encoding as well, which is kind of confusing to define two VInt/VLong standards in one spec. should the Compression enum be simply a new method on CompressionCodecFactory? If not, shouldn't it go in the io.compress package? [Hong] This part is a quick implementation of what should be a more extensible compression algorithm management in the future. The reason we did not directly use CompressionCodecFactory is because CompressionCodecFactory.getCodec() expects a path and the uses the suffix portion of the path to find the codec is based on some configuration. Directly using it would break TFile spec's requirement for language/implementation neutrality. On the other hand, it may be nice to include standard string name to compression codec definition in Hadoop.
        Hide
        Hong Tang added a comment -

        The TFile specification.

        TFile is an immutable <Key, Value> storage format. TFile is meant to replace Sequence File. Comparing with Sequence File, TFile provides the features of block compression, sorted and unsorted keys, block indexing (for sorted TFile only), scan (bulk read), and application-level meta data. The data abstraction and storage representation is language-neutral, which supports different implementation in various languages (although this specification only describes the API in Java).

        Show
        Hong Tang added a comment - The TFile specification. TFile is an immutable <Key, Value> storage format. TFile is meant to replace Sequence File. Comparing with Sequence File, TFile provides the features of block compression, sorted and unsorted keys, block indexing (for sorted TFile only), scan (bulk read), and application-level meta data. The data abstraction and storage representation is language-neutral, which supports different implementation in various languages (although this specification only describes the API in Java).
        Hide
        Doug Cutting added a comment -

        Util.java serves no purpose. It mostly contains a number of package-private classes that should be independent files. Also much of its code seems redundant:

        • Is Util#memcmp different from WritableComparator#compareBytes()?
        • Shouldn't BoundedByteArrayOutputStream extend ByteArrayOutputStream?
        • the VLong code duplicates code in WritableUtils, no?
        • readString/writeString duplicates Text methods.
        • should the Compression enum be simply a new method on CompressionCodecFactory? If not, shouldn't it go in the io.compress package?

        If these differ in subtle ways from the versions elsewhere in Hadoop then the justification for near-duplication of implementation should be made in a comment.

        I got tired of reading the patch after that. For a simpler implementation, it seems very complicated. The implementation of TFile is around 3.9k lines of Java. SequenceFile, which is thought to be too complicated, is only 3.1k lines.

        Show
        Doug Cutting added a comment - Util.java serves no purpose. It mostly contains a number of package-private classes that should be independent files. Also much of its code seems redundant: Is Util#memcmp different from WritableComparator#compareBytes()? Shouldn't BoundedByteArrayOutputStream extend ByteArrayOutputStream? the VLong code duplicates code in WritableUtils, no? readString/writeString duplicates Text methods. should the Compression enum be simply a new method on CompressionCodecFactory? If not, shouldn't it go in the io.compress package? If these differ in subtle ways from the versions elsewhere in Hadoop then the justification for near-duplication of implementation should be made in a comment. I got tired of reading the patch after that. For a simpler implementation, it seems very complicated. The implementation of TFile is around 3.9k lines of Java. SequenceFile, which is thought to be too complicated, is only 3.1k lines.
        Hide
        Pete Wyckoff added a comment -

        Hi Amir,

        could you also update the description of this JIRA which currently says only "SequenceFile's block compression format is too complex and requires 4 codecs to compress or decompress. It would be good to have a file format that only needs"; with the design goals of the new format?

        thanks, pete

        Show
        Pete Wyckoff added a comment - Hi Amir, could you also update the description of this JIRA which currently says only "SequenceFile's block compression format is too complex and requires 4 codecs to compress or decompress. It would be good to have a file format that only needs"; with the design goals of the new format? thanks, pete
        Hide
        Amir Youssefi added a comment -

        Alejandro,

        Thanks for reading above spec. It is changed and evolved dramatically. I will e-mail you a copy of new one and remove old ones.

        Hong will attach an updated version shortly.

        Show
        Amir Youssefi added a comment - Alejandro, Thanks for reading above spec. It is changed and evolved dramatically. I will e-mail you a copy of new one and remove old ones. Hong will attach an updated version shortly.
        Hide
        Alejandro Abdelnur added a comment -

        On the read API the finalRowId() and getRowId(), I assume this is the row count and it is 0..N, with N being the number or rows (records) in the file right?

        If that is the case, are you planing to provide an API to get the total number of rows in a directory when all files in the directory are TFiles ?

        Show
        Alejandro Abdelnur added a comment - On the read API the finalRowId() and getRowId() , I assume this is the row count and it is 0..N, with N being the number or rows (records) in the file right? If that is the case, are you planing to provide an API to get the total number of rows in a directory when all files in the directory are TFiles ?
        Hide
        Amir Youssefi added a comment -

        Attaching first preview of TFile implementation to get feedback. In a few days we will have a Beta release .

        Dev Team (alphabetically):

        Mahadev Konar, Hong Tang, Nathan Wang and Amir Youssefi

        Initially we used some parts contributed by Venkata Srikanth Kakani.

        Hong will attach updated spec shortly.

        Tested with revision: 693321

        Show
        Amir Youssefi added a comment - Attaching first preview of TFile implementation to get feedback. In a few days we will have a Beta release . Dev Team (alphabetically): Mahadev Konar, Hong Tang, Nathan Wang and Amir Youssefi Initially we used some parts contributed by Venkata Srikanth Kakani. Hong will attach updated spec shortly. Tested with revision: 693321
        Hide
        stack added a comment -

        any progress on this issue? just wondering....

        Show
        stack added a comment - any progress on this issue? just wondering....
        Hide
        Srikanth Kakani added a comment -

        At the beginning of each block. We need to record the first key that goes into the block for making an index entry. A raw write may preclude us from doing that, unless we wrap it around a "BoundedOutputStream" that captures the first n bytes written out.

        Show
        Srikanth Kakani added a comment - At the beginning of each block. We need to record the first key that goes into the block for making an index entry. A raw write may preclude us from doing that, unless we wrap it around a "BoundedOutputStream" that captures the first n bytes written out.
        Hide
        Owen O'Malley added a comment -

        Srikanth, I don't understand your concern. When the user calls append(long, long), the writer can decide whether to start a new block or not based on the lengths. So as the client calls write(byte[], int, int) on the output stream, it can be written directly to the file stream or codec's ByteBuffer. For codecs like lzo, the write may be broken into multiple calls to handle the required chunking.

        And yes, to make this efficient, you need to be able to get the serialized length of the objects.

        Show
        Owen O'Malley added a comment - Srikanth, I don't understand your concern. When the user calls append(long, long), the writer can decide whether to start a new block or not based on the lengths. So as the client calls write(byte[], int, int) on the output stream, it can be written directly to the file stream or codec's ByteBuffer. For codecs like lzo, the write may be broken into multiple calls to handle the required chunking. And yes, to make this efficient, you need to be able to get the serialized length of the objects.
        Hide
        Srikanth Kakani added a comment -

        Owen,

        There would be one complication in exposing the append(long keyLength, long valueLength) that we did not discuss earlier. Although it can be handled.

        If it the key,value is at the beginning of a block we need to copy to a byte array in the key.serialize(outputstream). We can do this by having a keyValueOutputStream(keybytes,valuebytes, outputstream), that captures the first keybytes of data written into a buffer. This needs to be done to generate an index. But it starts getting ugly.

        I would also suggest ObjectFile should be extending the TFile and it can do all this in a neater fashion without exposing the append(keyLength, valueLength).

        Additionally to make any of this feasible (You mentioned this earlier, I just want to record it), serializers should also have getSerializedLength().

        Show
        Srikanth Kakani added a comment - Owen, There would be one complication in exposing the append(long keyLength, long valueLength) that we did not discuss earlier. Although it can be handled. If it the key,value is at the beginning of a block we need to copy to a byte array in the key.serialize(outputstream). We can do this by having a keyValueOutputStream(keybytes,valuebytes, outputstream), that captures the first keybytes of data written into a buffer. This needs to be done to generate an index. But it starts getting ugly. I would also suggest ObjectFile should be extending the TFile and it can do all this in a neater fashion without exposing the append(keyLength, valueLength). Additionally to make any of this feasible (You mentioned this earlier, I just want to record it), serializers should also have getSerializedLength().
        Hide
        Owen O'Malley added a comment -

        and then we add a level of abstraction on top that handles objects:

        public class ObjectFile<KEY_CLASS, VALUE_CLASS> {
        public static class Reader

        { public boolean next(KEY_CLASS key, VALUE_CLASS value) throws IOException; }

        ...
        }

        Show
        Owen O'Malley added a comment - and then we add a level of abstraction on top that handles objects: public class ObjectFile<KEY_CLASS, VALUE_CLASS> { public static class Reader { public boolean next(KEY_CLASS key, VALUE_CLASS value) throws IOException; } ... }
        Hide
        Owen O'Malley added a comment -

        I would suggest splitting the API into two parts a cooked part and raw part.

        public class TFile {
          static public class Writer {
             public void append(byte[] keyData, int keyOffset, int keyLength,
                                  byte[] valueData, int valueOffset, int valueLength) throws IOException;
            /**
             * Get an output stream to write the key and value to. Only valid until the next call to append.
             * The number of bytes written to the stream must equal keyLength + valueLength.
             */
            public OutputStream append(long keyLength, long valueLength) throws IOException;
          }
          public static Writer create(Path p, RawComparator comparator) throws IOException;
          static public class Reader {
            /**
             * Get an input stream to read the key from. available() will return the number of bytes left.
             * Only valid until the next next call on this stream.
             */
            public InputStream nextKeyInputStream() throws IOException;
            /**
             * Get and input stream to read the value from. available will return the number of bytes left.
             * Only valid until the next next call.
             */
            public InputStream nextValueInputStream() throws IOException;
          }
          public static Reader open(Path p, RawComparator comparator) throws IOException;
        }
        
        Show
        Owen O'Malley added a comment - I would suggest splitting the API into two parts a cooked part and raw part. public class TFile { static public class Writer { public void append( byte [] keyData, int keyOffset, int keyLength, byte [] valueData, int valueOffset, int valueLength) throws IOException; /** * Get an output stream to write the key and value to. Only valid until the next call to append. * The number of bytes written to the stream must equal keyLength + valueLength. */ public OutputStream append( long keyLength, long valueLength) throws IOException; } public static Writer create(Path p, RawComparator comparator) throws IOException; static public class Reader { /** * Get an input stream to read the key from. available() will return the number of bytes left. * Only valid until the next next call on this stream. */ public InputStream nextKeyInputStream() throws IOException; /** * Get and input stream to read the value from. available will return the number of bytes left. * Only valid until the next next call. */ public InputStream nextValueInputStream() throws IOException; } public static Reader open(Path p, RawComparator comparator) throws IOException; }
        Hide
        Raymie Stata added a comment -

        In the imeta section, we need to be careful about using class names to represent the formatting of data in a TFile. In particular, we should support standardized data and compression formats whose existence is independent of any implementing to Java classes. (To Pete's point, this would support cross-language accessibility of the data.)

        One could use class names for this purpose. We could claim that the name of some Java codec class is the name of the data format. The problem, is that class file names tend to change (and multiple implementations of the same codec tend to get written), so this is not a good way to go. So perhaps the strings in the imeta section should be names of formats, and the codec classes are found through some registry.

        On the other hand, for "local" Hadoop installations that wants to make up its own data formats, being able to wire the class name directly into the file and not having to maintain a registry might be convenient. I'm not convinced: I suspect most people will use standard formats we can bake into a standard registry. Still, if we think we need to support both requirements ("universal" names for standardized formats, and class-names for local ones), then perhaps a convention should be defined: a string that starts with the character "#" is the name of a standard format, the codec for which will be found in some registry. Other strings are treated like Java class names and are used to find the codec classes directly.

        Show
        Raymie Stata added a comment - In the imeta section, we need to be careful about using class names to represent the formatting of data in a TFile. In particular, we should support standardized data and compression formats whose existence is independent of any implementing to Java classes. (To Pete's point, this would support cross-language accessibility of the data.) One could use class names for this purpose. We could claim that the name of some Java codec class is the name of the data format. The problem, is that class file names tend to change (and multiple implementations of the same codec tend to get written), so this is not a good way to go. So perhaps the strings in the imeta section should be names of formats, and the codec classes are found through some registry. On the other hand, for "local" Hadoop installations that wants to make up its own data formats, being able to wire the class name directly into the file and not having to maintain a registry might be convenient. I'm not convinced: I suspect most people will use standard formats we can bake into a standard registry. Still, if we think we need to support both requirements ("universal" names for standardized formats, and class-names for local ones), then perhaps a convention should be defined: a string that starts with the character "#" is the name of a standard format, the codec for which will be found in some registry. Other strings are treated like Java class names and are used to find the codec classes directly.
        Hide
        Pete Wyckoff added a comment -

        It would be really nice if TFiles (and SequenceFiles) were accessible by non-Java programs - e.g., pipes or Python over fuse. If the header/metadata were versioned Jute or Thrift records, it would make it much easier to implement bindings for other languages - esp Thrift since it's in like 50 langauges.

        pete

        Show
        Pete Wyckoff added a comment - It would be really nice if TFiles (and SequenceFiles) were accessible by non-Java programs - e.g., pipes or Python over fuse. If the header/metadata were versioned Jute or Thrift records, it would make it much easier to implement bindings for other languages - esp Thrift since it's in like 50 langauges. pete
        Hide
        Hong Tang added a comment -

        Sorry I am late to the party. (I just joined the Grid group on 05/12).

        A few questions/suggestions:

        • Do we assume a key-value pair always fits in one block?
        • It seems that each TFile would contain 1 or 2 columns (1 for keyless columns, and 2 for key-ed columns). May we have a need to put more than 2 columns in one TFile?
        • Looks like TFile does not concern with the support of sub-columns (as in BigTable's column family). One down-side is that application must read a whole column family value and deserialize the internal sub-column-key-value pairs. This may be okay if we assume a column family cell fits in a block, but may not be feasible if a column family cell contains thousands or even more sub-columns and may span multiple blocks.
        • What is vint? Is it some sort of variable-length integer encoding? AFAIK, byte-oriented decoding is very slow (you may end up using only a small fraction of the available memory bandwidth). An alternative scheme could be to use some special characters to separate the keys and values, and escape the possible occurance of such special characters in keys and values when we write the file. If the key and value sizes are small, this may give us better performance. [Need verification with experimentation.]
        • Should we put a size limit on the TFile (Owen, is your 5GB figure an average size or worst case)? It can help simplify the design by assuming the index part can fully fit in memory for a TFile. Other than DFS has issues with too many files, what might be reasons we not restricting size on a TFile? Do we simply offload such decision to the application side (that they must pick a right partition factor).
        • I think we need to support embedding uninterpreted application-specific (application here means the direct user of TFile) metadata in a TFile. Without such support, we may end up having to save such information in a separate file that goes in parallel with each TFile.
        Show
        Hong Tang added a comment - Sorry I am late to the party. (I just joined the Grid group on 05/12). A few questions/suggestions: Do we assume a key-value pair always fits in one block? It seems that each TFile would contain 1 or 2 columns (1 for keyless columns, and 2 for key-ed columns). May we have a need to put more than 2 columns in one TFile? Looks like TFile does not concern with the support of sub-columns (as in BigTable's column family). One down-side is that application must read a whole column family value and deserialize the internal sub-column-key-value pairs. This may be okay if we assume a column family cell fits in a block, but may not be feasible if a column family cell contains thousands or even more sub-columns and may span multiple blocks. What is vint? Is it some sort of variable-length integer encoding? AFAIK, byte-oriented decoding is very slow (you may end up using only a small fraction of the available memory bandwidth). An alternative scheme could be to use some special characters to separate the keys and values, and escape the possible occurance of such special characters in keys and values when we write the file. If the key and value sizes are small, this may give us better performance. [Need verification with experimentation.] Should we put a size limit on the TFile (Owen, is your 5GB figure an average size or worst case)? It can help simplify the design by assuming the index part can fully fit in memory for a TFile. Other than DFS has issues with too many files, what might be reasons we not restricting size on a TFile? Do we simply offload such decision to the application side (that they must pick a right partition factor). I think we need to support embedding uninterpreted application-specific (application here means the direct user of TFile) metadata in a TFile. Without such support, we may end up having to save such information in a separate file that goes in parallel with each TFile.
        Hide
        Doug Cutting added a comment -

        > You might consider implementing TFile as an Interface with an easily subclassable or reuseable implementation?

        TFile should not be an interface, but perhaps we should have an an abstract class for such files. Then we might have a SequenceFile-based implementation, so that it would be easy to have code that can, e.g., both process SequenceFiles and TFiles with a simple way to select one or the other.

        Show
        Doug Cutting added a comment - > You might consider implementing TFile as an Interface with an easily subclassable or reuseable implementation? TFile should not be an interface, but perhaps we should have an an abstract class for such files. Then we might have a SequenceFile-based implementation, so that it would be easy to have code that can, e.g., both process SequenceFiles and TFiles with a simple way to select one or the other.
        Hide
        Owen O'Malley added a comment -

        Note that for map/reduce you have two options, either the getInputSplits could read all of the row indexes and plan the splits exactly. Or you could take the current approach and cut blindly and have the maps adjust the boundaries. In either case, you want it to be cheap to read the row indexes, which is much harder if the keys are intermixed.

        Show
        Owen O'Malley added a comment - Note that for map/reduce you have two options, either the getInputSplits could read all of the row indexes and plan the splits exactly. Or you could take the current approach and cut blindly and have the maps adjust the boundaries. In either case, you want it to be cheap to read the row indexes, which is much harder if the keys are intermixed.
        Hide
        Owen O'Malley added a comment -

        That is my fault. Let's say that you have:

        block size = 1 mb
        key size = 100 bytes
        file size = 5 gb
        

        that means that you have:

        key length vint = 1 byte
        row idx vint = 4 bytes
        block offset vint = 4 bytes
        blocks = 5000
        key index = 500kb
        row index = 40kb
        

        which makes it very cheap to read the row index if you don't need the key index (ie. a single 64k read of the tail of the file will likely get the whole thing). The whole key index is not huge, but it is much bigger than the row index and may be compressed and therefore more expensive to read.

        Show
        Owen O'Malley added a comment - That is my fault. Let's say that you have: block size = 1 mb key size = 100 bytes file size = 5 gb that means that you have: key length vint = 1 byte row idx vint = 4 bytes block offset vint = 4 bytes blocks = 5000 key index = 500kb row index = 40kb which makes it very cheap to read the row index if you don't need the key index (ie. a single 64k read of the tail of the file will likely get the whole thing). The whole key index is not huge, but it is much bigger than the row index and may be compressed and therefore more expensive to read.
        Hide
        stack added a comment -

        Pardon me if I'm being thick but it is still not clear to me now rows and keys relate and why we need two indices especially if "Number of key entries equals number of RO entries, and they have 1-1 correspondence". Is a row made of keys? You might extend the introduction to include explication of how they relate.

        Doc. still says meta data keys and values are String. I thought above you allowed that values could be byte arrays?

        You have setMeta and then setMetaBytes? Why not just name the second method same as first?

        In your implemenation, please keep in mind that others will most likely want to extend. You might consider implementing TFile as an Interface with an easily subclassable or reuseable implementation?

        Good stuff

        Show
        stack added a comment - Pardon me if I'm being thick but it is still not clear to me now rows and keys relate and why we need two indices especially if "Number of key entries equals number of RO entries, and they have 1-1 correspondence". Is a row made of keys? You might extend the introduction to include explication of how they relate. Doc. still says meta data keys and values are String. I thought above you allowed that values could be byte arrays? You have setMeta and then setMetaBytes? Why not just name the second method same as first? In your implemenation, please keep in mind that others will most likely want to extend. You might consider implementing TFile as an Interface with an easily subclassable or reuseable implementation? Good stuff
        Hide
        Srikanth Kakani added a comment -

        We are starting to work on TFile implementation getting most of the stuff in will use byte compare for now.

        Show
        Srikanth Kakani added a comment - We are starting to work on TFile implementation getting most of the stuff in will use byte compare for now.
        Hide
        Srikanth Kakani added a comment -

        Changed the API
        Restructured the blockidx split it into keyidx, roidx.
        Added offsets to each.
        Fixed typos
        restructured the datablocks: entry, keylength, key, value

        Hopefully this is good.

        Show
        Srikanth Kakani added a comment - Changed the API Restructured the blockidx split it into keyidx, roidx. Added offsets to each. Fixed typos restructured the datablocks: entry, keylength, key, value Hopefully this is good.
        Hide
        Doug Cutting added a comment -

        We should add comparators to the Serialization framework (HADOOP-3380) before we implement TFile.

        Show
        Doug Cutting added a comment - We should add comparators to the Serialization framework ( HADOOP-3380 ) before we implement TFile.
        Hide
        Doug Cutting added a comment -

        > Map file uses: WritableComparator.get(keyClass) to get the comparator.

        TFile should be independent of Writable, so that it supports other serialization frameworks, like Thrift. Our generic Serialization framework does not yet include comparators, but it should. We should add a method:

        RawComparator Serialization#getComparator();

        TFile should use 'new SerializationFactory(conf).getSerialization(keyClass)' to get the serialization, and then get the serializer, deserializer and comparator from that.

        Show
        Doug Cutting added a comment - > Map file uses: WritableComparator.get(keyClass) to get the comparator. TFile should be independent of Writable, so that it supports other serialization frameworks, like Thrift. Our generic Serialization framework does not yet include comparators, but it should. We should add a method: RawComparator Serialization#getComparator(); TFile should use 'new SerializationFactory(conf).getSerialization(keyClass)' to get the serialization, and then get the serializer, deserializer and comparator from that.
        Hide
        Srikanth Kakani added a comment -

        Map file uses: WritableComparator.get(keyClass) to get the comparator. I will go with that for getting the comparator if the key is not byte comparable.

        Show
        Srikanth Kakani added a comment - Map file uses: WritableComparator.get(keyClass) to get the comparator. I will go with that for getting the comparator if the key is not byte comparable.
        Hide
        Srikanth Kakani added a comment -

        It seems that the append API and the next API for the sequence file are not complementary.

        append ( Object key, Object value)

        next (WritableComparable key, Writable Value)

        are we planning to change it? Or should I have the same API as the Sequence File for read.

        Show
        Srikanth Kakani added a comment - It seems that the append API and the next API for the sequence file are not complementary. append ( Object key, Object value) next (WritableComparable key, Writable Value) are we planning to change it? Or should I have the same API as the Sequence File for read.
        Hide
        Owen O'Malley added a comment -

        appendRaw should take:

        void appendRaw(byte[] key, int keyOffset, int keyLength,
                         byte[] value, int valueOffset, int valueLength);
        

        so that you can pass in subsets of byte arrays.

        The InputStream appender is a raw appender. So it breaks the abstraction to pass as non-raw key with a raw value.

        If we support user-specified meta-data, it should be settable until the close method is called. Since you might want to include summary information about the contents of the file.

        Show
        Owen O'Malley added a comment - appendRaw should take: void appendRaw( byte [] key, int keyOffset, int keyLength, byte [] value, int valueOffset, int valueLength); so that you can pass in subsets of byte arrays. The InputStream appender is a raw appender. So it breaks the abstraction to pass as non-raw key with a raw value. If we support user-specified meta-data, it should be settable until the close method is called. Since you might want to include summary information about the contents of the file.
        Hide
        Doug Cutting added a comment -

        > The only problem I see is when do we allow the setMeta functions [ ...]

        Perhaps you could instead make it so that a value can only be set once? Then, if the ctor sets things like the key and value classes, application code cannot.

        Show
        Doug Cutting added a comment - > The only problem I see is when do we allow the setMeta functions [ ...] Perhaps you could instead make it so that a value can only be set once? Then, if the ctor sets things like the key and value classes, application code cannot.
        Hide
        Srikanth Kakani added a comment - - edited

        > Does RO stand for something, or is it short for "row"?
        RO = Rowid Offset

        >The RO entry values can be more compactly represented as differences from the prior entry. Is this intended? If so, we should state this.
        I did not intend it that way, but we should do it, since the RO index is always memory loaded we can compute the entries while loading the index.

        > In data blocks, we might use something like <entryLength><keyLength><key><value>. This would permit one to skip entire entries more quickly. The valueLength can be computed as entryLength-keyLength. Do folks think this is worthwhile?

        I think it should not matter much, but since I heard this request third time, it seems like a good thing to do.

        > Owen O'Malley - 12/May/08 11:44 AM

        > getClosest should specify that it means closest after the given key
        Will update the document

        > Doug Cutting - 12/May/08 10:53 AM
        > row ids should be longs
        There is an inconsistency in the Read API. Will fix that.

        > does the compression of the key index follow tfile.compressionCodec?
        I believe so, I will explicitly state that in the document.

        > should the ioffset and moffset (or my key, row, and meta offsets) be vints?
        We will need to store the offsets for these somewhere, probably not.

        > I think the append method that takes an input stream should be:

        void appendRaw(int keyLength, InputStream key, int valueLength, InputStream value) throws IOException;
        

        Keys are supposed to be memory loadable, maximum length of 64K. I am not sure if this interface will be / should be used. We may want to keep that for later.

        > Most of the methods should have "throws IOException"
        Will update the spec with this.

        > It is useful to be able to get the key/value class names without the class. I'd replace the getKeyClass and getValueClass with string equivalents:
        I think the getMeta should take care of it.

        > stack - 12/May/08 12:15 PM
        > In the description of the blockidx, says 'Spare index of keys into the datablocks'. Whats this mean? The key that is at the start of each block will be in the block index? And only this? Or will index have entries keys from the middle of blocks in it?
        It currently means. the key that is at the start of each block will be in the block index? The minBlock size should be adjusted to handle the "sparseness".

        > In 'Goals', says 'support all kinds of columns'? Do you mean all column data types? Also says 'support seek to key and seek to row'. What is the difference between a key and a row?
        Kinds: Keyless columns, keyed columns, valueless columns, fixed size, and variable size. Maybe I should put it up there.

        > Its not plain that user can add their own metadata to imeta. You might explicitly state this.
        Will do.

        > In the Writer API, you state that a null key class is for a keyless column. Whats a null value class imply?
        Null value implies a value less column, a keyless column has some implication on the keyidx viz. it compresses excellently. I thought it was worth mentioning a keyless column.

        > Doug Cutting - 12/May/08 12:37 PM
        > I'd vote for supporting both String and byte[] as metadata values, perhaps with methods like:

        Writer#setMeta(String key, String value);
        Writer#setMetaBytes(String key, byte[] value);
        
        String Reader#getMeta(String key);
        byte[] Reader#getMetaBytes(String key);
        

        I thinks it makes sense to add these. (As opposed to having them in the constructor). The getMeta functions are a nobrainer.
        The only problem I see is when do we allow the setMeta functions, such as setting things like compression and KeyClass names may be problematic.
        So maybe we allow the TFile.* keys to be set only using the constructor and the rest of user variables to be set via the setMeta(??)

        Show
        Srikanth Kakani added a comment - - edited > Does RO stand for something, or is it short for "row"? RO = Rowid Offset >The RO entry values can be more compactly represented as differences from the prior entry. Is this intended? If so, we should state this. I did not intend it that way, but we should do it, since the RO index is always memory loaded we can compute the entries while loading the index. > In data blocks, we might use something like <entryLength><keyLength><key><value>. This would permit one to skip entire entries more quickly. The valueLength can be computed as entryLength-keyLength. Do folks think this is worthwhile? I think it should not matter much, but since I heard this request third time, it seems like a good thing to do. > Owen O'Malley - 12/May/08 11:44 AM > getClosest should specify that it means closest after the given key Will update the document > Doug Cutting - 12/May/08 10:53 AM > row ids should be longs There is an inconsistency in the Read API. Will fix that. > does the compression of the key index follow tfile.compressionCodec? I believe so, I will explicitly state that in the document. > should the ioffset and moffset (or my key, row, and meta offsets) be vints? We will need to store the offsets for these somewhere, probably not. > I think the append method that takes an input stream should be: void appendRaw( int keyLength, InputStream key, int valueLength, InputStream value) throws IOException; Keys are supposed to be memory loadable, maximum length of 64K. I am not sure if this interface will be / should be used. We may want to keep that for later. > Most of the methods should have "throws IOException" Will update the spec with this. > It is useful to be able to get the key/value class names without the class. I'd replace the getKeyClass and getValueClass with string equivalents: I think the getMeta should take care of it. > stack - 12/May/08 12:15 PM > In the description of the blockidx, says 'Spare index of keys into the datablocks'. Whats this mean? The key that is at the start of each block will be in the block index? And only this? Or will index have entries keys from the middle of blocks in it? It currently means. the key that is at the start of each block will be in the block index? The minBlock size should be adjusted to handle the "sparseness". > In 'Goals', says 'support all kinds of columns'? Do you mean all column data types? Also says 'support seek to key and seek to row'. What is the difference between a key and a row? Kinds: Keyless columns, keyed columns, valueless columns, fixed size, and variable size. Maybe I should put it up there. > Its not plain that user can add their own metadata to imeta. You might explicitly state this. Will do. > In the Writer API, you state that a null key class is for a keyless column. Whats a null value class imply? Null value implies a value less column, a keyless column has some implication on the keyidx viz. it compresses excellently. I thought it was worth mentioning a keyless column. > Doug Cutting - 12/May/08 12:37 PM > I'd vote for supporting both String and byte[] as metadata values, perhaps with methods like: Writer#setMeta( String key, String value); Writer#setMetaBytes( String key, byte [] value); String Reader#getMeta( String key); byte [] Reader#getMetaBytes( String key); I thinks it makes sense to add these. (As opposed to having them in the constructor). The getMeta functions are a nobrainer. The only problem I see is when do we allow the setMeta functions, such as setting things like compression and KeyClass names may be problematic. So maybe we allow the TFile.* keys to be set only using the constructor and the rest of user variables to be set via the setMeta(??)
        Hide
        Doug Cutting added a comment -

        > So the writer's constructor should have Serlializer<K> and Serializer <V> parameters [ ... ]

        On second thought, we should just use a SerializationFactory to construct serializers and deserializers for the key and value classes.

        > It is useful to be able to get the key/value class names without the class.

        We should have constants defined metadata keys. So getting the key class should look something like:

        Class.forName(tfile.getMeta(TFile.KEY_CLASS));

        With such constants, we may not need methods like getKeyClass() at all...

        > Does the metadata value have to be a String?

        I'd vote for supporting both String and byte[] as metadata values, perhaps with methods like:

        Writer#setMeta(String key, String value);
        Writer#setMeta(String key, byte[] value);

        String Reader#getMeta(String key);
        byte[] Reader#getMetaBytes(String key);

        Show
        Doug Cutting added a comment - > So the writer's constructor should have Serlializer<K> and Serializer <V> parameters [ ... ] On second thought, we should just use a SerializationFactory to construct serializers and deserializers for the key and value classes. > It is useful to be able to get the key/value class names without the class. We should have constants defined metadata keys. So getting the key class should look something like: Class.forName(tfile.getMeta(TFile.KEY_CLASS)); With such constants, we may not need methods like getKeyClass() at all... > Does the metadata value have to be a String? I'd vote for supporting both String and byte[] as metadata values, perhaps with methods like: Writer#setMeta(String key, String value); Writer#setMeta(String key, byte[] value); String Reader#getMeta(String key); byte[] Reader#getMetaBytes(String key);
        Hide
        stack added a comment -

        This proposal looks great. Here's a couple of comments:

        In 'Goals', says 'support all kinds of columns'? Do you mean all column data types? Also says 'support seek to key and seek to row'. What is the difference between a key and a row?

        In the description of the blockidx, says 'Spare index of keys into the datablocks'. Whats this mean? The key that is at the start of each block will be in the block index? And only this? Or will index have entries keys from the middle of blocks in it?

        Does the metadata value have to be a String? It looks like it doesn't have to be – that I can specify my own keyClass and valClass. For example, I would like to be able to write a bloom filter into the metadata.

        Its not plain that user can add their own metadata to imeta. You might explicitly state this.

        Section 3.2 where you describe two different kinds of index is a little confusing (I'm not clear on RO vs. Key as per above).

        In the Writer API, you state that a null key class is for a keyless column. Whats a null value class imply?

        Is the Writer API missing metadata writing? Same for reading.

        Reading talks about rowids but writer does not. Is this intentional?

        For the reader API, expose methods getting key only without reading value?

        Show
        stack added a comment - This proposal looks great. Here's a couple of comments: In 'Goals', says 'support all kinds of columns'? Do you mean all column data types? Also says 'support seek to key and seek to row'. What is the difference between a key and a row? In the description of the blockidx, says 'Spare index of keys into the datablocks'. Whats this mean? The key that is at the start of each block will be in the block index? And only this? Or will index have entries keys from the middle of blocks in it? Does the metadata value have to be a String? It looks like it doesn't have to be – that I can specify my own keyClass and valClass. For example, I would like to be able to write a bloom filter into the metadata. Its not plain that user can add their own metadata to imeta. You might explicitly state this. Section 3.2 where you describe two different kinds of index is a little confusing (I'm not clear on RO vs. Key as per above). In the Writer API, you state that a null key class is for a keyless column. Whats a null value class imply? Is the Writer API missing metadata writing? Same for reading. Reading talks about rowids but writer does not. Is this intentional? For the reader API, expose methods getting key only without reading value?
        Hide
        Owen O'Malley added a comment -
        • getClosest should specify that it means closest after the given key
        • row ids should be longs
        • does the compression of the key index follow tfile.compressionCodec?
        • should the ioffset and moffset (or my key, row, and meta offsets) be vints?
        • I think the append method that takes an input stream should be:
          void appendRaw(int keyLength, InputStream key, int valueLength, InputStream value) throws IOException;
          
        • Most of the methods should have "throws IOException"
        • It is useful to be able to get the key/value class names without the class. I'd replace the getKeyClass and getValueClass with string equivalents:
          String getKeyClassName();
          String getValueClassName();
          
        • I assume that seekToKey returns false, if the key does not exist, but the file pointer may have moved.
        Show
        Owen O'Malley added a comment - getClosest should specify that it means closest after the given key row ids should be longs does the compression of the key index follow tfile.compressionCodec? should the ioffset and moffset (or my key, row, and meta offsets) be vints? I think the append method that takes an input stream should be: void appendRaw( int keyLength, InputStream key, int valueLength, InputStream value) throws IOException; Most of the methods should have "throws IOException" It is useful to be able to get the key/value class names without the class. I'd replace the getKeyClass and getValueClass with string equivalents: String getKeyClassName(); String getValueClassName(); I assume that seekToKey returns false, if the key does not exist, but the file pointer may have moved.
        Hide
        Owen O'Malley added a comment -

        It would be good to have a way to find the offset of the start of the RO entries. I'd propose that we have:

          kindexoffset - the start of the the key index
          rindexoffset - the start of the row index
          moffset - the start of the meta data
        
        Show
        Owen O'Malley added a comment - It would be good to have a way to find the offset of the start of the RO entries. I'd propose that we have: kindexoffset - the start of the the key index rindexoffset - the start of the row index moffset - the start of the meta data
        Hide
        Doug Cutting added a comment -

        Latest draft looks much better!

        • Does RO stand for something, or is it short for "row"?
        • The RO entry values can be more compactly represented as differences from the prior entry. Is this intended? If so, we should state this.
        • In data blocks, we might use something like <entryLength><keyLength><key><value>. This would permit one to skip entire entries more quickly. The valueLength can be computed as entryLength-keyLength. Do folks think this is worthwhile?

        > We should not depend on keys/values being Writables in TFile.

        Good point. So the writer's constructor should have Serlializer<K> and Serializer <V> parameters, and the reader Deserializer<K> and Deserializer<V> parameters. This will permit us to, e.g., store Thrift or other objects in a TFile.

        Show
        Doug Cutting added a comment - Latest draft looks much better! Does RO stand for something, or is it short for "row"? The RO entry values can be more compactly represented as differences from the prior entry. Is this intended? If so, we should state this. In data blocks, we might use something like <entryLength><keyLength><key><value>. This would permit one to skip entire entries more quickly. The valueLength can be computed as entryLength-keyLength. Do folks think this is worthwhile? > We should not depend on keys/values being Writables in TFile. Good point. So the writer's constructor should have Serlializer<K> and Serializer <V> parameters, and the reader Deserializer<K> and Deserializer<V> parameters. This will permit us to, e.g., store Thrift or other objects in a TFile.
        Hide
        Enis Soztutar added a comment -

        We should not depend on keys/values being Writables in TFile, the API should be :

        
        public class TFile<K extends Comparable<K>, V> {
          public static class Writer<K extends Comparable<K>, V> {
             void append(K key, V value);
             Class<K> getKeyClass();
             ...
          }
          public static class Reader<K extends Comparable<K>, V> {
            V get(K key, V val) ;
            boolean next(K key, V val);
            ...
          }
        }
        
        
        Show
        Enis Soztutar added a comment - We should not depend on keys/values being Writables in TFile, the API should be : public class TFile<K extends Comparable<K>, V> { public static class Writer<K extends Comparable<K>, V> { void append(K key, V value); Class <K> getKeyClass(); ... } public static class Reader<K extends Comparable<K>, V> { V get(K key, V val) ; boolean next(K key, V val); ... } }
        Hide
        Doug Cutting added a comment -

        I think for at least the first version we should assume that the index fits in memory.

        As a subsequent enhancement, we might permit one to limit the size of the index. Then, when writing, if the index gets too big, we can downsample at that point, discarding every-other entry or somesuch, to keep the index within a certain bound. Similarly, when the file is opened, if the index is larger than available memory, we can downsample further then.

        Show
        Doug Cutting added a comment - I think for at least the first version we should assume that the index fits in memory. As a subsequent enhancement, we might permit one to limit the size of the index. Then, when writing, if the index gets too big, we can downsample at that point, discarding every-other entry or somesuch, to keep the index within a certain bound. Similarly, when the file is opened, if the index is larger than available memory, we can downsample further then.
        Hide
        Owen O'Malley added a comment -

        Hmm, I think in general that would strongly suggest that you have your block size too small. smile

        However, you could write the key index to a side file while you were building it and then copy it into place. The offset, row id table should never be too big for ram.

        Show
        Owen O'Malley added a comment - Hmm, I think in general that would strongly suggest that you have your block size too small. smile However, you could write the key index to a side file while you were building it and then copy it into place. The offset, row id table should never be too big for ram.
        Hide
        Srikanth Kakani added a comment -

        Owen, one concern separating the keys from offsets:

        In the rare case that the keys do not fit in memory or reach a limit:

        We might have to write out both the keys and offsets at that point and start another index "block".

        Show
        Srikanth Kakani added a comment - Owen, one concern separating the keys from offsets: In the rare case that the keys do not fit in memory or reach a limit: We might have to write out both the keys and offsets at that point and start another index "block".
        Hide
        Srikanth Kakani added a comment -

        Owen suggested the following:

        magic: 4 bytes
        vints: for all offsets and lengths
        separate index keys from offsets+ rowids, index keys are compressed.
        store extra virtual block row (for final key and final rowid).
        flag for memcomparable keys in imeta.

        Adding doug's request as well,

        I am updating the document and will post it soon, will have a document by the eod.

        Show
        Srikanth Kakani added a comment - Owen suggested the following: magic: 4 bytes vints: for all offsets and lengths separate index keys from offsets+ rowids, index keys are compressed. store extra virtual block row (for final key and final rowid). flag for memcomparable keys in imeta. Adding doug's request as well, I am updating the document and will post it soon, will have a document by the eod.
        Hide
        Doug Cutting added a comment -
        • Magic should be at start of file too.
        • Why use 8-bytes for index length, rowid & offset – could be VInts of deltas no?
        • Shouldn't metadata be an extensible list of <String,String> pairs? This could be encoded as <count>(<string><string>)^count, where <count> is a vint, and strings are vint-prefixed binary data. Some metadata keys might be reserved, like "hasKeys", "hasValues", "keyClass", "valueClass", "entryCount", etc.
        • Need support for total entries.
        Show
        Doug Cutting added a comment - Magic should be at start of file too. Why use 8-bytes for index length, rowid & offset – could be VInts of deltas no? Shouldn't metadata be an extensible list of <String,String> pairs? This could be encoded as <count>(<string><string>)^count, where <count> is a vint, and strings are vint-prefixed binary data. Some metadata keys might be reserved, like "hasKeys", "hasValues", "keyClass", "valueClass", "entryCount", etc. Need support for total entries.
        Hide
        Srikanth Kakani added a comment -

        Specification of a TFile. Binary block compressed file.

        Show
        Srikanth Kakani added a comment - Specification of a TFile. Binary block compressed file.
        Hide
        Owen O'Malley added a comment -

        I like the record count being in the tail. If we are saving the first key of each block, we should probably also save the last key of the file too.

        I would think this is just a new class that people migrate too. For most applications, I would expect it to be an easy transition.

        Srikanth, it is useful to have the magic bytes at the front so that commands like "file" on unix can work. It is just a constant 4 bytes, it doesn't really complicate the format at all.

        You absolutely do want the record length in the data, because deserialization can be slow.

        I absolutely don't think there should be any special code for fixed width types. The cost of the variable width types is really small with the vint encoding.

        Show
        Owen O'Malley added a comment - I like the record count being in the tail. If we are saving the first key of each block, we should probably also save the last key of the file too. I would think this is just a new class that people migrate too. For most applications, I would expect it to be an easy transition. Srikanth, it is useful to have the magic bytes at the front so that commands like "file" on unix can work. It is just a constant 4 bytes, it doesn't really complicate the format at all. You absolutely do want the record length in the data, because deserialization can be slow. I absolutely don't think there should be any special code for fixed width types. The cost of the variable width types is really small with the vint encoding.
        Hide
        Srikanth Kakani added a comment -

        > Doug Cutting - 28/Apr/08 09:27 AM
        > Do you expect to make this a drop-in replacement for SequenceFile and MapFile, or rather something that we expect code to migrate to? I'm guessing the latter.

        I think it will be the latter as well.

        > So we should have a magic number there too, but, if it doesn't harm things, I'd prefer leaving an (frequently unread) magic number in the header too.
        When will the header magic be read? If it is always then wouldnt it result in two seeks anyways? If not why do we have to complicate the format?

        > Jim Kellerman - 26/Apr/08 10:19 AM >
        > Dropping the record length would seriously slow down random reads unless the index is 'complete', i.e., every key/offset is represented. If the index is sparse like MapFile's, you would only get an approximate location of the desired record and then have to do a lot of work to seek forward to the desired one.

        Each block in this file would be memory loadable, it doesnt really matter (much) if we store key length or not as the total operation is bounded by seek and read. Even going through the index with variable sized keys is linear. Maybe bounding the index to one read block makes sense aswell.

        The only case this can change is if we have some metadata about the key being fixed size: in which case all the seek-to-keys are O(1)

        One more thought is an index purely based on record ids (fixed size encoded) that may keep the index skippable/seekable.

        Show
        Srikanth Kakani added a comment - > Doug Cutting - 28/Apr/08 09:27 AM > Do you expect to make this a drop-in replacement for SequenceFile and MapFile, or rather something that we expect code to migrate to? I'm guessing the latter. I think it will be the latter as well. > So we should have a magic number there too, but, if it doesn't harm things, I'd prefer leaving an (frequently unread) magic number in the header too. When will the header magic be read? If it is always then wouldnt it result in two seeks anyways? If not why do we have to complicate the format? > Jim Kellerman - 26/Apr/08 10:19 AM > > Dropping the record length would seriously slow down random reads unless the index is 'complete', i.e., every key/offset is represented. If the index is sparse like MapFile's, you would only get an approximate location of the desired record and then have to do a lot of work to seek forward to the desired one. Each block in this file would be memory loadable, it doesnt really matter (much) if we store key length or not as the total operation is bounded by seek and read. Even going through the index with variable sized keys is linear. Maybe bounding the index to one read block makes sense aswell. The only case this can change is if we have some metadata about the key being fixed size: in which case all the seek-to-keys are O(1) One more thought is an index purely based on record ids (fixed size encoded) that may keep the index skippable/seekable.
        Hide
        Andrzej Bialecki added a comment -

        Writers know exactly how many records have been written, so they can write the record count easily. I suggest using MapWritable for the metadata section, so that we can store other data, not just Strings. Record counts are just an example of informaiton that becomes available only when closing a file - other info may include timestamps, or min / max values, etc, etc.

        Show
        Andrzej Bialecki added a comment - Writers know exactly how many records have been written, so they can write the record count easily. I suggest using MapWritable for the metadata section, so that we can store other data, not just Strings. Record counts are just an example of informaiton that becomes available only when closing a file - other info may include timestamps, or min / max values, etc, etc.
        Hide
        Doug Cutting added a comment -

        > Owen: I can't see any cases where we need the key length

        I think we use it when sorting. We read raw keys into memory and call raw comparators on them, passing the key length to the raw comparator.

        The sorting code pre-dates MapReduce, though, from back when we manually partitioned, shuffled, sorted, merged, etc. things in Nutch. Perhaps we no longer need to sort and merge files directly, since folks can use MapReduce for that. Does any application code still use SequenceFile#Sorter?

        > Owen: my desire is to make applications not read the header and only read the tail

        So we should have a magic number there too, but, if it doesn't harm things, I'd prefer leaving an (frequently unread) magic number in the header too.

        Do you expect to make this a drop-in replacement for SequenceFile and MapFile, or rather something that we expect code to migrate to? I'm guessing the latter.

        > Alejandro: Our use case is specifically the example he mentions, the record count.

        I think we can include record-count as a base feature of this file format. But permitting a Map<String,String> of other metadata might also be good.

        Show
        Doug Cutting added a comment - > Owen: I can't see any cases where we need the key length I think we use it when sorting. We read raw keys into memory and call raw comparators on them, passing the key length to the raw comparator. The sorting code pre-dates MapReduce, though, from back when we manually partitioned, shuffled, sorted, merged, etc. things in Nutch. Perhaps we no longer need to sort and merge files directly, since folks can use MapReduce for that. Does any application code still use SequenceFile#Sorter? > Owen: my desire is to make applications not read the header and only read the tail So we should have a magic number there too, but, if it doesn't harm things, I'd prefer leaving an (frequently unread) magic number in the header too. Do you expect to make this a drop-in replacement for SequenceFile and MapFile, or rather something that we expect code to migrate to? I'm guessing the latter. > Alejandro: Our use case is specifically the example he mentions, the record count. I think we can include record-count as a base feature of this file format. But permitting a Map<String,String> of other metadata might also be good.
        Hide
        Alejandro Abdelnur added a comment -

        I would like to see something like Andrzej is mentioning, the metadata section at the end.

        Our use case is specifically the example hi mentions, the record count.

        Currently keep the count in a shadow file, ie _NAME.counter, using a custom SequenceOutputFormat. The problem with this approach is that we are doubling the number of files in HDFS for SequenceFiles.

        Show
        Alejandro Abdelnur added a comment - I would like to see something like Andrzej is mentioning, the metadata section at the end. Our use case is specifically the example hi mentions, the record count. Currently keep the count in a shadow file, ie _NAME.counter, using a custom SequenceOutputFormat. The problem with this approach is that we are doubling the number of files in HDFS for SequenceFiles.
        Hide
        Runping Qi added a comment -

        Will the new code be able to read the old format?

        Show
        Runping Qi added a comment - Will the new code be able to read the old format?
        Hide
        Owen O'Malley added a comment -

        I guess since I don't expect to ever read the front of these, we should put the magic value at the back too.

        magic (4 bytes)
        block 0 .. n
        index
        tail
        magic
        
        Show
        Owen O'Malley added a comment - I guess since I don't expect to ever read the front of these, we should put the magic value at the back too. magic (4 bytes) block 0 .. n index tail magic
        Hide
        Owen O'Malley added a comment -

        I agree that we shouldn't drop the record length. It would be unacceptable when seeking having to deserialize the values, which can be quite slow. I was asking about dropping the key length. I can't see any cases where we need the key length and I was asking if anyone else could see one...

        Show
        Owen O'Malley added a comment - I agree that we shouldn't drop the record length. It would be unacceptable when seeking having to deserialize the values, which can be quite slow. I was asking about dropping the key length. I can't see any cases where we need the key length and I was asking if anyone else could see one...
        Hide
        Jim Kellerman added a comment -

        > Owen O'Malley - 26/Apr/08 09:46 AM
        > One other possibility would be to represent key/values as:

        record length (vint)
        key
        value
        

        > That would work for all writables, because they all record their sizes internally. In theory you could drop
        > the record length, but that would mean that you would have to deserialize all of the keys and values as
        > you skip over records.
        >
        > Thoughts?

        Dropping the record length would seriously slow down random reads unless the index is 'complete', i.e., every key/offset is represented. If the index is sparse like MapFile's, you would only get an approximate location of the desired record and then have to do a lot of work to seek forward to the desired one.

        Show
        Jim Kellerman added a comment - > Owen O'Malley - 26/Apr/08 09:46 AM > One other possibility would be to represent key/values as: record length (vint) key value > That would work for all writables, because they all record their sizes internally. In theory you could drop > the record length, but that would mean that you would have to deserialize all of the keys and values as > you skip over records. > > Thoughts? Dropping the record length would seriously slow down random reads unless the index is 'complete', i.e., every key/offset is represented. If the index is sparse like MapFile's, you would only get an approximate location of the desired record and then have to do a lot of work to seek forward to the desired one.
        Hide
        Owen O'Malley added a comment -

        One other possibility would be to represent key/values as:

        record length (vint)
        key
        value
        

        That would work for all writables, because they all record their sizes internally. In theory you could drop the record length, but that would mean that you would have to deserialize all of the keys and values as you skip over records.

        Thoughts?

        Show
        Owen O'Malley added a comment - One other possibility would be to represent key/values as: record length (vint) key value That would work for all writables, because they all record their sizes internally. In theory you could drop the record length, but that would mean that you would have to deserialize all of the keys and values as you skip over records. Thoughts?
        Hide
        Owen O'Malley added a comment -

        Is this a format just for compressed sequence files, or for all sequence files?

        The issue is most critical for compressed sequence files, but it would make sense to make the compression optional. I would not support value compression.

        Is this intended as a replacement for MapFile too?

        yes

        I think some kind of a magic number header at the start files is good to have. That would also permit back-compatibility with SequenceFile in this case.

        that probably makes sense, although my desire is to make applications not read the header and only read the tail, which has the meta data including the index.

        In the index, what is "first record idx" - is that the key or the ordinal position of the first entry?

        It is the row id of the first row of that block. So that you can support seek to a given row number, which is useful if you have a bunch of files that correspond to different columns in a big table. You would make splits that look like rows 1000-2000 and you can map that across multiple files.

        magic (4 bytes)
        block 0 .. n
        index
        tail
        

        and the tail looks like:

        key class name
        value class name
        index kind (none, keys, keys+bloom filter)
        compression kind (none, default, lzo)
        format version
        offset of tail
        offset of index
        
        Show
        Owen O'Malley added a comment - Is this a format just for compressed sequence files, or for all sequence files? The issue is most critical for compressed sequence files, but it would make sense to make the compression optional. I would not support value compression. Is this intended as a replacement for MapFile too? yes I think some kind of a magic number header at the start files is good to have. That would also permit back-compatibility with SequenceFile in this case. that probably makes sense, although my desire is to make applications not read the header and only read the tail, which has the meta data including the index. In the index, what is "first record idx" - is that the key or the ordinal position of the first entry? It is the row id of the first row of that block. So that you can support seek to a given row number, which is useful if you have a bunch of files that correspond to different columns in a big table. You would make splits that look like rows 1000-2000 and you can map that across multiple files. magic (4 bytes) block 0 .. n index tail and the tail looks like: key class name value class name index kind (none, keys, keys+bloom filter) compression kind (none, default , lzo) format version offset of tail offset of index
        Hide
        Owen O'Malley added a comment -

        I would propose that we move to:

        block 1
        block 2
        ...
        index
        tail
        

        where block is compressed, and contain:

        key/value1: key len (vint), key, value len (vint), value
        key/value 2
        ...
        

        The index would be compressed and contain:

        block 1: offset, first record idx
        block 2: offset, first record idx
        block 3: offset, first record idx:
        ...
        

        and the tail would look like:

        key class name
        value class name
        index kind (none, keys, keys+bloom filter)
        format version
        offset of tail
        offset of index
        

        Then extensions of this format would put more indexes between the last block and the start of the index. So for example, the first key of each block:

        first key of block 1: key len (vint), key
        first key of block 2
        ...
        offset of start key index
        

        Another reasonable extension of the key index would be a bloom filter of the keys:

        bloom filter serialization
        offset of bloom filter index start
        
        Show
        Owen O'Malley added a comment - I would propose that we move to: block 1 block 2 ... index tail where block is compressed, and contain: key/value1: key len (vint), key, value len (vint), value key/value 2 ... The index would be compressed and contain: block 1: offset, first record idx block 2: offset, first record idx block 3: offset, first record idx: ... and the tail would look like: key class name value class name index kind (none, keys, keys+bloom filter) format version offset of tail offset of index Then extensions of this format would put more indexes between the last block and the start of the index. So for example, the first key of each block: first key of block 1: key len (vint), key first key of block 2 ... offset of start key index Another reasonable extension of the key index would be a bloom filter of the keys: bloom filter serialization offset of bloom filter index start
        Hide
        Andrzej Bialecki added a comment -

        Currently SequenceFile-s may contain Metadata. In my applications I never found use for this, because this Metadata needs to be written out right after the file is opened, and cannot be updated later. A much better model for my needs would be to write the Metadata record right before closing the file (in the "tail" section above), so that it can be updated until the end, with e.g. record count.

        Show
        Andrzej Bialecki added a comment - Currently SequenceFile-s may contain Metadata. In my applications I never found use for this, because this Metadata needs to be written out right after the file is opened, and cannot be updated later. A much better model for my needs would be to write the Metadata record right before closing the file (in the "tail" section above), so that it can be updated until the end, with e.g. record count.
        Hide
        Doug Cutting added a comment -

        [Meta comment: I wish folks would just describe problems in an issue's description, and leave solutions to the comments. Descriptions are appended to every email message. Also, solutions change as a result of discussion, while the problem should not.]

        Is this a format just for compressed sequence files, or for all sequence files?

        Is this intended as a replacement for MapFile too?

        I think some kind of a magic number header at the start files is good to have. That would also permit back-compatibility with SequenceFile in this case.

        In the index, what is "first record idx" – is that the key or the ordinal position of the first entry?

        Show
        Doug Cutting added a comment - [Meta comment: I wish folks would just describe problems in an issue's description, and leave solutions to the comments. Descriptions are appended to every email message. Also, solutions change as a result of discussion, while the problem should not.] Is this a format just for compressed sequence files, or for all sequence files? Is this intended as a replacement for MapFile too? I think some kind of a magic number header at the start files is good to have. That would also permit back-compatibility with SequenceFile in this case. In the index, what is "first record idx" – is that the key or the ordinal position of the first entry?

          People

          • Assignee:
            Hong Tang
            Reporter:
            Owen O'Malley
          • Votes:
            0 Vote for this issue
            Watchers:
            46 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development