Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-2886

Add Scan TimeRange to HBaseStorage

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 0.9.0, 0.9.1, 0.9.2
    • 0.11
    • None
    • Added the ability to set HBase Scan's maxTimestamp, minTimestamp and timestamp in HBaseStorage. I also added unit tests.
    • HBaseStorage timeRange timeStamp

    Description

      I have a client that wants to use pig. They are using MR now. They can't use PIG right now because they only want to fetch the last day's worth of data in HBase. A filter with time range would require reading all the HStore files. If we hold major compaction until after the fetch and use Scan Time Range we only need to read very little in compression.

      Attachments

        1. PIG-2886-0.patch
          6 kB
          Theodore michael Malaska
        2. PIG-2886-1.patch
          13 kB
          Theodore michael Malaska
        3. PIG-2886-2.patch
          10 kB
          Theodore michael Malaska
        4. PIG-2886-3.patch
          11 kB
          Theodore michael Malaska

        Issue Links

          Activity

            Submitted fix with pull request from tmalaska/pig github. Let me know what I need to get this into PIG. My client would like to use HBaseStorage from PIG instead of a fixed version I gave them in a jar.

            ted.m Theodore michael Malaska added a comment - Submitted fix with pull request from tmalaska/pig github. Let me know what I need to get this into PIG. My client would like to use HBaseStorage from PIG instead of a fixed version I gave them in a jar.
            cheolsoo Cheolsoo Park added a comment -

            Hi Ted, I think that you need to post your patch to this jira and wait for a committer to review/commit it. Please refer to:

            https://cwiki.apache.org/confluence/display/PIG/HowToContribute

            cheolsoo Cheolsoo Park added a comment - Hi Ted, I think that you need to post your patch to this jira and wait for a committer to review/commit it. Please refer to: https://cwiki.apache.org/confluence/display/PIG/HowToContribute

            OK, Sounds good. I can't do it tonight. I'll read the directions and do it tomorrow.

            ted.m Theodore michael Malaska added a comment - OK, Sounds good. I can't do it tonight. I'll read the directions and do it tomorrow.

            Adds timeRange to HBaseStorage

            ted.m Theodore michael Malaska added a comment - Adds timeRange to HBaseStorage

            Thanks for the patch Ted! The code looks good, just a few nits about style mainly.

            • That patch has a bunch of diff info about your git internal files, so it doesn't apply.
            • Standard indents in Pig are 4 spaces (no tabs).
            • Use a single space after brackets and between close parens and open brackets in your else/if statements.
            • else should be on one line, i.e., } else {
            • A pair of empty newlines were added after the ignoreWhitespace_ block, which should be removed.
            • Typos: TimeRagne and "Timestamp most be"

            Also would you please add a unit test to TestHBaseStorage.

            billgraham William W. Graham Jr added a comment - Thanks for the patch Ted! The code looks good, just a few nits about style mainly. That patch has a bunch of diff info about your git internal files, so it doesn't apply. Standard indents in Pig are 4 spaces (no tabs). Use a single space after brackets and between close parens and open brackets in your else/if statements. else should be on one line, i.e., } else { A pair of empty newlines were added after the ignoreWhitespace_ block, which should be removed. Typos: TimeRagne and "Timestamp most be" Also would you please add a unit test to TestHBaseStorage.

            Making progress. Need to check some more things before I'm totally done.

            ted.m Theodore michael Malaska added a comment - Making progress. Need to check some more things before I'm totally done.

            Question.

            I added the test cases and ran the following command and I noticed the TestHBaseStore doesn't run.

            ant -Djavac.args="-Xlint -Xmaxwarns 1000" clean jar test-commit

            I'm thinking that because TestHBaseStorage doesn't extend TestCase, also no other classes call TestHBaseStorage.

            So the question is: Is there a design reason why TestHBaseStorage is not running when running unit test? Is it ok if I make TestHBaseStorage run during unit tests?

            ted.m Theodore michael Malaska added a comment - Question. I added the test cases and ran the following command and I noticed the TestHBaseStore doesn't run. ant -Djavac.args="-Xlint -Xmaxwarns 1000" clean jar test-commit I'm thinking that because TestHBaseStorage doesn't extend TestCase, also no other classes call TestHBaseStorage. So the question is: Is there a design reason why TestHBaseStorage is not running when running unit test? Is it ok if I make TestHBaseStorage run during unit tests?

            Only a subset of tests run during test-commit. test will run all of them (and take a while). Also annotations are used to indicate that that class contains tests.

            You can do this to test just one test:

            ant clean test -Dtestcase=TestHBaseStorage
            
            billgraham William W. Graham Jr added a comment - Only a subset of tests run during test-commit. test will run all of them (and take a while). Also annotations are used to indicate that that class contains tests. You can do this to test just one test: ant clean test -Dtestcase=TestHBaseStorage

            Thanks Bill,

            I tried running TestHBaseStorage and it freezes on SetUp.

            >public void setUp() throws Exception

            { > // This is needed by Pig > > cluster = MiniCluster.buildCluster(); > conf = cluster.getConfiguration(); > > util = new HBaseTestingUtility(conf); > util.startMiniZKCluster(); > util.startMiniHBaseCluster(1, 1); > }

            Just wondering if you know what I'm missing to make this work. Hopefully I will get time in the next couple of days to research this.

            ted.m Theodore michael Malaska added a comment - Thanks Bill, I tried running TestHBaseStorage and it freezes on SetUp. >public void setUp() throws Exception { > // This is needed by Pig > > cluster = MiniCluster.buildCluster(); > conf = cluster.getConfiguration(); > > util = new HBaseTestingUtility(conf); > util.startMiniZKCluster(); > util.startMiniHBaseCluster(1, 1); > } Just wondering if you know what I'm missing to make this work. Hopefully I will get time in the next couple of days to research this.
            cheolsoo Cheolsoo Park added a comment -

            Hi Ted,

            Regarding TestHBaseStorage, does it hang in hadoop 20 or 23? I assume that you're not setting "-Dhadoopversion" so using hadoop 20 by default. In hadoop 20, TestHBaseStorage passes for me with your patch. I.e. "ant clean test -Dtestcase=TestHBaseStorage -Dhadoopversion=20" passes.

            [junit] Running org.apache.pig.test.TestHBaseStorage
            [junit] Tests run: 23, Failures: 0, Errors: 0, Time elapsed: 131.728 sec
            

            If it doesn't pass for you, it should be some environment issue. (e.g. did you set umask 0022?)

            However, it does time out in hadoop 23, and I believe that it's expected since hbase jar from the maven repository is not binary compatible with hadoop 23. I.e. "ant clean test -Dtestcase=TestHBaseStorage -Dhadoopversion=23" fails with time out error, and the following error can be found in the test log (build/test/logs/TEST-org.apache.pig.test.TestHBaseStorage.txt):

            Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hdfs.protocol.FSConstants$SafeModeAction
                at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
                at java.security.AccessController.doPrivileged(Native Method)
                at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
                at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
                at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
                at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
                ... 7 more
            

            I ran into the same issue while bumping hbase to 0.94, but it seem applied to 0.90 (current version in trunk) as well. Please see HBASE-5680 for more details.

            Please anyone corrects me if I am wrong about TestHBaseStorage in hadoop 23.

            Thanks!

            cheolsoo Cheolsoo Park added a comment - Hi Ted, Regarding TestHBaseStorage, does it hang in hadoop 20 or 23? I assume that you're not setting "-Dhadoopversion" so using hadoop 20 by default. In hadoop 20, TestHBaseStorage passes for me with your patch. I.e. "ant clean test -Dtestcase=TestHBaseStorage -Dhadoopversion=20" passes. [junit] Running org.apache.pig.test.TestHBaseStorage [junit] Tests run: 23, Failures: 0, Errors: 0, Time elapsed: 131.728 sec If it doesn't pass for you, it should be some environment issue. (e.g. did you set umask 0022?) However, it does time out in hadoop 23, and I believe that it's expected since hbase jar from the maven repository is not binary compatible with hadoop 23. I.e. "ant clean test -Dtestcase=TestHBaseStorage -Dhadoopversion=23" fails with time out error, and the following error can be found in the test log (build/test/logs/TEST-org.apache.pig.test.TestHBaseStorage.txt): Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hdfs.protocol.FSConstants$SafeModeAction at java.net.URLClassLoader$1.run(URLClassLoader.java:202) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang. ClassLoader .loadClass( ClassLoader .java:306) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang. ClassLoader .loadClass( ClassLoader .java:247) ... 7 more I ran into the same issue while bumping hbase to 0.94, but it seem applied to 0.90 (current version in trunk) as well. Please see HBASE-5680 for more details. Please anyone corrects me if I am wrong about TestHBaseStorage in hadoop 23. Thanks!

            Great thanks. Got it.

            I was first doing in on my local (no Hadoop) and it would freezy. Then I tried it on CDH4 and it didn't work either. I will try it on CDH3 tonight.

            By the way do you see anything else in the code I should add or clean up.

            I should have time to work on it tonight.

            Ted Malaska

            ted.m Theodore michael Malaska added a comment - Great thanks. Got it. I was first doing in on my local (no Hadoop) and it would freezy. Then I tried it on CDH4 and it didn't work either. I will try it on CDH3 tonight. By the way do you see anything else in the code I should add or clean up. I should have time to work on it tonight. Ted Malaska

            Hi Ted,
            Great to see clouderians contributing to Pig again!

            Couple of notes:

            minTimeRange, maxTimeRange – maybe better names would be minTimestamp and maxTimestamp ?
            That's the signature for HBase's scanTimeRange.

            Also, please fix up documentation – minTimestamp in scan.setTimeRange is inclusive (so, not strictly greater then). maxTimestamp is, indeed, exclusive – the range is [min, max)

            space between } and "else" around maxTimeRange handling.

            HBase scan also provides setTimestamp(). Might as well throw that in?

            Does your client care about # of returned versions? That's a much tricker change..

            dvryaboy Dmitriy V. Ryaboy added a comment - Hi Ted, Great to see clouderians contributing to Pig again! Couple of notes: minTimeRange, maxTimeRange – maybe better names would be minTimestamp and maxTimestamp ? That's the signature for HBase's scanTimeRange. Also, please fix up documentation – minTimestamp in scan.setTimeRange is inclusive (so, not strictly greater then). maxTimestamp is, indeed, exclusive – the range is [min, max) space between } and "else" around maxTimeRange handling. HBase scan also provides setTimestamp(). Might as well throw that in? Does your client care about # of returned versions? That's a much tricker change..

            Thx will work on it now.

            ted.m Theodore michael Malaska added a comment - Thx will work on it now.

            Cleaned things up. Change maxTimeRange and minTimeRange to maxTimestamp and minTimestamp. Plus I added a timestamp option. Along with unit tests.

            ted.m Theodore michael Malaska added a comment - Cleaned things up. Change maxTimeRange and minTimeRange to maxTimestamp and minTimestamp. Plus I added a timestamp option. Along with unit tests.

            Hey Dmitriy,

            I made the changes you requested and added in the setTimestamp() option.

            I would love to do the # of versions change, but can I do that in another jira issue so I can have this one closed.

            Thanks

            ted.m Theodore michael Malaska added a comment - Hey Dmitriy, I made the changes you requested and added in the setTimestamp() option. I would love to do the # of versions change, but can I do that in another jira issue so I can have this one closed. Thanks

            Found two type-os. I will have the fix and the maxVersion functionality soon

            ted.m Theodore michael Malaska added a comment - Found two type-os. I will have the fix and the maxVersion functionality soon

            Making more changes

            ted.m Theodore michael Malaska added a comment - Making more changes

            Let's keep the issue of multiple versions separate – it's not entirely clear how those should be returned (a bag?)

            dvryaboy Dmitriy V. Ryaboy added a comment - Let's keep the issue of multiple versions separate – it's not entirely clear how those should be returned (a bag?)

            OK sounds good. I will just update the type-o then

            ted.m Theodore michael Malaska added a comment - OK sounds good. I will just update the type-o then

            Ted, let me know if you want me to review, k? Wasn't clear to me from the last message if you are in a 'done' state or if you are just posting intermediate work right now.

            dvryaboy Dmitriy V. Ryaboy added a comment - Ted, let me know if you want me to review, k? Wasn't clear to me from the last message if you are in a 'done' state or if you are just posting intermediate work right now.

            Give me a couple minutes I need to check in a new comment and help message

            ted.m Theodore michael Malaska added a comment - Give me a couple minutes I need to check in a new comment and help message

            Ok I'll take a look tomorrow. Going rogue for a bit, disconnecting .

            dvryaboy Dmitriy V. Ryaboy added a comment - Ok I'll take a look tomorrow. Going rogue for a bit, disconnecting .

            Fixed type-os

            ted.m Theodore michael Malaska added a comment - Fixed type-os

            Urk.. git patch. You need to generate it with 'git diff --no-prefix' otherwise we can't apply it. I mean, we can, and I did, but for next time, --no-prefix makes life easier .

            dvryaboy Dmitriy V. Ryaboy added a comment - Urk.. git patch. You need to generate it with 'git diff --no-prefix' otherwise we can't apply it. I mean, we can, and I did, but for next time, --no-prefix makes life easier .

            Applied to trunk.

            As there any need to apply this to 0.10 branch?

            Not sure we'll release a 0.10.1 branch at this point..

            dvryaboy Dmitriy V. Ryaboy added a comment - Applied to trunk. As there any need to apply this to 0.10 branch? Not sure we'll release a 0.10.1 branch at this point..

            Wow thx. Sorry about the --no-prefix I will make sure to do that in the future.

            ted.m Theodore michael Malaska added a comment - Wow thx. Sorry about the --no-prefix I will make sure to do that in the future.

            seems to duplicate (and fix) PIG-1832

            zeph Guido Serra aka Zeph added a comment - seems to duplicate (and fix) PIG-1832

            People

              malaskat Ted Malaska
              ted.m Theodore michael Malaska
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: