Apache Gora
  1. Apache Gora
  2. GORA-119

implement a filter enabled scan in gora

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.2
    • Fix Version/s: 0.4
    • Component/s: None
    • Environment:
      gora hbase gora-core gora-hbase

      Description

      it'll be very of help to implement a filtered scan to reduce the time of scan in gora-core and gora-hbase components.

      1. GORA-119v3.patch
        55 kB
        Lewis John McGibbney
      2. GORA-119v3_94port.patch
        55 kB
        Lewis John McGibbney
      3. GORA-119v2.patch
        55 kB
        Lewis John McGibbney
      4. GORA-119-v1.txt
        27 kB
        Ferdy Galema
      5. gora-119-v1.1.patch
        55 kB
        Tien Nguyen Manh
      6. gora-119_v2.patch
        41 kB
        Enis Soztutar

        Issue Links

          Activity

          Hide
          Hudson added a comment -

          FAILURE: Integrated in Nutch-nutchgora #1015 (See https://builds.apache.org/job/Nutch-nutchgora/1015/)
          NUTCH-1674 Use batchId filter to enable scan (GORA-119) for Fetch,Parse,Update,Index (Tien Nguyen Manh and Alparslan Avcı via jnioche) (jnioche: http://svn.apache.org/viewvc/nutch/branches/2.x/?view=rev&rev=1594813)

          • /nutch/branches/2.x/CHANGES.txt
          • /nutch/branches/2.x/src/java/org/apache/nutch/crawl/DbUpdateMapper.java
          • /nutch/branches/2.x/src/java/org/apache/nutch/crawl/DbUpdaterJob.java
          • /nutch/branches/2.x/src/java/org/apache/nutch/fetcher/FetcherJob.java
          • /nutch/branches/2.x/src/java/org/apache/nutch/indexer/IndexingJob.java
          • /nutch/branches/2.x/src/java/org/apache/nutch/parse/ParserJob.java
          • /nutch/branches/2.x/src/java/org/apache/nutch/storage/Mark.java
          • /nutch/branches/2.x/src/java/org/apache/nutch/storage/StorageUtils.java
          Show
          Hudson added a comment - FAILURE: Integrated in Nutch-nutchgora #1015 (See https://builds.apache.org/job/Nutch-nutchgora/1015/ ) NUTCH-1674 Use batchId filter to enable scan ( GORA-119 ) for Fetch,Parse,Update,Index (Tien Nguyen Manh and Alparslan Avcı via jnioche) (jnioche: http://svn.apache.org/viewvc/nutch/branches/2.x/?view=rev&rev=1594813 ) /nutch/branches/2.x/CHANGES.txt /nutch/branches/2.x/src/java/org/apache/nutch/crawl/DbUpdateMapper.java /nutch/branches/2.x/src/java/org/apache/nutch/crawl/DbUpdaterJob.java /nutch/branches/2.x/src/java/org/apache/nutch/fetcher/FetcherJob.java /nutch/branches/2.x/src/java/org/apache/nutch/indexer/IndexingJob.java /nutch/branches/2.x/src/java/org/apache/nutch/parse/ParserJob.java /nutch/branches/2.x/src/java/org/apache/nutch/storage/Mark.java /nutch/branches/2.x/src/java/org/apache/nutch/storage/StorageUtils.java
          Hide
          Lewis John McGibbney added a comment -

          Ported to GORA_94 branch committed @revision 1562607

          Show
          Lewis John McGibbney added a comment - Ported to GORA_94 branch committed @revision 1562607
          Hide
          Lewis John McGibbney added a comment -

          Patch for GORA_94 branch.
          Can someone please test.
          Thank you.
          This is an excellent addition to Gora.

          Show
          Lewis John McGibbney added a comment - Patch for GORA_94 branch. Can someone please test. Thank you. This is an excellent addition to Gora.
          Hide
          Hudson added a comment -

          SUCCESS: Integrated in gora-trunk #984 (See https://builds.apache.org/job/gora-trunk/984/)
          GORA-119 implement a filter enabled scan in gora (lewismc: http://svn.apache.org/viewvc/gora/trunk/?view=rev&rev=1560408)

          • /gora/trunk/gora-core/src/main/java/org/apache/gora/filter
          • /gora/trunk/gora-core/src/main/java/org/apache/gora/filter/Filter.java
          • /gora/trunk/gora-core/src/main/java/org/apache/gora/filter/FilterList.java
          • /gora/trunk/gora-core/src/main/java/org/apache/gora/filter/FilterOp.java
          • /gora/trunk/gora-core/src/main/java/org/apache/gora/filter/MapFieldValueFilter.java
          • /gora/trunk/gora-core/src/main/java/org/apache/gora/filter/SingleFieldValueFilter.java
          • /gora/trunk/gora-core/src/test/java/org/apache/gora/filter
          • /gora/trunk/gora-core/src/test/java/org/apache/gora/filter/TestMapFieldValueFilter.java
          • /gora/trunk/gora-core/src/test/java/org/apache/gora/filter/TestSingleFieldValueFilter.java
          • /gora/trunk/gora-hbase/src/main/java/org/apache/gora/hbase/util/BaseFactory.java
          • /gora/trunk/gora-hbase/src/main/java/org/apache/gora/hbase/util/DefaultFactory.java
          • /gora/trunk/gora-hbase/src/main/java/org/apache/gora/hbase/util/FilterFactory.java
          • /gora/trunk/gora-hbase/src/main/java/org/apache/gora/hbase/util/HBaseFilterUtil.java
            GORA-119 (lewismc: http://svn.apache.org/viewvc/gora/trunk/?view=rev&rev=1560407)
          • /gora/trunk/CHANGES.txt
          • /gora/trunk/gora-core/src/main/java/org/apache/gora/query/Query.java
          • /gora/trunk/gora-core/src/main/java/org/apache/gora/query/impl/PartitionQueryImpl.java
          • /gora/trunk/gora-core/src/main/java/org/apache/gora/query/impl/QueryBase.java
          • /gora/trunk/gora-core/src/main/java/org/apache/gora/query/impl/ResultBase.java
          • /gora/trunk/gora-core/src/main/java/org/apache/gora/query/ws/impl/PartitionWSQueryImpl.java
          • /gora/trunk/gora-core/src/main/java/org/apache/gora/query/ws/impl/QueryWSBase.java
          • /gora/trunk/gora-dynamodb/src/main/java/org/apache/gora/dynamodb/query/DynamoDBQuery.java
          • /gora/trunk/gora-hbase/src/main/java/org/apache/gora/hbase/query/HBaseQuery.java
          • /gora/trunk/gora-hbase/src/main/java/org/apache/gora/hbase/store/HBaseColumn.java
          • /gora/trunk/gora-hbase/src/main/java/org/apache/gora/hbase/store/HBaseStore.java
          • /gora/trunk/gora-hbase/src/main/java/org/apache/gora/hbase/util/HBaseByteInterface.java
          Show
          Hudson added a comment - SUCCESS: Integrated in gora-trunk #984 (See https://builds.apache.org/job/gora-trunk/984/ ) GORA-119 implement a filter enabled scan in gora (lewismc: http://svn.apache.org/viewvc/gora/trunk/?view=rev&rev=1560408 ) /gora/trunk/gora-core/src/main/java/org/apache/gora/filter /gora/trunk/gora-core/src/main/java/org/apache/gora/filter/Filter.java /gora/trunk/gora-core/src/main/java/org/apache/gora/filter/FilterList.java /gora/trunk/gora-core/src/main/java/org/apache/gora/filter/FilterOp.java /gora/trunk/gora-core/src/main/java/org/apache/gora/filter/MapFieldValueFilter.java /gora/trunk/gora-core/src/main/java/org/apache/gora/filter/SingleFieldValueFilter.java /gora/trunk/gora-core/src/test/java/org/apache/gora/filter /gora/trunk/gora-core/src/test/java/org/apache/gora/filter/TestMapFieldValueFilter.java /gora/trunk/gora-core/src/test/java/org/apache/gora/filter/TestSingleFieldValueFilter.java /gora/trunk/gora-hbase/src/main/java/org/apache/gora/hbase/util/BaseFactory.java /gora/trunk/gora-hbase/src/main/java/org/apache/gora/hbase/util/DefaultFactory.java /gora/trunk/gora-hbase/src/main/java/org/apache/gora/hbase/util/FilterFactory.java /gora/trunk/gora-hbase/src/main/java/org/apache/gora/hbase/util/HBaseFilterUtil.java GORA-119 (lewismc: http://svn.apache.org/viewvc/gora/trunk/?view=rev&rev=1560407 ) /gora/trunk/CHANGES.txt /gora/trunk/gora-core/src/main/java/org/apache/gora/query/Query.java /gora/trunk/gora-core/src/main/java/org/apache/gora/query/impl/PartitionQueryImpl.java /gora/trunk/gora-core/src/main/java/org/apache/gora/query/impl/QueryBase.java /gora/trunk/gora-core/src/main/java/org/apache/gora/query/impl/ResultBase.java /gora/trunk/gora-core/src/main/java/org/apache/gora/query/ws/impl/PartitionWSQueryImpl.java /gora/trunk/gora-core/src/main/java/org/apache/gora/query/ws/impl/QueryWSBase.java /gora/trunk/gora-dynamodb/src/main/java/org/apache/gora/dynamodb/query/DynamoDBQuery.java /gora/trunk/gora-hbase/src/main/java/org/apache/gora/hbase/query/HBaseQuery.java /gora/trunk/gora-hbase/src/main/java/org/apache/gora/hbase/store/HBaseColumn.java /gora/trunk/gora-hbase/src/main/java/org/apache/gora/hbase/store/HBaseStore.java /gora/trunk/gora-hbase/src/main/java/org/apache/gora/hbase/util/HBaseByteInterface.java
          Hide
          Julien Nioche added a comment -

          Brilliant! thanks everyone

          Show
          Julien Nioche added a comment - Brilliant! thanks everyone
          Hide
          Lewis John McGibbney added a comment -

          Committed @revision's 1560407 and 1560408 in trunk
          Good work all involved.
          I'll port to GORA_94 tonight.

          Show
          Lewis John McGibbney added a comment - Committed @revision's 1560407 and 1560408 in trunk Good work all involved. I'll port to GORA_94 tonight.
          Hide
          Alexander Uretsky added a comment -

          As promised, tried this out with NUTCH-1674 and it works great! Thanks everyone for your help!

          Show
          Alexander Uretsky added a comment - As promised, tried this out with NUTCH-1674 and it works great! Thanks everyone for your help!
          Hide
          Renato Javier Marroquín Mogrovejo added a comment -

          Great! Let's push this one forward then!

          Show
          Renato Javier Marroquín Mogrovejo added a comment - Great! Let's push this one forward then!
          Hide
          Lewis John McGibbney added a comment -

          Renato Javier Marroquín Mogrovejo the v3 patch retains support for non-avro-based web services API and gora-dynamodb module. If you're happy with this then lets get it committed

          Show
          Lewis John McGibbney added a comment - Renato Javier Marroquín Mogrovejo the v3 patch retains support for non-avro-based web services API and gora-dynamodb module. If you're happy with this then lets get it committed
          Hide
          Renato Javier Marroquín Mogrovejo added a comment -

          Great catch Lewis John McGibbney!

          This is for sure one thing we don't have loose, the ability to persist objects into non-Avro-based data stores.

          -public class PartitionWSQueryImpl<K, T extends Persistent>
          +public class PartitionWSQueryImpl<K, T extends PersistentBase>

          Would you like a deeper explanation on this?

          Show
          Renato Javier Marroquín Mogrovejo added a comment - Great catch Lewis John McGibbney ! This is for sure one thing we don't have loose, the ability to persist objects into non-Avro-based data stores. -public class PartitionWSQueryImpl<K, T extends Persistent> +public class PartitionWSQueryImpl<K, T extends PersistentBase> Would you like a deeper explanation on this?
          Hide
          Alexander Uretsky added a comment -

          Lewis John McGibbney Well this one builds correctly I will try it with NUTCH-1674 (on nutch 2.2.1) in the near future. Will comment on the results. Thanks for your help, really looking forward for this patch being committed!

          Show
          Alexander Uretsky added a comment - Lewis John McGibbney Well this one builds correctly I will try it with NUTCH-1674 (on nutch 2.2.1) in the near future. Will comment on the results. Thanks for your help, really looking forward for this patch being committed!
          Hide
          Lewis John McGibbney added a comment -

          Alexander Uretsky, yeah this is what I was explaining above.
          This v3 patch fixes this and adds the unimplemented methods to the classes effected in gora-dynamodb. If you would/could give this a spin and report your results here it would be great.

          We are very close to getting this committed now.

          Show
          Lewis John McGibbney added a comment - Alexander Uretsky , yeah this is what I was explaining above. This v3 patch fixes this and adds the unimplemented methods to the classes effected in gora-dynamodb. If you would/could give this a spin and report your results here it would be great. We are very close to getting this committed now.
          Hide
          Alexander Uretsky added a comment -

          Lewis John McGibbney - There seems to be a problem when applying the patch (the latest v2) to the trunk version - I get a build failure in gora-dynamodb, something along the lines of: "DynamoDBQuery is not abstract and does not override abstract method isLocalFilterEnabled() in Query". Does this looks familiar? Thanks in advance

          Show
          Alexander Uretsky added a comment - Lewis John McGibbney - There seems to be a problem when applying the patch (the latest v2) to the trunk version - I get a build failure in gora-dynamodb, something along the lines of: "DynamoDBQuery is not abstract and does not override abstract method isLocalFilterEnabled() in Query". Does this looks familiar? Thanks in advance
          Hide
          Lewis John McGibbney added a comment -

          Updated patch for trunk which people can apply cleanly to trunk.

          There are issues here (which I did not recognise until just now) concerning proposed changes to webservices-based API's within gora-core. These need to be resolved before we can commit this.
          The proposed changes to make webservice implementation extend PersistentBase as oppose to Persistent. This is impossible unless the WebService stores are Avro-based... currently the DynamoDB module is NOT Avro-based so we therefore extend Persistent.

          Renato Javier Marroquín Mogrovejo, can you comment/are you able to provide a bit more context so we can resolve and move on? Thanks if you are.

          Show
          Lewis John McGibbney added a comment - Updated patch for trunk which people can apply cleanly to trunk. There are issues here (which I did not recognise until just now) concerning proposed changes to webservices-based API's within gora-core. These need to be resolved before we can commit this. The proposed changes to make webservice implementation extend PersistentBase as oppose to Persistent. This is impossible unless the WebService stores are Avro-based... currently the DynamoDB module is NOT Avro-based so we therefore extend Persistent. Renato Javier Marroquín Mogrovejo , can you comment/are you able to provide a bit more context so we can resolve and move on? Thanks if you are.
          Hide
          Lewis John McGibbney added a comment -
          Show
          Lewis John McGibbney added a comment - Daniel Kugel yes
          Hide
          Daniel Kugel added a comment -

          So the patch to use is v1.1?

          Show
          Daniel Kugel added a comment - So the patch to use is v1.1?
          Hide
          Lewis John McGibbney added a comment -

          Thanks Otis Gospodnetic, I'll commit in 24hrs unless any other objections.

          Show
          Lewis John McGibbney added a comment - Thanks Otis Gospodnetic , I'll commit in 24hrs unless any other objections.
          Hide
          Otis Gospodnetic added a comment -

          No objections. It's been working for us.

          Show
          Otis Gospodnetic added a comment - No objections. It's been working for us.
          Hide
          Lewis John McGibbney added a comment -

          Any objections to commit this?

          Show
          Lewis John McGibbney added a comment - Any objections to commit this?
          Hide
          Lewis John McGibbney added a comment -

          Hi Talat UYARER please see my most recent comment on GORA-245. If you are able to patch up with the most recent patch and look at my comments then maybe we can get the gora-cassandra module stable before moving on to gora-accumulo and getting the branch stable. Thank you for the interest.

          Show
          Lewis John McGibbney added a comment - Hi Talat UYARER please see my most recent comment on GORA-245 . If you are able to patch up with the most recent patch and look at my comments then maybe we can get the gora-cassandra module stable before moving on to gora-accumulo and getting the branch stable. Thank you for the interest.
          Hide
          Talat UYARER added a comment -

          Hi Lewis John McGibbney,

          I wait Gora-94. Gora-94 is very important for Nutch. May I ask what is missing part of Gora-94. I am volunteer for GORA-94. I will finish for 0.4 release.

          Show
          Talat UYARER added a comment - Hi Lewis John McGibbney , I wait Gora-94. Gora-94 is very important for Nutch. May I ask what is missing part of Gora-94. I am volunteer for GORA-94 . I will finish for 0.4 release.
          Hide
          Lewis John McGibbney added a comment -

          I think that we should think about pushing this into trunk ASAP and shooting to release trunk. We can make the GORA_94 stuff work after we do a 0.4 release. I will volunteer to maintain the GORA_94 branch in line with the release changes.

          Show
          Lewis John McGibbney added a comment - I think that we should think about pushing this into trunk ASAP and shooting to release trunk. We can make the GORA_94 stuff work after we do a 0.4 release. I will volunteer to maintain the GORA_94 branch in line with the release changes.
          Hide
          Tien Nguyen Manh added a comment -

          I add one new change to support byte[] in gora filter operators and fix a bug in previous patch

          Show
          Tien Nguyen Manh added a comment - I add one new change to support byte[] in gora filter operators and fix a bug in previous patch
          Hide
          Otis Gospodnetic added a comment -

          Tien Nguyen Manh do you have anything else for this issue or is this ready to commit?

          Show
          Otis Gospodnetic added a comment - Tien Nguyen Manh do you have anything else for this issue or is this ready to commit?
          Hide
          Lewis John McGibbney added a comment - - edited

          Talat UYARER I now see tests for SingleValue and MapValue filters... sorry about that.

          Renato Javier Marroquín Mogrovejo are you able to check the changes to the WS API? Specifically that they now extend PersistentBase as oppose to Persistent... I thought only Avro-based datastorer implementation were to extend PersistentBase but I may be wrong. Your input would be v valuable.

          Show
          Lewis John McGibbney added a comment - - edited Talat UYARER I now see tests for SingleValue and MapValue filters... sorry about that. Renato Javier Marroquín Mogrovejo are you able to check the changes to the WS API? Specifically that they now extend PersistentBase as oppose to Persistent... I thought only Avro-based datastorer implementation were to extend PersistentBase but I may be wrong. Your input would be v valuable.
          Hide
          Talat UYARER added a comment -

          Hi Lewis John McGibbney, I see a test class in patch. (TestSingleFieldValueFilter). What do we need else?

          Show
          Talat UYARER added a comment - Hi Lewis John McGibbney , I see a test class in patch. (TestSingleFieldValueFilter). What do we need else?
          Hide
          Lewis John McGibbney added a comment -

          Hi Folks, is there any motivation to put some tests into this patch?

          Show
          Lewis John McGibbney added a comment - Hi Folks, is there any motivation to put some tests into this patch?
          Hide
          Enis Soztutar added a comment -

          Awesome. This looks almost there. Just a couple of comments:

          • typo in getHbaseFitlerUtil()
          • can you update the javadoc for MapFieldValueFilter
          • FilterOp can be {EQUAL, NOT_EQUAL, LESS, LESS_OR_EQUAL, GREATER, GREATER_OR_EQUAL}
          • Are you going to implement FilterList.filter() ? We can leave that to a follow up patch if you want, just asking.
          • I guess you can remove Query. setLocalFilterEnabled() now.
          Show
          Enis Soztutar added a comment - Awesome. This looks almost there. Just a couple of comments: typo in getHbaseFitlerUtil() can you update the javadoc for MapFieldValueFilter FilterOp can be {EQUAL, NOT_EQUAL, LESS, LESS_OR_EQUAL, GREATER, GREATER_OR_EQUAL} Are you going to implement FilterList.filter() ? We can leave that to a follow up patch if you want, just asking. I guess you can remove Query. setLocalFilterEnabled() now.
          Hide
          Tien Nguyen Manh added a comment -

          Enis Soztutar Updated patch

          Show
          Tien Nguyen Manh added a comment - Enis Soztutar Updated patch
          Hide
          Tien Nguyen Manh added a comment - - edited

          Otis Gospodnetic i added ASL
          Enis Soztutar Ok, i will made that change. It make sense to use MapFieldValueFilter

          Show
          Tien Nguyen Manh added a comment - - edited Otis Gospodnetic i added ASL Enis Soztutar Ok, i will made that change. It make sense to use MapFieldValueFilter
          Hide
          Enis Soztutar added a comment -

          Yep, per my comments above, I would like to make these changes at least:

          • GoraFilter -> Filter
          • redo isLocalFilterEnabled
          • A filter against a map field can go to a MapFieldValueFilter. We should not be needing FieldOp.EQUALS_IN_MAP

          Tien Nguyen Manh what do you think about these?

          Show
          Enis Soztutar added a comment - Yep, per my comments above, I would like to make these changes at least: GoraFilter -> Filter redo isLocalFilterEnabled A filter against a map field can go to a MapFieldValueFilter. We should not be needing FieldOp.EQUALS_IN_MAP Tien Nguyen Manh what do you think about these?
          Hide
          Otis Gospodnetic added a comment - - edited

          Tien Nguyen Manh - FilterList, BaseFactory, DefaultFactory, and FilterFactory classes need the ASL blurb at the top of the class.

          Enis Soztutar looks good to you now or should anything else be changed?

          Show
          Otis Gospodnetic added a comment - - edited Tien Nguyen Manh - FilterList, BaseFactory, DefaultFactory, and FilterFactory classes need the ASL blurb at the top of the class. Enis Soztutar looks good to you now or should anything else be changed?
          Hide
          Tien Nguyen Manh added a comment -

          Updated patch to for HBaseFilterUtil and other files

          Show
          Tien Nguyen Manh added a comment - Updated patch to for HBaseFilterUtil and other files
          Hide
          Tien Nguyen Manh added a comment -

          Aha, don't know why eclipse don't include new file in patch

          Show
          Tien Nguyen Manh added a comment - Aha, don't know why eclipse don't include new file in patch
          Hide
          Enis Soztutar added a comment -

          Thanks for working on this.

          we should keep filter in query after convert successfully at least for debugging or maybe submit query again.

          If the filter is 1-to-1 transferrable, we do not want to filter twice, once on the client side, and once on the server side.
          the problem with Query.isLocalFilterEnabled() is that, the logic that the filter is not needed is kept outside of the filter. This means that it is not straightforward to decompose a filter into multiple parts, and execute some of the logic in server and some in client. The query has an all-or-nothing approach for replacing the filter. For example, if you want to have a FilterList or two filters, one is convertible to server-side filter, one is not, the data store should decompose this into pieces. If we want to keep whatever filter converted to server side, we can have a wrapper filter which passes everything, and replace the filter with this filter.

          And GoraFilter to not confuse with Filter in HBase

          We do not need to differentiate by classname. We have different package names already. It is better to be consistent with the rest of the gora API.

          The new patch does not include HBaseFilterUtil changes.

          Show
          Enis Soztutar added a comment - Thanks for working on this. we should keep filter in query after convert successfully at least for debugging or maybe submit query again. If the filter is 1-to-1 transferrable, we do not want to filter twice, once on the client side, and once on the server side. the problem with Query.isLocalFilterEnabled() is that, the logic that the filter is not needed is kept outside of the filter. This means that it is not straightforward to decompose a filter into multiple parts, and execute some of the logic in server and some in client. The query has an all-or-nothing approach for replacing the filter. For example, if you want to have a FilterList or two filters, one is convertible to server-side filter, one is not, the data store should decompose this into pieces. If we want to keep whatever filter converted to server side, we can have a wrapper filter which passes everything, and replace the filter with this filter. And GoraFilter to not confuse with Filter in HBase We do not need to differentiate by classname. We have different package names already. It is better to be consistent with the rest of the gora API. The new patch does not include HBaseFilterUtil changes.
          Hide
          Tien Nguyen Manh added a comment - - edited

          I investigated Enis change, and prefer not to merge.
          we should keep filter in query after convert successfully at least for debugging or maybe submit query again.
          some time we can filter again for example the PageFilter in Hbase, it return first N records from each region and may return to client more than N records, and we can filter again in Gora to return exactly N docs
          And GoraFilter to not confuse with Filter in HBase

          Show
          Tien Nguyen Manh added a comment - - edited I investigated Enis change, and prefer not to merge. we should keep filter in query after convert successfully at least for debugging or maybe submit query again. some time we can filter again for example the PageFilter in Hbase, it return first N records from each region and may return to client more than N records, and we can filter again in Gora to return exactly N docs And GoraFilter to not confuse with Filter in HBase
          Hide
          Tien Nguyen Manh added a comment - - edited

          Sure, i will merge Enis change.
          I used this filter feature for Nutch, it reduce the time to scan the whole hbase table in map task from 80 min to 40 min in most of crawling job. The hbase table size is 20M urls and my batch have about 100k url

          Show
          Tien Nguyen Manh added a comment - - edited Sure, i will merge Enis change. I used this filter feature for Nutch, it reduce the time to scan the whole hbase table in map task from 80 min to 40 min in most of crawling job. The hbase table size is 20M urls and my batch have about 100k url
          Hide
          Otis Gospodnetic added a comment -

          Tien Nguyen Manh - I see Enis Soztutar already made some changes to the original patch and added v2 of the patch. I am assuming your v1-1 version of the patch doesn't include Enis' changes? Would it be possible to "unify" the patches?
          Also, can you share how much this patch speeds up things?

          Show
          Otis Gospodnetic added a comment - Tien Nguyen Manh - I see Enis Soztutar already made some changes to the original patch and added v2 of the patch. I am assuming your v1-1 version of the patch doesn't include Enis' changes? Would it be possible to "unify" the patches? Also, can you share how much this patch speeds up things?
          Hide
          Tien Nguyen Manh added a comment -

          I update GORA-119-v1.txt

          • make it work with current gora trunk
          • add support for other compareOp (LESS, LESS_OR_EQUAL, GREATER, ...)
          • add FilterList (similar to Hbase Filter FilterList) allowing to use multiple filters
          • change HbaseFilterUtil to support user use custom hbase filter in Gora. This is required when I adding a custom Hbase TopNFilter in Nutch to push selecting TopN url from each host from generate mapper to hbase.
          Show
          Tien Nguyen Manh added a comment - I update GORA-119 -v1.txt make it work with current gora trunk add support for other compareOp (LESS, LESS_OR_EQUAL, GREATER, ...) add FilterList (similar to Hbase Filter FilterList) allowing to use multiple filters change HbaseFilterUtil to support user use custom hbase filter in Gora. This is required when I adding a custom Hbase TopNFilter in Nutch to push selecting TopN url from each host from generate mapper to hbase.
          Hide
          Enis Soztutar added a comment -

          Ferdy, the patch is great. I've rebased the patch and made some changes. I hope you guys did not mind.

          • removed localFilterEnabled, instead the DataStore inspects the filter, and returns another filter (or null) to replace the current filter. This way, the filter is decomposed into pieces that the datastore understands and can convert into efficient server side filters, while the rest of the filter can be executed from the client.
          • renamed GoraFilter to Filter to be consistent with the rest of the naming (DataStore, Query, Result, etc do not use prefix)
          • Still some more changes I want to do. Rename FilterOp, maybe redo MapFieldValueFilter. Nutch needs smt like MapFieldValueFilter for marks, so we can start with this, and add the others later as needed.
          Show
          Enis Soztutar added a comment - Ferdy, the patch is great. I've rebased the patch and made some changes. I hope you guys did not mind. removed localFilterEnabled, instead the DataStore inspects the filter, and returns another filter (or null) to replace the current filter. This way, the filter is decomposed into pieces that the datastore understands and can convert into efficient server side filters, while the rest of the filter can be executed from the client. renamed GoraFilter to Filter to be consistent with the rest of the naming (DataStore, Query, Result, etc do not use prefix) Still some more changes I want to do. Rename FilterOp, maybe redo MapFieldValueFilter. Nutch needs smt like MapFieldValueFilter for marks, so we can start with this, and add the others later as needed.
          Hide
          Otis Gospodnetic added a comment -

          Tien Nguyen Manh Could you please upload your modified patch that works with trunk? Also, you mentioned to me you'll do something to make Nutch's Generate phase faster. If that will involve changes in Gora, please share that patch, too, and please link the Nutch issue that I assume you will create, so we can track the relationship between issues. Thanks.

          Show
          Otis Gospodnetic added a comment - Tien Nguyen Manh Could you please upload your modified patch that works with trunk? Also, you mentioned to me you'll do something to make Nutch's Generate phase faster. If that will involve changes in Gora, please share that patch, too, and please link the Nutch issue that I assume you will create, so we can track the relationship between issues. Thanks.
          Hide
          Tien Nguyen Manh added a comment -

          I test the patch, it worked with some change for current gora trunk,
          I use this patch for nutch, the time is reduced about 50% for fetch, updatedb and indexer from 80 min to 40 min

          Show
          Tien Nguyen Manh added a comment - I test the patch, it worked with some change for current gora trunk, I use this patch for nutch, the time is reduced about 50% for fetch, updatedb and indexer from 80 min to 40 min
          Hide
          Ferdy Galema added a comment -

          If it's any help I can say that we are still using this modification on our Gora branch, so it works pretty solid. (However with the limitations mentioned above, only HBase uses server side filtering).

          Show
          Ferdy Galema added a comment - If it's any help I can say that we are still using this modification on our Gora branch, so it works pretty solid. (However with the limitations mentioned above, only HBase uses server side filtering).
          Hide
          Otis Gospodnetic added a comment -

          Lewis John McGibbney - a colleague of mine is trying Ferdy's patch from April right now and will soon comment here about whether it worked for him or not.

          Show
          Otis Gospodnetic added a comment - Lewis John McGibbney - a colleague of mine is trying Ferdy's patch from April right now and will soon comment here about whether it worked for him or not.
          Hide
          Lewis John McGibbney added a comment -

          Hi Otis. I am putting my time into stabilizing the GORA_94 branch right now. If you were able to take this one on it would be a help to us all. We could role it in to the 0.4 release along with the Avro upgrade and the HBase upgrade (the latter two will be addressed when we merge GORA_94 back in to trunk).

          Show
          Lewis John McGibbney added a comment - Hi Otis. I am putting my time into stabilizing the GORA_94 branch right now. If you were able to take this one on it would be a help to us all. We could role it in to the 0.4 release along with the Avro upgrade and the HBase upgrade (the latter two will be addressed when we merge GORA_94 back in to trunk).
          Hide
          Julien Nioche added a comment -

          A big fat +1 from me on this one Would be great to have that

          Show
          Julien Nioche added a comment - A big fat +1 from me on this one Would be great to have that
          Hide
          Otis Gospodnetic added a comment -

          Quick check to see if Ferdy Galema or Keith Turner or anyone else made any progress with this or have a newer version of the patch? This seems to be a bottleneck in Nutch for us. Thanks.

          Show
          Otis Gospodnetic added a comment - Quick check to see if Ferdy Galema or Keith Turner or anyone else made any progress with this or have a newer version of the patch? This seems to be a bottleneck in Nutch for us. Thanks.
          Hide
          Ferdy Galema added a comment - - edited

          Hi Renato,

          Unfortunately it is not yet in trunk. I think it is best to go with Keith's suggestion to introduce a new maven project for the filter implementations. If anyone wants to pick this up, feel free to do so.

          Show
          Ferdy Galema added a comment - - edited Hi Renato, Unfortunately it is not yet in trunk. I think it is best to go with Keith's suggestion to introduce a new maven project for the filter implementations. If anyone wants to pick this up, feel free to do so.
          Hide
          Renato Javier Marroquín Mogrovejo added a comment -

          Did this ever made it to trunk? Are we planning to include this in Gora's API? The DynamoDB module is able to use filters in scan operations, and I have implemented them by creating two different types of queries (range queries and scan queries) and I build them depending on the parameters gotten through configurating the query. Yeah I now see this is not such an elegant solution, and I like the query optimizer idea much better.

          Show
          Renato Javier Marroquín Mogrovejo added a comment - Did this ever made it to trunk? Are we planning to include this in Gora's API? The DynamoDB module is able to use filters in scan operations, and I have implemented them by creating two different types of queries (range queries and scan queries) and I build them depending on the parameters gotten through configurating the query. Yeah I now see this is not such an elegant solution, and I like the query optimizer idea much better.
          Hide
          Ferdy Galema added a comment -

          I think it's a good idea. It's possibly the only solution to the dependency issue. However I do like to give it a good night's rest.. perhaps there is an easier method. (Also I'm far from a Maven buff so any help with regard to getting this started would be greatly appreciated).

          Show
          Ferdy Galema added a comment - I think it's a good idea. It's possibly the only solution to the dependency issue. However I do like to give it a good night's rest.. perhaps there is an easier method. (Also I'm far from a Maven buff so any help with regard to getting this started would be greatly appreciated).
          Hide
          Keith Turner added a comment -

          I agree w/ what you said about being defensive. For the example design I just took what was passed to HBaseFilterUtil.setFilter() in your patch. What do you think about having a gora-filters maven project?

          Show
          Keith Turner added a comment - I agree w/ what you said about being defensive. For the example design I just took what was passed to HBaseFilterUtil.setFilter() in your patch. What do you think about having a gora-filters maven project?
          Hide
          Ferdy Galema added a comment -

          (Not to undermine your comments about it becoming part of the API, I agree that this is an important design consideration too).

          Show
          Ferdy Galema added a comment - (Not to undermine your comments about it becoming part of the API, I agree that this is an important design consideration too).
          Hide
          Ferdy Galema added a comment -

          Yeah therefore it is very important to design the optimizer interfaces in the most defensive manner possible. For example

          interface HBaseQueryOptimizer extends QueryOptimizer

          { void optimize(Scan scan, HBaseStore); }

          vs

          interface HBaseQueryOptimizer extends QueryOptimizer

          { Filter getHBaseFilter(HBaseMapping mapping); }

          The first variant exposes both the Scan and HBaseStore objects to the filter. This arguably gives filters too much control. It could modify Scan in wrong way or call wrong HBaseStore methods so that nasty bugs may be introduced.

          The latter simply asks the Optimizer to return the HBase filter. It provides one or two objects (HBaseMapping in this case) that are unharmful (immutable, copied or whatever) that can be used to build the filter. The filter has no way to modify crucial parts.

          Show
          Ferdy Galema added a comment - Yeah therefore it is very important to design the optimizer interfaces in the most defensive manner possible. For example interface HBaseQueryOptimizer extends QueryOptimizer { void optimize(Scan scan, HBaseStore); } vs interface HBaseQueryOptimizer extends QueryOptimizer { Filter getHBaseFilter(HBaseMapping mapping); } The first variant exposes both the Scan and HBaseStore objects to the filter. This arguably gives filters too much control. It could modify Scan in wrong way or call wrong HBaseStore methods so that nasty bugs may be introduced. The latter simply asks the Optimizer to return the HBase filter. It provides one or two objects (HBaseMapping in this case) that are unharmful (immutable, copied or whatever) that can be used to build the filter. The filter has no way to modify crucial parts.
          Hide
          Keith Turner added a comment -

          I think there is one peril to the design I suggested, it exposes the internals of a datastore to users who write filters. Therefore it become part of the public API, possibly impacting the ability to make drastic changes to a datastore. If the optimizer interface only exposes stable hbase and accumulo APIs, then maybe this not an issue.

          Show
          Keith Turner added a comment - I think there is one peril to the design I suggested, it exposes the internals of a datastore to users who write filters. Therefore it become part of the public API, possibly impacting the ability to make drastic changes to a datastore. If the optimizer interface only exposes stable hbase and accumulo APIs, then maybe this not an issue.
          Hide
          Keith Turner added a comment -

          For the dependency issue, one solution is putting filters in another top level maven project. Could have something like the following :

          gora-core # contains the filter interfaces
          gora-accumulo
          gora-hbase
          gora-filters #depends on core and stores like hbase and accumulo....
          .
          .
          .

          I like the filter of filters that supports AND, OR. It will be interesting trying to push this one to the servers side. I could develop an iterator for accumulo that does this. The list of filters would influence the design of the AccumuloOptimizer, because it would need info from the child filter optimizers.

          Show
          Keith Turner added a comment - For the dependency issue, one solution is putting filters in another top level maven project. Could have something like the following : gora-core # contains the filter interfaces gora-accumulo gora-hbase gora-filters #depends on core and stores like hbase and accumulo.... . . . I like the filter of filters that supports AND, OR. It will be interesting trying to push this one to the servers side. I could develop an iterator for accumulo that does this. The list of filters would influence the design of the AccumuloOptimizer, because it would need info from the child filter optimizers.
          Hide
          Ferdy Galema added a comment -

          Hey Keith,

          I like your idea about QueryOptimizer! I was struggling a bit on my own about how to cleanly design the filter interface so that the optimization is coupled with the implementation of the filter itself. Your suggestion exactly does that! And indeed has the additional benefit of allowing users to implement optimizers for their own filters. Only thing is that when implementing a QueryOptimizer, a user might not want to be bothered with implementing the generic non-optimized variant. (Not that it matters much, they could leave it empty of course since it is a locally implemented filter). A slightly bigger issue is that the Gora-supplied filters have knowledge about stores. (Dependency inversion?). But then again we could simply leave the optimization empty so that is up to each store to implement optimizations for those filters.

          About the list of filters, this can easily be implemented with a Gora-supplied filter that accepts a list with optional an operater (i.e. AND, OR).

          Thanks.

          Show
          Ferdy Galema added a comment - Hey Keith, I like your idea about QueryOptimizer! I was struggling a bit on my own about how to cleanly design the filter interface so that the optimization is coupled with the implementation of the filter itself. Your suggestion exactly does that! And indeed has the additional benefit of allowing users to implement optimizers for their own filters. Only thing is that when implementing a QueryOptimizer, a user might not want to be bothered with implementing the generic non-optimized variant. (Not that it matters much, they could leave it empty of course since it is a locally implemented filter). A slightly bigger issue is that the Gora-supplied filters have knowledge about stores. (Dependency inversion?). But then again we could simply leave the optimization empty so that is up to each store to implement optimizations for those filters. About the list of filters, this can easily be implemented with a Gora-supplied filter that accepts a list with optional an operater (i.e. AND, OR). Thanks.
          Hide
          Keith Turner added a comment -

          In Query, we may want to consider letting a user set a list of filters. Accumulo supports setting multiple server side iterators for a scan.

          public void setFilters(Collection<GoraFilter<K, T>> filters);

          Show
          Keith Turner added a comment - In Query, we may want to consider letting a user set a list of filters. Accumulo supports setting multiple server side iterators for a scan. public void setFilters(Collection<GoraFilter<K, T>> filters);
          Hide
          Keith Turner added a comment -

          I like option C. Accumulo could support this too, it has server side iterators which support arbitrary filtering.

          I took a look at the patch, the strategy seems to be that the store has special case code where it checks the type of the filter and determines if it can optimize for it. So the stores that ship w/ gora may support the filters that ship with gora. However if a user writes their own filter, it will always be applied client side w/o any optimization. To address this maybe optimizations could for filters could be associated with the filter instead of the store. Then if a user writes a filter they could optionally write optimizations for the data stores they care about. Below is sketch of how this might be done.

          interface GoraFilter

          { public boolean filter(K key, T persistent); public QueryOptimizer getOptimizer(DataStore ds); }

          interface QueryOptimizer {
          }

          interface HBaseQueryOptimizer extends QueryOptimizer

          { void optimize(Scan scan, HBaseStore); }

          class SingleFieldValueFilter implements GoraFilter {
          public boolean filter(K key, T persistent)

          { //generic implemetation for store w/o optimization }

          public QueryOptimizer getOptimizer(DataStore ds) {
          if(ds instanceof HBaseStore){
          return new HBaseQueryOptimizer()

          {...};
          } else if(ds instanceof AccumuloStore){
          return new AccumuloQueryOptimizer(){...}

          ;
          }
          return null;
          }
          }

          Show
          Keith Turner added a comment - I like option C. Accumulo could support this too, it has server side iterators which support arbitrary filtering. I took a look at the patch, the strategy seems to be that the store has special case code where it checks the type of the filter and determines if it can optimize for it. So the stores that ship w/ gora may support the filters that ship with gora. However if a user writes their own filter, it will always be applied client side w/o any optimization. To address this maybe optimizations could for filters could be associated with the filter instead of the store. Then if a user writes a filter they could optionally write optimizations for the data stores they care about. Below is sketch of how this might be done. interface GoraFilter { public boolean filter(K key, T persistent); public QueryOptimizer getOptimizer(DataStore ds); } interface QueryOptimizer { } interface HBaseQueryOptimizer extends QueryOptimizer { void optimize(Scan scan, HBaseStore); } class SingleFieldValueFilter implements GoraFilter { public boolean filter(K key, T persistent) { //generic implemetation for store w/o optimization } public QueryOptimizer getOptimizer(DataStore ds) { if(ds instanceof HBaseStore){ return new HBaseQueryOptimizer() {...}; } else if(ds instanceof AccumuloStore){ return new AccumuloQueryOptimizer(){...} ; } return null; } }
          Hide
          Ferdy Galema added a comment -

          I went ahead and made a proposal implementation for C.

          It currently consists of a single filter, SingleFieldValueFilter. It is able to include or exclude a row based on value of a single field. Tests are included. It will work for every store, because it filters client-side. Additionally, the HBaseStore sets the filter regionserver-side for efficiency. Of course the GoraFilter has been designed with the HBase filter in mind, more or less.

          This is just a start. There is still a lot of work to do, such as adding more filters (for example RowKeyFilter, ColumnPrefixFilter, FilterList etcetera) and more compare operations.

          Show
          Ferdy Galema added a comment - I went ahead and made a proposal implementation for C. It currently consists of a single filter, SingleFieldValueFilter. It is able to include or exclude a row based on value of a single field. Tests are included. It will work for every store, because it filters client-side. Additionally, the HBaseStore sets the filter regionserver-side for efficiency. Of course the GoraFilter has been designed with the HBase filter in mind, more or less. This is just a start. There is still a lot of work to do, such as adding more filters (for example RowKeyFilter, ColumnPrefixFilter, FilterList etcetera) and more compare operations.
          Hide
          Ferdy Galema added a comment -

          This would be very useful indeed. I can think of Nutchgora that uses it for skipping rows that do not match the (generate/fetch..) mark. There are a few ways to implement it:

          A) The quick cheap workaround by passing filter options on Configuration so that the store instance will apply it to all queries. A way to only make it work for HBase. Easy to implement.
          B) Make it a generic Query option, but make it optional. This allows for some stores to implement it, but it won't be necessary. Clients need to accept that filtering might be done, but still have to check every row in order to skip were necessary.
          C) Like B, the generic option, but create a generic implementation that makes sure filtering is always applied, even for a Store does not explicitely implements it. This way it is still optional for a store to implement it (i.e. HBaseStore that applies regionside filtering) but when they don't, the generic implementation at the Gora client side will still skip the rows before they are passed to the application client.

          I like to adopt for C.

          Show
          Ferdy Galema added a comment - This would be very useful indeed. I can think of Nutchgora that uses it for skipping rows that do not match the (generate/fetch..) mark. There are a few ways to implement it: A) The quick cheap workaround by passing filter options on Configuration so that the store instance will apply it to all queries. A way to only make it work for HBase. Easy to implement. B) Make it a generic Query option, but make it optional. This allows for some stores to implement it, but it won't be necessary. Clients need to accept that filtering might be done, but still have to check every row in order to skip were necessary. C) Like B, the generic option, but create a generic implementation that makes sure filtering is always applied, even for a Store does not explicitely implements it. This way it is still optional for a store to implement it (i.e. HBaseStore that applies regionside filtering) but when they don't, the generic implementation at the Gora client side will still skip the rows before they are passed to the application client. I like to adopt for C.
          Hide
          Lewis John McGibbney added a comment -

          Set and classify

          Show
          Lewis John McGibbney added a comment - Set and classify
          Hide
          Lewis John McGibbney added a comment -

          Hi Raf. Do you have a patch for trunk? Or a suggestion of how this issue should be addressed? Thanks

          Show
          Lewis John McGibbney added a comment - Hi Raf. Do you have a patch for trunk? Or a suggestion of how this issue should be addressed? Thanks

            People

            • Assignee:
              Unassigned
              Reporter:
              raf shin
            • Votes:
              3 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development