Nutch
  1. Nutch
  2. NUTCH-1533

Implement getPrevModifiedTime(), setPrevModifiedTime(), getBatchId() and setBatchId() accessors in o.a.n.storage.WebPage

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 2.1
    • Fix Version/s: 2.2
    • Component/s: storage
    • Labels:
      None

      Description

      NUTCH-1532 needs to obtain a batchId to add to NutchDocument prior to indexing. This is currently not available as we do not store the information in the WebPage. Additionally, we do not store the other ModifiedTime's but incorrectly set them in o.a.n.crawl.FetchSchedule#setFetchSchedule.
      All the above accessors should be implemented.

      1. NUTCH-1533.patch
        23 kB
        lufeng
      2. NUTCH-1533v2.patch
        23 kB
        Lewis John McGibbney
      3. NUTCH-1533-v3.patch
        24 kB
        lufeng

        Issue Links

          Activity

          Hide
          lufeng added a comment -

          yes, i think this patch is ok.

          Feng Committed @revision 1460380 in 2.x HEAD.

          Thanks Lewis.

          Show
          lufeng added a comment - yes, i think this patch is ok. Feng Committed @revision 1460380 in 2.x HEAD. Thanks Lewis.
          Hide
          Lewis John McGibbney added a comment -

          So are you happy with the patch which has been applied?
          I see you resolved the issue.
          Do you have a commit number please. It really helps to end Jira issues with a simple message saying that person x, committed to branch y at commit number Z.
          Thanks Feng.

          Show
          Lewis John McGibbney added a comment - So are you happy with the patch which has been applied? I see you resolved the issue. Do you have a commit number please. It really helps to end Jira issues with a simple message saying that person x, committed to branch y at commit number Z. Thanks Feng.
          Hide
          lufeng added a comment -

          Hi Lewis,
          I also found a problem when i committed this patch. but i can not found what's reason to cause this. Thanks Lewis.

          Show
          lufeng added a comment - Hi Lewis, I also found a problem when i committed this patch. but i can not found what's reason to cause this. Thanks Lewis.
          Hide
          Lewis John McGibbney added a comment -

          I committed a trivial fix in gora-sql-mapping.xml

          <field name="batchId" column="batchId" length="32"/>
          

          Committed revision 1460464.

          Show
          Lewis John McGibbney added a comment - I committed a trivial fix in gora-sql-mapping.xml <field name= "batchId" column= "batchId" length= "32" /> Committed revision 1460464.
          Hide
          Lewis John McGibbney added a comment -

          I just found out that this commit broke the build. It seems that there is a problem possibly within the gora-sql-mapping file or somewhere. I am getting problems like the following

          http://s.apache.org/E5O

          Show
          Lewis John McGibbney added a comment - I just found out that this commit broke the build. It seems that there is a problem possibly within the gora-sql-mapping file or somewhere. I am getting problems like the following http://s.apache.org/E5O
          Hide
          Lewis John McGibbney added a comment -

          Hi Feng. Can you please provide us with a commit number and also resolve this issue if you are happy with the commit?
          Thank you

          Show
          Lewis John McGibbney added a comment - Hi Feng. Can you please provide us with a commit number and also resolve this issue if you are happy with the commit? Thank you
          Hide
          Hudson added a comment -

          Integrated in Nutch-nutchgora #541 (See https://builds.apache.org/job/Nutch-nutchgora/541/)
          NUTCH-1533 - Implement getPrevModifiedTime(), setPrevModifiedTime(), getBatchId() and setBatchId() accessors in o.a.n.storage.WebPage (Revision 1460380)

          Result = FAILURE
          fenglu : http://svn.apache.org/viewvc/nutch/branches/2.x/?view=rev&rev=1460380
          Files :

          • /nutch/branches/2.x/CHANGES.txt
          • /nutch/branches/2.x/conf/gora-accumulo-mapping.xml
          • /nutch/branches/2.x/conf/gora-cassandra-mapping.xml
          • /nutch/branches/2.x/conf/gora-hbase-mapping.xml
          • /nutch/branches/2.x/conf/gora-sql-mapping.xml
          • /nutch/branches/2.x/src/gora/webpage.avsc
          • /nutch/branches/2.x/src/java/org/apache/nutch/crawl/AdaptiveFetchSchedule.java
          • /nutch/branches/2.x/src/java/org/apache/nutch/crawl/DbUpdateReducer.java
          • /nutch/branches/2.x/src/java/org/apache/nutch/crawl/DefaultFetchSchedule.java
          • /nutch/branches/2.x/src/java/org/apache/nutch/crawl/GeneratorReducer.java
          • /nutch/branches/2.x/src/java/org/apache/nutch/storage/ParseStatus.java
          • /nutch/branches/2.x/src/java/org/apache/nutch/storage/ProtocolStatus.java
          • /nutch/branches/2.x/src/java/org/apache/nutch/storage/WebPage.java
          Show
          Hudson added a comment - Integrated in Nutch-nutchgora #541 (See https://builds.apache.org/job/Nutch-nutchgora/541/ ) NUTCH-1533 - Implement getPrevModifiedTime(), setPrevModifiedTime(), getBatchId() and setBatchId() accessors in o.a.n.storage.WebPage (Revision 1460380) Result = FAILURE fenglu : http://svn.apache.org/viewvc/nutch/branches/2.x/?view=rev&rev=1460380 Files : /nutch/branches/2.x/CHANGES.txt /nutch/branches/2.x/conf/gora-accumulo-mapping.xml /nutch/branches/2.x/conf/gora-cassandra-mapping.xml /nutch/branches/2.x/conf/gora-hbase-mapping.xml /nutch/branches/2.x/conf/gora-sql-mapping.xml /nutch/branches/2.x/src/gora/webpage.avsc /nutch/branches/2.x/src/java/org/apache/nutch/crawl/AdaptiveFetchSchedule.java /nutch/branches/2.x/src/java/org/apache/nutch/crawl/DbUpdateReducer.java /nutch/branches/2.x/src/java/org/apache/nutch/crawl/DefaultFetchSchedule.java /nutch/branches/2.x/src/java/org/apache/nutch/crawl/GeneratorReducer.java /nutch/branches/2.x/src/java/org/apache/nutch/storage/ParseStatus.java /nutch/branches/2.x/src/java/org/apache/nutch/storage/ProtocolStatus.java /nutch/branches/2.x/src/java/org/apache/nutch/storage/WebPage.java
          Hide
          Hudson added a comment -

          Integrated in Nutch-2.x-Windows #77 (See https://builds.apache.org/job/Nutch-2.x-Windows/77/)
          NUTCH-1533 - Implement getPrevModifiedTime(), setPrevModifiedTime(), getBatchId() and setBatchId() accessors in o.a.n.storage.WebPage (Revision 1460380)

          Result = FAILURE
          fenglu : http://svn.apache.org/viewvc/nutch/branches/2.x/?view=rev&rev=1460380
          Files :

          • /nutch/branches/2.x/CHANGES.txt
          • /nutch/branches/2.x/conf/gora-accumulo-mapping.xml
          • /nutch/branches/2.x/conf/gora-cassandra-mapping.xml
          • /nutch/branches/2.x/conf/gora-hbase-mapping.xml
          • /nutch/branches/2.x/conf/gora-sql-mapping.xml
          • /nutch/branches/2.x/src/gora/webpage.avsc
          • /nutch/branches/2.x/src/java/org/apache/nutch/crawl/AdaptiveFetchSchedule.java
          • /nutch/branches/2.x/src/java/org/apache/nutch/crawl/DbUpdateReducer.java
          • /nutch/branches/2.x/src/java/org/apache/nutch/crawl/DefaultFetchSchedule.java
          • /nutch/branches/2.x/src/java/org/apache/nutch/crawl/GeneratorReducer.java
          • /nutch/branches/2.x/src/java/org/apache/nutch/storage/ParseStatus.java
          • /nutch/branches/2.x/src/java/org/apache/nutch/storage/ProtocolStatus.java
          • /nutch/branches/2.x/src/java/org/apache/nutch/storage/WebPage.java
          Show
          Hudson added a comment - Integrated in Nutch-2.x-Windows #77 (See https://builds.apache.org/job/Nutch-2.x-Windows/77/ ) NUTCH-1533 - Implement getPrevModifiedTime(), setPrevModifiedTime(), getBatchId() and setBatchId() accessors in o.a.n.storage.WebPage (Revision 1460380) Result = FAILURE fenglu : http://svn.apache.org/viewvc/nutch/branches/2.x/?view=rev&rev=1460380 Files : /nutch/branches/2.x/CHANGES.txt /nutch/branches/2.x/conf/gora-accumulo-mapping.xml /nutch/branches/2.x/conf/gora-cassandra-mapping.xml /nutch/branches/2.x/conf/gora-hbase-mapping.xml /nutch/branches/2.x/conf/gora-sql-mapping.xml /nutch/branches/2.x/src/gora/webpage.avsc /nutch/branches/2.x/src/java/org/apache/nutch/crawl/AdaptiveFetchSchedule.java /nutch/branches/2.x/src/java/org/apache/nutch/crawl/DbUpdateReducer.java /nutch/branches/2.x/src/java/org/apache/nutch/crawl/DefaultFetchSchedule.java /nutch/branches/2.x/src/java/org/apache/nutch/crawl/GeneratorReducer.java /nutch/branches/2.x/src/java/org/apache/nutch/storage/ParseStatus.java /nutch/branches/2.x/src/java/org/apache/nutch/storage/ProtocolStatus.java /nutch/branches/2.x/src/java/org/apache/nutch/storage/WebPage.java
          Hide
          Lewis John McGibbney added a comment -

          Nice one. I assigned the issue to you. I also sent away the remaining documentation to get your account set up. Apologies for the delay.
          Please commit this when you can.

          Show
          Lewis John McGibbney added a comment - Nice one. I assigned the issue to you. I also sent away the remaining documentation to get your account set up. Apologies for the delay. Please commit this when you can.
          Hide
          lufeng added a comment -

          Hi Lewis

          yes, i can commit this issue as soon as possible when i received Apache account.

          Thanks Lewis.

          Show
          lufeng added a comment - Hi Lewis yes, i can commit this issue as soon as possible when i received Apache account. Thanks Lewis.
          Hide
          Lewis John McGibbney added a comment -

          Hi Feng.
          I am +1 for committing the most recent patch. Can you please commit this?
          This way we can check if you have been set up with your Apache account, etc. properly.
          Thank you and great work on this one.

          Show
          Lewis John McGibbney added a comment - Hi Feng. I am +1 for committing the most recent patch. Can you please commit this? This way we can check if you have been set up with your Apache account, etc. properly. Thank you and great work on this one.
          Hide
          lufeng added a comment -

          add prevModifiedTime to FetchSchedule both methods when crawl status is equal to retry and gone in DbUpdateReducer class. Thanks Lewis.

          Show
          lufeng added a comment - add prevModifiedTime to FetchSchedule both methods when crawl status is equal to retry and gone in DbUpdateReducer class. Thanks Lewis.
          Hide
          lufeng added a comment -

          Hi Lewis

          I'm sorry, I did not make it clear, perhaps in my opinion, The prevFetchTime and prevModifiedTime are used together. Either set to 0L when CrawlStatus.STATUS_RETRY and CrawlStatus.STATUS_GONE which both set prevFetchTime and prevModifiedTime to 0L, or set to a value when CrawlStatus.NOTMODIFIED which set prevFetchTime and prevModifiedTime.

          yes, you are right, the both method should set prevModifiedTime to it. i will modified the patch later.

          Thanks Lewis.

          Show
          lufeng added a comment - Hi Lewis I'm sorry, I did not make it clear, perhaps in my opinion, The prevFetchTime and prevModifiedTime are used together. Either set to 0L when CrawlStatus.STATUS_RETRY and CrawlStatus.STATUS_GONE which both set prevFetchTime and prevModifiedTime to 0L, or set to a value when CrawlStatus.NOTMODIFIED which set prevFetchTime and prevModifiedTime. yes, you are right, the both method should set prevModifiedTime to it. i will modified the patch later. Thanks Lewis.
          Hide
          Lewis John McGibbney added a comment - - edited

          Hi Feng,

          {bq}

          i see that prevFetchTime is not fed into the schedule#setPageRetrySchedule, so i also not fed prevModifiedTime into it. How do your think about it?
          I am not quite understanding you here, I did not mention prevFetchTime, we are solely talking about long prevModifiedTime here. Can you please expand upon your comment?

          • My point is as follows: so far this patch (correctly) accounts for the CrawlStatus.STATUS_NOTMODIFIED case however it does not account for CrawlStatus.STATUS_RETRY and CrawlStatus.STATUS_GONE which both setPageRetrySchedule(String url, WebPage page, long prevFetchTime, long prevModifiedTime, long fetchTime) and setPageGoneSchdule(String url, WebPage page, long prevFetchTime, long prevModifiedTime, long fetchTime) respectively.

          As you see above, the current input parameters for the long prevModifiedTime for both method calls is set to 0L... which IMHO is incorrect.

          Do you have a comment on this?

          With regards to point two, I agree with you. We should address this in a different issue if and when one wishes to do so. Thanks for the insight.

          Show
          Lewis John McGibbney added a comment - - edited Hi Feng, {bq} i see that prevFetchTime is not fed into the schedule#setPageRetrySchedule, so i also not fed prevModifiedTime into it. How do your think about it? I am not quite understanding you here, I did not mention prevFetchTime, we are solely talking about long prevModifiedTime here. Can you please expand upon your comment? My point is as follows: so far this patch (correctly) accounts for the CrawlStatus.STATUS_NOTMODIFIED case however it does not account for CrawlStatus.STATUS_RETRY and CrawlStatus.STATUS_GONE which both setPageRetrySchedule(String url, WebPage page, long prevFetchTime, long prevModifiedTime , long fetchTime) and setPageGoneSchdule(String url, WebPage page, long prevFetchTime, long prevModifiedTime , long fetchTime) respectively. As you see above, the current input parameters for the long prevModifiedTime for both method calls is set to 0L... which IMHO is incorrect. Do you have a comment on this? With regards to point two, I agree with you. We should address this in a different issue if and when one wishes to do so. Thanks for the insight.
          Hide
          lufeng added a comment -

          Hi Lewis

          Thanks for your reviews.

          Issues:

          • i see that prevFetchTime is not fed into the schedule#setPageRetrySchedule, so i also not fed prevModifiedTime into it. How do your think about it?
          • currently maybe Host table is not affected by batchid. If we want to add a batchId to Host table metadata, maybe we shoud add multiple batchid to it ,because two page from one host maybe have different batchid.

          Thanks Lewis.

          Show
          lufeng added a comment - Hi Lewis Thanks for your reviews. Issues: i see that prevFetchTime is not fed into the schedule#setPageRetrySchedule, so i also not fed prevModifiedTime into it. How do your think about it? currently maybe Host table is not affected by batchid. If we want to add a batchId to Host table metadata, maybe we shoud add multiple batchid to it ,because two page from one host maybe have different batchid. Thanks Lewis.
          Hide
          Lewis John McGibbney added a comment - - edited

          Hi lufeng great work. I upload a new patch on this and comment below:

          Added


          • correct mappings for other Gora datastores.
          • Added license headers for WebPage classes generated by GoraCompiler.

          Issues


          • I have an issue about the following cases in DbUpdateReducer#reduce()
                  case CrawlStatus.STATUS_RETRY:
                    schedule.setPageRetrySchedule(url, page, 0L, 0L, page.getFetchTime());
                    if (page.getRetriesSinceFetch() < retryMax) {
                      page.setStatus(CrawlStatus.STATUS_UNFETCHED);
                    } else {
                      page.setStatus(CrawlStatus.STATUS_GONE);
                    }
                    break;
                  case CrawlStatus.STATUS_GONE:
                    schedule.setPageGoneSchedule(url, page, 0L, 0L, page.getFetchTime());
                    break;
            

          We still see the 0L to represent prevModifiedTime which is fed into the respective FetchSchedule.

          • Is the Host table affected by batchId at all? If so do we wish to associate a batchId field to the Host table metadata?

          Thanks for your work on this, it was unexpected and a real nice surprise.

          Show
          Lewis John McGibbney added a comment - - edited Hi lufeng great work. I upload a new patch on this and comment below: Added correct mappings for other Gora datastores. Added license headers for WebPage classes generated by GoraCompiler. Issues I have an issue about the following cases in DbUpdateReducer#reduce() case CrawlStatus.STATUS_RETRY: schedule.setPageRetrySchedule(url, page, 0L, 0L, page.getFetchTime()); if (page.getRetriesSinceFetch() < retryMax) { page.setStatus(CrawlStatus.STATUS_UNFETCHED); } else { page.setStatus(CrawlStatus.STATUS_GONE); } break ; case CrawlStatus.STATUS_GONE: schedule.setPageGoneSchedule(url, page, 0L, 0L, page.getFetchTime()); break ; We still see the 0L to represent prevModifiedTime which is fed into the respective FetchSchedule. Is the Host table affected by batchId at all? If so do we wish to associate a batchId field to the Host table metadata? Thanks for your work on this, it was unexpected and a real nice surprise.
          Hide
          lufeng added a comment -

          Implement getPrevModifiedTime(), setPrevModifiedTime(), getBatchId() and setBatchId() accessors in o.a.n.storage.WebPage

          Show
          lufeng added a comment - Implement getPrevModifiedTime(), setPrevModifiedTime(), getBatchId() and setBatchId() accessors in o.a.n.storage.WebPage

            People

            • Assignee:
              lufeng
              Reporter:
              Lewis John McGibbney
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development