Nutch
  1. Nutch
  2. NUTCH-1341

NotModified time set to now but page not modified

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.5
    • Fix Version/s: 1.6
    • Component/s: None
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      Servers tend to respond with incorrect or no value for LastModified. By comparing signatures or when (fetch.getStatus() == CrawlDatum.STATUS_FETCH_NOTMODIFIED) the reducer correctly sets the db_notmodified status for the CrawlDatum. The modifiedTime value, however, is not set accordingly.

        Activity

        Hide
        Hudson added a comment -

        Integrated in Nutch-trunk #1996 (See https://builds.apache.org/job/Nutch-trunk/1996/)
        NUTCH-1341 NotModified time set to now but page not modified (Revision 1401288)

        Result = ABORTED
        markus : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1401288
        Files :

        • /nutch/trunk/CHANGES.txt
        • /nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDbReducer.java
        Show
        Hudson added a comment - Integrated in Nutch-trunk #1996 (See https://builds.apache.org/job/Nutch-trunk/1996/ ) NUTCH-1341 NotModified time set to now but page not modified (Revision 1401288) Result = ABORTED markus : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1401288 Files : /nutch/trunk/CHANGES.txt /nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDbReducer.java
        Hide
        Hudson added a comment -

        Integrated in nutch-trunk-maven #465 (See https://builds.apache.org/job/nutch-trunk-maven/465/)
        NUTCH-1341 NotModified time set to now but page not modified (Revision 1401288)

        Result = SUCCESS
        markus :
        Files :

        • /nutch/trunk/CHANGES.txt
        • /nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDbReducer.java
        Show
        Hudson added a comment - Integrated in nutch-trunk-maven #465 (See https://builds.apache.org/job/nutch-trunk-maven/465/ ) NUTCH-1341 NotModified time set to now but page not modified (Revision 1401288) Result = SUCCESS markus : Files : /nutch/trunk/CHANGES.txt /nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDbReducer.java
        Hide
        Markus Jelsma added a comment -

        Committed for 1.6 in rev. 1401288.
        Thanks Lewis!

        Show
        Markus Jelsma added a comment - Committed for 1.6 in rev. 1401288. Thanks Lewis!
        Hide
        Lewis John McGibbney added a comment -

        No Markus I don't have anything to add to this issue. I am +1 for the commit.

        Show
        Lewis John McGibbney added a comment - No Markus I don't have anything to add to this issue. I am +1 for the commit.
        Hide
        Markus Jelsma added a comment -

        Any final thoughts about this?

        Show
        Markus Jelsma added a comment - Any final thoughts about this?
        Hide
        Markus Jelsma added a comment -

        Thanks. This has been running in production for quite some time now and has kept the modifiedTime stable over successive fetches. We use it togeter with the dateExtractorParseFilter, now the times don't shift if we force a complete reindex (we usually don't index notModified pages).

        Show
        Markus Jelsma added a comment - Thanks. This has been running in production for quite some time now and has kept the modifiedTime stable over successive fetches. We use it togeter with the dateExtractorParseFilter, now the times don't shift if we force a complete reindex (we usually don't index notModified pages).
        Hide
        Lewis John McGibbney added a comment -

        Initially I share your concern about exactly where this should be set Markus however having looked at the usage of prevModifiedTime throughout the class (~4 occurrences) and the knock on effect it looks fine.

        Show
        Lewis John McGibbney added a comment - Initially I share your concern about exactly where this should be set Markus however having looked at the usage of prevModifiedTime throughout the class (~4 occurrences) and the knock on effect it looks fine.
        Hide
        Julien Nioche added a comment -

        Looks like a reasonable thing to do

        Show
        Julien Nioche added a comment - Looks like a reasonable thing to do
        Hide
        Markus Jelsma added a comment -

        Any comments on this one?

        Show
        Markus Jelsma added a comment - Any comments on this one?
        Hide
        Julien Nioche added a comment -

        Let's release 1.5.1 first then add new bugs (sorry) functionalities later for 1.6

        Show
        Julien Nioche added a comment - Let's release 1.5.1 first then add new bugs (sorry) functionalities later for 1.6
        Hide
        Markus Jelsma added a comment -

        I'll commit this one shortly unless there are objections or improvements.

        Show
        Markus Jelsma added a comment - I'll commit this one shortly unless there are objections or improvements.
        Hide
        Markus Jelsma added a comment -

        Some comments on this change of functionallity are very much appreciated

        Show
        Markus Jelsma added a comment - Some comments on this change of functionallity are very much appreciated
        Hide
        Markus Jelsma added a comment -

        Here's a patch for 1.6. It simply resets the modifiedTime to the CrawlDatum's previous value right after the reducers sets a STATUS_DB_NOTMODIFIED status value. Since i believe the status is correct i assume the modifiedTime value can be reset here as well.

        Please comment. Did i overlook something? Implement it differently?

        Thanks

        Show
        Markus Jelsma added a comment - Here's a patch for 1.6. It simply resets the modifiedTime to the CrawlDatum's previous value right after the reducers sets a STATUS_DB_NOTMODIFIED status value. Since i believe the status is correct i assume the modifiedTime value can be reset here as well. Please comment. Did i overlook something? Implement it differently? Thanks

          People

          • Assignee:
            Markus Jelsma
            Reporter:
            Markus Jelsma
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development