Nutch
  1. Nutch
  2. NUTCH-1259

Store detected content type in crawldatum metadata

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.4
    • Fix Version/s: 1.5
    • Component/s: parser
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      The MIME-type detected by Tika's Detect() API is never added to a Parse's ContentMetaData or ParseMetaData. Because of this bad Content-Types will end up in the documents.

        Issue Links

          Activity

          Hide
          Markus Jelsma added a comment -

          A solution would be to prevent the type to be added just like what is already being done with the title field. Now, a reliable Content-Type value is added to the ParseMetaData.

          		// populate Nutch metadata with Tika metadata
          		String[] TikaMDNames = tikamd.names();
          		for (String tikaMDName : TikaMDNames) {
          			if (tikaMDName.equalsIgnoreCase(Metadata.TITLE))
          				continue;
          
                // DO NOT ADD Content-Type FROM HTTP_HEADERS, ONLY ADD THE DETECTED TYPE SEE https://issues.apache.org/jira/browse/NUTCH-1259
                 if (tikaMDName.equalsIgnoreCase(Metadata.CONTENT_TYPE))
                  continue;
          
          			// TODO what if multivalued?
          			nutchMetadata.add(tikaMDName, tikamd.get(tikaMDName));
          		}
              // Only add the detected TYPE
              nutchMetadata.add("Content-Type", mimeType);
          
          Show
          Markus Jelsma added a comment - A solution would be to prevent the type to be added just like what is already being done with the title field. Now, a reliable Content-Type value is added to the ParseMetaData. // populate Nutch metadata with Tika metadata String [] TikaMDNames = tikamd.names(); for ( String tikaMDName : TikaMDNames) { if (tikaMDName.equalsIgnoreCase(Metadata.TITLE)) continue ; // DO NOT ADD Content-Type FROM HTTP_HEADERS, ONLY ADD THE DETECTED TYPE SEE https://issues.apache.org/jira/browse/NUTCH-1259 if (tikaMDName.equalsIgnoreCase(Metadata.CONTENT_TYPE)) continue ; // TODO what if multivalued? nutchMetadata.add(tikaMDName, tikamd.get(tikaMDName)); } // Only add the detected TYPE nutchMetadata.add( "Content-Type" , mimeType);
          Hide
          Markus Jelsma added a comment -

          comments please.

          Show
          Markus Jelsma added a comment - comments please.
          Hide
          Markus Jelsma added a comment -

          I'll comment on it myself then: the code above fixes the issue and adds a proper content-type to parsemeta. Consider the following URL with a very bad content-type:

          http://kam.mff.cuni.cz/conferences/GraDR/

          I'll upload a patch in a minute that sets the detected content type in the metadata instead

          Show
          Markus Jelsma added a comment - I'll comment on it myself then: the code above fixes the issue and adds a proper content-type to parsemeta. Consider the following URL with a very bad content-type: http://kam.mff.cuni.cz/conferences/GraDR/ I'll upload a patch in a minute that sets the detected content type in the metadata instead
          Hide
          Markus Jelsma added a comment -

          Here's a patch for 1.5. Comments? We have this running in production and it does works very good. It completely solves the big problem of ending up with many thousands of crap content-types.

          I'll commit this one tomorrow unless there are objections.

          Show
          Markus Jelsma added a comment - Here's a patch for 1.5. Comments? We have this running in production and it does works very good. It completely solves the big problem of ending up with many thousands of crap content-types. I'll commit this one tomorrow unless there are objections.
          Hide
          Julien Nioche added a comment -

          I'll commit this one tomorrow unless there are objections.

          Markus, I understand that you could be frustrated with having your issues not reviewed as quickly as you'd wish but it would be nice to have a bit more notice. There aren't many active committers in the project and I can't follow the pace at which you submit patches

          Show
          Julien Nioche added a comment - I'll commit this one tomorrow unless there are objections. Markus, I understand that you could be frustrated with having your issues not reviewed as quickly as you'd wish but it would be nice to have a bit more notice. There aren't many active committers in the project and I can't follow the pace at which you submit patches
          Hide
          Markus Jelsma added a comment -

          you're right. but since you're most of the time the only person reviewing and the fact this issue has your attention now, what is your opinion on this problem?

          Show
          Markus Jelsma added a comment - you're right. but since you're most of the time the only person reviewing and the fact this issue has your attention now, what is your opinion on this problem?
          Hide
          Lewis John McGibbney added a comment -

          Hey Markus. I'm literally up to my eye balls with stuff the now so sorry for not having the time to look through your work. The best I can do is have a look tomorrow, I'll give it my all then. Thanks

          Show
          Lewis John McGibbney added a comment - Hey Markus. I'm literally up to my eye balls with stuff the now so sorry for not having the time to look through your work. The best I can do is have a look tomorrow, I'll give it my all then. Thanks
          Hide
          Julien Nioche added a comment -

          // DO NOT ADD Content-Type FROM HTTP_HEADERS, ONLY ADD THE DETECTED TYPE SEE https://issues.apache.org/jira/browse/NUTCH-1259

          hmmm, isn't that the content-type from the HTML headers instead?

          Anyway, probably a good idea NOT to add it to the parse-metadata as it has already been detected from the content and stored in the content metadata, however I can't think of a reason why we'd want to duplicate that to the parse metadata as well. The value in the content metadata is the one set by the detector and should be the correct one. Or am I missing something?

          Show
          Julien Nioche added a comment - // DO NOT ADD Content-Type FROM HTTP_HEADERS, ONLY ADD THE DETECTED TYPE SEE https://issues.apache.org/jira/browse/NUTCH-1259 hmmm, isn't that the content-type from the HTML headers instead? Anyway, probably a good idea NOT to add it to the parse-metadata as it has already been detected from the content and stored in the content metadata, however I can't think of a reason why we'd want to duplicate that to the parse metadata as well. The value in the content metadata is the one set by the detector and should be the correct one. Or am I missing something?
          Hide
          Markus Jelsma added a comment -

          Hi,

          Consider the following URL that produces bad output. This URL is not the only producing bad output. We've seen countless of examples that produce funky values in both content meta and parse meta, or no value at all.

          http://kam.mff.cuni.cz/conferences/GraDR/

          The current Nutch trunk shows us the following meta data for this URL obtained via parsechecker with only parse-tika enabled:

          Content Metadata: Vary=negotiate,accept,Accept-Encoding Date=Thu, 09 Feb 2012 14:37:47 GMT Content-Length=4911 TCN=choice Content-Encoding=gzip Content-Location=index.html.bak Content-Type=application/x-trash Connection=close Accept-Ranges=bytes Server=Apache/2.2.9 (Debian) mod_auth_kerb/5.3 PHP/5.2.6-1+lenny14 with Suhosin-Patch mod_ssl/2.2.9 OpenSSL/0.9.8g 
          Parse Metadata: Content-Encoding=ISO-8859-1
          

          It's an application/x-trash according to content meta and no data is available in parse meta. But, it's just an ordinary HTML page. This cannot be true, from an index point of view we will never know that this is an HTML page. With this patch enabled we will get the following output:

          Content Metadata: Vary=negotiate,accept,Accept-Encoding Date=Thu, 09 Feb 2012 14:40:15 GMT Content-Length=4911 TCN=choice Content-Encoding=gzip Content-Location=index.html.bak Content-Type=application/x-trash Connection=close Accept-Ranges=bytes Server=Apache/2.2.9 (Debian) mod_auth_kerb/5.3 PHP/5.2.6-1+lenny14 with Suhosin-Patch mod_ssl/2.2.9 OpenSSL/0.9.8g 
          Parse Metadata: Content-Encoding=ISO-8859-1 Content-Type=text/html
          

          For us, this solves all problems as we now only rely on Tika's MIME-detector and store it in parse meta. The value of content meta cannot be trusted. It's the same as with languages, when we do not use Tika to detect the language we get all sorts of crap.

          Since the upgrade to Tika 1.0 and with NUTCH-1230 we obtain the detected MIME-type but it's not added to the parse meta. Now it is.

          Do you have another suggestion?

          Show
          Markus Jelsma added a comment - Hi, Consider the following URL that produces bad output. This URL is not the only producing bad output. We've seen countless of examples that produce funky values in both content meta and parse meta, or no value at all. http://kam.mff.cuni.cz/conferences/GraDR/ The current Nutch trunk shows us the following meta data for this URL obtained via parsechecker with only parse-tika enabled: Content Metadata: Vary=negotiate,accept,Accept-Encoding Date=Thu, 09 Feb 2012 14:37:47 GMT Content-Length=4911 TCN=choice Content-Encoding=gzip Content-Location=index.html.bak Content-Type=application/x-trash Connection=close Accept-Ranges=bytes Server=Apache/2.2.9 (Debian) mod_auth_kerb/5.3 PHP/5.2.6-1+lenny14 with Suhosin-Patch mod_ssl/2.2.9 OpenSSL/0.9.8g Parse Metadata: Content-Encoding=ISO-8859-1 It's an application/x-trash according to content meta and no data is available in parse meta. But, it's just an ordinary HTML page. This cannot be true, from an index point of view we will never know that this is an HTML page. With this patch enabled we will get the following output: Content Metadata: Vary=negotiate,accept,Accept-Encoding Date=Thu, 09 Feb 2012 14:40:15 GMT Content-Length=4911 TCN=choice Content-Encoding=gzip Content-Location=index.html.bak Content-Type=application/x-trash Connection=close Accept-Ranges=bytes Server=Apache/2.2.9 (Debian) mod_auth_kerb/5.3 PHP/5.2.6-1+lenny14 with Suhosin-Patch mod_ssl/2.2.9 OpenSSL/0.9.8g Parse Metadata: Content-Encoding=ISO-8859-1 Content-Type=text/html For us, this solves all problems as we now only rely on Tika's MIME-detector and store it in parse meta. The value of content meta cannot be trusted. It's the same as with languages, when we do not use Tika to detect the language we get all sorts of crap. Since the upgrade to Tika 1.0 and with NUTCH-1230 we obtain the detected MIME-type but it's not added to the parse meta. Now it is. Do you have another suggestion?
          Hide
          Julien Nioche added a comment -

          Thanks for the example. Here is a summary of what is happening.
          The correct Mime-type guessed by Tika is stored in the Content object. This is what is then used during the parsing step to determine which implementation of the parser should be used. This value is what you can see displayed by the parser checker e.g.

           
          fetching: http://kam.mff.cuni.cz/conferences/GraDR/
          parsing: http://kam.mff.cuni.cz/conferences/GraDR/
          contentType: text/html
          signature: 575aecee981b1aa03a145e3dc5b4de72
          

          This is different from the value displayed in the content metadata which corresponds to what is returned in the protocol headers. It is also different from the value found in parse metadata which what can be found in the content. Note that there is no guarantee that these two values can be found.

          Now the problem with https://issues.apache.org/jira/browse/NUTCH-1258 is that while the ParserFilters have access to the Content object, this is not the case of the IndexingFilters. One option would be to have a bespoke Parser implementation to store a custom metadata to store the CT in the Content object (i.e. the one Tika guessed) then use that in the indexing filter. That's unnecessarily messy.

          I think a cleaner approach would be to store the guessed content-type in the crawldatum metadata. This way we :

          • can access it from the indexing filters (the parsing filter would still get it from Content if necessary)
          • do not override the value stored in parse metadata
          • can access it regardless of whether a document has been parsed or not
          • have a mechanism which is independent from the actual parser used (html / tika / other)
          • have the possibility of taking a different decision as to which value should be used (guessed vs protocol vs content)
          • keep a trace of why such or such parser was used on a given document

          This would be done in the output method of the class Fetcher.

          What do you think?

          Show
          Julien Nioche added a comment - Thanks for the example. Here is a summary of what is happening. The correct Mime-type guessed by Tika is stored in the Content object. This is what is then used during the parsing step to determine which implementation of the parser should be used. This value is what you can see displayed by the parser checker e.g. fetching: http://kam.mff.cuni.cz/conferences/GraDR/ parsing: http://kam.mff.cuni.cz/conferences/GraDR/ contentType: text/html signature: 575aecee981b1aa03a145e3dc5b4de72 This is different from the value displayed in the content metadata which corresponds to what is returned in the protocol headers. It is also different from the value found in parse metadata which what can be found in the content. Note that there is no guarantee that these two values can be found. Now the problem with https://issues.apache.org/jira/browse/NUTCH-1258 is that while the ParserFilters have access to the Content object, this is not the case of the IndexingFilters. One option would be to have a bespoke Parser implementation to store a custom metadata to store the CT in the Content object (i.e. the one Tika guessed) then use that in the indexing filter. That's unnecessarily messy. I think a cleaner approach would be to store the guessed content-type in the crawldatum metadata. This way we : can access it from the indexing filters (the parsing filter would still get it from Content if necessary) do not override the value stored in parse metadata can access it regardless of whether a document has been parsed or not have a mechanism which is independent from the actual parser used (html / tika / other) have the possibility of taking a different decision as to which value should be used (guessed vs protocol vs content) keep a trace of why such or such parser was used on a given document This would be done in the output method of the class Fetcher. What do you think?
          Hide
          Markus Jelsma added a comment -

          Sounds good! We already store the Content-Type in de CrawlDatum's metadata for NUTCH-1024 via db.parsemeta.to.crawldb. Wouldn't it be better to store it in the CrawlDatum object itself just like the signature? Then someone cannot override it by accident.

          Show
          Markus Jelsma added a comment - Sounds good! We already store the Content-Type in de CrawlDatum's metadata for NUTCH-1024 via db.parsemeta.to.crawldb. Wouldn't it be better to store it in the CrawlDatum object itself just like the signature? Then someone cannot override it by accident.
          Hide
          Julien Nioche added a comment -

          I haven't looked at NUTCH-1024. Does it take the detected value from Content or the one from the parse md?
          As for storing it in the CrawlDatum that would require changing the object, its version, making sure it remains compatible etc... so I'd rather store it in the crawldatum md for now. It means that it can be overriden indeed but this is quite unlikely to happen unless you write a custom resource etc... Let's keep this option in mind for later maybe

          Show
          Julien Nioche added a comment - I haven't looked at NUTCH-1024 . Does it take the detected value from Content or the one from the parse md? As for storing it in the CrawlDatum that would require changing the object, its version, making sure it remains compatible etc... so I'd rather store it in the crawldatum md for now. It means that it can be overriden indeed but this is quite unlikely to happen unless you write a custom resource etc... Let's keep this option in mind for later maybe
          Hide
          Markus Jelsma added a comment -

          NUTCH-1024 relies on the Content-Type to be added crawldatum metadata via db.parsemeta.to.crawldb.

          Anyway, i agree. Will you open another issue?

          have a nice weekend

          Show
          Markus Jelsma added a comment - NUTCH-1024 relies on the Content-Type to be added crawldatum metadata via db.parsemeta.to.crawldb. Anyway, i agree. Will you open another issue? have a nice weekend
          Hide
          Julien Nioche added a comment -

          Nah, might as well do it in this one. Will rename it, that's all
          Have a nice week end too

          Show
          Julien Nioche added a comment - Nah, might as well do it in this one. Will rename it, that's all Have a nice week end too
          Hide
          Markus Jelsma added a comment -

          Great. Thanks!

          Show
          Markus Jelsma added a comment - Great. Thanks!
          Hide
          Julien Nioche added a comment -

          trunk => Committed revision 1243482.

          Show
          Julien Nioche added a comment - trunk => Committed revision 1243482.
          Hide
          Hudson added a comment -

          Integrated in nutch-trunk-maven #146 (See https://builds.apache.org/job/nutch-trunk-maven/146/)
          NUTCH-1259 Store detected content type in crawldatum metadata (Revision 1243482)

          Result = SUCCESS
          jnioche :
          Files :

          • /nutch/trunk/CHANGES.txt
          • /nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java
          Show
          Hudson added a comment - Integrated in nutch-trunk-maven #146 (See https://builds.apache.org/job/nutch-trunk-maven/146/ ) NUTCH-1259 Store detected content type in crawldatum metadata (Revision 1243482) Result = SUCCESS jnioche : Files : /nutch/trunk/CHANGES.txt /nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java
          Hide
          Markus Jelsma added a comment -

          Hey Julien, there's something wrong with this commit. We're seeing NPE's in the Fetcher without stack trace now. The fetcher doesn't die but the generated seed list is quickly terminated and few records get processed instead of millions. It looks like it's triggered when a fetch error occurs. You can reproduce this error by injecting a unknown host but it's likely to happen as well when socket time outs and related errors are thrown.

          fetch of http://idonotexist.openindex.io/ failed with: java.net.UnknownHostException: idonotexist.openindex.io
          fetch of http://idonotexist.openindex.io/ failed with: java.lang.NullPointerException
          fetcher caught:java.lang.NullPointerException
          

          Can you look at it?

          Show
          Markus Jelsma added a comment - Hey Julien, there's something wrong with this commit. We're seeing NPE's in the Fetcher without stack trace now. The fetcher doesn't die but the generated seed list is quickly terminated and few records get processed instead of millions. It looks like it's triggered when a fetch error occurs. You can reproduce this error by injecting a unknown host but it's likely to happen as well when socket time outs and related errors are thrown. fetch of http: //idonotexist.openindex.io/ failed with: java.net.UnknownHostException: idonotexist.openindex.io fetch of http: //idonotexist.openindex.io/ failed with: java.lang.NullPointerException fetcher caught:java.lang.NullPointerException Can you look at it?
          Hide
          Julien Nioche added a comment -

          good catch. Had overlooked the fact that the content object can be null. Could you svn up to revision 1243928 and give it a try?
          Thanks

          Show
          Julien Nioche added a comment - good catch. Had overlooked the fact that the content object can be null. Could you svn up to revision 1243928 and give it a try? Thanks
          Hide
          Markus Jelsma added a comment -

          Splendid work my friend! The fetcher runs smoothly again! I'll check out your patch for NUTCH-1258 this week.
          But what about segments fetched with and without this new feature and db.parsemeta.to.crawldb=Content-Type property?

          I assume i'd have to update the segments before this change with the property enabled and update the segments fetched with this feature without the db.parsemeta.to.crawldb property.

          Show
          Markus Jelsma added a comment - Splendid work my friend! The fetcher runs smoothly again! I'll check out your patch for NUTCH-1258 this week. But what about segments fetched with and without this new feature and db.parsemeta.to.crawldb=Content-Type property? I assume i'd have to update the segments before this change with the property enabled and update the segments fetched with this feature without the db.parsemeta.to.crawldb property.
          Hide
          Julien Nioche added a comment -

          But what about segments fetched with and without this new feature and db.parsemeta.to.crawldb=Content-Type property?

          I assume i'd have to update the segments before this change with the property enabled and update the segments fetched with this feature without the db.parsemeta.to.crawldb property.

          yep

          Show
          Julien Nioche added a comment - But what about segments fetched with and without this new feature and db.parsemeta.to.crawldb=Content-Type property? I assume i'd have to update the segments before this change with the property enabled and update the segments fetched with this feature without the db.parsemeta.to.crawldb property. yep
          Hide
          Hudson added a comment -

          Integrated in nutch-trunk-maven #149 (See https://builds.apache.org/job/nutch-trunk-maven/149/)
          BugFix : NUTCH-1259 Store detected content type in crawldatum metadata (Revision 1243928)

          Result = SUCCESS
          jnioche :
          Files :

          • /nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java
          Show
          Hudson added a comment - Integrated in nutch-trunk-maven #149 (See https://builds.apache.org/job/nutch-trunk-maven/149/ ) BugFix : NUTCH-1259 Store detected content type in crawldatum metadata (Revision 1243928) Result = SUCCESS jnioche : Files : /nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java

            People

            • Assignee:
              Julien Nioche
              Reporter:
              Markus Jelsma
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development