Nutch
  1. Nutch
  2. NUTCH-1258

MoreIndexingFilter should be able to read Content-Type from both parse metadata and content metadata

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 1.4
    • Fix Version/s: 1.5
    • Component/s: indexer
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      The MoreIndexingFilter reads the Content-Type from parse metadata. However, this usually contains a lot of crap because web developers can set it to anything they like. The filter must be able to read the Content-Type field from content metadata as well because that contains the type detected by Tika's Detector.

      1. NUTCH-1258-v2.patch
        2 kB
        Julien Nioche
      2. NUTCH-1258-1.5-1.patch
        2 kB
        Markus Jelsma

        Issue Links

          Activity

          Hide
          Markus Jelsma added a comment -

          Patch for 1.5. Adds configuration to read from contentmeta, parsemeta or first parsemeta and fallback to contentmeta (default).

          Show
          Markus Jelsma added a comment - Patch for 1.5. Adds configuration to read from contentmeta, parsemeta or first parsemeta and fallback to contentmeta (default).
          Hide
          Markus Jelsma added a comment -

          Comments? Tested and things work as expected, tests pass. Ill commit shortly unless there are objections.

          Show
          Markus Jelsma added a comment - Comments? Tested and things work as expected, tests pass. Ill commit shortly unless there are objections.
          Hide
          Julien Nioche added a comment -

          What about using a similar mechanism for the parameters as we do for the lang extraction

           
          <name>lang.extraction.policy</name>
            <value>detect,identify</value>
          

          and specify explicitly the order in which we should get the data from e.g.

           
          <name>moreIndexingFilter.mimeTypeSource</name>
            <value>parse,content</value>
          

          or

           
          <name>moreIndexingFilter.mimeTypeSource</name>
            <value>parse</value>
          

          if we don't want the content at all

          Show
          Julien Nioche added a comment - What about using a similar mechanism for the parameters as we do for the lang extraction <name>lang.extraction.policy</name> <value>detect,identify</value> and specify explicitly the order in which we should get the data from e.g. <name>moreIndexingFilter.mimeTypeSource</name> <value>parse,content</value> or <name>moreIndexingFilter.mimeTypeSource</name> <value>parse</value> if we don't want the content at all
          Hide
          Markus Jelsma added a comment -

          That may be a good idea indeed but we need to extend it too. This patch fixes some issues with bad content-types but it seems the problem is bigger. The example URL [1] doesn't provide any Content-Type in ParseMeta and a bad Content-Type in ContentMeta, application/x-trash which is found in the HTTP resp. header. However, parserchecker (and indexchecker) both show
          contentType: text/html at the top but this value is not added to any metadata AFAIK. In this case only contentType = content.getContentType(); returns the desired Content-Type.

          Any idea how we can get a hold on that value when we have an instance of ParseData in the MoreIndexingFilter?

          [1]: http://kam.mff.cuni.cz/conferences/GraDR/

          Show
          Markus Jelsma added a comment - That may be a good idea indeed but we need to extend it too. This patch fixes some issues with bad content-types but it seems the problem is bigger. The example URL [1] doesn't provide any Content-Type in ParseMeta and a bad Content-Type in ContentMeta, application/x-trash which is found in the HTTP resp. header. However, parserchecker (and indexchecker) both show contentType: text/html at the top but this value is not added to any metadata AFAIK. In this case only contentType = content.getContentType(); returns the desired Content-Type. Any idea how we can get a hold on that value when we have an instance of ParseData in the MoreIndexingFilter? [1] : http://kam.mff.cuni.cz/conferences/GraDR/
          Hide
          Markus Jelsma added a comment -

          Ah, the Content-Type detected by Tika is never added to ParseMeta in the first place! I've modified TikaParser with nutchMetadata.add("Content-Type", mimeType);. In cases where at first i had a bad Content-Type in ParseMeta (but a good one in Content-Meta) i now have good old text/html. The problem is with Content-Types already added to the MetaData by the parser. In that case both the good and bad Content-Types are present in ParseMeta.

          Just as commented in the code we now have a problem with multi values fields.

          		// populate Nutch metadata with Tika metadata
          		String[] TikaMDNames = tikamd.names();
          		for (String tikaMDName : TikaMDNames) {
          			if (tikaMDName.equalsIgnoreCase(Metadata.TITLE))
          				continue;
          			// TODO what if multivalued?
          			nutchMetadata.add(tikaMDName, tikamd.get(tikaMDName));
          		}
          

          This needs another issue opened but some comments are more than appreciated first.

          Thanks

          Show
          Markus Jelsma added a comment - Ah, the Content-Type detected by Tika is never added to ParseMeta in the first place! I've modified TikaParser with nutchMetadata.add("Content-Type", mimeType);. In cases where at first i had a bad Content-Type in ParseMeta (but a good one in Content-Meta) i now have good old text/html. The problem is with Content-Types already added to the MetaData by the parser. In that case both the good and bad Content-Types are present in ParseMeta. Just as commented in the code we now have a problem with multi values fields. // populate Nutch metadata with Tika metadata String [] TikaMDNames = tikamd.names(); for ( String tikaMDName : TikaMDNames) { if (tikaMDName.equalsIgnoreCase(Metadata.TITLE)) continue ; // TODO what if multivalued? nutchMetadata.add(tikaMDName, tikamd.get(tikaMDName)); } This needs another issue opened but some comments are more than appreciated first. Thanks
          Hide
          Markus Jelsma added a comment -

          With the patch of NUTCH-1258 this is no longer a real requirement as the good, detected content type is put in parsemeta and that field has priority when reading the contenttype via parsedata.

          Show
          Markus Jelsma added a comment - With the patch of NUTCH-1258 this is no longer a real requirement as the good, detected content type is put in parsemeta and that field has priority when reading the contenttype via parsedata.
          Hide
          Julien Nioche added a comment -

          We now have access to the detected content-type from the crawldatum metadata as of NUTCH-1259. This patch tries to get this first then goes in the parse metadata.

          Show
          Julien Nioche added a comment - We now have access to the detected content-type from the crawldatum metadata as of NUTCH-1259 . This patch tries to get this first then goes in the parse metadata.
          Hide
          Markus Jelsma added a comment -

          The patch won't patch as it complains about being malformed. Also, the Writable class is not imported for some reason. It seems to work. Want me to commit?

          Show
          Markus Jelsma added a comment - The patch won't patch as it complains about being malformed. Also, the Writable class is not imported for some reason. It seems to work. Want me to commit?
          Hide
          Julien Nioche added a comment -

          Weird. Yes, please do fix and commit if you can
          Thanks!

          Show
          Julien Nioche added a comment - Weird. Yes, please do fix and commit if you can Thanks!
          Hide
          Markus Jelsma added a comment -

          Committed for 1.5 in rev 1295624.
          Thanks Jul.

          Show
          Markus Jelsma added a comment - Committed for 1.5 in rev 1295624. Thanks Jul.
          Hide
          Hudson added a comment -

          Integrated in nutch-trunk-maven #178 (See https://builds.apache.org/job/nutch-trunk-maven/178/)
          NUTCH-1258 MoreIndexingFilter should be able to read Content-Type from both parse metadata and content metadata (Revision 1295624)

          Result = SUCCESS
          markus :
          Files :

          • /nutch/trunk/CHANGES.txt
          • /nutch/trunk/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
          Show
          Hudson added a comment - Integrated in nutch-trunk-maven #178 (See https://builds.apache.org/job/nutch-trunk-maven/178/ ) NUTCH-1258 MoreIndexingFilter should be able to read Content-Type from both parse metadata and content metadata (Revision 1295624) Result = SUCCESS markus : Files : /nutch/trunk/CHANGES.txt /nutch/trunk/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
          Hide
          Hudson added a comment -

          Integrated in Nutch-trunk #1774 (See https://builds.apache.org/job/Nutch-trunk/1774/)
          NUTCH-1258 MoreIndexingFilter should be able to read Content-Type from both parse metadata and content metadata (Revision 1295624)

          Result = SUCCESS
          markus : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1295624
          Files :

          • /nutch/trunk/CHANGES.txt
          • /nutch/trunk/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
          Show
          Hudson added a comment - Integrated in Nutch-trunk #1774 (See https://builds.apache.org/job/Nutch-trunk/1774/ ) NUTCH-1258 MoreIndexingFilter should be able to read Content-Type from both parse metadata and content metadata (Revision 1295624) Result = SUCCESS markus : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1295624 Files : /nutch/trunk/CHANGES.txt /nutch/trunk/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java

            People

            • Assignee:
              Markus Jelsma
              Reporter:
              Markus Jelsma
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development