Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 1.3, nutchgora
    • Fix Version/s: 1.4
    • Component/s: indexer
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      A simple plugin called at indexing that adds fields with static data. You can specify a list of <fieldname>:<fieldcontent> per nutch job.
      It can be useful when collections can't be created by urlpatterns, like in subcollection, but on a job-basis.

      1. NUTCH-940-trunk-20110911.patch
        7 kB
        Lewis John McGibbney
      2. NUTCH-940-branch-1.4-20110911-final.patch
        2 kB
        Lewis John McGibbney
      3. NUTCH-940-branch-1.4-20110910-v3.patch
        9 kB
        Lewis John McGibbney
      4. NUTCH-940-branch-1.4-20110825-v2.patch
        2 kB
        Lewis John McGibbney
      5. NUTCH-940-branch-1.4-20110824.patch
        2 kB
        Lewis John McGibbney
      6. index-static.diff
        8 kB
        Claudio Martella
      7. index-static.diff
        8 kB
        Claudio Martella
      8. static-field.diff
        8 kB
        Claudio Martella
      9. static-field.tar.gz
        2 kB
        Claudio Martella

        Issue Links

          Activity

          Hide
          bronco added a comment -

          Is it possible to add a static field per domain which contains the user id of cms like drupal or similar ? I am looking for a solution to group crawld domains on a user basis. For example i just want to search websites which are owned by user xyz and if yes how can I do it.

          Thx in advance

          Show
          bronco added a comment - Is it possible to add a static field per domain which contains the user id of cms like drupal or similar ? I am looking for a solution to group crawld domains on a user basis. For example i just want to search websites which are owned by user xyz and if yes how can I do it. Thx in advance
          Hide
          Claudio Martella added a comment -

          if you want to add fields based on urlpatterns i suggest you have a look at the subcollection plugin. that's exactly what it does.

          Show
          Claudio Martella added a comment - if you want to add fields based on urlpatterns i suggest you have a look at the subcollection plugin. that's exactly what it does.
          Hide
          Julien Nioche added a comment -

          removed 1.2 flag as it has already been released.
          Claudio please attach a patch generated against the latest SVN code with 'svn diff', this will make it easier for others to review your contribution
          Thanks

          Show
          Julien Nioche added a comment - removed 1.2 flag as it has already been released. Claudio please attach a patch generated against the latest SVN code with 'svn diff', this will make it easier for others to review your contribution Thanks
          Hide
          Claudio Martella added a comment -

          you mean nutchbase or branch-1.3?

          Show
          Claudio Martella added a comment - you mean nutchbase or branch-1.3?
          Hide
          Julien Nioche added a comment -

          nutchbase is not an active branch. please diff against branch-1.3 and trunk thanks

          Show
          Julien Nioche added a comment - nutchbase is not an active branch. please diff against branch-1.3 and trunk thanks
          Hide
          Julien Nioche added a comment -

          still needs a patch for trunk and 1.x branch
          won't be part of the next 1.3 release

          Show
          Julien Nioche added a comment - still needs a patch for trunk and 1.x branch won't be part of the next 1.3 release
          Hide
          Claudio Martella added a comment -

          sorry julien, i've been quite busy. I'll fix it next week. Still gotta understand how you worked out the delegation to solr to remove dependencies on lucene in the plugin.

          Show
          Claudio Martella added a comment - sorry julien, i've been quite busy. I'll fix it next week. Still gotta understand how you worked out the delegation to solr to remove dependencies on lucene in the plugin.
          Hide
          Claudio Martella added a comment -

          this one is done with svn diff and is synced with nutch 1.3

          Show
          Claudio Martella added a comment - this one is done with svn diff and is synced with nutch 1.3
          Hide
          Julien Nioche added a comment -

          Claudio,

          It would be better to follow the implicit convention for naming the plugins and call it index-static for instance. This will be a better indication of what the plugin does.

          Would be better to be able to specify multiple values for a field as well i.e have a Map<String,String[]>

          Julien

          Show
          Julien Nioche added a comment - Claudio, It would be better to follow the implicit convention for naming the plugins and call it index-static for instance. This will be a better indication of what the plugin does. Would be better to be able to specify multiple values for a field as well i.e have a Map<String,String[]> Julien
          Hide
          Claudio Martella added a comment -

          changed naming conventions from static-field to index-static

          Show
          Claudio Martella added a comment - changed naming conventions from static-field to index-static
          Hide
          Claudio Martella added a comment -

          About the multiple values, i split on commas and on colons, so values can already have multiple tokens with spaces. They will not be divided in the map, but does it make a difference at indexing time?

          i.e. this is reasonable:

          field1:value1.1 value1.2 value1.3,field2:value2.1 value2.2 ...

          Show
          Claudio Martella added a comment - About the multiple values, i split on commas and on colons, so values can already have multiple tokens with spaces. They will not be divided in the map, but does it make a difference at indexing time? i.e. this is reasonable: field1:value1.1 value1.2 value1.3,field2:value2.1 value2.2 ...
          Hide
          Julien Nioche added a comment -

          Map<String,String[]> => multiple values for the same key which is useful for mulitvalued fields in SOLR e.g. anchors.

          see the functionalities of https://issues.apache.org/jira/browse/NUTCH-924 (and my comments there). Since your plugin is more generic I'd rather use it as soon as it provides at least the same functionalities .

          Don't forget to add to src/plugin/build.xml

          <ant dir="index-static" target="clean"/>

          Thanks

          Julien

          Show
          Julien Nioche added a comment - Map<String,String[]> => multiple values for the same key which is useful for mulitvalued fields in SOLR e.g. anchors. see the functionalities of https://issues.apache.org/jira/browse/NUTCH-924 (and my comments there). Since your plugin is more generic I'd rather use it as soon as it provides at least the same functionalities . Don't forget to add to src/plugin/build.xml <ant dir="index-static" target="clean"/> Thanks Julien
          Hide
          Julien Nioche added a comment -

          Also please add a description of the parameter used by the plugin in nutch-default.xml

          Show
          Julien Nioche added a comment - Also please add a description of the parameter used by the plugin in nutch-default.xml
          Hide
          Claudio Martella added a comment -
          • fixed src/plugin/build.xml
          • changed the map to HashMap<String,String[]>

          I hope this works this way. Looks like NutchField expects a Collection.

          Is it fine to do doc.add(String, String[]) for what you required?

          Show
          Claudio Martella added a comment - fixed src/plugin/build.xml changed the map to HashMap<String,String[]> I hope this works this way. Looks like NutchField expects a Collection. Is it fine to do doc.add(String, String[]) for what you required?
          Hide
          Claudio Martella added a comment -

          Just to understand, do you think anything more on this one?

          Show
          Claudio Martella added a comment - Just to understand, do you think anything more on this one?
          Hide
          Marseld Dedgjonaj added a comment - - edited

          Hello Claudio,
          I see in the patch static-field.diff is not the last version of the java file.
          So I think you should fix it.

          Regards,
          Marseld

          Show
          Marseld Dedgjonaj added a comment - - edited Hello Claudio, I see in the patch static-field.diff is not the last version of the java file. So I think you should fix it. Regards, Marseld
          Hide
          Markus Jelsma added a comment -

          +1 for including this in 1.4. Objections?

          Show
          Markus Jelsma added a comment - +1 for including this in 1.4. Objections?
          Hide
          Julien Nioche added a comment -

          Still needs a description of the parameter used by the plugin in nutch-default.xml
          Looks OK, apart from that

          Show
          Julien Nioche added a comment - Still needs a description of the parameter used by the plugin in nutch-default.xml Looks OK, apart from that
          Hide
          Markus Jelsma added a comment -

          Any volunteers to assign this issue to? If not i'll pick it up for 1.4 in the next couple of weeks.

          Show
          Markus Jelsma added a comment - Any volunteers to assign this issue to? If not i'll pick it up for 1.4 in the next couple of weeks.
          Hide
          Lewis John McGibbney added a comment -

          From reading the reasonable amount of correspondence on this one I am happy to pick it up and get a (hopefully) final patch submitted next week if this is OK with everyone. Thanks

          Show
          Lewis John McGibbney added a comment - From reading the reasonable amount of correspondence on this one I am happy to pick it up and get a (hopefully) final patch submitted next week if this is OK with everyone. Thanks
          Hide
          Markus Jelsma added a comment -

          cool, thanks! make sure to mention claudio in changes.txt

          Show
          Markus Jelsma added a comment - cool, thanks! make sure to mention claudio in changes.txt
          Hide
          Lewis John McGibbney added a comment -

          Final patch for review including the afore discussed property within nutch-site.xml. The only issue I have is the naming of the final directory as it currently exists as static field instead of index-static as per its parent folder. Also please review and comment re. nutch-site.xml description.

          Show
          Lewis John McGibbney added a comment - Final patch for review including the afore discussed property within nutch-site.xml. The only issue I have is the naming of the final directory as it currently exists as static field instead of index-static as per its parent folder. Also please review and comment re. nutch-site.xml description.
          Hide
          Lewis John McGibbney added a comment -

          please ignore the patch and my comments as above. Too late I'm away to bed and have lost concentration. I will deal with this tomorrow.

          Show
          Lewis John McGibbney added a comment - please ignore the patch and my comments as above. Too late I'm away to bed and have lost concentration. I will deal with this tomorrow.
          Hide
          Markus Jelsma added a comment - - edited

          Looks fine. If the naming is inconsistent you can change everything to index-static including directory names.

          Show
          Markus Jelsma added a comment - - edited Looks fine. If the naming is inconsistent you can change everything to index-static including directory names.
          Hide
          Lewis John McGibbney added a comment - - edited

          After a bit of confusion last night I've got this working and attach a new patch which passes all tests.

          My concern was that the final directory was not consistently named and that this should change, but upon viewing the new patch it does not include all of the new files as per claudio's original.

          I'm creating the patch as per

          svn diff > patch-name.patch
          

          but it doesn't seem to want to include the new index-static plugin src and associated plugin .xml files... any ideas please.

          In addition, is it required that we get some JUnit tests together to accompany this new plugin?

          Show
          Lewis John McGibbney added a comment - - edited After a bit of confusion last night I've got this working and attach a new patch which passes all tests. My concern was that the final directory was not consistently named and that this should change, but upon viewing the new patch it does not include all of the new files as per claudio's original. I'm creating the patch as per svn diff > patch-name.patch but it doesn't seem to want to include the new index-static plugin src and associated plugin .xml files... any ideas please. In addition, is it required that we get some JUnit tests together to accompany this new plugin?
          Hide
          Lewis John McGibbney added a comment -

          Finally I've got this working and the attached patch passes tests and compiles without any errors. If I can get the thumbs up i'll commit.
          Thanks

          Show
          Lewis John McGibbney added a comment - Finally I've got this working and the attached patch passes tests and compiles without any errors. If I can get the thumbs up i'll commit. Thanks
          Hide
          Chris A. Mattmann added a comment -

          Thumbs up. Please commit! No need to wait on issues like this where there's been tons of discussion and time in between.

          Show
          Chris A. Mattmann added a comment - Thumbs up. Please commit! No need to wait on issues like this where there's been tons of discussion and time in between.
          Hide
          Lewis John McGibbney added a comment -

          Committed revision 1167651.

          I'll work on this work trunk and get a patch submitted shortly (hopefully tomorrow)

          Show
          Lewis John McGibbney added a comment - Committed revision 1167651. I'll work on this work trunk and get a patch submitted shortly (hopefully tomorrow)
          Hide
          Lewis John McGibbney added a comment -

          A patch with some simple documentation to keep consistency across the plugins. This completes the new index-static plugin for branch 1.4.

          Show
          Lewis John McGibbney added a comment - A patch with some simple documentation to keep consistency across the plugins. This completes the new index-static plugin for branch 1.4.
          Hide
          Lewis John McGibbney added a comment -

          Final patch for branch 1.4 committed @ revision 1169502.
          This just adds some documentation to the class as well as adding a package.html description of the class.

          Show
          Lewis John McGibbney added a comment - Final patch for branch 1.4 committed @ revision 1169502. This just adds some documentation to the class as well as adding a package.html description of the class.
          Hide
          Lewis John McGibbney added a comment -

          this patch is a work in progress. There are various issues as I am getting used to the changes in code base and classes etc. If anyone feels like picking this up and giving me some pointers then it would be appreciated. I will try and pick it back up shortly and complete.

          Show
          Lewis John McGibbney added a comment - this patch is a work in progress. There are various issues as I am getting used to the changes in code base and classes etc. If anyone feels like picking this up and giving me some pointers then it would be appreciated. I will try and pick it back up shortly and complete.
          Hide
          Markus Jelsma added a comment -

          Hi Lewis,

          All looks fine except the formatting is tabbed and not 2 spaces. You could also opt for a comma separated list of key/values. Hadoop Configuration offers a convenience method for that: http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/conf/Configuration.html#getStrings%28java.lang.String%29

          Then all you need is split by colon.

          Cheers

          Show
          Markus Jelsma added a comment - Hi Lewis, All looks fine except the formatting is tabbed and not 2 spaces. You could also opt for a comma separated list of key/values. Hadoop Configuration offers a convenience method for that: http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/conf/Configuration.html#getStrings%28java.lang.String%29 Then all you need is split by colon. Cheers
          Hide
          Lewis John McGibbney added a comment -

          thanks Markus please give me time to get this sorted out. I appreciate your direction.

          Show
          Lewis John McGibbney added a comment - thanks Markus please give me time to get this sorted out. I appreciate your direction.
          Hide
          Lewis John McGibbney added a comment -

          My contribution for Nutch 2.0 is nearly there, so it should not be any bother should anyone wish to pick this up and complete it. The issue has been logged with NUTCH-1104 for clarity.

          Show
          Lewis John McGibbney added a comment - My contribution for Nutch 2.0 is nearly there, so it should not be any bother should anyone wish to pick this up and complete it. The issue has been logged with NUTCH-1104 for clarity.
          Hide
          Lewis John McGibbney added a comment -

          Fixed as per the previous 1.4 commit. For 2.0 issues a 'patch' has been supplied as per previous comments.

          Show
          Lewis John McGibbney added a comment - Fixed as per the previous 1.4 commit. For 2.0 issues a 'patch' has been supplied as per previous comments.
          Hide
          Lewis John McGibbney added a comment -

          I took it to close this as Claudio is no longer around...
          Hopefully this is OK.

          Show
          Lewis John McGibbney added a comment - I took it to close this as Claudio is no longer around... Hopefully this is OK.
          Hide
          Claudio Martella added a comment -

          Yes, I'm around. Thanks for getting this to the end

          Show
          Claudio Martella added a comment - Yes, I'm around. Thanks for getting this to the end

            People

            • Assignee:
              Lewis John McGibbney
              Reporter:
              Claudio Martella
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development