Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Won't Fix
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Per the discussion in the Nutch-User mailing list, there is a wish for an "Image Search" add-on component that will index images.

      Must have:

      • retrieve outlinks to image files from fetched pages
      • generate thumbnails from images
      • thumbnails are stored in the segments as ImageWritable that contains the compressed binary data and some meta data

      Should have:

      • implemented as hadoop map reduce job
      • should be seperate from main Nutch codeline as it breaks general Nutch logic of one url == one index document.

      Could have:

      • store the original image in the segments

      Would like to have:

      • search interface for image index
      • parameterizable thumbnail generation (width, height, quality)
      1.
      sandbox svn folder Sub-task Closed Unassigned
       

        Activity

        Hide
        Sanjib Narzary added a comment -

        i am working on a content based image retrieval that uses nutch as the main search engine, with the help of LIRe library.i will be happy if this project is on going.

        Show
        Sanjib Narzary added a comment - i am working on a content based image retrieval that uses nutch as the main search engine, with the help of LIRe library.i will be happy if this project is on going.
        Hide
        Lewis John McGibbney added a comment -

        This issue can be reopened should we wish to include it within a future release of Nutch which includes a web application.

        Show
        Lewis John McGibbney added a comment - This issue can be reopened should we wish to include it within a future release of Nutch which includes a web application.
        Hide
        Lewis John McGibbney added a comment -

        Having had a look at this, it is not appropriate for inclusion in current Nutch implementations and would have suited a JSP based web application e.g. Nutch-1.2.

        I'm going to reclose the issue at this point in time, should we get another web application up and running at least there has been some recent correspondence and the code is available should anyone wish to pursue the issue further.

        Show
        Lewis John McGibbney added a comment - Having had a look at this, it is not appropriate for inclusion in current Nutch implementations and would have suited a JSP based web application e.g. Nutch-1.2. I'm going to reclose the issue at this point in time, should we get another web application up and running at least there has been some recent correspondence and the code is available should anyone wish to pursue the issue further.
        Hide
        Lewis John McGibbney added a comment -

        I haven't looked too deeply into this, however I think we have just missed the window where this could have been easily integrated into Nutch 1.2. As it concerns searching and viewing of images as described here [1], I really think that it would be mostly useful for folks at Solr to look at, however there would obviously be some sort of image processing required by Nutch in the form of a plugin. My main concern is dealing with the API changes...

        Any comments from anyone familiar with the original project or anyone who has time to have a look through the README link below.

        [1] http://archive-access.svn.sourceforge.net/svnroot/archive-access/trunk/archive-access/projects/nutchwax/imagesearch/README.txt

        Show
        Lewis John McGibbney added a comment - I haven't looked too deeply into this, however I think we have just missed the window where this could have been easily integrated into Nutch 1.2. As it concerns searching and viewing of images as described here [1] , I really think that it would be mostly useful for folks at Solr to look at, however there would obviously be some sort of image processing required by Nutch in the form of a plugin. My main concern is dealing with the API changes... Any comments from anyone familiar with the original project or anyone who has time to have a look through the README link below. [1] http://archive-access.svn.sourceforge.net/svnroot/archive-access/trunk/archive-access/projects/nutchwax/imagesearch/README.txt
        Hide
        Lewis John McGibbney added a comment -

        This issue is back open...

        The code developed was for integration on nutchwax. The link to the project is:
        https://webarchive.jira.com/wiki/display/SOC06/Text-based+image+search+capability+for+NutchWAX

        The code has been made available to checkout, but it works on a
        previous version of nutch.
        http://archive-access.svn.sourceforge.net/svnroot/archive-access/trunk/archive-access/projects/nutchwax/imagesearch/

        Correct svn revisions for the code to work:

        • nutch: REV 678533
        • nutchwax: REV 2587
        • imagesearch: HEAD
        Show
        Lewis John McGibbney added a comment - This issue is back open... The code developed was for integration on nutchwax. The link to the project is: https://webarchive.jira.com/wiki/display/SOC06/Text-based+image+search+capability+for+NutchWAX The code has been made available to checkout, but it works on a previous version of nutch. http://archive-access.svn.sourceforge.net/svnroot/archive-access/trunk/archive-access/projects/nutchwax/imagesearch/ Correct svn revisions for the code to work: nutch: REV 678533 nutchwax: REV 2587 imagesearch: HEAD
        Hide
        Lewis John McGibbney added a comment -

        Hi Simão, any chance we could obtain the code? If this s the case we will reopen this issue and mark it somewhere down the line of things to deal with.

        Thank you for getting back to us on this one.

        Show
        Lewis John McGibbney added a comment - Hi Simão, any chance we could obtain the code? If this s the case we will reopen this issue and mark it somewhere down the line of things to deal with. Thank you for getting back to us on this one.
        Hide
        Simão Fontes added a comment -

        The GSoC did generate some code. There have been no contributions to Nutch or Nutchwax for that matter, but the code is available.
        -1 Close

        Show
        Simão Fontes added a comment - The GSoC did generate some code. There have been no contributions to Nutch or Nutchwax for that matter, but the code is available. -1 Close
        Hide
        Lewis John McGibbney added a comment -

        As there has been no progress made with this issue for years, and that it deviates from the direction in which Nutch branch-1.4 and trunk 2.0 are moving it is being closed. This may not have been the case if there had been some contribution from the GSoC suggestions, however this has sadly not been the case.

        Due to no objections we are closing this issue.

        Show
        Lewis John McGibbney added a comment - As there has been no progress made with this issue for years, and that it deviates from the direction in which Nutch branch-1.4 and trunk 2.0 are moving it is being closed. This may not have been the case if there had been some contribution from the GSoC suggestions, however this has sadly not been the case. Due to no objections we are closing this issue.
        Hide
        Markus Jelsma added a comment -

        Would be a nice feature but no patches. +1 close.

        Show
        Markus Jelsma added a comment - Would be a nice feature but no patches. +1 close.
        Hide
        Lewis John McGibbney added a comment -

        The parsing and extraction of metadata from images is handled by Apache Tika. If we were still working with a web app it would have been possible to get a plugin which combined metadata extraction with indexable thumbnail image snippets which would be available when searching, however this is not the case as search and indexing has been shifted to Solr.

        What is the status with this issue? Personally I am tempted to suggested we close it, reasoning being that it has not been given any attention in years, it reflects a requirement from an old generation of Nutch functionality, all image related processing is covered by parse-tika and finally there are far far more important issues to be dealt with.

        One last thing, there has been no code contribution from the 2008 GSoC therefore I'm guessing it was never pursued.

        Show
        Lewis John McGibbney added a comment - The parsing and extraction of metadata from images is handled by Apache Tika. If we were still working with a web app it would have been possible to get a plugin which combined metadata extraction with indexable thumbnail image snippets which would be available when searching, however this is not the case as search and indexing has been shifted to Solr. What is the status with this issue? Personally I am tempted to suggested we close it, reasoning being that it has not been given any attention in years, it reflects a requirement from an old generation of Nutch functionality, all image related processing is covered by parse-tika and finally there are far far more important issues to be dealt with. One last thing, there has been no code contribution from the 2008 GSoC therefore I'm guessing it was never pursued.
        Hide
        Gordon Mohr added a comment -

        FYI: We've suggested image-search extensions to Nutch as a possible InternetArchive-mentored Google Summer of Code 2008 student project. (See our ideas page at <http://webteam.archive.org/confluence/display/SOC06/Summer+of+Code+2008>.) Too early to say if we'll get any good proposals or if that project will make the cut when we see the final list of proposals and how many projects we get.

        Show
        Gordon Mohr added a comment - FYI: We've suggested image-search extensions to Nutch as a possible InternetArchive-mentored Google Summer of Code 2008 student project. (See our ideas page at < http://webteam.archive.org/confluence/display/SOC06/Summer+of+Code+2008 >.) Too early to say if we'll get any good proposals or if that project will make the cut when we see the final list of proposals and how many projects we get.
        Hide
        Otis Gospodnetic added a comment -

        Steve:
        I was going to say "Great to see you started work on this!", but then noticed your comment is from March 2007, not March 2008.

        So, instead, let me ask: "Did anything come out of this?" I, too, am seeing a need for a Nutch-based search engine.

        Show
        Otis Gospodnetic added a comment - Steve: I was going to say "Great to see you started work on this!", but then noticed your comment is from March 2007, not March 2008. So, instead, let me ask: "Did anything come out of this?" I, too, am seeing a need for a Nutch-based search engine.
        Hide
        Steve Severance added a comment -

        I know the commiters are hard at work on the 0.9.0 release but I have begun to work on the first piece of this, the parser. I am looking for guidance as to how the images and thumbnails should be stored. One file per image is probably too inefficient. Are there existing file formats that the community would like to use?

        I am building a parser that can handle most image types. Should I break them out into individual plugins so there is one per file type? e.g. jpg will have an extension, gif will have a separate extension etc... This may be more flexible in the long run. This is the first project that I am undertaking on the nutch codebase so any guidance would be great.

        Steve

        Show
        Steve Severance added a comment - I know the commiters are hard at work on the 0.9.0 release but I have begun to work on the first piece of this, the parser. I am looking for guidance as to how the images and thumbnails should be stored. One file per image is probably too inefficient. Are there existing file formats that the community would like to use? I am building a parser that can handle most image types. Should I break them out into individual plugins so there is one per file type? e.g. jpg will have an extension, gif will have a separate extension etc... This may be more flexible in the long run. This is the first project that I am undertaking on the nutch codebase so any guidance would be great. Steve

          People

          • Assignee:
            Lewis John McGibbney
            Reporter:
            Thomas Delnoij
          • Votes:
            3 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development