Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2030

ParseZip plugin is not able to extract language from zip document,this could solve that problem.

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • None
    • 1.21
    • parser, plugin
    • None
    • Linux Mint 17 qiana, 4 GB Ram,Core I3.

    Description

      Actually parse-zip plugin donĀ“t extract language from zip document, therefore lang field is empty in solr or elastic. If the package(.zip) contains a list of documents so the lang field could be multivalued to support that list of languages. A simple change to parse-zip pluging could fix this problem. I will use Language Identifier class from tika and analyze each document inside.

      Attachments

        Activity

          People

            Unassigned Unassigned
            eyeris Eyeris Rodriguez Rueda
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:

              Time Tracking

                Estimated:
                Original Estimate - 336h
                336h
                Remaining:
                Remaining Estimate - 336h
                336h
                Logged:
                Time Spent - Not Specified
                Not Specified