Uploaded image for project: 'OODT (Retired)'
  1. OODT (Retired)
  2. OODT-652

New TikaCmdLineMetExtractor

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 0.6
    • 0.7
    • metadata container
    • None
    • Don't Know (Unsure) - The default level

    Description

      Often times, we want to ingest a product and have some basic metadata automatically extracted from it without much effort. The Apache Tika project has great features supporting the detection of and extraction of metadata associated with a product to this effect. The purpose of this issue is to integrate these metadata extraction capabilities of Tika, so that OODT can easily leverage and make use of them.

      At a minimum, this issue seeks to:

      • Incorporate and use Tika's 'parse' method to extract metadata automatically
      • Include the text content (if any) of a document inside a new metadata element dubbed 'content'. This will be useful for lucene and solr based free-text searches

      Attachments

        1. OODT-652.rverma.08-27-2013.patch.txt
          4 kB
          Rishi Verma
        2. extractor-config.properties
          0.0 kB
          Rishi Verma

        Activity

          People

            riverma Rishi Verma
            riverma Rishi Verma
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: