Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1227

Apache Tika 1.4 Duplicate extract data

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Invalid
    • Affects Version/s: 1.4
    • Fix Version/s: 1.4
    • Component/s: general
    • Environment:

      Ubuntu12.04, Python 2.7, Apache Tika 1.4

      Description

      When Extracting text using Apache Tika 1.4, the Text is getting duplicated.

      APACHE_TIKA_PATH = os.path.abspath(os.path.join(PROJECT_ROOT, apache_tika/tika-app-1.4.jar'))

      sout = subprocess.check_output("java -jar %s -t %s"%(APACHE_TIKA_PATH, document),shell=True)

      sout contains duplicate text.

      Issue both for Doc and PDF files.

        Attachments

        1. tt1.doc
          41 kB
          vivek joshi

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              joshivj22 vivek joshi
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: