Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-3125

rmeta/text and unpack - the __TEXT__ file and X-TIKA:content differ by some leading new line characters

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None
    • None

    Description

      Using the attached docx file, when I parse it with

      /unpack

      Endpoint I get _TEXT_ file that contains my this:

      [[bookmark: _GoBack]Launching ms word
      
      Sadfsadfsaf
      
      Asdfsafsafasfsafd
      Asdf2
      Asfd3
      asfd
      

      But when I parse it with /rmeta/text I get a X-TIKA:content field that contains:

      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      Launching ms word
      
      Sadfsadfsaf
      
      Asdfsafsafasfsafd
      Asdf2
      Asfd3
      asfd
      

      Why do these differ? Seems like there a bunch of leading \n characters to start out on the /rmeta/text endpoint? And there is this strange [[bookmark: _GoBack] that I wasn't expecting too. Not sure what that means. Perhaps they are just fundamentally different outputs and this is normal behavior?

      Attachments

        1. test-ooxml.docx
          11 kB
          Nicholas DiPiazza

        Activity

          People

            Unassigned Unassigned
            ndipiazza_gmail Nicholas DiPiazza
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: