Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1794

TXTParser removes form feed characters

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Won't Fix
    • 1.11
    • None
    • parser
    • Java(TM) SE Runtime Environment (build 1.8.0_60-b27)

    Description

      Just noticed that Apache Tika removes form feed characters (0C in UTF-8) when parsing a text file.

      If I compare the hex bytes of the original file and the hex bytes of the extracted text I can see that the 0C character is replaced by EF BF BD which is the UTF-8 replacement character.

      Test.java
      	public static void main(String[] args) {
      		InputStream is = null;
      		
      		try {
      			is = new FileInputStream("form_feed.txt");
      			
      			AutoDetectParser parser = new AutoDetectParser();
      			Writer stringWriter = new StringWriter();
      			ContentHandler handler = new BodyContentHandler(stringWriter);
      			Metadata metadata = new Metadata();
      			parser.parse(is, handler, metadata);
      			
      			String extractedText = stringWriter.toString();
      			System.out.println(extractedText);
      			
      			String hex = Hex.encodeHexString(extractedText.getBytes("UTF-8"));
      			
      			System.out.println(hex); //0C replaced by EFBFBD
      
      		} catch (Exception e) {
      			e.printStackTrace();
      		} finally {
      			IOUtils.closeQuietly(is);
      		}
      	}
      

      Attachments

        1. form_feed.txt
          0.0 kB
          Olivier Masseau

        Activity

          People

            Unassigned Unassigned
            maol Olivier Masseau
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: