Details
Description
Just noticed that Apache Tika removes form feed characters (0C in UTF-8) when parsing a text file.
If I compare the hex bytes of the original file and the hex bytes of the extracted text I can see that the 0C character is replaced by EF BF BD which is the UTF-8 replacement character.
Test.java
public static void main(String[] args) { InputStream is = null; try { is = new FileInputStream("form_feed.txt"); AutoDetectParser parser = new AutoDetectParser(); Writer stringWriter = new StringWriter(); ContentHandler handler = new BodyContentHandler(stringWriter); Metadata metadata = new Metadata(); parser.parse(is, handler, metadata); String extractedText = stringWriter.toString(); System.out.println(extractedText); String hex = Hex.encodeHexString(extractedText.getBytes("UTF-8")); System.out.println(hex); //0C replaced by EFBFBD } catch (Exception e) { e.printStackTrace(); } finally { IOUtils.closeQuietly(is); } }