[TIKA-1794] TXTParser removes form feed characters - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Minor
Resolution: Won't Fix
Affects Version/s: 1.11
Fix Version/s: None
Component/s: parser
Labels:
- parser
- txt
Environment:

Java(TM) SE Runtime Environment (build 1.8.0_60-b27)

Description

Just noticed that Apache Tika removes form feed characters (0C in UTF-8) when parsing a text file.

If I compare the hex bytes of the original file and the hex bytes of the extracted text I can see that the 0C character is replaced by EF BF BD which is the UTF-8 replacement character.

Test.java

	public static void main(String[] args) {
		InputStream is = null;
		
		try {
			is = new FileInputStream("form_feed.txt");
			
			AutoDetectParser parser = new AutoDetectParser();
			Writer stringWriter = new StringWriter();
			ContentHandler handler = new BodyContentHandler(stringWriter);
			Metadata metadata = new Metadata();
			parser.parse(is, handler, metadata);
			
			String extractedText = stringWriter.toString();
			System.out.println(extractedText);
			
			String hex = Hex.encodeHexString(extractedText.getBytes("UTF-8"));
			
			System.out.println(hex); //0C replaced by EFBFBD

		} catch (Exception e) {
			e.printStackTrace();
		} finally {
			IOUtils.closeQuietly(is);
		}
	}

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

form_feed.txt
16/Nov/15 10:00
0.0 kB
Olivier Masseau

Activity

People

Assignee:: Unassigned

Reporter:: Olivier Masseau

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 16/Nov/15 09:58

Updated:: 16/Nov/15 15:40

Resolved:: 16/Nov/15 15:40