[TIKA-40] Tika needs to support diverse character encodings. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 0.1-incubating
Fix Version/s: 0.1-incubating
Component/s: general
Labels:
None

Description

Currently, the text parser implementation uses the default encoding of the Java runtime when instantiating a Reader for the passed input stream. We need to support other encodings as well.

It would be helpful to support the specification of an encoding in the parse method.

Ideally, Tika would also provide the ability to determine the encoding automatically based on the data stream. (Unicode files may have byte order marks (http://unicode.org/faq/utf_bom.html#BOM), but I don't know if other encodings can be inferred from content.)

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

TIKA-40.patch
10/Oct/07 09:19
6 kB
Jukka Zitting

Activity

People

Assignee:: Jukka Zitting

Reporter:: Keith Bennett

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 01/Oct/07 23:02

Updated:: 22/May/09 22:06

Resolved:: 10/Oct/07 12:08