[TIKA-394] Missing spaces on html parsing - ASF JIRA

Attach files

Attach Screenshot

Voters

Watch issue

Watchers

Create sub-task

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 0.6
Fix Version/s: 0.8
Component/s: parser
Labels:
- html
- spaces
- tag
Environment:

Tomcat 6, Windows XP (russian locale)

Description

On parsing such html code:

text<p>more<br>yet<select><option>city1<option>city2</select>

resulting text is:

textmore
yetcity1city2

But must be:

text
more
yet city1 city2

Code sample:

import java.io.*;
import org.apache.tika.metadata.*;
import org.apache.tika.parser.*;

public class test {

public static void main(String[] args) throws Exception {
Metadata metadata = new Metadata();
metadata.set(Metadata.CONTENT_TYPE, "text/html");
String content = "text<p>more<br>yet<select><option>city1<option>city2</select>";

InputStream in = new ByteArrayInputStream(content.getBytes("UTF-8"));
AutoDetectParser parser = new AutoDetectParser();
Reader reader = new ParsingReader(parser, in, metadata, new ParseContext());
char[] buf = new char[10000];
int len;
StringBuffer text = new StringBuffer();
while((len = reader.read(buf)) > 0)

{ text.append(buf, 0, len); }

System.out.print(text);
}
}

Attachments

TIKA-394.patch
26/Oct/10 16:59
3 kB
Kenneth William Krugler

Issue Links

Add Link

is duplicated by

TIKA-532 missing spaces in text extraction of BodyContentHandler

Closed

Delete this link

Activity

Comment

This comment will be Viewable by All Users Viewable by All Users

Cancel

People

Assignee:: Kenneth William Krugler

Reporter:: Andrey Barhatov

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 25/Mar/10 15:56

Updated:: 25/Aug/11 16:28

Resolved:: 26/Oct/10 17:02

Agile

View on Board

Missing spaces on html parsing

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Agile

Slack

Issue deployment