[COR-20] Write an XML/HTML parser - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: 0.5
Component/s: DocFormats - core, DocFormats - platform
Labels:
None

Description

Currently we rely on libxml2 and HTML Tidy for parsing XML and HTML, respectively. In both cases we are only using the parsing functions of libraries, not other features like the DOM tree or other things.

Parsing XML is not very difficult to do. HTML slightly more, because of all the ambiguities that arise from the poorly-defined parsing rules in earlier versions of the spec ("make a best effort" became "replicate what internet explorer does" because almost every site violated the rules). However the HTML5 spec now defines a proper parsing algorithm that deals with said ambiguities. We'll need to also take into account the details of which tags must have a corresponding close dag and which tags do not require this.

Having our own parser will simplify dependencies a lot, particularly with the somewhat awkward HTML tidy.

Attachments

Activity

People

Assignee:: Peter Kelly

Reporter:: Peter Kelly

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 28/Dec/14 13:49

Updated:: 18/Jan/15 09:30