Solr
  1. Solr
  2. SOLR-2347

Use InputStream and not Reader for XML parsing

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: 4.9, 5.0
    • Labels:
      None

      Description

      Followup to SOLR-96:

      Solr mostly uses java.io.Reader and passes this Reader to the XML parser. According to XML spec, a XML file should be initially seen as a binary stream with a default charset of UTF-8 or another charset given by the network protocol (like Content-Type header in HTTP). But very important, this default charset is only a "hint" to the parser - mandatory is the charset from the XML header processing inctruction. Because of this, the parser must be able to change the charset when reading the XML headers (possibly also when seeing BOM markers). This is not possible if the XML parser gets a java.io.Reader instead of java.io.InputStreams. SOLR-96 already fixed this for the XmlUpdateRequestHandler and the DocumentAnalysisRequestHandler. This issue should fix the rest to be conforming to XML-spec (open schema.xml and config.xml as InputStream not Reader and others).

      This change would not break anything in Solr (perhaps only backwards compatibility in the API), as the default used by XML parsers is UTF-8.

        Issue Links

          Activity

          Hide
          Uwe Schindler added a comment -

          To again post the most important info from XML spec how to handle charset detection:

          XML parsers by definition only take a byte stream and a charset hint and use the XML 1.0 spec to determince the charset: http://www.w3.org/TR/REC-xml/#charencoding and http://www.w3.org/TR/REC-xml/#sec-guessing

          Show
          Uwe Schindler added a comment - To again post the most important info from XML spec how to handle charset detection: XML parsers by definition only take a byte stream and a charset hint and use the XML 1.0 spec to determince the charset: http://www.w3.org/TR/REC-xml/#charencoding and http://www.w3.org/TR/REC-xml/#sec-guessing
          Hide
          Uwe Schindler added a comment -

          The same problem applies to DIH, this may be the problem behind SOLR-2346!

          DataSource in DIH always is generified to <Reader>, so all input from files or URLs are read using a charset. When you want to stream XML data using DIH, the XML parser is also unable to use the encoding as declared in the XML file itsself. In my opinion, all DataSources should simply be generified to DataSource<ContentStream>, which makes also a lot of code easier and the consumer can choose between binary or chars.

          Show
          Uwe Schindler added a comment - The same problem applies to DIH, this may be the problem behind SOLR-2346 ! DataSource in DIH always is generified to <Reader>, so all input from files or URLs are read using a charset. When you want to stream XML data using DIH, the XML parser is also unable to use the encoding as declared in the XML file itsself. In my opinion, all DataSources should simply be generified to DataSource<ContentStream>, which makes also a lot of code easier and the consumer can choose between binary or chars.
          Hide
          Uwe Schindler added a comment -

          I checked Schema and SolrConfig loading: It's fine, no changes needed. It uses InputStream to load.

          Show
          Uwe Schindler added a comment - I checked Schema and SolrConfig loading: It's fine, no changes needed. It uses InputStream to load.
          Hide
          Mark Miller added a comment -

          Thanks for looking into this Uwe!

          Show
          Mark Miller added a comment - Thanks for looking into this Uwe!
          Hide
          Uwe Schindler added a comment -

          This now only affects DIH anymore.

          DIH have to be changed that all the base classes that open files, read from network don't use Readers but InputStreams. This is easy to do, but breaks backwards (which is no problem as DIH is contrib and experimental).

          Show
          Uwe Schindler added a comment - This now only affects DIH anymore. DIH have to be changed that all the base classes that open files, read from network don't use Readers but InputStreams. This is easy to do, but breaks backwards (which is no problem as DIH is contrib and experimental).
          Hide
          Mark Miller added a comment -

          which is no problem as DIH is contrib and experimental

          +1

          Show
          Mark Miller added a comment - which is no problem as DIH is contrib and experimental +1
          Hide
          Robert Muir added a comment -

          Bulk move 3.2 -> 3.3

          Show
          Robert Muir added a comment - Bulk move 3.2 -> 3.3
          Hide
          Robert Muir added a comment -

          3.4 -> 3.5

          Show
          Robert Muir added a comment - 3.4 -> 3.5
          Hide
          Uwe Schindler added a comment -

          Won't fix for 3.6, in 4.0 we should move DIH from Reader to InputStream

          Show
          Uwe Schindler added a comment - Won't fix for 3.6, in 4.0 we should move DIH from Reader to InputStream
          Hide
          Hoss Man added a comment -

          bulk fixing the version info for 4.0-ALPHA and 4.0 all affected issues have "hoss20120711-bulk-40-change" in comment

          Show
          Hoss Man added a comment - bulk fixing the version info for 4.0-ALPHA and 4.0 all affected issues have "hoss20120711-bulk-40-change" in comment
          Hide
          Robert Muir added a comment -

          rmuir20120906-bulk-40-change

          Show
          Robert Muir added a comment - rmuir20120906-bulk-40-change
          Hide
          Robert Muir added a comment -

          moving all 4.0 issues not touched in a month to 4.1

          Show
          Robert Muir added a comment - moving all 4.0 issues not touched in a month to 4.1
          Hide
          James Dyer added a comment -

          Uwe, Does your concern here entirely have to do with when DIH is indexing XML files?

          Show
          James Dyer added a comment - Uwe, Does your concern here entirely have to do with when DIH is indexing XML files?
          Hide
          Uwe Schindler added a comment -

          It is not only XML files. In general, the encoding information of textual content should be determined by the parser. E.g. if you write a DIH instance reading from a network stream, the encoding might be defined by the headers (e.g. HTTP). In the case of XML it is defined by both headers and the data itsself (<?xml> header). Data import handler should in this case work with InputStreams, so the encoding could be determined later (e.g. when reading unknown text files, e.g. ICU4J could autodetect the encoding from language, etc.). This would also fit DIH better with TIKA processing.

          My proposal is to let DIH take InputStreams and let the encoding be determined in a later stage of processing.

          Show
          Uwe Schindler added a comment - It is not only XML files. In general, the encoding information of textual content should be determined by the parser. E.g. if you write a DIH instance reading from a network stream, the encoding might be defined by the headers (e.g. HTTP). In the case of XML it is defined by both headers and the data itsself (<?xml> header). Data import handler should in this case work with InputStreams, so the encoding could be determined later (e.g. when reading unknown text files, e.g. ICU4J could autodetect the encoding from language, etc.). This would also fit DIH better with TIKA processing. My proposal is to let DIH take InputStreams and let the encoding be determined in a later stage of processing.
          Hide
          James Dyer added a comment -

          Here is an attempt to fix DIH. But we cannot make DataSource always deal in InputStreams. JDBCDataSource, for instance, is entirely different and returns Maps.

          One difficult thing here is FieldReaderDataSource which can take a java.sql.Clob and pass it down to an XML or other text processor. Clob#getCharacterStream() returns a Reader, not an Inputstream. So I ended up having XpathEntityProcessor take a DataSource<?> and checking instanceof for Reader or InputStream.

          All of this makes me wonder if having DataSource separate from EntityProcessor is really good design here. The EntityProcessors are very much married to their DataSources and you really can't mix-n-match very much as the conceptualization would lend one to believe...

          Show
          James Dyer added a comment - Here is an attempt to fix DIH. But we cannot make DataSource always deal in InputStreams. JDBCDataSource, for instance, is entirely different and returns Maps. One difficult thing here is FieldReaderDataSource which can take a java.sql.Clob and pass it down to an XML or other text processor. Clob#getCharacterStream() returns a Reader, not an Inputstream. So I ended up having XpathEntityProcessor take a DataSource<?> and checking instanceof for Reader or InputStream. All of this makes me wonder if having DataSource separate from EntityProcessor is really good design here. The EntityProcessors are very much married to their DataSources and you really can't mix-n-match very much as the conceptualization would lend one to believe...
          Hide
          Steve Rowe added a comment -

          Bulk move 4.4 issues to 4.5 and 5.0

          Show
          Steve Rowe added a comment - Bulk move 4.4 issues to 4.5 and 5.0
          Hide
          Uwe Schindler added a comment -

          Move issue to Solr 4.9.

          Show
          Uwe Schindler added a comment - Move issue to Solr 4.9.

            People

            • Assignee:
              Uwe Schindler
              Reporter:
              Uwe Schindler
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:

                Development