ManifoldCF
  1. ManifoldCF
  2. CONNECTORS-623

stream_size and stream_name can't be sent

    Details

      Description

      These metadata can be sent to Solr in MCF 1.0.1 but can not be sent in MCF 1.1.
      I think it is because of SolrJ.

      1. CONNECTORS-623.patch
        0.8 kB
        Shinichiro Abe

        Issue Links

          Activity

          Hide
          Shinichiro Abe added a comment -

          Though this is a temporary fix, Can I commit this?

          Show
          Shinichiro Abe added a comment - Though this is a temporary fix, Can I commit this?
          Hide
          Karl Wright added a comment -

          You can commit to trunk but I would like to figure out the right fix. If you do commit, please attach the revision number in a comment to this ticket.

          How were you setting stream_size and stream_name before? The old Solr connector did not set these, I think, or at least I don't remember any such thing. Were you explicitly setting these somehow? What repository connector are you using?

          Show
          Karl Wright added a comment - You can commit to trunk but I would like to figure out the right fix. If you do commit, please attach the revision number in a comment to this ticket. How were you setting stream_size and stream_name before? The old Solr connector did not set these, I think, or at least I don't remember any such thing. Were you explicitly setting these somehow? What repository connector are you using?
          Hide
          Karl Wright added a comment -

          These metadata names do not appear at all in the old Solr connector:

          C:\wip\mcf\release-1.0-branch\connectors\solr\connector\src\main\java\org\apache\manifoldcf\agents\output\solr>grep "stream_name" *.java
          
          C:\wip\mcf\release-1.0-branch\connectors\solr\connector\src\main\java\org\apache\manifoldcf\agents\output\solr>grep "stream_size" *.java
          
          C:\wip\mcf\release-1.0-branch\connectors\solr\connector\src\main\java\org\apache\manifoldcf\agents\output\solr>
          
          Show
          Karl Wright added a comment - These metadata names do not appear at all in the old Solr connector: C:\wip\mcf\release-1.0-branch\connectors\solr\connector\src\main\java\org\apache\manifoldcf\agents\output\solr>grep "stream_name" *.java C:\wip\mcf\release-1.0-branch\connectors\solr\connector\src\main\java\org\apache\manifoldcf\agents\output\solr>grep "stream_size" *.java C:\wip\mcf\release-1.0-branch\connectors\solr\connector\src\main\java\org\apache\manifoldcf\agents\output\solr>
          Hide
          Shinichiro Abe added a comment -

          Yes. The old Solr connector did not set these but it seemed that Tika extracted these metadata.
          I can see that by using ExtractingRequestHandler in which I set "<str name="uprefix">attr_</str>" (attr_stream_size and attr_stream_name).
          I'm using JCIFS and Filesystem connector.
          I fix this before. See CONNECTORS-424.

          Show
          Shinichiro Abe added a comment - Yes. The old Solr connector did not set these but it seemed that Tika extracted these metadata. I can see that by using ExtractingRequestHandler in which I set "<str name="uprefix">attr_</str>" (attr_stream_size and attr_stream_name). I'm using JCIFS and Filesystem connector. I fix this before. See CONNECTORS-424 .
          Hide
          Shinichiro Abe added a comment -

          I want to commit not only trunk but also release branch because I know the user who uses stream_name field.

          Show
          Shinichiro Abe added a comment - I want to commit not only trunk but also release branch because I know the user who uses stream_name field.
          Hide
          Karl Wright added a comment -

          Ok - so Tika was extracting the filename and stream size from the multipart header?

          I wonder how best to emulate that. This current fix as it is currently is emulating Tika's behavior, but probably only partially. The alternative is to make SolrJ use multipart forms for posting. There may be a way to do this, but I am not sure there is a good way to set the filename and stream size in the multipart data.

          I'll look at the SolrJ code and see what I can find.

          Show
          Karl Wright added a comment - Ok - so Tika was extracting the filename and stream size from the multipart header? I wonder how best to emulate that. This current fix as it is currently is emulating Tika's behavior, but probably only partially. The alternative is to make SolrJ use multipart forms for posting. There may be a way to do this, but I am not sure there is a good way to set the filename and stream size in the multipart data. I'll look at the SolrJ code and see what I can find.
          Hide
          Karl Wright added a comment -

          If multipart is used, this is where SolrJ gets its info:

                            parts.add(new FormBodyPart(content.getName(), 
                                 new InputStreamBody(
                                     content.getStream(), 
                                     contentType, 
                                     content.getName())));
          

          So it all should work if I hook up the getName() to do the right thing, AND we can get SolrJ to use multipart post.

          Show
          Karl Wright added a comment - If multipart is used, this is where SolrJ gets its info: parts.add( new FormBodyPart(content.getName(), new InputStreamBody( content.getStream(), contentType, content.getName()))); So it all should work if I hook up the getName() to do the right thing, AND we can get SolrJ to use multipart post.
          Hide
          Karl Wright added a comment -

          Unfortunately, I can't find a good way to make SolrJ use a multipart form, other than to provide a SECOND dummy content stream. Here's the code:

                      boolean isMultipart = ( streams != null && streams.size() > 1 );
          

          So this is what I think is necessary. (1) Please commit your change to trunk, and I will pull it up to the release branch. (2) I will also commit a change which sets the stream's content name properly, and pull that up to the release branch also. (3) I will open another ticket in Solr to deal with the fact that SolrJ may lose key information when it does not multi-part post.

          Show
          Karl Wright added a comment - Unfortunately, I can't find a good way to make SolrJ use a multipart form, other than to provide a SECOND dummy content stream. Here's the code: boolean isMultipart = ( streams != null && streams.size() > 1 ); So this is what I think is necessary. (1) Please commit your change to trunk, and I will pull it up to the release branch. (2) I will also commit a change which sets the stream's content name properly, and pull that up to the release branch also. (3) I will open another ticket in Solr to deal with the fact that SolrJ may lose key information when it does not multi-part post.
          Hide
          Shinichiro Abe added a comment -

          r1438528(trunk)

          Show
          Shinichiro Abe added a comment - r1438528(trunk)
          Hide
          Karl Wright added a comment -

          My change: r1438529(trunk)

          Show
          Karl Wright added a comment - My change: r1438529(trunk)
          Hide
          Karl Wright added a comment -

          r1438533 (both changes, release branch)

          Show
          Karl Wright added a comment - r1438533 (both changes, release branch)
          Hide
          Karl Wright added a comment -

          This ticket may need to be revisited if and when there is a change in SolrJ behavior.

          Show
          Karl Wright added a comment - This ticket may need to be revisited if and when there is a change in SolrJ behavior.
          Hide
          Karl Wright added a comment -

          The ticket is SOLR-4358.

          Show
          Karl Wright added a comment - The ticket is SOLR-4358 .
          Hide
          Karl Wright added a comment -

          A client is noticing that the field that winds up in Solr seems to be multivalued:

          <arr name="attr_stream_size">
          <str>null</str>
          <str>57344</str>
          </arr>
          

          Does anyone else see this? Can anyone tell me where the spurious null value is coming from?

          Show
          Karl Wright added a comment - A client is noticing that the field that winds up in Solr seems to be multivalued: <arr name= "attr_stream_size" > <str> null </str> <str>57344</str> </arr> Does anyone else see this? Can anyone tell me where the spurious null value is coming from?
          Hide
          Karl Wright added a comment -

          The client is using unmodified Solr 3.6 with an unmodified SolrJ 4.1.0 jar that we include.

          Show
          Karl Wright added a comment - The client is using unmodified Solr 3.6 with an unmodified SolrJ 4.1.0 jar that we include.
          Hide
          Shinichiro Abe added a comment -

          I've never seen that in Solr 4.1, this "null" may occur in specific Solr 3.6. In fact, before the code of this issue was added, attr_stream_size showed "null". At that time, I thought this occurred because new Solr Connector was introduced.

          Show
          Shinichiro Abe added a comment - I've never seen that in Solr 4.1, this "null" may occur in specific Solr 3.6. In fact, before the code of this issue was added, attr_stream_size showed "null". At that time, I thought this occurred because new Solr Connector was introduced.
          Hide
          Karl Wright added a comment -

          I think the null is coming from the Solr side. When SolrJ does not use multipart forms, there is no name and no content length for the content being sent to Solr, so Solr uses "null" instead. So your patch was just a workaround for that SolrJ bug. Unfortunately, it's not a very good workaround.

          The only ways to really fix this are:

          (1) Solr accepts our SolrJ patch, which they have shown no sign of, OR
          (2) If we modify the SolrJ jar with our own patches just for MCF.

          I am thinking we should start thinking about doing (2).

          Show
          Karl Wright added a comment - I think the null is coming from the Solr side. When SolrJ does not use multipart forms, there is no name and no content length for the content being sent to Solr, so Solr uses "null" instead. So your patch was just a workaround for that SolrJ bug. Unfortunately, it's not a very good workaround. The only ways to really fix this are: (1) Solr accepts our SolrJ patch, which they have shown no sign of, OR (2) If we modify the SolrJ jar with our own patches just for MCF. I am thinking we should start thinking about doing (2).
          Hide
          Karl Wright added a comment -

          I looked at a simple extension of HttpSolrServer, with the thought of just overriding the useMultiPartPost flag and setting it to "true" rather than its current default of "false". Unfortunately, while it is possible to extend the class, it is not possible to set this variable because it is private.

          Instead, we're going to have to override this method:

            public NamedList<Object> request(final SolrRequest request,
                final ResponseParser processor) throws SolrServerException, IOException {
            ...
            }
          

          Duplicating the code for this method is unfortunate, but is not the end of the world by any stretch. So my current recommendation is that we go ahead and do it, and then we'll be able to control the way content gets posted so that multipart indeed gets used.

          Show
          Karl Wright added a comment - I looked at a simple extension of HttpSolrServer, with the thought of just overriding the useMultiPartPost flag and setting it to "true" rather than its current default of "false". Unfortunately, while it is possible to extend the class, it is not possible to set this variable because it is private. Instead, we're going to have to override this method: public NamedList< Object > request( final SolrRequest request, final ResponseParser processor) throws SolrServerException, IOException { ... } Duplicating the code for this method is unfortunate, but is not the end of the world by any stretch. So my current recommendation is that we go ahead and do it, and then we'll be able to control the way content gets posted so that multipart indeed gets used.
          Hide
          Karl Wright added a comment -

          r1446153 represents a (hopefully better) hack for this problem.

          Show
          Karl Wright added a comment - r1446153 represents a (hopefully better) hack for this problem.
          Hide
          Karl Wright added a comment -

          Abe-san, would you be willing to try the new hack in your environment? Do the right parameters make it through to Solr now?

          Show
          Karl Wright added a comment - Abe-san, would you be willing to try the new hack in your environment? Do the right parameters make it through to Solr now?
          Hide
          Shinichiro Abe added a comment -

          Hi, I tried to sync trunk up and confirmed that. After crawling file system and windows shared drive, I didn't find attr_stream_name field. I could find attr_stream_size field but it put "null". Instead of stream_name, there was a resourcename. I'd like to get right stream_name and stream_size.

          Show
          Shinichiro Abe added a comment - Hi, I tried to sync trunk up and confirmed that. After crawling file system and windows shared drive, I didn't find attr_stream_name field. I could find attr_stream_size field but it put "null". Instead of stream_name, there was a resourcename. I'd like to get right stream_name and stream_size.
          Hide
          Karl Wright added a comment -

          Can you verify that with current trunk Solr connector, multipart post is being used? And, that it looks similar to the multipart post done in 1.0.1? You may need WireShark to do this, but I need to know if the difference is because of multipart post differences or because of Solr differences.

          Show
          Karl Wright added a comment - Can you verify that with current trunk Solr connector, multipart post is being used? And, that it looks similar to the multipart post done in 1.0.1? You may need WireShark to do this, but I need to know if the difference is because of multipart post differences or because of Solr differences.
          Hide
          Karl Wright added a comment - - edited

          Actually, it should be possible to get the necessary information with HttpComponents HttpClient wire logging in trunk. But in 1.0.1 wire logging won't work - you have to use WireShark there.

          Looking at the 1.0.1 code, the way the info is transmitted is as follows:

              String value = "Content-Disposition: form-data";
              if (name != null)
                value += "; name=\""+name+"\"";
              if (fileName != null)
                value += "; filename=\""+fileName+"\"";
              value += "\r\n";
              byte[] tmp = value.getBytes("UTF-8");
              rval += tmp.length;
              tmp = ("Content-Type: "+contentType+"\r\n\r\n").getBytes("ASCII");
          

          ... which means that there are two headers in the multipart form section of the document:

          Content-Disposition: form-data; name=<name>; filename=<filename>
          Content-Type: <content-type>
          

          ... where, for the content, <name> is "myfile", and <filename> is the file name.

          If this is not what the multipart form poster in 1.1.1 is actually doing, I should be able to fix it to do what we need. But I'd like first to understand what it's currently doing before I start changing things, because if it is already working this way then the problem is that Solr changed too.

          Show
          Karl Wright added a comment - - edited Actually, it should be possible to get the necessary information with HttpComponents HttpClient wire logging in trunk. But in 1.0.1 wire logging won't work - you have to use WireShark there. Looking at the 1.0.1 code, the way the info is transmitted is as follows: String value = "Content-Disposition: form-data" ; if (name != null ) value += "; name=\" "+name+" \""; if (fileName != null ) value += "; filename=\" "+fileName+" \""; value += "\r\n" ; byte [] tmp = value.getBytes( "UTF-8" ); rval += tmp.length; tmp = ( "Content-Type: " +contentType+ "\r\n\r\n" ).getBytes( "ASCII" ); ... which means that there are two headers in the multipart form section of the document: Content-Disposition: form-data; name=<name>; filename=<filename> Content-Type: <content-type> ... where, for the content, <name> is "myfile", and <filename> is the file name. If this is not what the multipart form poster in 1.1.1 is actually doing, I should be able to fix it to do what we need. But I'd like first to understand what it's currently doing before I start changing things, because if it is already working this way then the problem is that Solr changed too.
          Hide
          Shinichiro Abe added a comment -

          Unfortunately, I'm not good at WireShark(I don't know how to filter), sorry. Would you reproduce and confirm that? BTW I post a pdf via cURL then I can get right stream_name and stream_size. So the problem lies not in Solr but in Solrj..?

          Show
          Shinichiro Abe added a comment - Unfortunately, I'm not good at WireShark(I don't know how to filter), sorry. Would you reproduce and confirm that? BTW I post a pdf via cURL then I can get right stream_name and stream_size. So the problem lies not in Solr but in Solrj..?
          Hide
          Karl Wright added a comment -

          Abe-san, I don't know for sure where the problem lies. I am just hoping that it's on the SolrJ side so I can fix it in some way or another. If it is in Solr, then there is nothing I can do and your fix is as good as any, and Shigeki will have to live with multiple values in his setup.

          I will try to set this up sometime today.

          Show
          Karl Wright added a comment - Abe-san, I don't know for sure where the problem lies. I am just hoping that it's on the SolrJ side so I can fix it in some way or another. If it is in Solr, then there is nothing I can do and your fix is as good as any, and Shigeki will have to live with multiple values in his setup. I will try to set this up sometime today.
          Hide
          Karl Wright added a comment -

          There was a problem and multi-part forms were still not getting enabled. So I checked in a fix:
          r1447262.

          I verified that the form posting all looks reasonable. Please give this a try, Abe-san, and let me know what happens on the Solr side.

          Show
          Karl Wright added a comment - There was a problem and multi-part forms were still not getting enabled. So I checked in a fix: r1447262. I verified that the form posting all looks reasonable. Please give this a try, Abe-san, and let me know what happens on the Solr side.
          Hide
          Karl Wright added a comment -

          r1447526 fixes a problem with deletes and commits. Now the Solr integration test passes too.

          Show
          Karl Wright added a comment - r1447526 fixes a problem with deletes and commits. Now the Solr integration test passes too.
          Hide
          Karl Wright added a comment -

          r1447556 is yet another fix, this time for an illegal argument exception.

          Show
          Karl Wright added a comment - r1447556 is yet another fix, this time for an illegal argument exception.
          Hide
          Shinichiro Abe added a comment -

          Now I can get stream_name and stream_size on both Solr 3.6 and Solr 4.1. Thank you.

          Show
          Shinichiro Abe added a comment - Now I can get stream_name and stream_size on both Solr 3.6 and Solr 4.1. Thank you.
          Hide
          Shinichiro Abe added a comment -

          Hi,
          I realized that "stream_name" was garbled in MCF 1.2 in the case that a file name has Japanese.
          There is no problem in the case that a file name has English.
          The value of "resourcename" field is not garbled even if a file name is written by Japanese.
          I think I might have forgot to check a Japanese file name at that time.
          Could you fix that?

          Show
          Shinichiro Abe added a comment - Hi, I realized that "stream_name" was garbled in MCF 1.2 in the case that a file name has Japanese. There is no problem in the case that a file name has English. The value of "resourcename" field is not garbled even if a file name is written by Japanese. I think I might have forgot to check a Japanese file name at that time. Could you fix that?
          Hide
          Karl Wright added a comment -

          Hi Abe-san,

          The place where this happens may be in Solr; it is not clear.

          Please look at ModifiedHttpSolrServer.java, starting at line 156. The content name field gets added to the form, I believe, at line 186. The org.apache.http.entity.mime.FormBodyPart class is what would be responsible for any encoding.

          Can you do the following:
          (1) Include the INFO output from Solr for the post?
          (2) Turn on wire debugging (in logging.ini, see http://hc.apache.org/httpcomponents-client-ga/logging.html), and include the actual post where you see garbled characters? Then I can figure out where the problem is occurring.

          Thanks,
          Karl

          Show
          Karl Wright added a comment - Hi Abe-san, The place where this happens may be in Solr; it is not clear. Please look at ModifiedHttpSolrServer.java, starting at line 156. The content name field gets added to the form, I believe, at line 186. The org.apache.http.entity.mime.FormBodyPart class is what would be responsible for any encoding. Can you do the following: (1) Include the INFO output from Solr for the post? (2) Turn on wire debugging (in logging.ini, see http://hc.apache.org/httpcomponents-client-ga/logging.html ), and include the actual post where you see garbled characters? Then I can figure out where the problem is occurring. Thanks, Karl
          Hide
          Shinichiro Abe added a comment -

          Hi Karl,

          (1) Include the INFO output from Solr for the post?

          INFO  - 2013-06-14 16:56:44.083; org.apache.solr.update.processor.LogUpdateProcessor; [collection1] webapp=/solr path=/update/extract params={literal.id=file:/Users/abe/Desktop/test2/ロンウイット.txt&resource.name=ロンウイット.txt&wt=xml&version=2.2&literal.uri=/Users/abe/Desktop/test2/ロンウイット.txt} {add=[file:/Users/abe/Desktop/test2/ロンウイット.txt (1437803850320838656)]} 0 5
          

          (2) Turn on wire debugging

          DEBUG 2013-06-14 16:56:44,077 (Thread-2261) - >> "Content-Disposition: form-data; name="??????.txt"; filename="??????.txt"[\r][\n]"
          DEBUG 2013-06-14 16:56:44,077 (Thread-2261) - >> "Content-Type: text/plain[\r][\n]"
          DEBUG 2013-06-14 16:56:44,077 (Thread-2261) - >> "Content-Transfer-Encoding: binary[\r][\n]"
          DEBUG 2013-06-14 16:56:44,077 (Thread-2261) - >> "[\r][\n]"
          DEBUG 2013-06-14 16:56:44,077 (Thread-2261) - >> "test"
          DEBUG 2013-06-14 16:56:44,077 (Thread-2261) - >> "[\r][\n]"
          DEBUG 2013-06-14 16:56:44,077 (Thread-2261) - >> "28[\r][\n]"
          DEBUG 2013-06-14 16:56:44,077 (Thread-2261) - >> "[\r][\n]"
          DEBUG 2013-06-14 16:56:44,077 (Thread-2261) - >> "--wm_LDbvcKR7PUhcAeC4tnPfvpdABhNLZ--[\r][\n]"
          

          When I debugged in eclipse, contentName at line 185 was not garbled.
          It seems that the filename gets garbled character (shows into "??????.txt") in wire debugging.
          I don't understand why.
          Thank you.

          Show
          Shinichiro Abe added a comment - Hi Karl, (1) Include the INFO output from Solr for the post? INFO - 2013-06-14 16:56:44.083; org.apache.solr.update.processor.LogUpdateProcessor; [collection1] webapp=/solr path=/update/extract params={literal.id=file:/Users/abe/Desktop/test2/ロンウイット.txt&resource.name=ロンウイット.txt&wt=xml&version=2.2&literal.uri=/Users/abe/Desktop/test2/ロンウイット.txt} {add=[file:/Users/abe/Desktop/test2/ロンウイット.txt (1437803850320838656)]} 0 5 (2) Turn on wire debugging DEBUG 2013-06-14 16:56:44,077 (Thread-2261) - >> "Content-Disposition: form-data; name="??????.txt"; filename="??????.txt"[\r][\n]" DEBUG 2013-06-14 16:56:44,077 (Thread-2261) - >> "Content-Type: text/plain[\r][\n]" DEBUG 2013-06-14 16:56:44,077 (Thread-2261) - >> "Content-Transfer-Encoding: binary[\r][\n]" DEBUG 2013-06-14 16:56:44,077 (Thread-2261) - >> "[\r][\n]" DEBUG 2013-06-14 16:56:44,077 (Thread-2261) - >> "test" DEBUG 2013-06-14 16:56:44,077 (Thread-2261) - >> "[\r][\n]" DEBUG 2013-06-14 16:56:44,077 (Thread-2261) - >> "28[\r][\n]" DEBUG 2013-06-14 16:56:44,077 (Thread-2261) - >> "[\r][\n]" DEBUG 2013-06-14 16:56:44,077 (Thread-2261) - >> "--wm_LDbvcKR7PUhcAeC4tnPfvpdABhNLZ--[\r][\n]" When I debugged in eclipse, contentName at line 185 was not garbled. It seems that the filename gets garbled character (shows into "??????.txt") in wire debugging. I don't understand why. Thank you.
          Hide
          Karl Wright added a comment -

          Hi Abe-san,

          Thank you for the data.

          Headers in HTTP are supposed to be 7-bit ASCII, which is probably why it comes up as "??????.txt". Although the extension is right, since that is always ASCII, this may never have worked to be able to transmit an accurate and complete file name. In my code, before we went to SolrJ and HttpClient, I used UTF-8, as follows:

              String value = "Content-Disposition: form-data";
              if (name != null)
                value += "; name=\""+name+"\"";
              if (fileName != null)
                value += "; filename=\""+fileName+"\"";
              value += "\r\n";
              byte[] tmp = value.getBytes("UTF-8");
          

          I will do research to see if there is any W3C specification that seems to permit using UTF-8 rather than 7-bit ASCII for form headers in form data, because without such a identified specification it is unlikely I could reasonably change HttpComponents HttpClient to work in this way. It is also possible that this never really worked; it would be good to confirm that in (say) ManifoldCF release 1.0.

          Thanks,
          Karl

          Show
          Karl Wright added a comment - Hi Abe-san, Thank you for the data. Headers in HTTP are supposed to be 7-bit ASCII, which is probably why it comes up as "??????.txt". Although the extension is right, since that is always ASCII, this may never have worked to be able to transmit an accurate and complete file name. In my code, before we went to SolrJ and HttpClient, I used UTF-8, as follows: String value = "Content-Disposition: form-data" ; if (name != null ) value += "; name=\" "+name+" \""; if (fileName != null ) value += "; filename=\" "+fileName+" \""; value += "\r\n" ; byte [] tmp = value.getBytes( "UTF-8" ); I will do research to see if there is any W3C specification that seems to permit using UTF-8 rather than 7-bit ASCII for form headers in form data, because without such a identified specification it is unlikely I could reasonably change HttpComponents HttpClient to work in this way. It is also possible that this never really worked; it would be good to confirm that in (say) ManifoldCF release 1.0. Thanks, Karl
          Hide
          Karl Wright added a comment -

          Ah! Here it is:

          http://tools.ietf.org/html/rfc6266

          I will open an HttpClient ticket.

          Show
          Karl Wright added a comment - Ah! Here it is: http://tools.ietf.org/html/rfc6266 I will open an HttpClient ticket.
          Hide
          Karl Wright added a comment -
          Show
          Karl Wright added a comment - HTTPCLIENT-1372
          Hide
          Karl Wright added a comment -

          I will try and find a way around needing a new HttpClient release, but probably this won't happen.

          Show
          Karl Wright added a comment - I will try and find a way around needing a new HttpClient release, but probably this won't happen.
          Hide
          Karl Wright added a comment -

          Hi Abe-san,

          It does not look like the HttpClient team will be willing to release a new 4.2.x HttpClient for this issue. Unfortunately, we cannot yet upgrade to HttpClient 4.3 because SolrJ still uses HttpClient 4.2, and there are significant changes between the clients.

          Before I put a lot of work into figuring out a workaround for HttpClient 4.2.x, can you please try to show whether ManifoldCF 1.0.1 worked properly with Solr, as far as the file name is concerned? If it did not, I think we can postpone this ticket. The file extension does make it through, and that is what Tika needs for decoding the document, so hopefully this is not going to be of critical importance.

          You will not be able to use wire debugging for this purpose, unfortunately, but you should be able to see what Solr does.

          Show
          Karl Wright added a comment - Hi Abe-san, It does not look like the HttpClient team will be willing to release a new 4.2.x HttpClient for this issue. Unfortunately, we cannot yet upgrade to HttpClient 4.3 because SolrJ still uses HttpClient 4.2, and there are significant changes between the clients. Before I put a lot of work into figuring out a workaround for HttpClient 4.2.x, can you please try to show whether ManifoldCF 1.0.1 worked properly with Solr, as far as the file name is concerned? If it did not, I think we can postpone this ticket. The file extension does make it through, and that is what Tika needs for decoding the document, so hopefully this is not going to be of critical importance. You will not be able to use wire debugging for this purpose, unfortunately, but you should be able to see what Solr does.
          Hide
          Karl Wright added a comment -

          r1493058 turns on UTF-8 encoding for these headers. Unfortunately it also affects the content-type, so I am unsure of whether it will work properly in a japanese setting. Please give it a try and let me know what happens.

          Show
          Karl Wright added a comment - r1493058 turns on UTF-8 encoding for these headers. Unfortunately it also affects the content-type, so I am unsure of whether it will work properly in a japanese setting. Please give it a try and let me know what happens.
          Hide
          Shinichiro Abe added a comment -

          Hi,
          r1493058 makes the stream_name not garbled, this is okay,
          but "id" and "resourcename" fields are odd. The Solr INFO shows garbled characters.

          INFO  - 2013-06-17 11:08:30.636; org.apache.solr.update.processor.LogUpdateProcessor; [collection1] webapp=/solr path=/update/extract params={literal.id=file:/Users/abe/Desktop/test2/ロンウイット.txt&resource.name=ロンウイット.txt&wt=xml&version=2.2&literal.uri=/Users/abe/Desktop/test2/ロンウイット.txt} {add=[file:/Users/abe/Desktop/test2/ロンウイット.txt (1438053732817305600)]} 0 405
          

          Thank you.

          Show
          Shinichiro Abe added a comment - Hi, r1493058 makes the stream_name not garbled, this is okay, but "id" and "resourcename" fields are odd. The Solr INFO shows garbled characters. INFO - 2013-06-17 11:08:30.636; org.apache.solr.update.processor.LogUpdateProcessor; [collection1] webapp=/solr path=/update/extract params={literal.id=file:/Users/abe/Desktop/test2/ロンウイット.txt&resource.name=ロンウイット.txt&wt=xml&version=2.2&literal.uri=/Users/abe/Desktop/test2/ロンウイット.txt} {add=[file:/Users/abe/Desktop/test2/ロンウイット.txt (1438053732817305600)]} 0 405 Thank you.
          Hide
          Karl Wright added a comment - - edited

          Hi Abe-san,

          I need to see what the wire logging looks like now, in order to decide where the problem is. Can you include the wire logging for one whole multipart form?

          The problem could be in several places:

          (1) The terminal encoding does not match Solr's output log encoding. You can set java's encoding with -Dfile.encoding=SJIS when you start Solr.
          (2) The POST is actually incorrect; the characters do not match the stated content-type in the multipart port.
          (3) Solr's form decoding is incorrect; it ignores the content-type, or misinterprets it, and always decodes in an unchanging way.

          Thanks!

          Show
          Karl Wright added a comment - - edited Hi Abe-san, I need to see what the wire logging looks like now, in order to decide where the problem is. Can you include the wire logging for one whole multipart form? The problem could be in several places: (1) The terminal encoding does not match Solr's output log encoding. You can set java's encoding with -Dfile.encoding=SJIS when you start Solr. (2) The POST is actually incorrect; the characters do not match the stated content-type in the multipart port. (3) Solr's form decoding is incorrect; it ignores the content-type, or misinterprets it, and always decodes in an unchanging way. Thanks!
          Hide
          Karl Wright added a comment -

          Looking at the HttpClient code, it looks like BROWSER_COMPATIBILITY mode only writes Content-Type for a section if there is a filename in that section:

                          MinimalField cd = part.getHeader().getField(MIME.CONTENT_DISPOSITION);
                          writeField(cd, this.charset, out);
                          String filename = part.getBody().getFilename();
                          if (filename != null) {
                              MinimalField ct = part.getHeader().getField(MIME.CONTENT_TYPE);
                              writeField(ct, this.charset, out);
                          }
          

          So it is thus likely that the non-filename parts of the form are not interpreted as containing UTF-8 on the Solr side. I don't think this is correct behavior for HttpClient, and there seems to be no way to override it either.

          Show
          Karl Wright added a comment - Looking at the HttpClient code, it looks like BROWSER_COMPATIBILITY mode only writes Content-Type for a section if there is a filename in that section: MinimalField cd = part.getHeader().getField(MIME.CONTENT_DISPOSITION); writeField(cd, this .charset, out); String filename = part.getBody().getFilename(); if (filename != null ) { MinimalField ct = part.getHeader().getField(MIME.CONTENT_TYPE); writeField(ct, this .charset, out); } So it is thus likely that the non-filename parts of the form are not interpreted as containing UTF-8 on the Solr side. I don't think this is correct behavior for HttpClient, and there seems to be no way to override it either.
          Hide
          Karl Wright added a comment -

          I can override this, but I will need to copy essentially all of HttpClient's multipart form code in a couple of classes. That's going to take some work; I won't be done for a couple of days most likely.

          Show
          Karl Wright added a comment - I can override this, but I will need to copy essentially all of HttpClient's multipart form code in a couple of classes. That's going to take some work; I won't be done for a couple of days most likely.
          Hide
          Karl Wright added a comment -

          r1493700 creates functionally unmodified local forms of the HttpClient classes I will need to modify.

          Show
          Karl Wright added a comment - r1493700 creates functionally unmodified local forms of the HttpClient classes I will need to modify.
          Hide
          Karl Wright added a comment -

          will resolve when I'm done hacking

          Show
          Karl Wright added a comment - will resolve when I'm done hacking
          Hide
          Karl Wright added a comment -

          r1493704 undoes the BROWSER_COMPATIBLE change, and causes the modified HttpClient classes to be used instead of the ones in httpmime.jar.

          Show
          Karl Wright added a comment - r1493704 undoes the BROWSER_COMPATIBLE change, and causes the modified HttpClient classes to be used instead of the ones in httpmime.jar.
          Hide
          Karl Wright added a comment -

          r1493717 encodes all headers with the specified charset. This should complete the work of this ticket. Abe-san, if you could synch up and try again I'd greatly appreciate it.

          Thanks!

          Show
          Karl Wright added a comment - r1493717 encodes all headers with the specified charset. This should complete the work of this ticket. Abe-san, if you could synch up and try again I'd greatly appreciate it. Thanks!
          Hide
          Shinichiro Abe added a comment -

          Hi Karl,

          Sorry for the delay.
          I synched trunk up and check that. It works, "id", "resoucename" and "stream_name" shows a correct file name in Japanese, not garbled.

          Thanks!

          Show
          Shinichiro Abe added a comment - Hi Karl, Sorry for the delay. I synched trunk up and check that. It works, "id", "resoucename" and "stream_name" shows a correct file name in Japanese, not garbled. Thanks!

            People

            • Assignee:
              Karl Wright
              Reporter:
              Shinichiro Abe
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development