Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.1, 4.0-ALPHA
    • Component/s: Response Writers
    • Labels:
      None
    • Environment:

      indep. of env.

      Description

      As part of some work I'm doing, I put together a CSV Response Writer. It currently takes all the docs resultant from a query and then outputs their metadata in simple CSV format. The use of a delimeter is configurable (by default if there are multiple values for a particular field they are separated with a | symbol).

      1. SOLR-1925.Mattmann.053010.patch.txt
        10 kB
        Chris A. Mattmann
      2. SOLR-1925.Mattmann.053010.patch.2.txt
        10 kB
        Chris A. Mattmann
      3. SOLR-1925.Mattmann.053010.patch.3.txt
        10 kB
        Chris A. Mattmann
      4. SOLR-1925.Mattmann.061110.patch.txt
        10 kB
        Chris A. Mattmann
      5. SOLR-1925.Chheng.071410.patch.txt
        10 kB
        tommy chheng
      6. SOLR-1925.patch
        5 kB
        Yonik Seeley
      7. SOLR-1925.patch
        22 kB
        Yonik Seeley
      8. SOLR-1925.patch
        32 kB
        Yonik Seeley

        Issue Links

          Activity

          Hide
          Erik Hatcher added a comment -

          I was just thinking of writing this very thing the other day.

          I think this should use the same default delimiters and header as the CSV update handler does so that data is easily ingested and output in the the same format (provided the field data is stored of course).

          Show
          Erik Hatcher added a comment - I was just thinking of writing this very thing the other day. I think this should use the same default delimiters and header as the CSV update handler does so that data is easily ingested and output in the the same format (provided the field data is stored of course).
          Hide
          Chris A. Mattmann added a comment -

          Hey Eric cool!

          Sure, I'd love to collaborate with you on this. Patch forthcoming, then let's work it...

          Cheers,
          Chris

          Show
          Chris A. Mattmann added a comment - Hey Eric cool! Sure, I'd love to collaborate with you on this. Patch forthcoming, then let's work it... Cheers, Chris
          Hide
          Yonik Seeley added a comment -

          This is something that some people have asked for since my CNET days... I thought there was already an open issue for this, but I can't seem to find it (so I guess not!)

          Show
          Yonik Seeley added a comment - This is something that some people have asked for since my CNET days... I thought there was already an open issue for this, but I can't seem to find it (so I guess not!)
          Hide
          Hoss Man added a comment -

          Bulk updating 240 Solr issues to set the Fix Version to "next" per the process outlined in this email...

          http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E

          Selection criteria was "Unresolved" with a Fix Version of 1.5, 1.6, 3.1, or 4.0. email notifications were suppressed.

          A unique token for finding these 240 issues in the future: hossversioncleanup20100527

          Show
          Hoss Man added a comment - Bulk updating 240 Solr issues to set the Fix Version to "next" per the process outlined in this email... http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E Selection criteria was "Unresolved" with a Fix Version of 1.5, 1.6, 3.1, or 4.0. email notifications were suppressed. A unique token for finding these 240 issues in the future: hossversioncleanup20100527
          Hide
          Chris A. Mattmann added a comment -

          Okey dok, here's the patch, I'll post some sample queries and response writer config to show how it's used in one sec.

          Show
          Chris A. Mattmann added a comment - Okey dok, here's the patch, I'll post some sample queries and response writer config to show how it's used in one sec.
          Hide
          Chris A. Mattmann added a comment - - edited

          Hey Guys:

          Here are some samples on how to call it:

          This example queries Solr for children's hospital, turns on CSV output, and requests the fields site_id and agency_name

          curl"http://localhost:8080/solr/select/?q=children%27s%20AND%20hospital&version=2.2&start=0&rows=10&indent=on&wt=csv&fl=site_id,id,agency_name"
          

          This example queries Solr for children's hospital and turns on CSV output, requests the fields site_id and agency_name, and then changes the default delimiter to semi-colon (only in the context of this request)

          curl"http://localhost:8080/solr/select/?q=children%27s%20AND%20hospital&version=2.2&start=0&rows=10&indent=on&wt=csv&fl=site_id,id,agency_name&delimiter=;"
          

          This example queries Solr for children's hospital and turns on CSV output, requests the fields site_id and agency_name, and then specifies (by turning Excel off) that CR LF should be left inside of the fields and not replaced (only in the context of this request):

          curl"http://localhost:8080/solr/select/?q=children%27s%20AND%20hospital&version=2.2&start=0&rows=10&indent=on&wt=csv&fl=site_id,id,agency_name&excel=false"
          
          Show
          Chris A. Mattmann added a comment - - edited Hey Guys: Here are some samples on how to call it: This example queries Solr for children's hospital, turns on CSV output, and requests the fields site_id and agency_name curl "http: //localhost:8080/solr/select/?q=children%27s%20AND%20hospital&version=2.2&start=0&rows=10&indent=on&wt=csv&fl=site_id,id,agency_name" This example queries Solr for children's hospital and turns on CSV output, requests the fields site_id and agency_name, and then changes the default delimiter to semi-colon (only in the context of this request) curl "http: //localhost:8080/solr/select/?q=children%27s%20AND%20hospital&version=2.2&start=0&rows=10&indent=on&wt=csv&fl=site_id,id,agency_name&delimiter=;" This example queries Solr for children's hospital and turns on CSV output, requests the fields site_id and agency_name, and then specifies (by turning Excel off) that CR LF should be left inside of the fields and not replaced (only in the context of this request): curl "http: //localhost:8080/solr/select/?q=children%27s%20AND%20hospital&version=2.2&start=0&rows=10&indent=on&wt=csv&fl=site_id,id,agency_name&excel= false "
          Hide
          Chris A. Mattmann added a comment -
          • small bug in my initial patch where the field col headers didn't respect the "delimeter" parameter. Fixed.
          Show
          Chris A. Mattmann added a comment - small bug in my initial patch where the field col headers didn't respect the "delimeter" parameter. Fixed.
          Hide
          Erik Hatcher added a comment -

          shouldn't that be spelled "delimiter"? or we hossifying this thing?

          Show
          Erik Hatcher added a comment - shouldn't that be spelled "delimiter"? or we hossifying this thing?
          Hide
          Chris A. Mattmann added a comment -

          haha crap i can't spell. Hold on let me fix it... I caught some typos in the Javadoc too, so fixing those too!

          Show
          Chris A. Mattmann added a comment - haha crap i can't spell. Hold on let me fix it... I caught some typos in the Javadoc too, so fixing those too!
          Hide
          Chris A. Mattmann added a comment -

          Fix typos: nice catch, Erik!

          Show
          Chris A. Mattmann added a comment - Fix typos: nice catch, Erik!
          Hide
          Kevin Black added a comment -

          Thanks for this response writer!
          Question (not sure if this is a universal problem, or a problem with my solr install):
          if a document is missing a field value, is the CSV output missing a column?

          e.g.
          Doc 1: id=121; field1=string1.1; field2=string1.2; field3=string1.3
          Doc 2: id=122; field1=string2.1; field3=string2.3 [field2 is empty with no value]
          Doc 3: id=123; field1=string3.1; field2=string3.2; field3=string3.3

          my CSV output is:

          id,field1,field2,field3
          121,string1.1,string1.2,string1.3
          122,string2.1,string2.3
          123,string3.1,string3.2,string3.3

          For the 2nd record, the 3rd field appears in the 2nd column which annoys the user who downloads the CSV data

          Show
          Kevin Black added a comment - Thanks for this response writer! Question (not sure if this is a universal problem, or a problem with my solr install): if a document is missing a field value, is the CSV output missing a column? e.g. Doc 1: id=121; field1=string1.1; field2=string1.2; field3=string1.3 Doc 2: id=122; field1=string2.1; field3=string2.3 [field2 is empty with no value] Doc 3: id=123; field1=string3.1; field2=string3.2; field3=string3.3 my CSV output is: id,field1,field2,field3 121,string1.1,string1.2,string1.3 122,string2.1,string2.3 123,string3.1,string3.2,string3.3 For the 2nd record, the 3rd field appears in the 2nd column which annoys the user who downloads the CSV data
          Hide
          Chris A. Mattmann added a comment -

          Hi Kevin,

          Thanks for the feedback!

          I hear ya on the annoying part. Right now the CSVResponseWriter has the following behavior:

          • if you define output fields using fl=field1,field2 as part of your SOLR url request, then you can control the way columns are displayed and ensure that there is a uniform set of columns output in the CSV based on your indexed documents
          • if you don't define output fields, the response writer simple assumes that your documents indexed have uniform fields and will try and simply write the output fields based on those fields indexed per document

          The assumption might be limiting, but I it was the best I could think of in the case that the requested output cols aren't uniformly specified.

          HTH clarify what you're seeing! Of course, ideas for improvement are always welcome.

          Cheers,
          Chris

          Show
          Chris A. Mattmann added a comment - Hi Kevin, Thanks for the feedback! I hear ya on the annoying part. Right now the CSVResponseWriter has the following behavior: if you define output fields using fl=field1,field2 as part of your SOLR url request, then you can control the way columns are displayed and ensure that there is a uniform set of columns output in the CSV based on your indexed documents if you don't define output fields, the response writer simple assumes that your documents indexed have uniform fields and will try and simply write the output fields based on those fields indexed per document The assumption might be limiting, but I it was the best I could think of in the case that the requested output cols aren't uniformly specified. HTH clarify what you're seeing! Of course, ideas for improvement are always welcome. Cheers, Chris
          Hide
          Kevin Black added a comment -

          Hi Chris,

          In your 1st point, are you saying that defining "fl=field1,field2,field3" will populate a column with empty quotes if field2 is missing in a document?

          however, in my URL, I have "&fl=id,field3,field1,field2", but since some Documents do not have field2 data in the index, the output is missing columns: (quotes are not shown for clarity)

          id,field1,field3
          121,string1.1,string1.3
          122,string2.1,string2.3
          123,string3.1,string3.2,string3.3

          In this example, both the 1st and 2nd Document are missing "field2" data, so 'field2' is not even appearing in the header of my output, and neither record is populated with a empty placeholder for "field2".

          Hopefully I'm being clear with my example.

          Thanks!

          Show
          Kevin Black added a comment - Hi Chris, In your 1st point, are you saying that defining "fl=field1,field2,field3" will populate a column with empty quotes if field2 is missing in a document? however, in my URL, I have "&fl=id,field3,field1,field2", but since some Documents do not have field2 data in the index, the output is missing columns: (quotes are not shown for clarity) id,field1,field3 121,string1.1,string1.3 122,string2.1,string2.3 123,string3.1,string3.2,string3.3 In this example, both the 1st and 2nd Document are missing "field2" data, so 'field2' is not even appearing in the header of my output, and neither record is populated with a empty placeholder for "field2". Hopefully I'm being clear with my example. Thanks!
          Hide
          Chris A. Mattmann added a comment -

          Hey Kevin,

          Oh ok then yep that's a bug, I can fix that. I'll throw up a new patch for that today. Should still output a blank...

          Cheers,
          Chris

          Show
          Chris A. Mattmann added a comment - Hey Kevin, Oh ok then yep that's a bug, I can fix that. I'll throw up a new patch for that today. Should still output a blank... Cheers, Chris
          Hide
          Chris A. Mattmann added a comment -

          Updated patch forthcoming. I tested it out by doing e.g.,:

          http://localhost:8080/solr/select/?wt=csv&excel=true&delimeter=;&q=%28taxonomy_term%3A%22Medicaid+Applications%22%29&start=0&rows=10&fl=site_id,site_name,site_hours,foo
          

          where I knew that "foo" wasn't a valid field, and thus would have no value in the resultant doc.

          Show
          Chris A. Mattmann added a comment - Updated patch forthcoming. I tested it out by doing e.g.,: http: //localhost:8080/solr/select/?wt=csv&excel= true &delimeter=;&q=%28taxonomy_term%3A%22Medicaid+Applications%22%29&start=0&rows=10&fl=site_id,site_name,site_hours,foo where I knew that "foo" wasn't a valid field, and thus would have no value in the resultant doc.
          Hide
          Chris A. Mattmann added a comment -
          • updated patch, that addresses the comments from Kevin.
          Show
          Chris A. Mattmann added a comment - updated patch, that addresses the comments from Kevin.
          Hide
          Chris A. Mattmann added a comment -
          • oops named it wrong the first time i attached.
          Show
          Chris A. Mattmann added a comment - oops named it wrong the first time i attached.
          Hide
          Kevin Black added a comment -

          Beautiful! Thanks very much Chris!

          Show
          Kevin Black added a comment - Beautiful! Thanks very much Chris!
          Hide
          tommy chheng added a comment -

          Thanks for this!
          I fixed the path of the queryResponseWriter class in the example solrconfig.xml. This was successfully applied against solr 4.0 trunk.

          A few quirks:

          • When I didn't specify a default Delimiter, it printed out null as delimiter. I couldn't figure out why because init(NamedList args) specifies it'll use a default of ","
            "organization"null"2"null"
          • If i don't specify the column names, the output doesn't put in empty "" correctly.
            eg: output has a mismatched number of commas.
            "organization","1","Test","Name","2"," ","2000000","8",
            "organization","4","Solar","4","0",
          Show
          tommy chheng added a comment - Thanks for this! I fixed the path of the queryResponseWriter class in the example solrconfig.xml. This was successfully applied against solr 4.0 trunk. A few quirks: When I didn't specify a default Delimiter, it printed out null as delimiter. I couldn't figure out why because init(NamedList args) specifies it'll use a default of "," "organization"null"2"null" If i don't specify the column names, the output doesn't put in empty "" correctly. eg: output has a mismatched number of commas. "organization","1","Test","Name","2"," ","2000000","8", "organization","4","Solar","4","0",
          Hide
          Chris A. Mattmann added a comment -

          Hi Tommy:

          I fixed the path of the queryResponseWriter class in the example solrconfig.xml. This was successfully applied against solr 4.0 trunk.

          What does "fixing the path" mean?

          Onto your comments below:

          • When I didn't specify a default Delimiter, it printed out null as delimiter. I couldn't figure out why because init(NamedList args) specifies it'll use a default of "," "organization"null"2"null"

          It didn't do that for me? Where did you initialize your response writer? Note that the patch I attached included updates to solrconfig.xml, where the default delimeter is provided as an initParam.

          • If i don't specify the column names, the output doesn't put in empty "" correctly. eg: output has a mismatched number of commas. "organization","1","Test","Name","2"," ","2000000","8", "organization","4","Solar","4","0",

          Yep that's the intention. It's because Lucene/Solr documents can have an arbitrary # of fields in them. I saw no good, intuitive way to overcome this, so unless you tell me what the default field name list is, the writer won't do anything special (b/c I believe to do so would be somewhat limiting and non-generic), and it will simply output the fields that are in the doc, which, can be a non-uniform number, causing what you're seeing. To say that's not correct, well, I don't agree with that.

          I tried diffing your latest patch against my latest:

          [chipotle:~/Desktop/Apache/solr-dev] mattmann% diff -u SOLR-1925.Chheng.071410.patch.txt SOLR-1925.Mattmann.061110.patch.txt
          --- SOLR-1925.Chheng.071410.patch.txt	2010-07-14 15:22:57.000000000 -0700
          +++ SOLR-1925.Mattmann.061110.patch.txt	2010-06-11 20:34:58.000000000 -0700
          @@ -1,3 +1,5 @@
          +### Eclipse Workspace Patch 1.0
          +#P solrcene
           Index: solr/src/java/org/apache/solr/response/CSVResponseWriter.java
           ===================================================================
           --- solr/src/java/org/apache/solr/response/CSVResponseWriter.java	(revision 0)
          @@ -259,7 +261,7 @@
                <int name="xsltCacheLifetimeSeconds">5</int>
              </queryResponseWriter>
           +  
          -+  <queryResponseWriter name="csv" class="org.apache.solr.response.CSVResponseWriter">
          ++  <queryResponseWriter name="csv" class="solr.CSVResponseWriter">
           +    <str name="delimiter">,</str>
           +    <!--  this specifies that
           +          you are writing CSV that you expect to load into M$
          @@ -290,7 +292,6 @@
                m.put("raw", new RawResponseWriter());
                m.put("javabin", new BinaryResponseWriter());
           +    m.put("csv", new CSVResponseWriter());
          -     m.put("velocity", new VelocityResponseWriter());
                DEFAULT_RESPONSE_WRITERS = Collections.unmodifiableMap(m);
              }
              
          [chipotle:~/Desktop/Apache/solr-dev] mattmann% 
          

          All you did was use the virtual solr package resolution string, versus using the FQDN for the class. Either one works fine. Also, I'm not sure I get the removing velocity from the default response writer map part...

          Cheers,
          Chris

          Show
          Chris A. Mattmann added a comment - Hi Tommy: I fixed the path of the queryResponseWriter class in the example solrconfig.xml. This was successfully applied against solr 4.0 trunk. What does "fixing the path" mean? Onto your comments below: When I didn't specify a default Delimiter, it printed out null as delimiter. I couldn't figure out why because init(NamedList args) specifies it'll use a default of "," "organization"null"2"null" It didn't do that for me? Where did you initialize your response writer? Note that the patch I attached included updates to solrconfig.xml, where the default delimeter is provided as an initParam. If i don't specify the column names, the output doesn't put in empty "" correctly. eg: output has a mismatched number of commas. "organization","1","Test","Name","2"," ","2000000","8", "organization","4","Solar","4","0", Yep that's the intention. It's because Lucene/Solr documents can have an arbitrary # of fields in them. I saw no good, intuitive way to overcome this, so unless you tell me what the default field name list is, the writer won't do anything special (b/c I believe to do so would be somewhat limiting and non-generic), and it will simply output the fields that are in the doc, which, can be a non-uniform number, causing what you're seeing. To say that's not correct, well, I don't agree with that. I tried diffing your latest patch against my latest: [chipotle:~/Desktop/Apache/solr-dev] mattmann% diff -u SOLR-1925.Chheng.071410.patch.txt SOLR-1925.Mattmann.061110.patch.txt --- SOLR-1925.Chheng.071410.patch.txt 2010-07-14 15:22:57.000000000 -0700 +++ SOLR-1925.Mattmann.061110.patch.txt 2010-06-11 20:34:58.000000000 -0700 @@ -1,3 +1,5 @@ +### Eclipse Workspace Patch 1.0 +#P solrcene Index: solr/src/java/org/apache/solr/response/CSVResponseWriter.java =================================================================== --- solr/src/java/org/apache/solr/response/CSVResponseWriter.java (revision 0) @@ -259,7 +261,7 @@ <int name="xsltCacheLifetimeSeconds">5</int> </queryResponseWriter> + -+ <queryResponseWriter name="csv" class="org.apache.solr.response.CSVResponseWriter"> ++ <queryResponseWriter name="csv" class="solr.CSVResponseWriter"> + <str name="delimiter">,</str> + <!-- this specifies that + you are writing CSV that you expect to load into M$ @@ -290,7 +292,6 @@ m.put("raw", new RawResponseWriter()); m.put("javabin", new BinaryResponseWriter()); + m.put("csv", new CSVResponseWriter()); - m.put("velocity", new VelocityResponseWriter()); DEFAULT_RESPONSE_WRITERS = Collections.unmodifiableMap(m); } [chipotle:~/Desktop/Apache/solr-dev] mattmann% All you did was use the virtual solr package resolution string, versus using the FQDN for the class. Either one works fine. Also, I'm not sure I get the removing velocity from the default response writer map part... Cheers, Chris
          Hide
          Yonik Seeley added a comment -

          I took a quick look... here are some of the issues I see:

          • loses info by removing newlines
          • always encapsulates with quotes - not as readable
          • doesn't escape encapsulator in values
          • doesn't escape separator in multi-valued fields
          • isn't really nested CSV, so it's not compatible with the CSVLoader
          • uses System.getProperty("line.separator")... we should avoid different behavior on different platforms
          • doesn't stream documents (dumping your entire index will be one use case)
          • performance: patterns shouldn't be compiled per-doc
          Show
          Yonik Seeley added a comment - I took a quick look... here are some of the issues I see: loses info by removing newlines always encapsulates with quotes - not as readable doesn't escape encapsulator in values doesn't escape separator in multi-valued fields isn't really nested CSV, so it's not compatible with the CSVLoader uses System.getProperty("line.separator")... we should avoid different behavior on different platforms doesn't stream documents (dumping your entire index will be one use case) performance: patterns shouldn't be compiled per-doc
          Hide
          Lance Norskog added a comment -

          Please make sure that it does not print floats/doubles in scientific notation, but sticks with canonical. That is, it should print 0.0000001 as that, not '1e-7'.

          Show
          Lance Norskog added a comment - Please make sure that it does not print floats/doubles in scientific notation, but sticks with canonical. That is, it should print 0.0000001 as that, not '1e-7'.
          Hide
          Chris A. Mattmann added a comment -

          Hi Yonik:

          Thanks. Replies below:

          • loses info by removing newlines

          Only does this when

          &excel=true

          , and actually adds functionality in doing so (without doing this, you can't load the data into Excel, see my comments above and in the code).

          • always encapsulates with quotes - not as readable

          See the CSV spec, via Wikipedia in the links in the code. Doing so reduces ambiguity, and clearly delineates where the value starts, and where it stops.

          • doesn't escape encapsulator in values

          Is there a need to do this? I don't think so...

          • doesn't escape separator in multi-valued fields

          Same as above: no need, really.

          • isn't really nested CSV, so it's not compatible with the CSVLoader

          What do you mean not compatible with CSV loader?

          • uses System.getProperty("line.separator")... we should avoid different behavior on different platforms

          Hmm, I've never been dinged before for writing platform independent code. That's what they put the property in there, so line.separator means the same thing, programming-construct wise, across platforms. So, I don't really get your ding here.

          • doesn't stream documents (dumping your entire index will be one use case)

          I actually implemented both the streaming method (#writeDoc) and the aggregate method (#writeAllDocs). I set #isStreaming to false, because it makes for a clean CSV header writing, rather than hacky code in #writeDoc to take care of the (potential) non-uniformity. Additionally, I'm using this in production right now, on solr-1.5 branch with an index of over 1M documents, and the performance overhead for the write is quite fast.

          • performance: patterns shouldn't be compiled per-doc

          This only matters when

          excel=true

          , and I think the performance hit isn't really an issue. If you feel strongly about it though we could always compile the pattern above the loop, and reuse it...

          Show
          Chris A. Mattmann added a comment - Hi Yonik: Thanks. Replies below: loses info by removing newlines Only does this when &excel=true , and actually adds functionality in doing so (without doing this, you can't load the data into Excel, see my comments above and in the code). always encapsulates with quotes - not as readable See the CSV spec, via Wikipedia in the links in the code. Doing so reduces ambiguity, and clearly delineates where the value starts, and where it stops. doesn't escape encapsulator in values Is there a need to do this? I don't think so... doesn't escape separator in multi-valued fields Same as above: no need, really. isn't really nested CSV, so it's not compatible with the CSVLoader What do you mean not compatible with CSV loader? uses System.getProperty("line.separator")... we should avoid different behavior on different platforms Hmm, I've never been dinged before for writing platform independent code. That's what they put the property in there, so line.separator means the same thing, programming-construct wise, across platforms. So, I don't really get your ding here. doesn't stream documents (dumping your entire index will be one use case) I actually implemented both the streaming method (#writeDoc) and the aggregate method (#writeAllDocs). I set #isStreaming to false, because it makes for a clean CSV header writing, rather than hacky code in #writeDoc to take care of the (potential) non-uniformity. Additionally, I'm using this in production right now, on solr-1.5 branch with an index of over 1M documents, and the performance overhead for the write is quite fast. performance: patterns shouldn't be compiled per-doc This only matters when excel=true , and I think the performance hit isn't really an issue. If you feel strongly about it though we could always compile the pattern above the loop, and reuse it...
          Hide
          Yonik Seeley added a comment -

          Excel (at least the version I just tried) handled embedded newlines just fine.

          > > always encapsulates with quotes - not as readable
          > See the CSV spec, via Wikipedia in the links in the code

          AFAIK, the CSV spec doesn't recommend always using encapsulators.

          > > doesn't escape encapsulator in values
          > Is there a need to do this? I don't think so...

          Proper escaping is an absolute necessity. You can't represent arbitrary text field values without it.

          > > What do you mean not compatible with CSV loader?

          If we do things correctly, we should be able to round-trip with http://wiki.apache.org/solr/UpdateCSV

          > > uses System.getProperty("line.separator")... we should avoid different behavior on different platforms
          > Hmm, I've never been dinged before for writing platform independent code.

          Having a server process act differently on different hosts is bad. We strive to never use the default locale - it's a recipe for non-portability. All file encodings (stopword lists, etc) default to UTF-8 instead of the system locale. Date and number formatting is standardized and does not use the system locale. We missed some of these in the past (and sure enough, Solr wouldn't work properly when installed on a machine of a certain locale), but Robert cleaned all that up.

          Show
          Yonik Seeley added a comment - Excel (at least the version I just tried) handled embedded newlines just fine. > > always encapsulates with quotes - not as readable > See the CSV spec, via Wikipedia in the links in the code AFAIK, the CSV spec doesn't recommend always using encapsulators. > > doesn't escape encapsulator in values > Is there a need to do this? I don't think so... Proper escaping is an absolute necessity. You can't represent arbitrary text field values without it. > > What do you mean not compatible with CSV loader? If we do things correctly, we should be able to round-trip with http://wiki.apache.org/solr/UpdateCSV > > uses System.getProperty("line.separator")... we should avoid different behavior on different platforms > Hmm, I've never been dinged before for writing platform independent code. Having a server process act differently on different hosts is bad. We strive to never use the default locale - it's a recipe for non-portability. All file encodings (stopword lists, etc) default to UTF-8 instead of the system locale. Date and number formatting is standardized and does not use the system locale. We missed some of these in the past (and sure enough, Solr wouldn't work properly when installed on a machine of a certain locale), but Robert cleaned all that up.
          Hide
          Chris A. Mattmann added a comment -

          Excel (at least the version I just tried) handled embedded newlines just fine.

          Well not for me. I'm using MS Office 2008, on Mac OS X 10.5.6. I also tried on Office XP SP 2, and same behavior on a Win XP SP2 instance I have running in VMWare. What version are you looking at?

          AFAIK, the CSV spec doesn't recommend always using encapsulators.

          See here: http://en.wikipedia.org/wiki/Comma-separated_values, 1st Paragraph:

          Fields that contain a special character (comma, newline, or double quote), must be enclosed in double quotes.

          Since we don't know what the contents of each Field's value is, it's best to just account for that by encapsulating within double quotes. This doesn't break anything, and arguably isn't any less uglier than without (that's a judgment call).

          Proper escaping is an absolute necessity. You can't represent arbitrary text field values without it.

          How would you recommend doing so?

          If we do things correctly, we should be able to round-trip with http://wiki.apache.org/solr/UpdateCSV

          What's your rationale that this isn't compatible with that? Have you tried it? Also, I think that's a good thing to make happen in the end, but not a blocker to getting this into the sources? My rationale behind that is that, e.g., for instance XML given to Solr doesn't always round trip to the XMLReponseWriter (especially if the schema weeds out fields, discards them, etc.)

          Having a server process act differently on different hosts is bad. We strive to never use the default locale - it's a recipe for non-portability. All file encodings (stopword lists, etc) default to UTF-8 instead of the system locale. Date and number formatting is standardized and does not use the system locale. We missed some of these in the past (and sure enough, Solr wouldn't work properly when installed on a machine of a certain locale), but Robert cleaned all that up.

          Admittedly, I'm not an expert here, so I'll take your word for it. What's the host-independent way to do System.getProperty("line.separator")?

          Show
          Chris A. Mattmann added a comment - Excel (at least the version I just tried) handled embedded newlines just fine. Well not for me. I'm using MS Office 2008, on Mac OS X 10.5.6. I also tried on Office XP SP 2, and same behavior on a Win XP SP2 instance I have running in VMWare. What version are you looking at? AFAIK, the CSV spec doesn't recommend always using encapsulators. See here: http://en.wikipedia.org/wiki/Comma-separated_values , 1st Paragraph: Fields that contain a special character (comma, newline, or double quote), must be enclosed in double quotes. Since we don't know what the contents of each Field's value is, it's best to just account for that by encapsulating within double quotes. This doesn't break anything, and arguably isn't any less uglier than without (that's a judgment call). Proper escaping is an absolute necessity. You can't represent arbitrary text field values without it. How would you recommend doing so? If we do things correctly, we should be able to round-trip with http://wiki.apache.org/solr/UpdateCSV What's your rationale that this isn't compatible with that? Have you tried it? Also, I think that's a good thing to make happen in the end, but not a blocker to getting this into the sources? My rationale behind that is that, e.g., for instance XML given to Solr doesn't always round trip to the XMLReponseWriter (especially if the schema weeds out fields, discards them, etc.) Having a server process act differently on different hosts is bad. We strive to never use the default locale - it's a recipe for non-portability. All file encodings (stopword lists, etc) default to UTF-8 instead of the system locale. Date and number formatting is standardized and does not use the system locale. We missed some of these in the past (and sure enough, Solr wouldn't work properly when installed on a machine of a certain locale), but Robert cleaned all that up. Admittedly, I'm not an expert here, so I'll take your word for it. What's the host-independent way to do System.getProperty("line.separator")?
          Hide
          Yonik Seeley added a comment -

          I tried excel 2003 and excel 2007 - both work fine with embedded newlines. Perhaps you're not encoding your test file correctly.

          > > AFAIK, the CSV spec doesn't recommend always using encapsulators.
          > See here: http://en.wikipedia.org/wiki/Comma-separated_values, 1st Paragraph:
          > Fields that contain a special character (comma, newline, or double quote), must be enclosed in double quotes.

          I'll repeat: "the CSV spec doesn't recommend always using encapsulators". Why do you keep suggesting that it does?

          > > If we do things correctly, we should be able to round-trip with http://wiki.apache.org/solr/UpdateCSV
          > What's your rationale that this isn't compatible with that? Have you tried it?

          I didn't need to try it... I just looked at this patch, which doesn't do proper CSV encoding/escaping.

          > What's the host-independent way to do System.getProperty("line.separator")?

          You pick one (like \n)... if there is a need for a different one, you let it be configured / specified by the client.

          Show
          Yonik Seeley added a comment - I tried excel 2003 and excel 2007 - both work fine with embedded newlines. Perhaps you're not encoding your test file correctly. > > AFAIK, the CSV spec doesn't recommend always using encapsulators. > See here: http://en.wikipedia.org/wiki/Comma-separated_values , 1st Paragraph: > Fields that contain a special character (comma, newline, or double quote), must be enclosed in double quotes. I'll repeat: "the CSV spec doesn't recommend always using encapsulators". Why do you keep suggesting that it does? > > If we do things correctly, we should be able to round-trip with http://wiki.apache.org/solr/UpdateCSV > What's your rationale that this isn't compatible with that? Have you tried it? I didn't need to try it... I just looked at this patch, which doesn't do proper CSV encoding/escaping. > What's the host-independent way to do System.getProperty("line.separator")? You pick one (like \n)... if there is a need for a different one, you let it be configured / specified by the client.
          Hide
          Chris A. Mattmann added a comment -

          I tried excel 2003 and excel 2007 - both work fine with embedded newlines. Perhaps you're not encoding your test file correctly.

          Huh? It has nothing to do with any test file? There is no test file.

          Process

          1. Load data into Solr with embedded newlines
          2. Do query to Solr, call wt=csv and save results to a file (don't specify excel=true, preserving embedded newlines)
          3. Load .csv file into Excel on the platforms I mentioned and watch it break

          You continue to omit the platforms you're testing on. Why do you continue to do this?

          I'll repeat: "the CSV spec doesn't recommend always using encapsulators". Why do you keep suggesting that it does?

          Because your statement that it doesn't recommend always using encapsulators neglects to take into account my perspective. Also the spec isn't black and white, as you imply. There is room for interpretation. I believe my interpretation allows for the most flexibility.

          I didn't need to try it... I just looked at this patch, which doesn't do proper CSV encoding/escaping.

          Right, and you selectively quoted me. If you're going to quote me, include the whole quote. Look at the 2nd part of what I wrote that mentions that there are plenty of cases with Solr data loading and Response Writers where it doesn't round trip.

          You pick one (like \n)... if there is a need for a different one, you let it be configured / specified by the client.

          Hmmm, not sure I understand this statement.

          All I know is that regardless, I'm using this and it's been working fine for me in production for weeks by applying this patch to branch-1.5.

          Show
          Chris A. Mattmann added a comment - I tried excel 2003 and excel 2007 - both work fine with embedded newlines. Perhaps you're not encoding your test file correctly. Huh? It has nothing to do with any test file? There is no test file. Process Load data into Solr with embedded newlines Do query to Solr, call wt=csv and save results to a file (don't specify excel=true, preserving embedded newlines) Load .csv file into Excel on the platforms I mentioned and watch it break You continue to omit the platforms you're testing on. Why do you continue to do this? I'll repeat: "the CSV spec doesn't recommend always using encapsulators". Why do you keep suggesting that it does? Because your statement that it doesn't recommend always using encapsulators neglects to take into account my perspective. Also the spec isn't black and white, as you imply. There is room for interpretation. I believe my interpretation allows for the most flexibility. I didn't need to try it... I just looked at this patch, which doesn't do proper CSV encoding/escaping. Right, and you selectively quoted me. If you're going to quote me, include the whole quote. Look at the 2nd part of what I wrote that mentions that there are plenty of cases with Solr data loading and Response Writers where it doesn't round trip. You pick one (like \n)... if there is a need for a different one, you let it be configured / specified by the client. Hmmm, not sure I understand this statement. All I know is that regardless, I'm using this and it's been working fine for me in production for weeks by applying this patch to branch-1.5.
          Hide
          Robert Muir added a comment -

          Hi, what happens if a field has a double quote in it?

          Show
          Robert Muir added a comment - Hi, what happens if a field has a double quote in it?
          Hide
          Chris A. Mattmann added a comment -

          Hi, what happens if a field has a double quote in it?

          Good question. I haven't seen any glaring bad things that have happened in my use of it on a moderately sized dataset of around 1M documents.

          Show
          Chris A. Mattmann added a comment - Hi, what happens if a field has a double quote in it? Good question. I haven't seen any glaring bad things that have happened in my use of it on a moderately sized dataset of around 1M documents.
          Hide
          Robert Muir added a comment -

          Good question. I haven't seen any glaring bad things that have happened in my use of it on a moderately sized dataset of around 1M documents.

          I dont see any code that properly "quotes the quotes". So if the data has even one double quote in it, I think the output is wrong.
          Personally, I think the simplest solution to many of these problems is to use a csv lib instead (preferably the same one used on the parsing side)

          Show
          Robert Muir added a comment - Good question. I haven't seen any glaring bad things that have happened in my use of it on a moderately sized dataset of around 1M documents. I dont see any code that properly "quotes the quotes". So if the data has even one double quote in it, I think the output is wrong. Personally, I think the simplest solution to many of these problems is to use a csv lib instead (preferably the same one used on the parsing side)
          Hide
          Chris A. Mattmann added a comment -

          I dont see any code that properly "quotes the quotes". So if the data has even one double quote in it, I think the output is wrong.

          Yeah I guess it's b/c I'm looking at loading the data into Excel, and what's in there seems to work for doing that.

          Personally, I think the simplest solution to many of these problems is to use a csv lib instead (preferably the same one used on the parsing side)

          That's probably the easiest thing to do too. I was just looking to do it w/o adding any dependencies to the core.

          Show
          Chris A. Mattmann added a comment - I dont see any code that properly "quotes the quotes". So if the data has even one double quote in it, I think the output is wrong. Yeah I guess it's b/c I'm looking at loading the data into Excel, and what's in there seems to work for doing that. Personally, I think the simplest solution to many of these problems is to use a csv lib instead (preferably the same one used on the parsing side) That's probably the easiest thing to do too. I was just looking to do it w/o adding any dependencies to the core.
          Hide
          Robert Muir added a comment -

          That's probably the easiest thing to do too. I was just looking to do it w/o adding any dependencies to the core.

          There is already one present (commons-csv-1.0-SNAPSHOT-r609327.jar)

          Show
          Robert Muir added a comment - That's probably the easiest thing to do too. I was just looking to do it w/o adding any dependencies to the core. There is already one present (commons-csv-1.0-SNAPSHOT-r609327.jar)
          Hide
          Chris A. Mattmann added a comment -

          Hi Robert:

          There is already one present (commons-csv-1.0-SNAPSHOT-r609327.jar)

          Ah cool, I didn't know that! w00t.

          Cheers,
          Chris

          Show
          Chris A. Mattmann added a comment - Hi Robert: There is already one present (commons-csv-1.0-SNAPSHOT-r609327.jar) Ah cool, I didn't know that! w00t. Cheers, Chris
          Hide
          Yonik Seeley added a comment -

          Here's a patch that produces output that can correctly round-trip with the CSV loader.
          It uses the current version of commons CSV, including the CSVPrinter overhaul I did in SANDBOX-322

          Parameters:
          csv.encapsulator (defaults to ")
          csv.escape (defaults to none)
          csv.separator (defaults to ,)
          csv.header (defaults to true... if false, we skip printing out the column headers)
          csv.newline (defaults to \n)
          csv.null (defaults to "")
          // csv.mv params are the defaults for how multi-valued fields are encoded
          // per-field overrides can be done via f.myfield.csv.separator=|
          csv.mv.encapsulator (defaults to none)
          csv.mv.escape (defaults to )
          csv.mv.separator (defaults to csv.separator)

          Notes:

          • works on fields in the index that aren't even defined in the schema
          • maintains the order of fields passed in by the user (and allows the header to be omitted)
          • efficiently streamable... avoids intermediate creation of SolrDocument instances, and reuses a single buffer & writer across all instances of multi-valued fields for the sub-CSVPrinters

          I'll start adding some tests now.

          Show
          Yonik Seeley added a comment - Here's a patch that produces output that can correctly round-trip with the CSV loader. It uses the current version of commons CSV, including the CSVPrinter overhaul I did in SANDBOX-322 Parameters: csv.encapsulator (defaults to ") csv.escape (defaults to none) csv.separator (defaults to ,) csv.header (defaults to true... if false, we skip printing out the column headers) csv.newline (defaults to \n) csv.null (defaults to "") // csv.mv params are the defaults for how multi-valued fields are encoded // per-field overrides can be done via f.myfield.csv.separator=| csv.mv.encapsulator (defaults to none) csv.mv.escape (defaults to ) csv.mv.separator (defaults to csv.separator) Notes: works on fields in the index that aren't even defined in the schema maintains the order of fields passed in by the user (and allows the header to be omitted) efficiently streamable... avoids intermediate creation of SolrDocument instances, and reuses a single buffer & writer across all instances of multi-valued fields for the sub-CSVPrinters I'll start adding some tests now.
          Hide
          Yonik Seeley added a comment -

          Here's the final patch w/ tests.
          I think we're all ready to go... I'll commit this along with updating the commons-csv lib soon.
          I think it makes sense to backport this to 3.x too?

          Show
          Yonik Seeley added a comment - Here's the final patch w/ tests. I think we're all ready to go... I'll commit this along with updating the commons-csv lib soon. I think it makes sense to backport this to 3.x too?
          Hide
          Yonik Seeley added a comment -

          committed to trunk and merged to 3x.
          todo: docs

          Show
          Yonik Seeley added a comment - committed to trunk and merged to 3x. todo: docs
          Hide
          Grant Ingersoll added a comment -

          Bulk close for 3.1.0 release

          Show
          Grant Ingersoll added a comment - Bulk close for 3.1.0 release
          Hide
          Sirisha added a comment -

          how can we apply the csvresponsewriter to copyField.For example when we have phone and altphone(copyfield) when we retreive using csvresponsewriter it return as "123-345-3456,126-737-5838" but when want to return to solr using this file it stores it in phone field as a record how can parse it seperatlylike phone contains 123-345-3456 and altphone(copyfield):126-737-5838

          Show
          Sirisha added a comment - how can we apply the csvresponsewriter to copyField.For example when we have phone and altphone(copyfield) when we retreive using csvresponsewriter it return as "123-345-3456,126-737-5838" but when want to return to solr using this file it stores it in phone field as a record how can parse it seperatlylike phone contains 123-345-3456 and altphone(copyfield):126-737-5838

            People

            • Assignee:
              Erik Hatcher
              Reporter:
              Chris A. Mattmann
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development