Solr
  1. Solr
  2. SOLR-2731

CSVResponseWriter should optionally return numfound

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 3.1, 3.3, 4.0-ALPHA
    • Fix Version/s: 3.1.1, 4.9, 5.0
    • Component/s: Response Writers
    • Labels:

      Description

      an optional parameter "csv.numfound=true" can be added to the request which causes the first line of the response to be the numfound. This would have no impact on existing behavior, and those who are interested in that value can simply read off the first line before sending to their usual csv parser.

      1. SOLR-2731.patch
        4 kB
        Jon Hoffman
      2. SOLR-2731-R1.patch
        5 kB
        Jon Hoffman

        Activity

        Hide
        Yonik Seeley added a comment -

        It seems like if we go down this road, it should somehow be a more generic mechanism (since others will then want values like maxScore, etc).

        Here are some alternatives:

        numFound,maxScore,start
        2038,1.414,100
        id,score
        doc1,1.3
        doc2,1.1
        doc3,1.05
        
        numFound,2038,maxScore,1.414,start,100
        id,score
        doc1,1.3
        doc2,1.1
        doc3,1.05
        
        numFound=2038,maxScore=1.414,start=100
        id,score
        doc1,1.3
        doc2,1.1
        doc3,1.05
        
        #numFound=2038,maxScore=1.414,start=100
        id,score
        doc1,1.3
        doc2,1.1
        doc3,1.05
        
        

        Perhaps the "numFound=2038,maxScore=1.414,start=100" would be the most human readable (and maybe alternately commenting it if that's supported).
        But the first option could be attractive since it's more in the spirit of CSV and might be desirable if imported into excel for example.
        Thoughts?

        Show
        Yonik Seeley added a comment - It seems like if we go down this road, it should somehow be a more generic mechanism (since others will then want values like maxScore, etc). Here are some alternatives: numFound,maxScore,start 2038,1.414,100 id,score doc1,1.3 doc2,1.1 doc3,1.05 numFound,2038,maxScore,1.414,start,100 id,score doc1,1.3 doc2,1.1 doc3,1.05 numFound=2038,maxScore=1.414,start=100 id,score doc1,1.3 doc2,1.1 doc3,1.05 #numFound=2038,maxScore=1.414,start=100 id,score doc1,1.3 doc2,1.1 doc3,1.05 Perhaps the "numFound=2038,maxScore=1.414,start=100" would be the most human readable (and maybe alternately commenting it if that's supported). But the first option could be attractive since it's more in the spirit of CSV and might be desirable if imported into excel for example. Thoughts?
        Hide
        Jon Hoffman added a comment -

        I like maintaining consistency with the CSV format because you don't have to reinvent any parsing logic. It should be pretty easy for the client developer to read off the first two lines and parse with the same tool that's used to parse the rest of the document. Preferences around separator, newline, etc can be reused (except maybe this meta header should always have a column name header).

        What should the parameter be called? csv.metaheader?

        Show
        Jon Hoffman added a comment - I like maintaining consistency with the CSV format because you don't have to reinvent any parsing logic. It should be pretty easy for the client developer to read off the first two lines and parse with the same tool that's used to parse the rest of the document. Preferences around separator, newline, etc can be reused (except maybe this meta header should always have a column name header). What should the parameter be called? csv.metaheader?
        Hide
        Jon Hoffman added a comment -

        To be clear, I like the first option best.

        Show
        Jon Hoffman added a comment - To be clear, I like the first option best.
        Hide
        Simon Rosenthal added a comment -

        In addition to loading CSV results into a spreadsheet, I often use CSV as a quick-and-dirty way of dumping the contents of an index to be re-read into Solr, and adding lines which would need manual removal would be rather inconvenient.

        I'd go for option 4, with the comment symbol and result metadata on one line. org.apache.commons.csv has an option (which is not currently enabled in the CSVRequestHandler) to recognize and discard comment lines - adding a request parameter to the handler to recognize comment lines would be straightforward, and would at least solve my use case, though I admit not all others.

        Show
        Simon Rosenthal added a comment - In addition to loading CSV results into a spreadsheet, I often use CSV as a quick-and-dirty way of dumping the contents of an index to be re-read into Solr, and adding lines which would need manual removal would be rather inconvenient. I'd go for option 4, with the comment symbol and result metadata on one line. org.apache.commons.csv has an option (which is not currently enabled in the CSVRequestHandler) to recognize and discard comment lines - adding a request parameter to the handler to recognize comment lines would be straightforward, and would at least solve my use case, though I admit not all others.
        Hide
        Jon Hoffman added a comment -

        Simon,

        Keep in mind that this additional header would only appear if you asked for it via a request parameter like "csv.metaheader=true". Existing behavior would remain unchanged. Is that still a problem?

        Show
        Jon Hoffman added a comment - Simon, Keep in mind that this additional header would only appear if you asked for it via a request parameter like "csv.metaheader=true". Existing behavior would remain unchanged. Is that still a problem?
        Hide
        Simon Rosenthal added a comment -

        good point. In that case I'm agnostic - 1) would be fine.

        Show
        Simon Rosenthal added a comment - good point. In that case I'm agnostic - 1) would be fine.
        Hide
        Hoss Man added a comment -

        i think yonik's 1st example would be the best for people loading the data into a spreedsheet tool or parsing with conventional CSV tools. (even better then #2 because it's easy to cut/paste that data into a different sheet and still have clean separation between headers/data. or parsing with conventional CSV tools)

        but i would suggest that if we're at the point of thinking about having a "metadata" section and a "results" section we shouldn't limit ourselves to two sections.

        instead of just including metadata about the main doclist, we could allow arbitrary sections or arbitrary lengths (like facet counts) ... i haven't thought hard about what the params should look like, but i would suggest that for easy output parsing a simple 1 row/column row count prefix value telling you the number of (csv) rows for each "section", followed by the (csv) rows of data (including a header row for each section if "csv.header=true") would be easy for people to parse (assuming they were expecting it because they asked for it)

        ie...

        2
        numFound,maxScore,start
        103,1.414,100
        4
        id,score
        doc1,1.3
        doc2,1.1
        doc3,1.05
        

        ..or if csv.header=false ...

        1
        103,1.414,100
        3
        doc1,1.3
        doc2,1.1
        doc3,1.05
        

        We can worry about what other "sections" might be supported later as long as the basic param syntax gets fleshed out ... i would suggest maybe something like:

        • multivalued "csv.section" param
        • sections are written out in the order that they are passed as param
        • default is "csv.section=results"
        • if only one value is specified for csv.section, then no row count prefix is used for that section
        • only one other value for csv.section supported initially: "csv.section=results.meta"
          • adds the numFound,maxScore,start for the results
        Show
        Hoss Man added a comment - i think yonik's 1st example would be the best for people loading the data into a spreedsheet tool or parsing with conventional CSV tools. (even better then #2 because it's easy to cut/paste that data into a different sheet and still have clean separation between headers/data. or parsing with conventional CSV tools) but i would suggest that if we're at the point of thinking about having a "metadata" section and a "results" section we shouldn't limit ourselves to two sections. instead of just including metadata about the main doclist, we could allow arbitrary sections or arbitrary lengths (like facet counts) ... i haven't thought hard about what the params should look like, but i would suggest that for easy output parsing a simple 1 row/column row count prefix value telling you the number of (csv) rows for each "section", followed by the (csv) rows of data (including a header row for each section if "csv.header=true") would be easy for people to parse (assuming they were expecting it because they asked for it) ie... 2 numFound,maxScore,start 103,1.414,100 4 id,score doc1,1.3 doc2,1.1 doc3,1.05 ..or if csv.header=false ... 1 103,1.414,100 3 doc1,1.3 doc2,1.1 doc3,1.05 We can worry about what other "sections" might be supported later as long as the basic param syntax gets fleshed out ... i would suggest maybe something like: multivalued "csv.section" param sections are written out in the order that they are passed as param default is "csv.section=results" if only one value is specified for csv.section, then no row count prefix is used for that section only one other value for csv.section supported initially: "csv.section=results.meta" adds the numFound,maxScore,start for the results
        Hide
        Lance Norskog added a comment -

        -1

        • When you do the same query twice, the second time it usually takes 0ms. If it doesn't, turn on query caching.
        • You can code these variations with Velocity. I would stick with keeping the very simplest CSV output and then coding any additions yourself.
        Show
        Lance Norskog added a comment - -1 When you do the same query twice, the second time it usually takes 0ms. If it doesn't, turn on query caching. You can code these variations with Velocity. I would stick with keeping the very simplest CSV output and then coding any additions yourself.
        Hide
        Erik Hatcher added a comment -

        I'm mostly with Lance here, actually. I want pure CSV. So long as there there is always an option (which should be the default) to keep the output pure CSV then I'm ok with whatever extras folks want to add as options.

        We really should get the response writer framework able to return custom HTTP headers though.

        Show
        Erik Hatcher added a comment - I'm mostly with Lance here, actually. I want pure CSV. So long as there there is always an option (which should be the default) to keep the output pure CSV then I'm ok with whatever extras folks want to add as options. We really should get the response writer framework able to return custom HTTP headers though.
        Hide
        Erik Hatcher added a comment -

        Perhaps we could have an Excel response writer that could create a multi-sheet spreadsheet file?

        Show
        Erik Hatcher added a comment - Perhaps we could have an Excel response writer that could create a multi-sheet spreadsheet file?
        Hide
        Jon Hoffman added a comment -

        This new patch includes the csv style metaheader with "numFound,maxScore,start"

        Show
        Jon Hoffman added a comment - This new patch includes the csv style metaheader with "numFound,maxScore,start"
        Hide
        Steve Rowe added a comment -

        Bulk move 4.4 issues to 4.5 and 5.0

        Show
        Steve Rowe added a comment - Bulk move 4.4 issues to 4.5 and 5.0
        Hide
        Uwe Schindler added a comment -

        Move issue to Solr 4.9.

        Show
        Uwe Schindler added a comment - Move issue to Solr 4.9.

          People

          • Assignee:
            Unassigned
            Reporter:
            Jon Hoffman
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:

              Time Tracking

              Estimated:
              Original Estimate - 1h
              1h
              Remaining:
              Remaining Estimate - 1h
              1h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development