Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-15041

CSV update handler can't handle line breaks/new lines together with field split/separators for multivalued fields

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 8.4
    • None
    • update
    • Ubuntu 20.04 8 CPU 60GB+ ram

    Description

      I've been using the /update/csv option to bulk import large numbers of data with great success, but I believe I've found a corner case in the parsing of csv when the field is a multi-valued string field with a new-line character in it.

      As soon as you specify f.[fieldname].split=true&f.[fieldname].separator=[something] the multi-field/split parsing stops at the first linebreak

      My managed schema:

      -- managed schema
      <fieldType name="string" class="solr.StrField" sortMissingLast="true" docValues="true" /><fieldType name="strings" class="solr.StrField" sortMissingLast="true" multiValued="true" docValues="true" /> 
      <dynamicField name="*_str" type="string" indexed="true" stored="false" />    
      <dynamicField name="*_strs" type="strings" indexed="true" stored="false"/>

      Example POST url,  I'm using ! as split character for test1_strs and test2_strs

      http://[myserver]/solr/[mycore]/update/csv?commitWithin=1000&f.test1_strs.split=true&f.test1_strs.separator=!&f.test2_strs.split=true&f.test2_strs.separator=!

      CSV content: (notice the new-lines are included but encapsulated by "", these new-lines need to be maintained as is)

      id,title,test1_strs,test2_strs,test3_str
      csv_test,title,"first line
      with break!second line","first line!second_line","a line
      break"
      

      Resulting Solr Doc:

      {
              "id":"csv_test",
              "title":"title",
              "_version_":1685718010076069888,
              "test1_strs":["first line "], 
              "test2_strs":["first line", "second_line"],
              "test3_str":"a line\r\nbreak"}]
        }
      

      Note in the single value test3_str the new-line is appropriately maintained as \r\n (or just \n when this is done via code instead of manually)

      test2_strs shows that the mutli-value split on ! worked correctly

      test1_strs immediately stops processing after the first value's new-line, instead of the actual separator after the new-line.

      Expected values should look like:

      {
              "id":"csv_test",
              "title":"title",
              "_version_":1685718010076069888,
              "test1_strs":["first line\r\nwith break", "second line"], 
              "test2_strs":["first line", "second_line"],
              "test3_str":"a line\r\nbreak"}]
        }
      

       
      I've tried pre-escaping line breaks but all that gives me is the escaped new-line in solr, which would need to be post-processed on the consuming end to return to a \r\n (or \n) and would be nontrivial to do.  Solr handles \n just find in all other cases so I consider this an expected behavior.

       

       

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            mhov Matt Hov
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: