Solr
  1. Solr
  2. SOLR-3434

CSVRequestHandler does not trim header when using header=true&trim=true

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 3.6
    • Fix Version/s: None
    • Component/s: None
    • Environment:

      Linux

      Description

      when using header=true&trim=true the field names in the header row are not trimmed.

      this is consistent with the documentation, but that doesn't mean it makes sense.

      would be good to change this so trim=true also applied to the header row (at least by default)

        Activity

        Hide
        Erick Erickson added a comment -

        Odd, it works fine for me, but I did have to remove the spaces after the header names. Is there anything else in the log file?

        Oh, and it is usually better if the first time you have a problem, you bring it up on the user's list rather than a JIRA, you tend to get faster responses there.

        Show
        Erick Erickson added a comment - Odd, it works fine for me, but I did have to remove the spaces after the header names. Is there anything else in the log file? Oh, and it is usually better if the first time you have a problem, you bring it up on the user's list rather than a JIRA, you tend to get faster responses there.
        Hide
        Hoss Man added a comment -

        but I did have to remove the spaces after the header names

        That right there seems to be the crux of hte issue.

        The header=true parsing is working fine, but the devil is in the detail of the docs for the "trim" option...

        http://wiki.apache.org/solr/UpdateCSV#trim

        If true remove leading and trailing whitespace from values. ...

        ...it was only ever designed to trim the values, not the names of the fields in the header.

        using the 3.6 example, you can see this clearly with data like...

        |foo_s   |book_d_i   |id           |name_id_i
        |--------|-----------|-------------|-----------
        |20120420|      15600|   2070469502|      12787
        |20120420|      64400|   2070469503|      12787
        

        Which, when using header=true, generates a very clear error...

        SEVERE: org.apache.solr.common.SolrException: undefined field: "foo_s   "
        	at org.apache.solr.schema.IndexSchema.getField(IndexSchema.java:1261)
        	at org.apache.solr.handler.CSVLoader.prepareFields(CSVRequestHandler.java:290)
        

        I suspect the reason the david didn't get this kind of an error with his fields is because of a "*" dynamicField.

        I'm not sure that there is really a bug here since it's working as documented, but i think it would certainly make sense to enhance the handler to also trim the header if trim=true.

        Show
        Hoss Man added a comment - but I did have to remove the spaces after the header names That right there seems to be the crux of hte issue. The header=true parsing is working fine, but the devil is in the detail of the docs for the "trim" option... http://wiki.apache.org/solr/UpdateCSV#trim If true remove leading and trailing whitespace from values. ... ...it was only ever designed to trim the values , not the names of the fields in the header. using the 3.6 example, you can see this clearly with data like... |foo_s |book_d_i |id |name_id_i |--------|-----------|-------------|----------- |20120420| 15600| 2070469502| 12787 |20120420| 64400| 2070469503| 12787 Which, when using header=true, generates a very clear error... SEVERE: org.apache.solr.common.SolrException: undefined field: "foo_s " at org.apache.solr.schema.IndexSchema.getField(IndexSchema.java:1261) at org.apache.solr.handler.CSVLoader.prepareFields(CSVRequestHandler.java:290) I suspect the reason the david didn't get this kind of an error with his fields is because of a "*" dynamicField. I'm not sure that there is really a bug here since it's working as documented, but i think it would certainly make sense to enhance the handler to also trim the header if trim=true.
        Hide
        Hoss Man added a comment -

        Two other things i wanted to explicitly note:

        • the error message that david is getting ("Document is missing mandatory uniqueKey field: "jq_idn") is most likely because of the "*" dynamicField i mentioned – all of the fields in the file are getting added with the verbatim field name (ie: "jq_idn_______") so then there is an error when the required fields aren't found
        • even if/when the header logic is improved to also trim, this file would still likelye not do what you expect because of that second line (consisting of many "------" values) ... the loader will also try to index that line as a document, so if any of your field types have any constraints on them (ie: numbers or dates) that row will fail – you'll need to use the "skipLines" option to ignore it
        • you could work around this header triming issue by listing the field names explicitly (using the "fieldnames" param, in addition to using skipLines=2
        Show
        Hoss Man added a comment - Two other things i wanted to explicitly note: the error message that david is getting ("Document is missing mandatory uniqueKey field: "jq_idn") is most likely because of the "*" dynamicField i mentioned – all of the fields in the file are getting added with the verbatim field name (ie: " jq_idn_______ ") so then there is an error when the required fields aren't found even if/when the header logic is improved to also trim, this file would still likelye not do what you expect because of that second line (consisting of many " ------ " values) ... the loader will also try to index that line as a document, so if any of your field types have any constraints on them (ie: numbers or dates) that row will fail – you'll need to use the " skipLines " option to ignore it you could work around this header triming issue by listing the field names explicitly (using the " fieldnames " param, in addition to using skipLines=2
        Hide
        Hoss Man added a comment -

        edited summary & trimed down description to reduce verbosity

        original issue description with formatting fixes

        The documentation says:
        header
        true if the first line of the CSV input contains field or column names. The default is header=true. If the fieldnames parameter is absent, these field names will be used when adding documents to the index.

        My command:

        /usr/bin/curl  --proxy ""   'http://localhost:8983/solr/update/csv?commit=true&debug=true&separator=|&escape=\&trim=true&header=true&overwrite=true'  --data-binary @/tmp/file_with_header.txt   -H 'Content-type:text/plain; charset=utf-8'
        

        My data file (/tmp/file_with_header.txt) :

        |busdate |book_id    |jq_idn       |name_id
        |--------|-----------|-------------|-----------
        |20120420|      15600|   2070469502|      12787
        |20120420|      64400|   2070469503|      12787
        |20120420|     100000|   2070469501|      12787
        |20120420|      60000|   2070469504|      12787
        |20120420|      60000|   2070538002|      12787
        |20120420|     206501|   2070538003|      12787
        |20120420|     199418|   2070538004|      12787
        |20120420|       7000|   2070538005|      12787
        

        schema.xml: (tried different variations)

            897    <field name="jq_idn" type="string" indexed="true" stored="true" required="false" /> 
           1005    <uniqueKey>jq_idn</uniqueKey>
        

        Stack trace:

        SEVERE: org.apache.solr.common.SolrException: Document is missing mandatory uniqueKey field: jq_idn
                at org.apache.solr.update.UpdateHandler.getIndexedId(UpdateHandler.java:118)
                at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:229)
                at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:61)
                at org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:115)
                at org.apache.solr.handler.CSVLoader.doAdd(CSVRequestHandler.java:416)
                at org.apache.solr.handler.SingleThreadedCSVLoader.addDoc(CSVRequestHandler.java:431)
                at org.apache.solr.handler.CSVLoader.load(CSVRequestHandler.java:393)
                at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:58)
                at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
                at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:244)
                at org.apache.solr.core.SolrCore.execute(SolrCore.java:1376)
                at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:365)
                at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:260)
                at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
                at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
                at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
                at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
                at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
                at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
                at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
                at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
                at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
                at org.mortbay.jetty.Server.handle(Server.java:326)
                at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
                at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:945)
                at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:756)
                at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
                at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
                at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
                at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
        
        Show
        Hoss Man added a comment - edited summary & trimed down description to reduce verbosity original issue description with formatting fixes The documentation says: header true if the first line of the CSV input contains field or column names. The default is header=true. If the fieldnames parameter is absent, these field names will be used when adding documents to the index. My command: /usr/bin/curl --proxy "" 'http://localhost:8983/solr/update/csv?commit=true&debug=true&separator=|&escape=\&trim=true&header=true&overwrite=true' --data-binary @/tmp/file_with_header.txt -H 'Content-type:text/plain; charset=utf-8' My data file (/tmp/file_with_header.txt) : |busdate |book_id |jq_idn |name_id |--------|-----------|-------------|----------- |20120420| 15600| 2070469502| 12787 |20120420| 64400| 2070469503| 12787 |20120420| 100000| 2070469501| 12787 |20120420| 60000| 2070469504| 12787 |20120420| 60000| 2070538002| 12787 |20120420| 206501| 2070538003| 12787 |20120420| 199418| 2070538004| 12787 |20120420| 7000| 2070538005| 12787 schema.xml: (tried different variations) 897 <field name="jq_idn" type="string" indexed="true" stored="true" required="false" /> 1005 <uniqueKey>jq_idn</uniqueKey> Stack trace: SEVERE: org.apache.solr.common.SolrException: Document is missing mandatory uniqueKey field: jq_idn at org.apache.solr.update.UpdateHandler.getIndexedId(UpdateHandler.java:118) at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:229) at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:61) at org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:115) at org.apache.solr.handler.CSVLoader.doAdd(CSVRequestHandler.java:416) at org.apache.solr.handler.SingleThreadedCSVLoader.addDoc(CSVRequestHandler.java:431) at org.apache.solr.handler.CSVLoader.load(CSVRequestHandler.java:393) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:58) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:244) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1376) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:365) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:260) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:945) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:756) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
        Hide
        david babits added a comment -

        Thank you Hoss and Erick for your help.

        I trimmed all headers, still can't get it to work.

        dynamicField commented out:

        trim=true&header=true&skipLines=1:
        =>undefined field: "--------"

        trim=true&header=true&skipLines=2:
        =>undefined field: "20120420"

        dynamicField uncommented: <dynamicField name="*" type="ignored" multiValued="true" />:
        =>Document is missing mandatory uniqueKey field: jq_idn

        The field is in the file. Does skipLines=2 also skip the header ?

        Show
        david babits added a comment - Thank you Hoss and Erick for your help. I trimmed all headers, still can't get it to work. dynamicField commented out: trim=true&header=true&skipLines=1: =>undefined field: "--------" trim=true&header=true&skipLines=2: =>undefined field: "20120420" dynamicField uncommented: <dynamicField name="*" type="ignored" multiValued="true" />: =>Document is missing mandatory uniqueKey field: jq_idn The field is in the file. Does skipLines=2 also skip the header ?
        Hide
        david babits added a comment -

        993 <!-- uncomment the following to ignore any fields that don't already match an existing
        994 field name or dynamic field, rather than reporting them as an error.
        995 alternately, change the type="ignored" to some other type e.g. "text" if you want
        996 unknown fields indexed and/or stored by default -->

        Tried what the doc suggests:
        <dynamicField name="*" type="text" multiValued="true" />

        Got:
        SEVERE: org.apache.solr.common.SolrException: Unknown fieldtype 'text' specified on field *
        at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:511)

        Show
        david babits added a comment - 993 <!-- uncomment the following to ignore any fields that don't already match an existing 994 field name or dynamic field, rather than reporting them as an error. 995 alternately, change the type="ignored" to some other type e.g. "text" if you want 996 unknown fields indexed and/or stored by default --> Tried what the doc suggests: <dynamicField name="*" type="text" multiValued="true" /> Got: SEVERE: org.apache.solr.common.SolrException: Unknown fieldtype 'text' specified on field * at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:511)
        Hide
        Erick Erickson added a comment -

        1> You should not need the dynamic field.
        2> pare down your CSV file and remove the spaces from the first line, and remove the second line
        OR
        specify skipLines=2 and, as Hoss said, explicitly define your fields on your URL using the fieldnames parameter (http://wiki.apache.org/solr/UpdateCSV#fieldnames)

        Show
        Erick Erickson added a comment - 1> You should not need the dynamic field. 2> pare down your CSV file and remove the spaces from the first line, and remove the second line OR specify skipLines=2 and, as Hoss said, explicitly define your fields on your URL using the fieldnames parameter ( http://wiki.apache.org/solr/UpdateCSV#fieldnames )
        Hide
        david babits added a comment -

        Yes, specifying fieldnames works, and worked yesterday too, I forgot to mention it.

        To close this out:
        My goal is to accept a random file, generated by extract from a database, and load it into Solr.
        Database extract comes with fields aligned, hence the white space in the header and values.
        I do not know the fieldnames ahead of time, so I was hoping to specify header=true&trim=true and have Solr take care of parsing.
        This proved not to work.
        Since I have to massage the data anyway to remove spaces, I might as well parse out the header line at the same time using sed and construct fieldnames variable.

        I also found that I need <dynamicField name="*" type="string" multiValued="true" /> since I do not know header up front, and can't rely on _s etc, and it wouldn't work otherwise.

        So, trim=true&header=false&skipLines=2&fieldnames=$fieldnames
        This is the workaround.

        My opinion is: 'trim' should be true by default, and certainly apply to both data and header, although I understand it would break backward-compatibility.

        Thanks again for your help.

        Show
        david babits added a comment - Yes, specifying fieldnames works, and worked yesterday too, I forgot to mention it. To close this out: My goal is to accept a random file, generated by extract from a database, and load it into Solr. Database extract comes with fields aligned, hence the white space in the header and values. I do not know the fieldnames ahead of time, so I was hoping to specify header=true&trim=true and have Solr take care of parsing. This proved not to work. Since I have to massage the data anyway to remove spaces, I might as well parse out the header line at the same time using sed and construct fieldnames variable. I also found that I need <dynamicField name="*" type="string" multiValued="true" /> since I do not know header up front, and can't rely on _s etc, and it wouldn't work otherwise. So, trim=true&header=false&skipLines=2&fieldnames=$fieldnames This is the workaround. My opinion is: 'trim' should be true by default, and certainly apply to both data and header, although I understand it would break backward-compatibility. Thanks again for your help.
        Hide
        Yonik Seeley added a comment -

        My opinion is: 'trim' should be true by default, and certainly apply to both data and header, although I understand it would break backward-compatibility.

        IMO, we should always trim the headers - the CSV loader certainly never intended to support fieldnames with leading or trailing whitespace.... this was simply just a case of lacking any examples that added whitespace.

        As for your other problem, it seems like we also need a skipLinesAfterHeader option?

        Show
        Yonik Seeley added a comment - My opinion is: 'trim' should be true by default, and certainly apply to both data and header, although I understand it would break backward-compatibility. IMO, we should always trim the headers - the CSV loader certainly never intended to support fieldnames with leading or trailing whitespace.... this was simply just a case of lacking any examples that added whitespace. As for your other problem, it seems like we also need a skipLinesAfterHeader option?
        Hide
        david babits added a comment -

        I don't think skipLinesAfterHeader is needed.
        skipLines should do, I believe implicitly skipLines=1 when header=true, which is what the docs seem to say.
        What's needed is an example, in addition to the docs.
        My use case is generic - this is how the data looks when extracted from database(using isql in my case), so it shouldn't take a newbie 2 days of trial-and-error to figure out how to load it into solr, an example along with necessary schema.xml change would go a long way.

        skipLines

        Specifies the number of lines in the input stream to discard before the CSV data starts (including the header, if present). Default is skipLines=0.

        Show
        david babits added a comment - I don't think skipLinesAfterHeader is needed. skipLines should do, I believe implicitly skipLines=1 when header=true, which is what the docs seem to say. What's needed is an example, in addition to the docs. My use case is generic - this is how the data looks when extracted from database(using isql in my case), so it shouldn't take a newbie 2 days of trial-and-error to figure out how to load it into solr, an example along with necessary schema.xml change would go a long way. skipLines Specifies the number of lines in the input stream to discard before the CSV data starts (including the header, if present). Default is skipLines=0.

          People

          • Assignee:
            Unassigned
            Reporter:
            david babits
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:

              Development