Solr
  1. Solr
  2. SOLR-2549

DIH LineEntityProcessor support for delimited & fixed-width files

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: 4.0-ALPHA
    • Fix Version/s: None
    • Labels:
      None

      Description

      Provides support for Fixed Width and Delimited Files without needing to write a Transformer.

      The following xml properties are supported with this version of LineEntityProcessor:

      For fixed width files:

      • colDef[#]

      For Delimited files:

      • fieldDelimiterRegex
      • firstLineHasFieldnames
      • delimitedFieldNames
      • delimitedFieldTypes

      These properties are described in the api documentation. See patch.

      When combined with the cache improvements from SOLR-2382 this allows you to join a flat file entity with other entities (sql, etc).

      1. SOLR-2549.patch
        23 kB
        zakaria benzidalmal
      2. SOLR-2549.patch
        23 kB
        James Dyer
      3. SOLR-2549.patch
        16 kB
        James Dyer
      4. SOLR-2549.patch
        18 kB
        James Dyer
      5. v400-SOLR-2549.patch
        24 kB
        zakaria benzidalmal

        Issue Links

          Activity

          Hide
          James Dyer added a comment -

          This patch depends on the enum class "DIHCacheTypes.java" from SOLR-2382, included here for convenience. Should this issue be considered for committing without SOLR-2382, the class could be renamed and included here by itself. This is the only dependency on SOLR-2382.

          This patch includes unit tests for Delimited & Fixed Width files.

          Show
          James Dyer added a comment - This patch depends on the enum class "DIHCacheTypes.java" from SOLR-2382 , included here for convenience. Should this issue be considered for committing without SOLR-2382 , the class could be renamed and included here by itself. This is the only dependency on SOLR-2382 . This patch includes unit tests for Delimited & Fixed Width files.
          Hide
          James Dyer added a comment -

          Here is a version sync'ed with the current Trunk.

          Show
          James Dyer added a comment - Here is a version sync'ed with the current Trunk.
          Hide
          Pulkit Singhal added a comment - - edited

          @jdyer Can you please post some data-config.xml samples in the comments?
          The patch docs are good but this would be also very helpful. If you don't mind

          Show
          Pulkit Singhal added a comment - - edited @jdyer Can you please post some data-config.xml samples in the comments? The patch docs are good but this would be also very helpful. If you don't mind
          Hide
          James Dyer added a comment -

          A long time ago someone on the users' list asked for better support for delimited files. This version supports most of the same features as the CSVRequestHandler, using the same csv parser and most of the same parameter names.

          The reason for using DIH instead for CSVRequestHandler would be cases where the flat file needs to be joined to other entities, if the data needs to be cached, and/or if transformers need to be applied.

          This patch also retains the same support for fixed-width files.

          The unit tests have been enhanced to test these new possibilities.

          Show
          James Dyer added a comment - A long time ago someone on the users' list asked for better support for delimited files. This version supports most of the same features as the CSVRequestHandler, using the same csv parser and most of the same parameter names. The reason for using DIH instead for CSVRequestHandler would be cases where the flat file needs to be joined to other entities, if the data needs to be cached, and/or if transformers need to be applied. This patch also retains the same support for fixed-width files. The unit tests have been enhanced to test these new possibilities.
          Hide
          James Dyer added a comment -

          The dependency here to SOLR-2943 is only for the "DIHCacheTypes" enum, which defines data types for each flat file column of data. This is particularly helpful when joining to SQL data sources as DIH requires the join keys be the same type. It might be beneficial to rename the enum to "DIHType" or something more generic, should either issue become a candidate for commit.

          Show
          James Dyer added a comment - The dependency here to SOLR-2943 is only for the "DIHCacheTypes" enum, which defines data types for each flat file column of data. This is particularly helpful when joining to SQL data sources as DIH requires the join keys be the same type. It might be beneficial to rename the enum to "DIHType" or something more generic, should either issue become a candidate for commit.
          Hide
          zakaria benzidalmal added a comment -

          Fix NPE Bug when escape parameter is not specified.

          Show
          zakaria benzidalmal added a comment - Fix NPE Bug when escape parameter is not specified.
          Hide
          zakaria benzidalmal added a comment -

          data config example:

          <dataConfig>
          <dataSource name="URL" baseUrl="file:///c:/work/solr/example/example-DIH/solr/csv/in/" type="URLDataSource" />
          <document name="FixedWidthCounts">

          <!-- for delimited files -->
          <entity
          name="sites"
          processor="org.apache.solr.handler.dataimport.LineEntityProcessor"
          dataSource="URL"
          url="data.csv"
          header="true"
          separator=","
          ... <!-- you can specify here other updatecsv request handler parameters -->
          />

          <!-- for fixed-width files -->
          <entity
          name="sites"
          processor="org.apache.solr.handler.dataimport.LineEntityProcessor"
          dataSource="URL"
          url="data.csv"
          colDef1="ID,0,6,STRING,0,LEFT"
          colDef2="NAME,6,26,STRING,0,LEFT"
          ...
          />

          </document>
          </dataConfig>

          Show
          zakaria benzidalmal added a comment - data config example: <dataConfig> <dataSource name="URL" baseUrl="file:///c:/work/solr/example/example-DIH/solr/csv/in/" type="URLDataSource" /> <document name="FixedWidthCounts"> <!-- for delimited files --> <entity name="sites" processor="org.apache.solr.handler.dataimport.LineEntityProcessor" dataSource="URL" url="data.csv" header="true" separator="," ... <!-- you can specify here other updatecsv request handler parameters --> /> <!-- for fixed-width files --> <entity name="sites" processor="org.apache.solr.handler.dataimport.LineEntityProcessor" dataSource="URL" url="data.csv" colDef1="ID,0,6,STRING,0,LEFT" colDef2="NAME,6,26,STRING,0,LEFT" ... /> </document> </dataConfig>
          Hide
          zakaria benzidalmal added a comment - - edited

          thanks to james for his help

          Show
          zakaria benzidalmal added a comment - - edited thanks to james for his help
          Hide
          zakaria benzidalmal added a comment - - edited

          patch for solr 4.0.0 available #v400-SOLR-2549.patch

          Show
          zakaria benzidalmal added a comment - - edited patch for solr 4.0.0 available #v400- SOLR-2549 .patch
          Hide
          yuanyun.cn added a comment -

          Very useful feature. Just want to know when we can have this feature?

          Show
          yuanyun.cn added a comment - Very useful feature. Just want to know when we can have this feature?

            People

            • Assignee:
              Unassigned
              Reporter:
              James Dyer
            • Votes:
              2 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:

                Development