Uploaded image for project: 'Commons CSV'
  1. Commons CSV
  2. CSV-196

Store the information of raw data read by lexer

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Reopened
    • Major
    • Resolution: Unresolved
    • 1.4
    • None
    • Parser

    Description

      It will be good to have CSVParser class to store the info of whether a field was enclosed by quotes in the original source file.
      For example, for this data sample:

      A, B, C
      a1, "b1", c1

      CSVParser gives us record a1, b1, c1, which is helpful because it parsed double quotes, but we also lost the information of original data at the same time. We can't tell from the CSVRecord returned whether the original data is enclosed by double quotes or not.

      In our use case, we are integrating Apache Hadoop APIs with Commons CSV. CSV is one kind of input of Hadoop Jobs, which should support splitting input data. To accurately split a CSV file into pieces, we need to count the bytes of data CSVParser actually read. CSVParser doesn't have accurate information of whether a field was enclosed by quotes, neither does it store raw data of the original source. Downstream users of commons CSVParser is not able to get those info.

      To suggest a fix: Extend the token/CSVRecord to have a boolean field indicating whether the column was enclosed by quotes. While Lexer is doing getNextToken, set the flag if a field is encapsulated and successfully parsed.

      I find another issue reported with similar request, but it was marked as resolved: [CSV91] https://issues.apache.org/jira/browse/CSV-91?jql=project%20%3D%20CSV%20AND%20text%20~%20%22with%20quotes%22

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              mattsun Matt Sun
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:

                Time Tracking

                  Estimated:
                  Original Estimate - 48h
                  48h
                  Remaining:
                  Time Spent - 40m Remaining Estimate - 47h 20m
                  47h 20m
                  Logged:
                  Time Spent - 40m Remaining Estimate - 47h 20m
                  40m