Apache OpenOffice (AOO) Bugzilla – Issue 80385
CSV import of quoted data gives unpredictable results
Last modified: 2007-08-08 18:50:16 UTC
I was trying to figure out how Excel and Calc handle quoted strings in CSV files, so I created a file with some nice confusing data. This loaded almost perfectly in Excel (the exception being the last line of the file), but produced funny results in Calc, to say the least. I have two test cases, and the relevant screen shots of Excel and Calc. The only difference between the two test cases is the extra quote at the end of the first line.
Created attachment 47340 [details] First test case
Created attachment 47341 [details] Second test case
Created attachment 47342 [details] Excel - first test case
Created attachment 47343 [details] Excel - second test case
Created attachment 47344 [details] Calc - first test case
Created attachment 47345 [details] Calc - second test case
Those are most probably broken csv-files. Therefore this is a duplicate of issue 78926. Please note, that a *line break* (CRLF) is allowed inside a field and this is accomplished by including the break within double quotes, e.g.: "this is a single field that continues on the 2nd line" Therefore, IF a quote is NOT closed, the field will continue with the next line. This is valid behaviour! (And IF changed, one would NOT be able to include line breaks inside fields!) See also http://tools.ietf.org/html/rfc4180 for further details.
Excuse me? Did you even look at anything that I attached? The files I attached both have the following structure: Line 1: 13 fields Line 2: 3 fields Line 3: 3 fields, 2 of which have multiple lines Line 4: 1 field Line 5: 2 fields Line 6: 2 fields Line 7: 1 field with multiple lines, no ending quote The only thing funny about my input files is that all the lines are not the same length. However, Calc attempts to read it in and does not give an error or warning message, so I assume that this is allowed. Now let me describe the field format when Calc reads in the first file: Line 1: 13 fields Line 2: 3 fields Line 3: 3 fields Line 4: 1 field Line 5: *3 fields* The 3 fields in line 5 are made up of the contents of the rest of the file. The first field is fine. The second field contains the last field of line 5, and the first field of line 6. The third field contains the last field of line 6, and the field in line 7. Is this right? When Calc reads in the second file (which is exactly the same apart from a quote at the end of the last field in line 1, which should not make a difference), it suddenly comes out with only *4* lines, because the last field of the first line has been combined with the first field of the second line, and the other fields in the second line have been appended to the first line! I know I am stating in words what my screenshots are supposed to describe, but at least now one can see my problem without looking at the screenshots or attachments which I specifically created to describe the problem.
Hi, discoleo is right. This one is a double to Issue 78926 . For now a starting quote or double quote has to have an ending quote. All between these quotes or better text delimiters goes to one cell. Mixing up these quotes will result in bad imported cells, but this is not a bug of OOo but the creator of such a file is to blame for it. Frank *** This issue has been marked as a duplicate of 78926 ***
closed double but I tend to close it as invalid.
Dear friends discoleo and fst, It still appears that nobody has looked at my data, or even read my nice problem description. Instead you seem to have become fixated on the idea that I have invalid data in my file. Well, I will accept that the last line of data is incorrect, as it should have a closing quote. So I decided to add a closing quote to the last line of data and try again. Wow! Now Calc has silently dropped the last three lines of data in my CSV file, although the lines are visible in the text import dialog. Is this still correct behaviour?
Created attachment 47358 [details] Third test case - valid file
Created attachment 47359 [details] Fourth test case - valid file
Created attachment 47360 [details] Excel - third test case
Created attachment 47361 [details] Excel - fourth test case
Created attachment 47362 [details] Calc - third test case
Created attachment 47363 [details] Calc - fourth test case
Please note that all your files are broken csv-files. There are a set of simple rules in csv-files: 1. text-containing quotes shall be enclosed in quotes and the text-quote shall be doubled e.g. this "-quote must be doubled => csv: "this ""-quote must be doubled" 2. the whole field shall be enclosed in quotes => therefore "a"b is undefined I would interpret it as enclosing quotes => ab [OOo differs here => "a"b] 3. breaks inside a quoted field are considered as part of the field Therefore, Excel is really broken, but your file is the primary culprit: 1. |test| , |->"a"<-b| , |->"a"<-| , |d->"a"<-c| , |m| , |->"j"<- | , |d| , |b->"A"<-| , |D->"E,"<-f| , |1->","<-a| , |b->","<-de->"b,a"<-| 2. |->"abcdef"<- test ->"abc"<-| , |->"def"<-g->"h"<-| , |def->"gh"<-| 3. |->"this is a test"<-| , |->"yes it works"<-| , 4. |No it doesn't| 5. |->"yes it,"","<-does->","<-no->" """"a,cd""e "<-a->""<-???BIG PROBLEM UDEFINED???<-b->"""???<-NO ENDING QUOTE |<== ??? DOES THE FIELD REALLY END So, understood? [OOo has some quirks, too, but it is really the undefined behaviour of your broken file that is the problem!]
Seems that OOo does NOT import last quoted field, IF the quote is NOT closed. Eventually, we should rename this issue to reflect that fact. And that should be fixed. [Still, the csv is badly broken.]
Sir, Here is what I have observed regarding string fields in CSV files in Calc (and Excel, for that matter): - A field is considered a delimited string ONLY IF there is a quote IMMEDIATELY after a field separator. - Leading and trailing space in a field IS SIGNIFICANT. This means that if a field begins with a space and then a quote, it is not interpreted as a string, but is stored as is. - A delimited string may continue with other data (that may contain quotes) until the end of a field is reached. This subsequent data is stored AS IS, i.e. no quote translation. - As a result of the preceding point, ONLY THE FIRST PROPERLY QUOTED STRING IS RECOGNISED AND TRANSLATED. According to the rules I described above, here is how the fields in test case 3 would have been read in. 1. |test| , |->"a"<-b| , | "a"| , | d"a"c | , | m | , | "j" | , |d| , |b"A"| , |D"E| , |->"f,1"<-| , |->"a,b"<-| , |->"de"<-b| , |a"| 2. |->"abcdef"<- test "abc"| , |->"def"<-g"h"| , |def"gh"| 3. |->"this is a test"<-| , |->"yes it works"<-| , || 4. |No it doesn't| 5. |->"yes it,"","<-does"| , |->"no"<-| 6. |->""""<-a| , |cd""e| 7. |->"a""b"""<-| This is consistent with how Excel reads in the file. Whilst I appreciate your attempt at explaining how Calc reads in the file, I must disagree with it because you interpret strings wherever they occur in a field, whereas Calc interprets the first 4 lines of test case 1 in the exact same way as I described above. Therefore, I assumed its behaviour would be consistent throughout the file. And in response to your statements that my CSV file is "broken", it may be that it is so, but let me quote some of what is written under "Interoperability considerations" in RFC 4180: "Due to lack of a single specification, there are considerable differences among implementations. Implementors should "be conservative in what you do, be liberal in what you accept from others" (RFC 793 [8]) when processing CSV files."
From RFC4180: 6. Fields containing line breaks (CRLF), *double quotes*, and commas should be enclosed in double-quotes. 7. If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote. Therefore, you CAN'T MIX QUOTES as TEXT-Delimiters AND as unescaped QUOTES, like in: ...,"this is enclosed in quotes" BUT "THIS IS NOT",... A. So, applying these rules, something like: d"a"c, or "abc"de"f" is ILLEGAL and undefined. [When selecting '"' as a TEXT-Delimiter.] B. IF the user selects TO USE '"' AS THE TEXT DELIMITER, then he acknowledges that every occurrence of '"' that is NOT escaped is actually a FIELD-Quote. Otherwise, he (and the application saving this) would have escaped it properly. Therefore, IF you really want: d"a"c then code it properly: "d""a""c" Otherwise, TREAT ALL QUOTES AS STRINGS. You have the option to select this in the import dialog. My handling of the error ======================== IF such a broken csv-file is encountered, and the user insists on '"' as the TEXT-Delimiter, then: EVERY occurrence of '"' that is NOT escaped SHALL be considered as a TEXT-Delimiter; the USER opted for this, so you can't just consider some of them are TEXT-delimiters, some are NOT - this is INCONSISTENT! Therefore, d"a"c becomes really dac Similarly, "abcdef"test"abc" becomes abcdeftestabc This is consistent! Else, how should this be handled: ,"abc" d "ef", IS the FIELD actually: 1. 'abc" d "ef' AND the application forgot to double the QUOTES (valid assumption) 2. 'abc d ef' - my view - CONSISTENT HANDLING 3. 'abc d "ef"' - your view - the least consistent of these 4 methods 4. "abc" d "ef" - NO ESCAPING QUOTES at all, application does NOT use them (valid assumption, but user has the freedom to select NOT to use them)
Sir, Let us be clear on one thing. I am discussing the existing behaviour of Calc. What I want is not an enhancement or new way to handle CSV files, I merely want the handling to be consistent, and to reflect what I have in the file reasonably accurately. I discovered an inconsistency in the current behaviour, and reported it. The examples and screenshots I have given are not wishful thinking on my part as to how CSV files should be read in. Calc actually does read these files in in this way, up to a point. Therefore my observations about quoted string behaviour (only the first string gets converted) is ACTUALLY WHAT HAPPENS, and not something I made up in my head. > Else, how should this be handled: > ,"abc" d "ef", > IS the FIELD actually: > 1. 'abc" d "ef' AND the application forgot to double the QUOTES > (valid assumption) > 2. 'abc d ef' - my view - CONSISTENT HANDLING > 3. 'abc d "ef"' - your view - the least consistent of these 4 methods This is not my view, it is actually what happens in Calc and Excel. If you don't believe me, please try it yourself.
> Let us be clear on one thing. I am discussing the existing behaviour of Calc. > What I want is not an enhancement or new way to handle CSV files, I merely want > the handling to be consistent, and to reflect what I have in the file reasonably > accurately. Isn't that somewhat contradicting? You have broken data in your file and want it reflected reasonably accurately? Yes, we can add some more magic to try to guess broken data correctly. Yes, we will do eventually, this is what issue 78926 is about, so this issue here in fact is a duplicate. > > Else, how should this be handled: > > ,"abc" d "ef", > > IS the FIELD actually: > > 1. 'abc" d "ef' AND the application forgot to double the QUOTES > > (valid assumption) This is how it should be treated, IMHO. Doing so would also lead to ,"a"b, "a", resulting in _one_ field 'a"b, "a' and not two, 'ab' and ' "a"' like it is currently the case. I could bet sooner or later someone will show up and complain that we don't do it like Excel ... > > 2. 'abc d ef' - my view - CONSISTENT HANDLING Consistent, but also loses information. > > 3. 'abc d "ef"' - your view - the least consistent of these 4 methods > > This is not my view, it is actually what happens in Calc and Excel. If you don't > believe me, please try it yourself. Yes, that's how this case of broken data that doesn't follow the escaped quotes rule currently may be treated. *** This issue has been marked as a duplicate of 78926 ***
Closing dup.
Friend er, > Isn't that somewhat contradicting? You have broken data in your file and > want it reflected reasonably accurately? Yes, we can add some more magic > to try to guess broken data correctly. Yes, we will do eventually, this > is what issue 78926 is about, so this issue here in fact is a duplicate. I do not ask for any magic or guessing of data. All I ask is for the treatment of data to be consistent. I thought I knew what Calc was doing, and that it would do the same or similar thing as Excel, but it looks like I was wrong there. To be honest with you, I was writing a C routine to read in CSV files, and I wanted to find out how they were handled in production applications, hence my creation of the "broken" CSV files. Excel appears to handle "broken" data rather gracefully. Calc... rather less gracefully, losing some data in the process. Surely it is important not to lose data? Surely it is important to apply rules consistently? I guess I might as well give up. At the very least, could you put something in the documentation somewhere that specifies exactly how Calc reads in CSV files? At least then, people like me wouldn't be unpleasantly surprised.