Apache OpenOffice (AOO) Bugzilla – Issue 91028
CSV import chops off characters for lines >64k
Last modified: 2022-12-31 12:22:39 UTC
I try to import a tab delimited utf-8 text file (in Ubuntu 8.4), but some lines are strangelly chopped its not the problem of column limit (as the last column is GF) the problem happens at EP3 cell: instead of "Domisi naujausiomis..." it leaves just "Domi" :/ Seams to be the issue with line length, because if I strip it from beggining, it leaves as many characers more at the end (now it chops at 63036-th character) The file: http://popmokslas.projektas.lt/etc/example.txt (is it possible to upload/attach it to the bug?) copying from clipboard also gives similar problems (the chop-places seems to depend on encoding a bit (windows-1257 chops ~20-50 characters later)
In version OOo-dev300-m21_en,when I opened the attachment(example.txt) in Calc,it opened as Writer,instead of Calc.Then I created a new .txt document ,and tried again,it's the same. so I can't reproduce it.
I use Insert -> Sheet from File (not sure about exact naming, as I use in my native language)
Can reproduce it in DEV300_m28_EN on WinXP. To reproduce: Insert -> Sheet from File, to insert the attachment (example.txt) into a calc file. Result: In cell EP3, it leaves "Domis" instead of "Domisi naujausiomis..."
dz0, >>The file: http://popmokslas.projektas.lt/etc/example.txt (is it possible to >>upload/attach it to the bug?) Yes, you can upload an attachment by "create a new attachment" in this page.
confirmed
"Domis" instead of "Domisi naujausiomis" is cut off after the 65535th character of that line. It seems that there is some relation between this and the 64k limit of class String that is used to read from SvStream.
Resolved in LibreOffice 3.6: https://bugs.freedesktop.org/show_bug.cgi?id=48501
And will it be corrected in OOCalc ? 2years and still no corrections the max line length is 65536 characters (64k). DWORD instead of INTEGER... please let me know if any correction are available on OOCalc.
Let's see, this might be an easy fix. Calc's filters are in main/sc/source/filters, but CSV is not there. Importing a CSV must set values to cells, and if we put a breakpoint on a sets values to cells, we should be able to backtrace to the CSV code. Unfortunately that was hard to find. Setting a breakpoint on the POSIX open() function ("break --qualified open") instead showed many files opening, and the one I was trying to test was only being opened while detecting the file type, and never again. Grepping through main/sc and looking around I eventually found main/sc/source/ui/docshell/impex.cxx does get involved in importing CSV, and calls ScDocument::SetString() to populate cells. Setting a breakpoint on that finally got me a nice stack trace through the CSV import code path: #0 ScImportExport::ExtText2Doc(SvStream&) (this=this@entry=0x7fffffffc1a8, rStrm=...) at source/ui/docshell/impex.cxx:1154 #1 0x000000080c61b55c in ScImportExport::ImportStream(SvStream&, String const&, unsigned long) (this=0x7fffffffc1a8, rStrm=..., rBaseURL=..., nFmt=1) at source/ui/docshell/impex.cxx:450 #2 0x000000080c5c27fb in ScDocShell::ConvertFrom(SfxMedium&) (this=0x80b27ae10, rMedium=...) at source/ui/docshell/docsh.cxx:1211 #3 0x000000080141a88d in SfxObjectShell::DoLoad(SfxMedium*) (this=0x80b27ae10, pMed=0x81168e4b0) at source/doc/objstor.cxx:753 #4 0x000000080144b5ee in SfxBaseModel::load(com::sun::star::uno::Sequence<com::sun::star::beans::PropertyValue> const&) (this=0x8116b9a00, seqArguments=<optimized out>) at source/doc/sfxbasemodel.cxx:1878 Starting at frame #1, sal_Bool ScImportExport::ImportStream( SvStream& rStrm, const String& rBaseURL, sal_uLong nFmt ) already gets an opened SvStream, which explains why it was never calling open(). Now let's understand how it works, and where the 64K line length limit comes in.
---snip--- sal_Bool ScImportExport::ImportStream( SvStream& rStrm, const String& rBaseURL, sal_uLong nFmt ) { if( nFmt == FORMAT_STRING ) { if( ExtText2Doc( rStrm ) ) // pExtOptions auswerten return sal_True; } ... ---snip--- The nFmt is FORMAT_STRING, so it immediately calls "ExtText2Doc( rStrm )", as it visible in frame #0 in my previous comment. Let's set a breakpoint there and load a CSV file with huge lines, 20000 each of a, b, c, d, e, separated by commas, 100004 characters on one line. This line: ---line--- rStrm.ReadCsvLine( aLine, !bFixed, rSeps, cStr); ---line--- calls SvStream::ReadCsvLine() in main/tools/source/stream/stream.cxx, which ultimately ends up in SvStream::ReadLine( ByteString& rStr ). There, it incrementally reads chunks of 256 bytes or less data from the file, looks for end of line, and appends them to the string that was passed in: ---line--- rStr.Append( buf, n ); ---line--- Putting a breakpoint there and printing the length of the rStr string, shows that it increases on each loop round, reaches 65535, then remains stuck there: ---snip--- Thread 1 hit Breakpoint 17, SvStream::ReadLine (this=this@entry=0x80b409830, rStr=...) at source/stream/stream.cxx:736 736 rStr.Append( buf, n ); $396 = 65024 Thread 1 hit Breakpoint 17, SvStream::ReadLine (this=this@entry=0x80b409830, rStr=...) at source/stream/stream.cxx:736 736 rStr.Append( buf, n ); $397 = 65280 Thread 1 hit Breakpoint 17, SvStream::ReadLine (this=this@entry=0x80b409830, rStr=...) at source/stream/stream.cxx:736 736 rStr.Append( buf, n ); $398 = 65535 Thread 1 hit Breakpoint 17, SvStream::ReadLine (this=this@entry=0x80b409830, rStr=...) at source/stream/stream.cxx:736 736 rStr.Append( buf, n ); $399 = 65535 ---snip--- That's because rStr is of type ByteString, which in tools/inc/tools/string.hxx is defined with a 16 bit maximum size limit: ---snip--- #ifdef STRING32 #define STRING_NOTFOUND ((xub_StrLen)0x7FFFFFFF) #define STRING_MATCH ((xub_StrLen)0x7FFFFFFF) #define STRING_LEN ((xub_StrLen)0x7FFFFFFF) #define STRING_MAXLEN ((xub_StrLen)0x7FFFFFFF) #else #define STRING_NOTFOUND ((xub_StrLen)0xFFFF) #define STRING_MATCH ((xub_StrLen)0xFFFF) #define STRING_LEN ((xub_StrLen)0xFFFF) #define STRING_MAXLEN ((xub_StrLen)0xFFFF) #endif ---snip--- There are multiple ways to fix this. Globally define STRING32, which could have many unintended consequences, such as larger spreadsheet cell strings. Pass a different string buffer type to SvStream::ReadCsvLine(), so that it is unaffected by that limit. Drop the stream entirely and switch to push-model CSV parsing, which is the lightest on memory and could be paused and resumed, but has more complex code.
And why we do not define rStr as a Vector Container? It is C++ after all.
(In reply to Peter from comment #11) > And why we do not define rStr as a Vector Container? > It is C++ after all. LibreOffice was planning to change the various custom containers to standard C++ classes at one stage, like the UNO Sequence to ::std::vector. Don't know how that went. But ByteString isn't a Vector, it's a string. The C++ standard equivalent would be ::std::string. I don't think we should be generally changing ByteString to ::std::string though, because sometimes it is intended for places that want limited length strings. The 32 bit length ::rtl::OString and ::rtl::OUString would be better candidates.
Looking at the code modifying ByteString itself would be the most preferably approach. We replace the structure ByteStringData with a Container in the ByteString Class (line 163 in inc/string.hxx). I think the change would have limited risk, and extend the limits we see at the moment. I am not sure how the ByteStream::Append method is working, since in source/string/tstring.cxx I could not find an append methods, but I think it implementations of various other methods of ByteString. Because we append a lot, Vector does not feel suitable as a container. I would pick std::deque<sal_Char>. We need to redefine the ByteString Methods in order to use the Container properly.
*** Issue 113658 has been marked as a duplicate of this issue. ***
Update: I found the implementation. So in total I think the implementation is a bit difficult for me to follow. I found the following files that implement pieces: tools/inc/tools/string.hxx - the header file we found tools/source/string/strascii.cxx - implements a friend class UniString tools/source/string/tustring.cxx - implements some methods (createFrom..., To...) for friend class UniString tools/source/string/tstring.cxx - implements some methods (createFrom..., To...) for Bytestring If it becomes clear how all files relate it should be easy to update the class to use std::deque<sal_Char> instead of ByteStringData* mpData and UniStringData* mpData In total a comment suggest we should move to a string class. But I think in order to fix the bug we should still update, and then make a plan if we want to replace. This depreciated approach is a mess. The byteString is in wide use.
The CSV parsing is done twice: once by SvStream::ReadCsvLine() to join consecutive lines that fields can be spread over into one "line", then again, by code in both Base and Calc, to split that line into different fields. The field splitting already works well, only SvStream::ReadCsvLine() needs patching for this bug. I don't think redefining strings across the whole of AOO is good idea, and 16 bit string lengths reduce RAM usage on big spreadsheets too. Even Excel only allows 32K chars per cell IIRC.