Issue 91028 - CSV import chops off characters for lines >64k
Summary: CSV import chops off characters for lines >64k
Status: CONFIRMED
Alias: None
Product: Calc
Classification: Application
Component: open-import (show other issues)
Version: OOo 2.4.1 RC2
Hardware: All All
: P3 Trivial (vote)
Target Milestone: ---
Assignee: AOO issues mailing list
QA Contact:
URL:
Keywords:
: 113658 (view as issue list)
Depends on:
Blocks:
 
Reported: 2008-06-25 01:26 UTC by dz0
Modified: 2022-12-31 12:22 UTC (History)
9 users (show)

See Also:
Issue Type: DEFECT
Latest Confirmation in: ---
Developer Difficulty: ---


Attachments

Note You need to log in before you can comment on or make changes to this issue.
Description dz0 2008-06-25 01:26:12 UTC
I try to import a tab delimited utf-8 text file (in Ubuntu 8.4), 
but some lines are strangelly chopped 
its not the problem of  column limit (as the last column is GF)
the problem happens at EP3 cell:
instead of "Domisi naujausiomis..."
it leaves just "Domi" :/

Seams to be the issue with line length, because if I strip it from beggining, it
leaves as many characers more at the end (now it chops at 63036-th character)

The file: http://popmokslas.projektas.lt/etc/example.txt (is it possible to
upload/attach it to the bug?)

copying from clipboard also gives similar problems (the chop-places seems to
depend on encoding a bit (windows-1257 chops  ~20-50 characters later)
Comment 1 chl207 2008-07-07 03:35:19 UTC
In version OOo-dev300-m21_en,when I opened the attachment(example.txt) in 
Calc,it opened as Writer,instead of Calc.Then I created a new .txt 
document ,and tried again,it's the same. so I can't reproduce it.
Comment 2 dz0 2008-07-07 14:37:49 UTC
I use 
Insert -> Sheet from File 
(not sure about exact naming, as I use in my native language)
Comment 3 amy2008 2008-08-15 03:38:35 UTC
Can reproduce it in DEV300_m28_EN on WinXP.

To reproduce:
Insert -> Sheet from File, to insert the attachment (example.txt) into a calc 
file.

Result: 
In cell EP3, it leaves "Domis" instead of "Domisi naujausiomis..."
Comment 4 amy2008 2008-08-15 03:48:48 UTC
dz0,
>>The file: http://popmokslas.projektas.lt/etc/example.txt (is it possible to
>>upload/attach it to the bug?)

Yes, you can upload an attachment by "create a new attachment" in this page.
Comment 5 redflagzhulihua 2008-08-20 09:43:32 UTC
confirmed
Comment 6 ooo 2008-09-15 14:38:07 UTC
"Domis" instead of "Domisi naujausiomis" is cut off after the 65535th character
of that line. It seems that there is some relation between this and the 64k
limit of class String that is used to read from SvStream.
Comment 7 vulcain 2013-01-28 09:43:46 UTC
Resolved in LibreOffice 3.6:
https://bugs.freedesktop.org/show_bug.cgi?id=48501
Comment 8 JBO 2015-05-30 20:53:41 UTC
And will it be corrected in OOCalc ? 2years and still no corrections

the max line length is 65536 characters (64k). DWORD instead of INTEGER...

please let me know if any correction are available on OOCalc.
Comment 9 damjan 2022-12-26 08:43:29 UTC
Let's see, this might be an easy fix.

Calc's filters are in main/sc/source/filters, but CSV is not there.

Importing a CSV must set values to cells, and if we put a breakpoint on a sets values to cells, we should be able to backtrace to the CSV code. Unfortunately that was hard to find.

Setting a breakpoint on the POSIX open() function ("break --qualified open") instead showed many files opening, and the one I was trying to test was only being opened while detecting the file type, and never again.

Grepping through main/sc and looking around I eventually found main/sc/source/ui/docshell/impex.cxx does get involved in importing CSV, and calls ScDocument::SetString() to populate cells. Setting a breakpoint on that finally got me a nice stack trace through the CSV import code path:

#0  ScImportExport::ExtText2Doc(SvStream&) (this=this@entry=0x7fffffffc1a8, rStrm=...) at source/ui/docshell/impex.cxx:1154
#1  0x000000080c61b55c in ScImportExport::ImportStream(SvStream&, String const&, unsigned long) (this=0x7fffffffc1a8, rStrm=..., rBaseURL=..., nFmt=1) at source/ui/docshell/impex.cxx:450
#2  0x000000080c5c27fb in ScDocShell::ConvertFrom(SfxMedium&) (this=0x80b27ae10, rMedium=...) at source/ui/docshell/docsh.cxx:1211
#3  0x000000080141a88d in SfxObjectShell::DoLoad(SfxMedium*) (this=0x80b27ae10, pMed=0x81168e4b0) at source/doc/objstor.cxx:753
#4  0x000000080144b5ee in SfxBaseModel::load(com::sun::star::uno::Sequence<com::sun::star::beans::PropertyValue> const&) (this=0x8116b9a00, seqArguments=<optimized out>) at source/doc/sfxbasemodel.cxx:1878

Starting at frame #1,
sal_Bool ScImportExport::ImportStream( SvStream& rStrm, const String& rBaseURL, sal_uLong nFmt )

already gets an opened SvStream, which explains why it was never calling open().  Now let's understand how it works, and where the 64K line length limit comes in.
Comment 10 damjan 2022-12-26 09:35:22 UTC
---snip---
sal_Bool ScImportExport::ImportStream( SvStream& rStrm, const String& rBaseURL, sal_uLong nFmt )
{
    if( nFmt == FORMAT_STRING )
    {
        if( ExtText2Doc( rStrm ) )      // pExtOptions auswerten
            return sal_True;
    }
...

---snip---

The nFmt is FORMAT_STRING, so it immediately calls "ExtText2Doc( rStrm )", as it visible in frame #0 in my previous comment.

Let's set a breakpoint there and load a CSV file with huge lines, 20000 each of  a, b, c, d, e, separated by commas, 100004 characters on one line.

This line:
---line---
rStrm.ReadCsvLine( aLine, !bFixed, rSeps, cStr);
---line---

calls SvStream::ReadCsvLine() in main/tools/source/stream/stream.cxx, which ultimately ends up in SvStream::ReadLine( ByteString& rStr ). There, it incrementally reads chunks of 256 bytes or less data from the file, looks for end of line, and appends them to the string that was passed in:

---line---
rStr.Append( buf, n );
---line---

Putting a breakpoint there and printing the length of the rStr string, shows that it increases on each loop round, reaches 65535, then remains stuck there:

---snip---
Thread 1 hit Breakpoint 17, SvStream::ReadLine (this=this@entry=0x80b409830, rStr=...) at source/stream/stream.cxx:736
736	            rStr.Append( buf, n );
$396 = 65024

Thread 1 hit Breakpoint 17, SvStream::ReadLine (this=this@entry=0x80b409830, rStr=...) at source/stream/stream.cxx:736
736	            rStr.Append( buf, n );
$397 = 65280

Thread 1 hit Breakpoint 17, SvStream::ReadLine (this=this@entry=0x80b409830, rStr=...) at source/stream/stream.cxx:736
736	            rStr.Append( buf, n );
$398 = 65535

Thread 1 hit Breakpoint 17, SvStream::ReadLine (this=this@entry=0x80b409830, rStr=...) at source/stream/stream.cxx:736
736	            rStr.Append( buf, n );
$399 = 65535
---snip---

That's because rStr is of type ByteString, which in tools/inc/tools/string.hxx is defined with a 16 bit maximum size limit:

---snip---
#ifdef STRING32
#define STRING_NOTFOUND    ((xub_StrLen)0x7FFFFFFF)
#define STRING_MATCH       ((xub_StrLen)0x7FFFFFFF)
#define STRING_LEN                 ((xub_StrLen)0x7FFFFFFF)
#define STRING_MAXLEN      ((xub_StrLen)0x7FFFFFFF)
#else
#define STRING_NOTFOUND    ((xub_StrLen)0xFFFF)
#define STRING_MATCH       ((xub_StrLen)0xFFFF)
#define STRING_LEN                 ((xub_StrLen)0xFFFF)
#define STRING_MAXLEN      ((xub_StrLen)0xFFFF)
#endif
---snip---

There are multiple ways to fix this. Globally define STRING32, which could have many unintended consequences, such as larger spreadsheet cell strings. Pass a different string buffer type to SvStream::ReadCsvLine(), so that it is unaffected by that limit. Drop the stream entirely and switch to push-model CSV parsing, which is the lightest on memory and could be paused and resumed, but has more complex code.
Comment 11 Peter 2022-12-26 13:52:05 UTC
And why we do not define rStr as a Vector Container?
It is C++ after all.
Comment 12 damjan 2022-12-26 15:36:08 UTC
(In reply to Peter from comment #11)
> And why we do not define rStr as a Vector Container?
> It is C++ after all.

LibreOffice was planning to change the various custom containers to standard C++ classes at one stage, like the UNO Sequence to ::std::vector. Don't know how that went.

But ByteString isn't a Vector, it's a string. The C++ standard equivalent would be ::std::string. I don't think we should be generally changing ByteString to ::std::string though, because sometimes it is intended for places that want limited length strings. The 32 bit length ::rtl::OString and ::rtl::OUString would be better candidates.
Comment 13 Peter 2022-12-26 16:25:09 UTC
Looking at the code modifying ByteString itself would be the most preferably approach. We replace the structure ByteStringData with a Container in the ByteString Class (line 163 in inc/string.hxx). I think the change would have limited risk, and extend the limits we see at the moment.

I am not sure how the ByteStream::Append method is working, since in source/string/tstring.cxx I could not find an append methods, but I think it implementations of various other methods of ByteString.

Because we append a lot, Vector does not feel suitable as a container. I would pick std::deque<sal_Char>. We need to redefine the ByteString Methods in order to use the Container properly.
Comment 14 damjan 2022-12-27 13:17:11 UTC
*** Issue 113658 has been marked as a duplicate of this issue. ***
Comment 15 Peter 2022-12-31 09:58:49 UTC
Update: I found the implementation.

So in total I think the implementation is a bit difficult for me to follow.
I found the following files that implement pieces:

tools/inc/tools/string.hxx - the header file we found
tools/source/string/strascii.cxx - implements a friend class UniString
tools/source/string/tustring.cxx - implements some methods (createFrom..., To...) for friend class UniString
tools/source/string/tstring.cxx - implements some methods (createFrom..., To...) for Bytestring


If it becomes clear how all files relate it should be easy to update the class to use std::deque<sal_Char> instead of ByteStringData* mpData and UniStringData* mpData


In total a comment suggest we should move to a string class. But I think in order to fix the bug we should still update, and then make a plan if we want to replace. This depreciated approach is a mess. The byteString is in wide use.
Comment 16 damjan 2022-12-31 12:22:39 UTC
The CSV parsing is done twice: once by SvStream::ReadCsvLine() to join consecutive lines that fields can be spread over into one "line", then again, by code in both Base and Calc, to split that line into different fields.

The field splitting already works well, only SvStream::ReadCsvLine() needs patching for this bug.

I don't think redefining strings across the whole of AOO is good idea, and 16 bit string lengths reduce RAM usage on big spreadsheets too. Even Excel only allows 32K chars per cell IIRC.