Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-3139

[Python] ArrowIOError: Arrow error: Capacity error during read

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Duplicate
    • 0.10.0
    • None
    • Python
    • pandas=0.23.1=py36h637b7d7_0
      pyarrow==0.10.0

    Description

      My assumption: the problem is caused by a large object column containing strings up to 27 characters long. (so that column is much larger than 2GB of strings, chunking issue)

      looks similar as  https://issues.apache.org/jira/browse/ARROW-2227?focusedCommentId=16379574&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16379574 

      Code

      • basket_plateau= pq.read_table("basket_plateau.parquet")
      • basket_plateau = pd.read_parquet("basket_plateau.parquet")

      Error produced

      • ArrowIOError: Arrow error: Capacity error: BinaryArray cannot contain more than 2147483646 bytes, have 2147483655

      Dataset

      • Pandas dataframe (pandas=0.23.1=py36h637b7d7_0)
      • 2.7 billion record, 4 columns ( int64/object/datetime64/float64)
      • aprox 90GB in memory
      • example of object col: "Fresh Vegetables", "Alcohol Beers", ... (think food retail categories)

      History to bug:

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              MarkiesFredje Frédérique Vanneste
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: