Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-17887

[R] [Doc] Improve readability of the Get Started and README pages

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 11.0.0
    • R

    Description

      In its current form the pkgdown Get Started and Read Me pages are a little hard for new users to follow. I would argue that both pages are written in a way that makes sense to someone who is already familiar with core Arrow concepts, but is potentially intimidating to an R user who is curious about Arrow but has never used it. The issue is perhaps most severe on the main [README page](https://arrow.apache.org/docs/r/index.html) and the [Get Started](https://arrow.apache.org/docs/r/articles/arrow.html) page. A few examples:

      • The README page opens with the sentence *"Apache Arrow is a cross-language development platform for in-memory data".* This is a problem for multiple reasons. Firstly it's not really true anymore, because we encourage users to rely on `Dataset` for on-disk datasets. Secondly, the sentence simply assumes the user has a clear mental model of the difference between in-memory and on-disk data. I don't think that's true for data scientists in general. A data engineer likely has a more precise mental model here, but R users are typically focused on analytics. Unless they have extensive experience working with large data sets this isn't something we can assume. Thirdly, and maybe most importantly, it doesn't explain to the user why they should care about arrow: it doesn't say what the arrow package does. It's too vague.
      • There are (IMO) too many boldfaced sections in the README page, and it's very cluttered. It gives the page an intensity and feeling of "denseness" that I think we should avoid at all costs. Arrow already has a reputation for being a complicated project (because it is!) but we don't want our documentation to have that feeling. I think we ought to be aiming for something gentler and welcoming. If that means pushing more details into vignettes, that's totally okay. Readers don't need to be told all the things on the very first page: it's probably better to give a simpler description and then push the details onto additional vignettes.
      • The "get started" page has some of the same problems as the main README. The "object hierarchy" and "data object" tables only make sense once you already understand core Arrow concepts. What needs to happen in both cases is the tables need to be wrapped with some explanatory text that provide the missing context for users, and then additional details are pushed out to vignettes that explain it in more detail. 
      • The data types mapping section on the get started page has the same issue. A novice user doesn't necessarily even have a clear understanding of how fundamental types are represented in R, much less how they are represented in Arrow. A section that simply assumes that these types are meaningful concepts and gives a lookup table with various footnotes isn't at all helpful to that kind of user. I think it makes more sense to again split the work: on the "get started" page we should have something simple, and a longer discussion of these mappings should be pushed to a vignette

      The concrete proposal here is to restructure the content of these two pages to be more novice-friendly: specifically, to add more "Arrow 101" explanatory notes to these pages, and to move more of the technical information to new vignettes (e.g., there should be a new "data types" vignette)

      Attachments

        Issue Links

          Activity

            People

              djnavarro Danielle Navarro
              djnavarro Danielle Navarro
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 17h 10m
                  17h 10m

                  Slack

                    Issue deployment