Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-444

Possibly use a different library to parse RSS feed for improved performance and compatibility

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 0.9.0
    • 1.0.0
    • fetcher
    • None

    Description

      As discussed by Nutch Newbie, Gal, and Chris on NUTCH-443, the current library (feedparser) has the following issues:

      • OutOfMemory when parsing > 100k feeds, since it has to convert the feed to jdom first
      • no support for Atom 1.0
      • there has been no development in the last year

      Alternatives are:

      • Rome
      • Informa
      • custom implementation based on Stax
      • ??

      Attachments

        1. parse-feed.tar.bz2
          332 kB
          Dogacan Guney
        2. parse-feed-v2.tar.bz2
          332 kB
          Dogacan Guney
        3. feed.tar.bz2
          327 kB
          Dogacan Guney
        4. NUTCH-444.patch
          2 kB
          Dogacan Guney
        5. NUTCH-444.Mattmann.061707.patch.txt
          32 kB
          Chris A. Mattmann
        6. NUTCH-444.1-1.patch
          0.7 kB
          Dennis Kubes

        Activity

          People

            chrismattmann Chris A. Mattmann
            renaudrichardet Renaud Richardet
            Votes:
            1 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: