Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.2.0
    • Fix Version/s: 0.2.0
    • Component/s: regionserver
    • Labels:
      None

      Description

      From Bryan Duxbury supplied log:

         1044 2007-12-15 04:37:56,052 INFO org.apache.hadoop.hbase.HRegionServer: MSG_REGION_OPEN : spider_pages,7_202623541,1197662034823
         1045 2007-12-15 04:37:56,060 ERROR org.apache.hadoop.hbase.HRegionServer: error opening region spider_pages,7_202623541,1197662034823
         1046 java.io.EOFException
         1047     at java.io.DataInputStream.readByte(DataInputStream.java:250)
         1048     at org.apache.hadoop.hbase.HStoreFile.loadInfo(HStoreFile.java:594)
         1049     at org.apache.hadoop.hbase.HStore.<init>(HStore.java:613)
         1050     at org.apache.hadoop.hbase.HRegion.<init>(HRegion.java:287)
         1051     at org.apache.hadoop.hbase.HRegionServer.openRegion(HRegionServer.java:1182)
         1052     at org.apache.hadoop.hbase.HRegionServer$Worker.run(HRegionServer.java:1133)
         1053     at java.lang.Thread.run(Thread.java:619)
         1054 2007-12-15 04:37:56,061 FATAL org.apache.hadoop.hbase.HRegionServer: Unhandled exception                                                                                                                                                                                                                               1055 java.lang.NullPointerException
         1056     at org.apache.hadoop.hbase.HRegionServer.reportClose(HRegionServer.java:1066)
         1057     at org.apache.hadoop.hbase.HRegionServer.openRegion(HRegionServer.java:1188)
         1058     at org.apache.hadoop.hbase.HRegionServer$Worker.run(HRegionServer.java:1133)
         1059     at java.lang.Thread.run(Thread.java:619)
         1060 2007-12-15 04:37:56,061 INFO org.apache.hadoop.hbase.HRegionServer: worker thread exiting
      

      I see same exception when we try to deploy same region on another server; the info file must be horked (Seems like something we could recover from reading through looking for highest sequence number; would be expensive but alternative is lost region).

        Issue Links

          Activity

          Hide
          Bryan Duxbury added a comment -

          Since there's data loss potential here, this seems like it should be higher priority.

          Show
          Bryan Duxbury added a comment - Since there's data loss potential here, this seems like it should be higher priority.
          Hide
          Bryan Duxbury added a comment -

          Has this issue come up again? If not, we should close the issue.

          Show
          Bryan Duxbury added a comment - Has this issue come up again? If not, we should close the issue.
          Hide
          Jim Kellerman added a comment -

          I think this could still happen if a region server crashes in the middle of a cache flush (which could leave a zero length MapFile or info file.

          We could put some defensive measures in HStore to prevent the problem, and if HLog does not guard against zero length files it should as well.

          Finally, hbase-fsck (or whatever we call it) should check for zero length files.

          Show
          Jim Kellerman added a comment - I think this could still happen if a region server crashes in the middle of a cache flush (which could leave a zero length MapFile or info file. We could put some defensive measures in HStore to prevent the problem, and if HLog does not guard against zero length files it should as well. Finally, hbase-fsck (or whatever we call it) should check for zero length files.
          Hide
          Bryan Duxbury added a comment -

          Let's add the suggested zero-length checks where appropriate. Jim, do you want to tackle this one?

          Show
          Bryan Duxbury added a comment - Let's add the suggested zero-length checks where appropriate. Jim, do you want to tackle this one?
          Hide
          Jim Kellerman added a comment -

          I thought that the worker thread would no longer exit in trunk. Are you running trunk or an older version?

          Show
          Jim Kellerman added a comment - I thought that the worker thread would no longer exit in trunk. Are you running trunk or an older version?
          Hide
          Bryan Duxbury added a comment -

          I haven't seen this problem recently. I was responding to Jim's above comment regarding other situations when this could occur.

          Show
          Bryan Duxbury added a comment - I haven't seen this problem recently. I was responding to Jim's above comment regarding other situations when this could occur.
          Hide
          Jim Kellerman added a comment -

          The worker thread will no longer exit.

          However, we still need to deal with bad files.

          HLog can handle zero length log files in splitLog.

          HStoreFile still needs to be made more robust. There are up to 3 files in each HStoreFile:

          • the mapFile
          • the infoFile
          • bloomFilter file (optional)

          For this immediate problem, HStore.loadHStoreFiles should throw an error if one of these files does not exist or is zero length (ignoring the bloomFilter file if the column is not configured with one). This would be caught by the worker thread, which would mark the region offline in the meta (so the master won't try to reassign it) and the region server should tell the master about the bad region. The master (and/or) region server needs to notify the user (How? People complain about stuff like this only being in the logs. In the web ui? On stdout?)

          Moving forward, the user should be directed to run HBase-fsck (or whatever it will be called).

          Additionally, if we are going to create our own 'mapFile' format, why not combine all of these into a single file? Seems kind of silly to have those little info files around, and there is no reason that a bloom filter couldn't be stored in the same file as well.

          Show
          Jim Kellerman added a comment - The worker thread will no longer exit. However, we still need to deal with bad files. HLog can handle zero length log files in splitLog. HStoreFile still needs to be made more robust. There are up to 3 files in each HStoreFile: the mapFile the infoFile bloomFilter file (optional) For this immediate problem, HStore.loadHStoreFiles should throw an error if one of these files does not exist or is zero length (ignoring the bloomFilter file if the column is not configured with one). This would be caught by the worker thread, which would mark the region offline in the meta (so the master won't try to reassign it) and the region server should tell the master about the bad region. The master (and/or) region server needs to notify the user (How? People complain about stuff like this only being in the logs. In the web ui? On stdout?) Moving forward, the user should be directed to run HBase-fsck (or whatever it will be called). Additionally, if we are going to create our own 'mapFile' format, why not combine all of these into a single file? Seems kind of silly to have those little info files around, and there is no reason that a bloom filter couldn't be stored in the same file as well.
          Hide
          Bryan Duxbury added a comment -

          One problem with these busted regions just getting permanently offlined is that there will be holes in the table. We should consider updating the start and end keys for the two regions on either side so this won't be a problem.

          I agree that the custom mapfile implementation could encompass all of these little external files. More, there will probably be a mandatory bloom filter in the new mapfile.

          Show
          Bryan Duxbury added a comment - One problem with these busted regions just getting permanently offlined is that there will be holes in the table. We should consider updating the start and end keys for the two regions on either side so this won't be a problem. I agree that the custom mapfile implementation could encompass all of these little external files. More, there will probably be a mandatory bloom filter in the new mapfile.
          Hide
          Jim Kellerman added a comment -

          I don't envision regions being permanently offline if HBase-fsck can fix them.

          Show
          Jim Kellerman added a comment - I don't envision regions being permanently offline if HBase-fsck can fix them.
          Hide
          stack added a comment -

          +1 on combing info and bloom filter all into one file (I'd imagine info and bloom filter might be done as metadata on a SequenceFile). Should we also add index (BigTable adds them to end of the file). Will help w/ the too-many-open files issue.

          Show
          stack added a comment - +1 on combing info and bloom filter all into one file (I'd imagine info and bloom filter might be done as metadata on a SequenceFile). Should we also add index (BigTable adds them to end of the file). Will help w/ the too-many-open files issue.
          Hide
          Bryan Duxbury added a comment -

          Let's spend some more time planning for the custom mapfile, and integrate these last few comments in the design. If only someone had drafted a provisional new store file interface spec...

          Show
          Bryan Duxbury added a comment - Let's spend some more time planning for the custom mapfile, and integrate these last few comments in the design. If only someone had drafted a provisional new store file interface spec...
          Hide
          Jim Kellerman added a comment -

          Unmarking this as a blocker for 0.1.0. In 0.1.0, any exception that occurs during region open is caught, and the region is marked off-line.

          More extensive recovery should be slated for 0.2.0

          Show
          Jim Kellerman added a comment - Unmarking this as a blocker for 0.1.0. In 0.1.0, any exception that occurs during region open is caught, and the region is marked off-line. More extensive recovery should be slated for 0.2.0
          Hide
          Jim Kellerman added a comment -

          hbase-fsck (or whatever we call it)

          • should check for zero length files.

          Since we no longer offline broken regions, as long as hbase-fsck can fix them, then it is no longer a part of HBASE-236

          Show
          Jim Kellerman added a comment - hbase-fsck (or whatever we call it) should check for zero length files. Since we no longer offline broken regions, as long as hbase-fsck can fix them, then it is no longer a part of HBASE-236
          Hide
          Jim Kellerman added a comment -

          HBASE-61 will address the new MapFile format discussed in this issue.

          Show
          Jim Kellerman added a comment - HBASE-61 will address the new MapFile format discussed in this issue.
          Hide
          Jim Kellerman added a comment -

          HBASE-433 addressed HLog issues processing zero length files

          Show
          Jim Kellerman added a comment - HBASE-433 addressed HLog issues processing zero length files
          Hide
          Jim Kellerman added a comment -

          HBASE-11 will address EOFExceptions while replaying logs.

          Show
          Jim Kellerman added a comment - HBASE-11 will address EOFExceptions while replaying logs.
          Hide
          Jim Kellerman added a comment -

          Since all the remaining issues associated with this bug are (or will be) addressed by other Jiras resolving this issue.

          Show
          Jim Kellerman added a comment - Since all the remaining issues associated with this bug are (or will be) addressed by other Jiras resolving this issue.

            People

            • Assignee:
              Jim Kellerman
              Reporter:
              stack
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development