Fetcher doesn't store ProtocolStatus in output segments. ProtocolStatus contains important information, such as protocol-level response code, lastModified time, and possibly other messages.
I propose that ProtocolStatus should be stored inside CrawlDatum.metaData, which is then stored into crawl_fetch (in Fetcher.FetcherThread.output()). In addition, if ProtocolStatus contains a valid lastModified time, that CrawlDatum's modified time should also be set to this value.
Additionally, Fetcher doesn't store redirected pages. Content of such pages is silently discarded. When Fetcher translates from protocol-level status to crawldb-level status it should probably store such pages with the following translation of status codes:
- ProtocolStatus.TEMP_MOVED -> CrawlDatum.STATUS_DB_RETRY. This code indicates a transient change, so we probably shouldn't mark the initial URL as bad.
- ProtocolStatus.MOVED -> CrawlDatum.STATUS_DB_GONE. This code indicates a permanent change, so the initial URL is no longer valid, i.e. it will always result in redirects.