[SVN-2286] Identical files should share storage space in repository - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: trunk
Fix Version/s: 1.6.0
Component/s: libsvn_fs
Labels:
None

Description

See link for discussion.

When using branches it often happens that identical changes are done to copied
files; this results in wasted storage space.

Using eg. the MD5-hash as an index it should be possible to find such duplicates
and, instead of storing a new delta or even fulltext, just saving the other
"inode" in the repository (simplest case is [filename,revision], better some
internal pointer for speed reasons).

Con: FSFS cannot be append-only; the indizes have to be written and re-written.

Furthermore I'd like that to be more a cache, so that it can be generated,
deleted and regenerated at any time (at a very high speed, as every file has a
MD5 archived).

For FSFS I'd suggest making a new directory, which uses 2 indirection layers
down the hierarchy.
Eg. for a file with MD5 of 8a04f87ad04f4a1d3c7e6ca12e07290d 

repository/
  dav/
  ...
  db/
    revs/
    revprops/
    transactions/
    md5index/
      8a/
        04/
          f87a.index

If this index has more than say 256 entries (which should be sorted in the file),
it would be possible to split the file into new 16 parts.


I believe that could save a lot of space, especially for scenarios with many
branches.

http://marc.theaimsgroup.com/?l=subversion-dev&m=111319801911398&w=2

Original issue reported by pmarek

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Subversion Importer

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 28/Apr/05 11:31

Updated:: 27/Oct/18 16:34

Resolved:: 10/Mar/09 15:22