Technical details

This document is mostly of interest to those who want to look at/understand/muck around with the code, but those who are really interested in understanding what happens to the repository and why might want to read this as well...

Some concepts

Revision: I assume anyone coming this far to know what a revision is. In subversion revisions are consequtively numbered starting from 0, and each contain an arbitary number of actions. In svndumpsanitizer this is represented by the revisions-array. revisions[8] will unsurprisingly contain the actions of revision 8.

Node: Any single action contitutes a "node" in svndumpsanitizer, and is the most primitive data unit used. Each revision contains a "nodes" array, so that revisions[8].nodes[1] points to the second action of revision 8. The fundamental idea behind svndumpsanitizer is to parse all the metadata in the repository, and arrive at a "take it" or "leave it" judgement for each and every node in the repository.

Fake node: An svn repository can be viewed as a large (and rather fragile) state machine. The result of some actions will therefore depend on the state of the repository when the action was made. The result of "svn mv trunk/foo trunk/bar" will depend on the contents of trunk/foo when the command was run; the same goes for "svn rm branches/baz". Such nodes will therefore spawn many implicit actions - things that happen, even though it's not immediately visible from just looking at the node itself. In order to make the repository analyzable in an atomic manner, svndumpsanitizer creates fake nodes to make the implicit actions explicit. They are stored separately in the "fakes" array of each revision.

Node action: Since nodes are just actions in a repository, every node belongs to one of four types. Why four? Because that's the number of types defined by svn. They are, "add", "delete", "change" and (the very rare) "replace". Whenever I refer to "an add node", I mean the action behind the node. In the code it can be referenced by node.action.

Dependency: Most nodes depend on one or more other nodes for their existence. (The only exception are add nodes in the repository root.) There are a few different dependency types:

Node dependencies. Red color indicates a fake node.

Previous version dependency: Any non-add node will depend on the previous version of the file. E.g. the revision 7 node that changes file trunk/p1/foo.txt depends on the node that changed the same file in revision 4. That in turn depends on the node that added it in revision 2.

Directory dependency: Unlike version control systems like git, where directories only exist as epiphenomena of file locations, (which is why you can't add an empty directory to a git repo) in svn directories exist by themselves. This also means that you can't add the file trunk/p1/foo.txt, unless the directory trunk/p1 exists first. This also means that foo.txt depends on trunk/p1, which in turn depends on trunk. These dependencies are added to the add nodes only to reduce redunancy.

Copyfrom dependency: A node that is copied from another will unsurprisingly depend on the node it was copied from. The tricky bit is that the node in question may not reside in the revision indicated by the copyfrom-revision attribute. Due to the state machine nature of svn it may be (and frequently is) situated in an earlier revision.

Mergefrom dependency: When one branch of the code is merged into another, the resulting file will depend on the file which was merged into it. This requires that "nice and clean" mergeinfo exists to begin with. If the mergeinfo can't be deciphered, it will be ignored.

Repotree: If the repository structure organizes the data by time, the repotree structure is a construct that makes use of that data, and arranges the data by space and time. The path will be unique for each repotree struct, and that struct will contain a chronologically sorted map of all nodes with that exact path. It should be noted that even though typically the map would contain different versions of the same file, it's possible (across time) to have files with the same path & name that have nothing in common except that. A delete node in the map will be the natural watershed between these files.

Repo tree. Any path ever featured in the repository will be added