Svndumpsanitizer

Background

This is the home page for my tool svndumpsanitizer. It's a small project born from my experiences with the official subversion tool "svndumpfilter". Svndumpfilter unfortunately does not work with every valid repository, and even though I can't vouch for my program either, I have certainly tried to make it that way. If it doesn't work with some valid repository, that is to be considered a bug. I know it can handle all the files I've thrown at it, that svndumpfilter couldn't.

Download

The latest stable version can be downloaded here.

The old stable version can be downloaded here.

Please scroll down to the "Differences between v1 and v2"-section for an explanation of the differences.

If you prefer, I've also made this code available via github. Check out branch "v1" for the old stable. (If you for some odd reason need an older version, this is also the place to go.)

Misc info

Language: C
License: GNU GPL v3 (or later)
Latest stable version: 2.0.7
Release date of latest stable version: May 16, 2018
The changelog

The program has been tested on Linux (i386 and x86_64 architectures) and should work out-of-the-box on any system using the GNU toolchain. It uses only standard libraries and should be easily portable, though. The only thing that might cause some snags is the 64 bit file API. (As of 1.0.2 it contains a modification by $ergi0 that should make it possible to build under Windows. I haven't tested that myself, though.)

To compile it, just run:

        gcc svndumpsanitizer.c -o svndumpsanitizer

What svndumpsanitizer doesn't do

To save yourself time and frustration by trying to use this tool for something it was never intended to do, please notice that it does not:

Repair broken dump files. Svndumpsanitizer assumes that it's reading valid data. If you try to give it a broken dump file, the principle of "Garbage in, garbage out" applies.
Work with partial dumps. There is a good technical reason for this. The short version is that svndumpsanitizer uses a different approach than svndumpfilter. Svndumpfilter only reads the data once, but the price is that it has to make assumptions about how the repository is constructed. If even one assumption turns out to be incorrect (almost inevitable in bigger repositories) the operation will fail. Svndumpsanitizer on the other hand first reads all the metadata, then analyzes it, then reads the data again copying it sans the parts the user wanted to omit. The price for this approach is that all the data needs to be there when the process is launched.*) Please scroll down to the "How it works" section for more details.
Read from stdin. As explained above, svndumpsanitizer needs to read the data twice. That means it needs the actual dump file, and can not read from stdin or other stream.

*) I know some people have used svndumpsanitizer with partial dumps, and it might work under some circumstances, just be aware that it's neither supported nor recommended. Yes, I know; downtime sucks. Sometimes there just is no other way, though...

Why svndumpsanitizer is needed

If you've found this page you probably already know. You have a large subversion repository, and you've been charged with the task to filter out some of the paths in the repository while keeping some others - naturally maintaining the entire history of the paths and files that should be kept. You create the dump file and you google the problem. You quickly discover svndumpfilter, and you start feeling hopeful about your task. This is going to work out...

You proceed with reading the man pages of svndumpfilter and when you think you've got it figured out you give it your first shot:

        cat foobar.dump | svndumpfilter include trunk/dowant > clean.dump

You see the program start doing its thing, and you're convinced that you're almost done. Then some time later disaster strikes.

        Revision 8932 committed as 8932.
        Revision 8933 committed as 8933.
        Revision 8934 committed as 8934.
        svndumpfilter: Invalid copy source path '/trunk/donotwant/hello.c'

Hmm... That looks bad. So you try a new strategy. Maybe if you just exclude the unwanted stuff instead.

        cat foobar.dump | svndumpfilter exclude trunk/donotwant branches > clean.dump

You watch svndumpfilter spring into action, but your optimism has already suffered a blow, and a nagging suspicion has taken its place. A while later your worst fears are realized.

        Revision 3461 committed as 3461.
        Revision 3462 committed as 3462.
        Revision 3463 committed as 3463.
        svndumpfilter: Invalid copy source path '/branches/george-test-branch/bork.py'

D'oh! Apparently someone has moved stuff from a place you wanted to exclude to some directory you didn't even know existed, (because it's been long since deleted) and has thus not been able to exclude. Svndumpfilter in it's wisdom naturally didn't tell you what that directory is either. Realizing that this is probably a dead end, you become somewhat discouraged, but since you really need to get the file cleaned up, you go for the brute force approach. It's back to the include strategy, but you add the offending file to the includes.

        cat foobar.dump | svndumpfilter include trunk/dowant trunk/donotwant/hello.c > clean.dump

        ...
        Revision 8932 committed as 8932.
        Revision 8933 committed as 8933.
        Revision 8934 committed as 8934.
        svndumpfilter: Invalid copy source path '/trunk/donotwant/hello.h'

Argh! The header file of the .c file that got you the last time nailed you this time. Well, try, try again. You could of course include the trunk/donotwant directory, but that would defeat the purpose of the filtering, so you add only the offending file. (Also if you do include the entire directory you can run into "fun" surprises where svndumpfilter craps out due to stuff that you're not really interested in that was moved from an excluded directory to the newly included one.)

        cat foobar.dump | svndumpfilter include trunk/dowant trunk/donotwant/hello.c trunk/donotwant/hello.h > clean.dump

        ...
        Revision 8932 committed as 8932.
        Revision 8933 committed as 8933.
        Revision 8934 committed as 8934.
        svndumpfilter: Invalid copy source path '/trunk/donotwant/someotherstupidfile.c'

You can feel your blood pressure rising as you realize that, at some point someone has apparently moved a load of files around in the repository between the areas of the repository that you want to keep, and the areas you do not. Not only is svndumpfilter unable to handle this - it is also unable to tell you about more than one offending file at a time! In order to not have to go through filtering 8934 revisions n times, where n may be an annoyingly large value you start digging through the dumpfile instead. You locate the offending commit and dig out all the offending files. It takes a fair amount of time, because even though svn dumpfiles are human readable, they aren't all that pleasant to read. With a combination of fury and agony, you try again.

        cat foobar.dump | svndumpfilter include trunk/dowant trunk/donotwant/hello.c trunk/donotwant/hello.h trunk/donotwant/someotherstupidfile.c \
        trunk/donotwant/someotherstupidfile.h trunk/donotwant/foo.c trunk/donotwant/foo.h trunk/dowant trunk/donotwant/blah.c trunk/donotwant/blah.h \
        trunk/donotwant/spreadsheetsarefun.ods trunk/donotwant/andsoarerandombinaries.bin trunk/donotwant/foobar.c trunk/donotwant/foobar.h \
        trunk/donotwant/randomcrud.h trunk/donotwant/randomcrud.c trunk/donotwant/main.c trunk/dowant trunk/donotwant/stop.c trunk/donotwant/stop.h \
        > clean.dump

        ...
        Revision 1713 committed as 1713.
        Revision 1714 committed as 1714.
        svndumpfilter: Invalid copy source path '/branches/quux/hello.c'

It's the same story all over again. At an earlier point in time (some of) these files have been moved from another location and svndumpfilter failed to handle it or tell you about it until it ran head first into a brick wall. As if to taunt you the revision is situated at an earlier point of the repository. (You thought that you had at least progressed to revision 8934, well think again. Nyah, nyah, nyah!) Also, even if you fix this issue you have no idea of how many fun little surprises lie ahead. You now look and feel like this:

If you google for this problem you will mostly run into things like suggestions on how to manage your repository in order to avoid files that svndumpfilter can't handle. That's nice advice, but also useless when you're expected to fix a b0rked 40GB dumpfile with a huge number of commits, made by people you don't know and are not allowed to murder.

I could keep going, but I think I've made my point by now. Parsing tons of boring data is a job for a computer, not a human being. And if the dumpfile is valid (i.e. represents any actual repository - not just a repository built in a certain manner) the tool should be able to handle it. Hence svndumpsanitizer.

How it works

It is in fact quite understandable that svndumpfilter doesn't work. It's an aptly named program, because all it does is take a data stream and output the contents to stdout after filtering it on the fly. The problem is that the subversion repository structure is too complicated for such an approach to even have a theoretical chance of working. When the filter is at revision 10 it has no way of knowing whether a node the user wants to discard, will be moved to a position he wishes to keep in revision 113. So it does the only thing it can do - it discards the node, and at revision 113 craps out because it has already discarded the data it turns out it would have needed.

Svndumpsanitizer works in a different manner. It scans the nodes several times in order to discover which nodes should actually be kept. After it has determined which nodes to keep it writes only these nodes to the outfile. Finally - if necessary - it adds a commit that deletes any unwanted nodes that had to be kept in order not to break the repository. There are 5 steps in total, not counting additional things the user might request.

Read and organize all the nodes and their relevant data.
Parse through the metadata, creating dependencies between the nodes, as well as creating "fake nodes" in order to make implicit operations explicit.
Parse through all the nodes, matching them against the criteria provided. If a node is to be kept, mark it as wanted, and recursively mark all its dependencies wanted as well.
Since nothing depends on delete nodes, some delete nodes that would be useful may have been set to unwanted. The delete nodes are parsed again, and these nodes are identified and brought back.
Write the source file to the outfile sans the unwanted nodes.

For the purpose of keeping the user informed of the progress, steps 1, 2 and 5 are shown, because these will take the lion's share of the time. Those who are interested in the technical details, can look here

Differences between v1 and v2

Version 2 of svndumpsanitizer came about because of a few shortcomings of v1 that turned out to be too problematic to deal with using the simplistic v1 logic. More specifically:

The purely directory-based logic of v1 had grown quite complicated and unwieldy. Bugfixing was a nightmare.
There is a bug in v1 that can leave in too much unnecessary data.
Version 1 completely ignores svn:mergeinfo.
The redefine-root-feature doesn't work properly with all repositories.

Version 2 deals with these things in the following ways:

Uses a tree-based structure with strictly defined dependencies to keep track of everything. Details here.
Thanks to the clearer logic, keeping only that which is necessary is much easier.
Tries to deal with svn:mergeinfo on a "best effort" basis. If it can't make sense of the data it will be ignored, but it will at least try.
Version 2 will try to identify problematic repositories. For these a warning will be displayed, and the operation performed without the redefine.

Version 2 also has a new feature "query", which allows the user to query the rationale behind keeping a particular file. Furthermore the new version does not delete the "remaining cruft" that is left in the repo because it had to be kept due to dependencies by default. I realized there was a risk of this feature papering over bugs if activated by default. The feature is still available, but has to be explicitly requested.

In addition version 2 is typically faster than version 1 (even if I've had reports of the opposite with really big repositories and lots of include-conditions). The one area where v1 is clearly superior to v2 is memory consumption. The lack of data structures in v1 means it doesn't need any memory to hold them. This is also just about the only case where I would recommend v1: if you have a really big repository, are strapped for memory and don't consider the above issues to be a problem.

Q: What's the deal with svn:mergeinfo?
A: Basically svn:mergeinfo is a huge freaking mess that bears all the hallmarks of a half-assed feature that was slapped on as an afterthought when the system was already in production. I'm not exactly sure when it was added, but the first red-bean book to include it is version 1.5. The repos I originally needed svndumpsanitizer for didn't have them, so I blissfully didn't even know about this part of svn when I wrote it. If you google for svn:mergeinfo, you'll mostly find frustrated users who want to know how to sort out the mess it's created. The red-bean book recommends to "not change the value of this property yourself, unless you really know what you're doing." so I'm expecting this to be a source of bugs and neverending annoyance... If you really want your mergeinfo untouched, you need to avoid using the "drop-empty" feature.

Q: Will you fix the issues for the v1 series?
A: Short answer: No. Long answer: If I could have done that there would have been a 1.2.13, rather than a 2.0.0. When I said that bugfixing the v1 logic was horrible, I meant it. Version 1 is no longer supported.

Q: So how much memory does the new version use?
A: Depends on your repository. In my own tests, a 71 GB dumpfile with nearly 380.000 revisions almost (but not quite) brought a machine with 8 GB RAM + 2 GB swap to its knees. Having plenty of swap seems to compensate, though, as a machine with 4 GB RAM + 4 GB swap made it past revision 350.000 before the OOM killer whacked the process.

Limitations and upcoming features

Version 0.8.0 adds support for dropping and renumbering revisions, ~~so starting then there are no serious known limitations.~~

Unlike svndumpfilter, svndumpsanitizer does not have 2 different switches for dropping and renumbering. If you want to drop the empty revisions, you typically want to renumber them as well, and having 2 switches would have added complexity to the code, so I decided against it.

The feature to move/rename directories on the fly was too cumbersome to implement, and would have required a complete re-design of how svndumpsanitizer works. Instead, I settled for implementing the most common use case, which is redefining the repository root (I.e. moving stuff up in the directory structure.) This feature was implemented in version 1.1.0, and is still considered to be of beta-quality. Starting with the 2.0.0 beta 2 version the feature has been re-enabled, will check for problematic cases, and simply issue a warning and not do the renaming if the repository structure makes it problematic.

It has been pointed out to me that svndumpsanitizer creates a lot of unnecessary newlines in the sanitized files. This is true. If you look at the changelog you'll see that I tried to address this in version 0.8.2, but eventually got rid of it, because the "fix" intoduced a bug, and the problem is only cosmetic. Svnadmin ignores surplus newlines anyway. If you absolutely must have a clean dump file (instead of "merely" a working one), the workaround is to import the dump and then dump again.

Bug reports

Bugs can be reported to daniel[dot]suni[at]protonmail[dot]com. If the problem is with a specific dumpfile, please include the offending dumpfile. If the contents of the repository is too sensitive/secret/embarrassing/too freakin' huge to post, then please try to recreate the problem with a simple non-sensitive dumpfile.

You can also try creating a non-sensitive dumpfile by using dumpstrip, a tool that strips out all the data, leaving only the metadata (which is usually the interesting part from a debugging perspective). Dumpstrip is used like this:

        dumpstrip --infile foobar.dump --outfile stripped.dump

Oh, and if you can code and use gdb, patches are of course welcome. Thanks, Gary. :-)

Back to main page