cstroe / svndumpapi Goto Github PK

View Code? Open in Web Editor NEW

5.0 2.0 1.0 723 KB

A Java library for manipulating a Subversion dump file.

License: GNU Affero General Public License v3.0

Java 93.89% Makefile 0.61% Shell 5.50%

svn-dump svn-history subversion svndump svn

svndumpapi's Introduction

SVN Dump API

An API for reading, editing, and writing SVN dump files.

Background

SVN dump files are created via the svnadmin dump command, and contain all the history of an SVN repository. An SVN dump file contains a list of revisions (see Revision), and each revision contains a list of nodes (see Node).

Revisions can have properties such as author, date, and commit message. Nodes can have properties too, which are maintained on a node by node basis.

Related Work

I'm not the first one to have this idea. Here are some links:

svndumpfilter: comes with svn, limited functionality
svndumpmultitool: very similar project to this one, written in Python

Model

SVNDumpFileParser

The SvnDumpFileParser is an auto-generated parser for SVN dump files (files created with svnadmin dump). It will parse SVN dump files into a Repository object. The Repository representation is meant to be very light-weight and does minimal validation.

The parser is auto-generated using JavaCC (Java Compiler Compiler) from the svndump.jj gramar file. This grammar generates a parser that is dependenent on the Java interfaces and classes in this project.

Repository Summary

To get an svn log-like summary of your dump file, you can use the RepositorySummary (sample output here).

Consumers

A RepositoryConsumer consumes the various pieces of a Repository. Specializations of a consumer are:

RepositoryMutator: changes the Repository in some way
RepositoryValidator: validates the correctness of the Repository in some way
RepositoryWriter: write the Repository in some format

Consumers (and therefore any of its specializations) can be chained together to achieve complex operations on SVN dump files using the continueTo(RepositoryConsumer) method.

Mutators

The API allows for changing of an SVN dump file via RepositoryMutator implementations.

Some useful mutators are:

ClearRevision - empties a revision (removes all changes, revision is preserved so that references to revision numbers still work)
PathChange - updates file/dir paths
NodeRemove - removes an individual file change from a revision
NodeAdd - add some newly crafted change to a specific revision
NodeHeaderChange - change a specific property on an existing SvnNode

To apply multiple mutators in sequence, you can chain them together, using RepositoryConsumer.continueTo(RepositoryConsumer).

Validators

When you start messing with your SVN history via the mutators, you can be left with an SVN dump file that cannot be imported back into an SVN repository. To make changing SVN history easier the API has the concept of a RepositoryValidator.

Validation is done while the data is in memory, which is much faster than running it through svnadmin load.

Some useful validators:

PathCollisionValidator - checks that file operations are valid (don't delete non-existent files, don't double add files, check that files exist when making copies)

Usage

Command Line Interface

The bin/run-java shell script will run the CliConsumer.
The current usage pattern is to modify the CliConsumer and create your chain programmatically, then do:

mvn clean install dependency:copy-dependencies
cat file.dump | ./bin/run-java > output.dump

or, if your repository is too large for a single file:

mvn clean install dependency:copy-dependencies
svnadmin create /path/to/newrepo
svnadmin dump /path/to/repo | ./bin/run-java | svnadmin load -q /path/to/newrepo

Example: AgreementMaker

To see how all these pieces fit together to allow you to edit SVN history, you can look at a SVN repository cleanup that I did for the AgreementMaker project. All the operations to the SVN dump file are detailed in this test.

Reading an SVN dump file

Parsing an SVN dump file is straight forward. Here's an example that uses a single consumer (writes the SVN dump to STD OUT):

RepositoryInMemory inMemory = new RepositoryInMemory();
InputStream is = new FileInputStream("svn.dump");
SvnDumpFileParser.consume(is, inMemory);

Repository svnRepository = inMemory.getRepo();

See SvnDumpFileParserTest for usage patterns of the parser.

Developing

Coverage Report

To get a JaCoCo coverage report, run the following:

mvn clean test jacoco:report

The coverage report output will be in HTML format in target/site/jacoco/index.html.

svndumpapi's People

Contributors

Stargazers

Watchers

Forkers

babeloff

svndumpapi's Issues

Enable coveralls.io

Get https://coveralls.io/ working on this repo.

Publish v0.1 to Maven Central

Publish version 0.1 to Maven Central.

Describe our domain model

This project works in the domain of rewriting Subversion history. It fills in the tooling gap that exists in this domain, especially around the problems of upgrading Subversion 1.6 repositories to be used with newer Subversion versions, removing large binaries from Subversion history, and getting a Subversion repository in a proper form to be converted to Git.

Part of the documentation should some sort of graphical representation of how we modelled the domain, and the process of how an SVN dump file is read. We can use these documents to bootstrap a common language that people can use when speaking about this project. The chart should be easy to update over time, as needed.

SvnDumpFileParser.readByteArray() is very slow

Reading large files that are part of an SvnNode takes a disproportionately long time. We currently read the file content of an SvnNode with the readByteArray() method. This is slow for some reason, and I don't know why.

Need to fix this, so that we don't take a very long time reading streams.

Add mutator to modify svn:mergeinfo

svn:mergeinfo is a SVN property that can be present on any SvnNode, but usually on root directories.

A well known problem happens when using Subversion 1.7+ with repositories that were created with Subversion 1.6. An svn merge will get the following error:

svn: E200020: Invalid revision number '0' found in range list

The little internet chatter I saw about this recommends changing the svn:mergeinfo to replace any mention of revision 0 with revision 1.

Write an SvnDumpMutator that will replace svn:mergeinfo mentions of a specific revision with another.

TerminatingValidatorTest doesn't check for positive outcome

TerminatingValidatorTest doesn't check that the TerminatingValidator doesn't terminate when there is no error.

svnadmin --deltas support

Make sure things still work fine when the --deltas parameter is passed to svnadmin.

NodeAdd doesn't properly chain and cannot handle file content

NodeAdd has a few problems:

It never calls endNode(SvnNode) for the newly added node.
It never does anything with file content chunks (call consume(FileContentChunk) on each one, and also endChunks() after all of them).

FastCharStream should not buffer file content

PR #9 tried to fix loading of large files into memory. It broke up the parsing of file contents into chunks.

That's fine, however FastCharStream has its own buffer that tries to store the file content in memory. Therefore, parsing large files still doesn't work.

Fix FastCharStream and SvnDumpFileParser to not buffer file content in the FastCharStream buffer. This will require adding some shortcutting methods to FastCharStream.

Make SvnPropertyChange more efficient

Right now, SvnPropertyChange will accept a name matcher and a value transform function. The transform function will transform the property value if the name of the property matches the criteria.

In order to match on another property name, you have to create a new SvnPropertyChange consumer. The problem with this is that the SvnPropertyChange consumer rebuilds every properties map in the SvnDump object.

To do that once each time you add a new SvnPropertyChange seems very inefficient. We can just give SvnPropertyChange a list of (name matcher, value transformer) pairs and it will only need to rebuild the properties map once.

Support for reading extremely large files (>12gb) and revisions (>30gb)

SVN allows users to commit extremely large files to a repository. Currently svndumpgui cannot support parsing an SVN dump that contains large files (>12gb) or a single revision of many smaller files that add up to a large size.

The problem is that when we read the file contents for SvnNodes, we read them into memory to create the SvnRevision object, and then pass the SvnRevision on to the SvnDumpConsumers. This is a problem because to load large files into RAM we need to increase the max heap size unnecessarily high. For sufficiently large files, we won't have enough RAM to store the files in memory.

Even if you have a ton of ram, the parser defined in svndump.jj expects a single file's content length to be less than Integer.MAX_SIZE. This won't be the case with files larger than 2^31 - 1 in size.

Reading of file contents should be handled specially, separate from the machinery of JavaCC. File contents should be read in chunks, and the chunks passed on to the SvnConsumer. It's very rare that an SvnConsumers will need to operate on the actual file content. The SvnConsumers should know how to handle file content chunks appropriately for whatever they're doing.

For example, in the case of an SvnDumpWriterImpl, the file chunks should be written directly back out to the output stream. For an SvnDumpSummary writer, the file content chunks can be ignored.

NodeRemove will allow svnadmin load to commit a transaction when NodeRemove throws an exception

NodeRemove should fail earlier, so that svnadmin load does not commit the transaction.

PathCollisionValidator is not very memory efficient

PathCollisionValidator will choke quickly on a large repository, because revisionSnapshots stores a map with every full path in the repository for each revision. With many tags this becomes quite a lot quite fast, and Java runs out of memory :)

I solved that same issue (I'm doing something very similar to help me refactor a giant repository, but for work so I can't share code ;) ) by instead using a tree structure, with each node storing it's 'file' name and it's revision. The big trick is that for each new revision only the root node is replaced, the sub nodes are the nodes from the previous repository. On each add/delete/replace/modify the nodes in the path are also replaced, but no others. When adding a copy, again only the top node is new, and links to the nodes from the copied revision.
This creates a structure which holds every revision completely, in a easily traverse able format, so you can perform the checks, but which is far more memory use friendly.

Hope this makes sense, hope this helps ;)

Cover SvnDumpInMemory with tests

SvnDumpInMemory is currently not covered by any tests. Write tests around SvnDumpInMemory.

SvnDumpWriter never checks for an exception caught by PrintStream

Based on the Java documentation, PrintStream never throws an IOException. This means that if there is a problem writing to the output stream, the SvnDumpWriter will never know, and won't terminate early.

This is a problem when piping input to CliConsumer and there is an error in the output stream (for example svnadmin load encounters unknown input), SvnDumpWriter will continue "writing" to the bad stream as if nothing ever happened.

We should check for the exception and throw it anew, or use another Stream that doesn't swallow exceptions.

ClearRevision calls endChunks() and endNode() for removed nodes

When clearing a revision, ClearRevision will filter out the consume(SvnNode) and consume(FileContentChunk) calls and not call them on the next consumer. However, there's no logic to keep endChunks() and endNode(SvnNode) from being called on the next consumer.

This will cause a follow up consumer to receive endChunks() and endNode(SvnNode) without the matching consume(...) calls, and will lead to undesired behavior.

PathChange won't update more than one line in the svn:mergeinfo property

PathChange only updates the first line of an svn:mergeinfo property. It should update every line, if there's a match.

Create a DSL to describe a consumer chain

Currently, to create a consumer chain that operates on an SVN Dump stream you must write Java code, compile it into your JAR and then run it on your SVN dump file. It is a tedious process.

To make this easier, we can create a DSL that describes a consumer chain. The file that describes the consumer chain can be read at runtime and the consumer chain be generated dynamically, requiring no recompilation.

This will make running the program from the CLI much easier.

The DSL should be implemented in JavaCC.

FileContentReplace: Add a way to replace the content of a file with other content.

Allow file content to be replaced. New file content should be provided from an InputStream, so that it can come from anywhere.

SvnDumpWriterImpl produces incorrect output when replacing files

SvnDumpImplWriter adds extra new lines in some places, and misses newlines in other places. This will be easier to describe with a failing test, so will submit failing test to the branch, and then make it pass.

SvnPropertyChange only operates on SvnNode properties and completely ignores SvnRevision properties

Ya, so that's bad. Make SvnPropertyChange operate on revision properties too!

MergeInfoParser cannot handle non-inheritable merge info ranges

As mentioned in this mergeinfo internals article, the merge info ranges can be marked non-inheritable which is encoded as a star * after the merge info range.

For example:

/trunk/file:1234*,1255*,1265*,1266*,1267-1357*,1359*,2001*

Notice the *s. Make MergeInfoParser be able to parse these ranges, and modify the merge info api to deal with these.

Very little testing around NodeHeaderChange

Add more tests around NodeHeaderChange and suss out any hidden bugs.

Fix svn:log property to be compatible with newer versions of Subversion

If you try to rebuild a SVN 1.6 repo with an svnadmin command from a newer version of SVN (in my case 1.8.8), you may encounter the following:

svnadmin: E125005: Invalid property value found in dumpstream; consider repairing the source or using --bypass-prop-validation while loading.
svnadmin: E125005: Cannot accept non-LF line endings in 'svn:log' property

So it seems like in Subversion 1.6 you could have a svn:log property that didn't end in a line feed character, but in Subversion 1.8 that's no longer allowed.

It seems like a trivial exercise to make a property mutator that will guarantee that all svn:log properties end with a line feed character. Maybe do this?

Properly handle UTF-8 log messages (make SvnDumpFileCharStream respect encodings again)

So, the current implementation of SvnDumpFileCharStream has done away with proper reading of characters. It just reads bytes one at a time, and then converts the byte to a char via casting.

My assumption is that the casting breaks encoding. Make UTF-8 characters in log messages parse properly.

NodeRemove still calls super.endNode(SvnNode) on removed node

endNode(SvnNode) should not be chained if the node is removed