Git Product home page Git Product logo

svndumpapi's Introduction

Build Status codecov Coverage Status GNU Affero GPL v3

SVN Dump API

An API for reading, editing, and writing SVN dump files.

Background

SVN dump files are created via the svnadmin dump command, and contain all the history of an SVN repository. An SVN dump file contains a list of revisions (see Revision), and each revision contains a list of nodes (see Node).

Revisions can have properties such as author, date, and commit message. Nodes can have properties too, which are maintained on a node by node basis.

Related Work

I'm not the first one to have this idea. Here are some links:

Model

SVNDumpFileParser

The SvnDumpFileParser is an auto-generated parser for SVN dump files (files created with svnadmin dump). It will parse SVN dump files into a Repository object. The Repository representation is meant to be very light-weight and does minimal validation.

The parser is auto-generated using JavaCC (Java Compiler Compiler) from the svndump.jj gramar file. This grammar generates a parser that is dependenent on the Java interfaces and classes in this project.

Repository Summary

To get an svn log-like summary of your dump file, you can use the RepositorySummary (sample output here).

Consumers

A RepositoryConsumer consumes the various pieces of a Repository. Specializations of a consumer are:

Consumers (and therefore any of its specializations) can be chained together to achieve complex operations on SVN dump files using the continueTo(RepositoryConsumer) method.

Mutators

The API allows for changing of an SVN dump file via RepositoryMutator implementations.

Some useful mutators are:

  • ClearRevision - empties a revision (removes all changes, revision is preserved so that references to revision numbers still work)
  • PathChange - updates file/dir paths
  • NodeRemove - removes an individual file change from a revision
  • NodeAdd - add some newly crafted change to a specific revision
  • NodeHeaderChange - change a specific property on an existing SvnNode

To apply multiple mutators in sequence, you can chain them together, using RepositoryConsumer.continueTo(RepositoryConsumer).

Validators

When you start messing with your SVN history via the mutators, you can be left with an SVN dump file that cannot be imported back into an SVN repository. To make changing SVN history easier the API has the concept of a RepositoryValidator.

Validation is done while the data is in memory, which is much faster than running it through svnadmin load.

Some useful validators:

  • PathCollisionValidator - checks that file operations are valid (don't delete non-existent files, don't double add files, check that files exist when making copies)

Usage

Command Line Interface

The bin/run-java shell script will run the CliConsumer.
The current usage pattern is to modify the CliConsumer and create your chain programmatically, then do:

mvn clean install dependency:copy-dependencies
cat file.dump | ./bin/run-java > output.dump

or, if your repository is too large for a single file:

mvn clean install dependency:copy-dependencies
svnadmin create /path/to/newrepo
svnadmin dump /path/to/repo | ./bin/run-java | svnadmin load -q /path/to/newrepo

Example: AgreementMaker

To see how all these pieces fit together to allow you to edit SVN history, you can look at a SVN repository cleanup that I did for the AgreementMaker project. All the operations to the SVN dump file are detailed in this test.

Reading an SVN dump file

Parsing an SVN dump file is straight forward. Here's an example that uses a single consumer (writes the SVN dump to STD OUT):

RepositoryInMemory inMemory = new RepositoryInMemory();
InputStream is = new FileInputStream("svn.dump");
SvnDumpFileParser.consume(is, inMemory);

Repository svnRepository = inMemory.getRepo();

See SvnDumpFileParserTest for usage patterns of the parser.

Developing

Coverage Report

To get a JaCoCo coverage report, run the following:

mvn clean test jacoco:report

The coverage report output will be in HTML format in target/site/jacoco/index.html.

svndumpapi's People

Contributors

cstroe avatar dependabot[bot] avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

babeloff

svndumpapi's Issues

Describe our domain model

This project works in the domain of rewriting Subversion history. It fills in the tooling gap that exists in this domain, especially around the problems of upgrading Subversion 1.6 repositories to be used with newer Subversion versions, removing large binaries from Subversion history, and getting a Subversion repository in a proper form to be converted to Git.

Part of the documentation should some sort of graphical representation of how we modelled the domain, and the process of how an SVN dump file is read. We can use these documents to bootstrap a common language that people can use when speaking about this project. The chart should be easy to update over time, as needed.

SvnDumpFileParser.readByteArray() is very slow

Reading large files that are part of an SvnNode takes a disproportionately long time. We currently read the file content of an SvnNode with the readByteArray() method. This is slow for some reason, and I don't know why.

Need to fix this, so that we don't take a very long time reading streams.

Add mutator to modify svn:mergeinfo

svn:mergeinfo is a SVN property that can be present on any SvnNode, but usually on root directories.

A well known problem happens when using Subversion 1.7+ with repositories that were created with Subversion 1.6. An svn merge will get the following error:

svn: E200020: Invalid revision number '0' found in range list

The little internet chatter I saw about this recommends changing the svn:mergeinfo to replace any mention of revision 0 with revision 1.

Write an SvnDumpMutator that will replace svn:mergeinfo mentions of a specific revision with another.

FastCharStream should not buffer file content

PR #9 tried to fix loading of large files into memory. It broke up the parsing of file contents into chunks.

That's fine, however FastCharStream has its own buffer that tries to store the file content in memory. Therefore, parsing large files still doesn't work.

Fix FastCharStream and SvnDumpFileParser to not buffer file content in the FastCharStream buffer. This will require adding some shortcutting methods to FastCharStream.

Make SvnPropertyChange more efficient

Right now, SvnPropertyChange will accept a name matcher and a value transform function. The transform function will transform the property value if the name of the property matches the criteria.

In order to match on another property name, you have to create a new SvnPropertyChange consumer. The problem with this is that the SvnPropertyChange consumer rebuilds every properties map in the SvnDump object.

To do that once each time you add a new SvnPropertyChange seems very inefficient. We can just give SvnPropertyChange a list of (name matcher, value transformer) pairs and it will only need to rebuild the properties map once.

Support for reading extremely large files (>12gb) and revisions (>30gb)

SVN allows users to commit extremely large files to a repository. Currently svndumpgui cannot support parsing an SVN dump that contains large files (>12gb) or a single revision of many smaller files that add up to a large size.

The problem is that when we read the file contents for SvnNodes, we read them into memory to create the SvnRevision object, and then pass the SvnRevision on to the SvnDumpConsumers. This is a problem because to load large files into RAM we need to increase the max heap size unnecessarily high. For sufficiently large files, we won't have enough RAM to store the files in memory.

Even if you have a ton of ram, the parser defined in svndump.jj expects a single file's content length to be less than Integer.MAX_SIZE. This won't be the case with files larger than 2^31 - 1 in size.

Reading of file contents should be handled specially, separate from the machinery of JavaCC. File contents should be read in chunks, and the chunks passed on to the SvnConsumer. It's very rare that an SvnConsumers will need to operate on the actual file content. The SvnConsumers should know how to handle file content chunks appropriately for whatever they're doing.

For example, in the case of an SvnDumpWriterImpl, the file chunks should be written directly back out to the output stream. For an SvnDumpSummary writer, the file content chunks can be ignored.

PathCollisionValidator is not very memory efficient

PathCollisionValidator will choke quickly on a large repository, because revisionSnapshots stores a map with every full path in the repository for each revision. With many tags this becomes quite a lot quite fast, and Java runs out of memory :)

I solved that same issue (I'm doing something very similar to help me refactor a giant repository, but for work so I can't share code ;) ) by instead using a tree structure, with each node storing it's 'file' name and it's revision. The big trick is that for each new revision only the root node is replaced, the sub nodes are the nodes from the previous repository. On each add/delete/replace/modify the nodes in the path are also replaced, but no others. When adding a copy, again only the top node is new, and links to the nodes from the copied revision.
This creates a structure which holds every revision completely, in a easily traverse able format, so you can perform the checks, but which is far more memory use friendly.

Hope this makes sense, hope this helps ;)

SvnDumpWriter never checks for an exception caught by PrintStream

Based on the Java documentation, PrintStream never throws an IOException. This means that if there is a problem writing to the output stream, the SvnDumpWriter will never know, and won't terminate early.

This is a problem when piping input to CliConsumer and there is an error in the output stream (for example svnadmin load encounters unknown input), SvnDumpWriter will continue "writing" to the bad stream as if nothing ever happened.

We should check for the exception and throw it anew, or use another Stream that doesn't swallow exceptions.

ClearRevision calls endChunks() and endNode() for removed nodes

When clearing a revision, ClearRevision will filter out the consume(SvnNode) and consume(FileContentChunk) calls and not call them on the next consumer. However, there's no logic to keep endChunks() and endNode(SvnNode) from being called on the next consumer.

This will cause a follow up consumer to receive endChunks() and endNode(SvnNode) without the matching consume(...) calls, and will lead to undesired behavior.

Create a DSL to describe a consumer chain

Currently, to create a consumer chain that operates on an SVN Dump stream you must write Java code, compile it into your JAR and then run it on your SVN dump file. It is a tedious process.

To make this easier, we can create a DSL that describes a consumer chain. The file that describes the consumer chain can be read at runtime and the consumer chain be generated dynamically, requiring no recompilation.

This will make running the program from the CLI much easier.

The DSL should be implemented in JavaCC.

Fix svn:log property to be compatible with newer versions of Subversion

If you try to rebuild a SVN 1.6 repo with an svnadmin command from a newer version of SVN (in my case 1.8.8), you may encounter the following:

svnadmin: E125005: Invalid property value found in dumpstream; consider repairing the source or using --bypass-prop-validation while loading.
svnadmin: E125005: Cannot accept non-LF line endings in 'svn:log' property

So it seems like in Subversion 1.6 you could have a svn:log property that didn't end in a line feed character, but in Subversion 1.8 that's no longer allowed.

It seems like a trivial exercise to make a property mutator that will guarantee that all svn:log properties end with a line feed character. Maybe do this?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.