Skip to end of metadata
Go to start of metadata

This page outlines the steps that had to be executed to migrate Kuali Rice from Kuali's Subversion repository to GitHub.

For additional information (and an alternative approach), also reference the migration process that the Kuali Coeus project went through here: SVN to Git

The Approach

The Kuali Rice project is planning to migrate the most important tags and branches in the codebase over to GitHub. Once this process is complete, the original Subversion repository will be put into read-only mode and all new work (on all active branches of Rice) will continue within GitHub.

Migration Analysis

As part of planning our migration, we chose the specific directories we would like to remain in the repository once it has been migrated. This analysis can be found in our Branch and Tag Migration Plan.

Unfortunately, while our aspirations of cleaning up the repository are commendable, the desired exclusions noted in the branch and tag migration plan can not be fully achieved with the svn2git toolset. Specifically, this is because code on the project's trunk and development branches include code that was merged into those branches from some of the directories we want to exclude. Specific examples are related to the "sandbox" directory which has traditionally been used as a place to create "feature" branches that eventually get reintegrated back into trunk. This means they have to be included in the migration process and then deleted after the import into a Git repository has been completed.

The Tools

We will use a combination of tools for this effort:

  • Amazon EC2
  • svnadmin (dump|load|create)
  • svndumpfilter
  • svn2git
  • git
  • GitHub!

Scripts and Config Files

A number of scripts and config files used during this migration have been checked in here for safekeeping: https://github.com/ewestfal/rice-svn2git-support

Preliminary Steps

Preliminary Step 1 - Setting up the Migration Server

Kuali Rice has a large amount of history in Subversion (approximately 48,000 revisions dating back to 2007). Due to this, it takes a long time to run the various steps of the migration process (just the svn2git process itself takes around 18 hours). Therefore, it is helpful to have an Amazon EC2 instance to run these various jobs in the background. Additionally, that allows us to avoid the network latency to the Kuali SVN server because we can clone the repository to that machine (more on that later). We are sure to use an instance with an SSD since the migration process involves a lot of disk-heavy activity.

Using Amazon EC2, we will create a new server of the following type:

  • Ubuntu 12.04 LTS server, c3.xlarge, 256gb SSD root volume

In our case, we set this up so that we can log in as root.

Once this server is set up, we can log in and run a number of commands to get the distribution upgraded and the migration tools in place. Our server is set up to allow us to log in as root, if that's not possible then "sudo" will need to be used in front of each of these commands.

  • First, we make sure that our distribution is up to date 

  • Next, we will make sure that all of our existing packages are upgraded

  • After this is complete, the system will indicate that a restart is required, so we go ahead and do that just to be safe

  • Next, we install git, git-svn, ruby, and ruby-gems since those are required to run the svn2git tool

  • Since we will be loading a Subversion repository locally, we need to make sure that subversion is installed as well

  • At this point, we have an older version of git on our machine (1.7.9.5). Let's go ahead and get the latest stable version of Git (version 2)

  • Now if we check our git version we should be on the latest

  • We will also upgrade to version 1.7 of Subversion

  • Now if we check our subversion version we should be on the latest

  • Still, at this point our git-svn application is using a different version of SVN than the one we just installed

  • git-svn is actually written in Perl and uses a Perl module for it's SVN integration, so we will upgrade the SVN version using CPAN

  • You can accept all of the defaults during this install process. Note that this likely will look like it did not succeed with an error message "error: no suitable apr found". However, at this point it has succeeded in upgrading the SVN version used by git-svn 

  • Finally, we can run the command to get the svn2git tool installed

At this point, everything that we need has been installed.

Preliminary Step 2 - Mapping Authors

As part of the migration process, authors need to be mapped from the historical Subversion commits to GitHub. A list of authors from the Subversion repo can be assembled using this command:

Once we have a list of authors, we needed to assign a First Name, Last Name and email in the following format:

svn_user_id = First_Name Last_Name <email_address>

Where possible, we have committers self-identify their preferred email address to go into the git commits. For those who are no longer committers, we can use kis.kuali.org to look up current name and email information. If an email address is used which matches an email address in someone's GitHub account, then those historical commits will be associated with those accounts. If an author is missed, svn2git will kick out an error message related to an author not being found, so the list must be comprehensive.

The list for the Kuali Rice migration can be found here: https://github.com/ewestfal/rice-svn2git-support/blob/master/authors.txt

Step-by-Step Migration

Step 0 - Notify the Developers and Prepare the SVN Repo

When performing the real migration a few things should be done before the migration process begins. Anyone with commit access to Rice should be notified that this migration is beginning and that all commits should be halted. It will be easier for the developers to not have to worry about integrating changes from their working copy after the switch to GitHub is completed, so everyone should try to have all outstanding work committed.

Also, prior to the dump, the Infrastructure team should put the Subversion repository in read-only mode. The expectation should be set with committers that the entire process could take a day or more and commits to the project will be offline during this time.

Step 1 - Dumping the SVN Repository

To help us perform the migration more quickly, we will use a dump of the current Kuali Rice SVN repository so that we can run the entire migration locally without any network dependencies.

We need the Infrastructure team to help us with this (thanks Farooq!) since we do not have access to the Kuali Subversion environment. The command that needs to be run on the Subversion server is as follows:

It is important that the "–deltas" option not be used since the svndumpfilter tool that we use in a later step does not support a delta-based dump file. This file will be very large, it is around 7 GB for Rice.

Finally, have the Infrastructure team copy the SVN dump to a directory on the migration server so that it is accessible during migration.

Step 2 - Filter the SVN Dump

The Kuali Rice project has a very large number of deleted tags and branches. These don't show up when browsing the repository at svn.kuali.org, but they are included in the migration since they are part of the history of the project. If we attempt to migrate all of these over, the svn2git migration will take too long to complete (approximately 1 week!). The svn2git tool allows us to exclude specific directories for migration, but unfortunately it does not have an option to include specific directories. Given the large number of deleted tags, branches, and directories in the rice repository (on the order of 1000's) we would rather use an inclusion-based approach.

Thankfully, the svndumpfilter tool exists which allows us to filter an svn dump to remove certain paths. It supports pattern-based inclusion during it's processing, so our first step is to set up a file containing the patterns we want the filter to include.

In order to set this up, we must determine which directories we want to include in the dump. Note however that directories that serve as source copies for other included directories cannot be excluded.

We have this in many places in the Rice code base as we have a "sandbox" directory that is used for feature branching. So if a developer had branched "trunk" to "sandbox/my-awesome-feature", made changes, and later merged back into trunk, then we would need to be sure that sandbox/my-awesome-feature was in our list of inclusions. Otherwise the svndumpfilter would fail with a message like the following:

For Rice, to make things simpler we just include the entire sandbox directory. However, we are at least able to include only the tags that we absolutely want (or need) to keep. Through trial and error (and running svndumpfilter numerous times to identify all invalid copy source paths so we can add them to our filter file) we can create our final filter-targets.txt file. This file can be found here: https://github.com/ewestfal/rice-svn2git-support/blob/master/filter-targets.txt

Note the syntax in this file. When we run the svndumpfilter command, we are going to use the "--pattern" option. This allows our paths to be standard globs (http://en.wikipedia.org/wiki/Glob_%28programming%29) which are checked against the path of each file. Therefore, if we just have "trunk" in our filter file, it will only include the root trunk directory, nothing underneath it. So we either need to include both "trunk" and "trunk/*" or just "trunk*".

Once the filter-targets.txt file is ready, the tool can be executed using this command:

We run this with nohup because it takes awhile to complete and we don't want a terminated network connection to kill our SSH login and stop the process. During the execution of this, we can observe the progress by tailing the output:

If it fails with the "Invalid copy source path" error, add the missing path to filter-targets.txt and try again.

For Rice, once this has completed we are able to filter approximately 18,000 revisions and 183,000 different paths.

See the svndumpfilter tool documentation for more information: http://svnbook.red-bean.com/en/1.7/svn.ref.svndumpfilter.html

When testing the migration process for Rice, we originally attempted to run this command with the "--drop-empty-revs" option. However, when doing that the "svnadmin load" command (see next step) failed, so we re-ran it without this option and were successful in loading the filtered dump.

Step 3 - Load the Dump into a Local SVN Repository

Now that we have our filtered dump, we can load it into a local Subversion repository. To do this, we must first create the repository:

Next, we can load the repository. We do this using nohup as well since it will take awhile to complete:

We can test that this was successful by executing a command like the following:

That will show the last 10 revision log messages from the newly loaded repository.

Step 4 - Migrate the SVN Repository to Git

At this stage we are ready to migrate the SVN repo to a local Git repository on our migration server. To do this, we will create a new directory on the migration server and "cd" into it, and then execute the following command:

It is especially important to run this command using nohup because it will take a very long time to complete. For the filtered Kuali Rice repository, it takes approximately 18 hours.

Once it is complete we will have a directory containing the migrated git repository.

Step 5 - Cleaning up Branches and Tags

Even though our git repository has been created, we still have some branch and tag cleanup to do. To delete a branch the following command can be used:

And for tags:

Notice the use of capital "D" for branch deletion.

For Rice, we have a number of branches and tags to delete, so we have created a couple of scripts to execute those deletions:

Step 6 - Deleting Large Files

GitHub recommends that repositories stay below 1 GB. We can check what size ours is by running this command:

For Rice, this returns "750M". So we are under the 1 GB limit, but not by much. It would be great to have a bit more breathing room and make the repository smaller.

We can identify what the largest files are through the following process:

  1. First we will get a hash of all of the files in the repository:

  2. Next we generate a list of all potentially large object:

  3. Finally, we use both the list of hashes and file sizes to create a list of largest to smallest files (be sure to cd to the directory where you created these two files):

  4. Now in "bigtosmall.txt" we have a list of the largest files to the smallest. For Rice, the top of the file looks like this:

  5. As can be seen here (and throughout the rest of the file) we have a lot of jars in lib. This is from a time in the history of the repository when we had not switched over to Maven yet. These directories and files have been long deleted on the current and active development branch. Therefore, we decide that the savings in space is worth removing these files and losing the (very) old history. We also identify a number of other files through this process which can be deleted. Our final command uses git's filter-branch command and looks like the following:

  6. It will take awhile for this to complete, but once it is done, we execute one final command:

  7. Finally, we can check what our new size is. The easiest way to do this is to clone the repository and remove the hard links:

  8. Change into the ricegit4-clone directory and check the size again:

Much better! We have reduced the repository size by almost a third. We can delete this clone once we are done checking it. We will push to GitHub from the original directory, not from the clone.

Step 7 - Push to GitHub

We are finally ready to push our slim-and-trim repository up to GitHub. Before doing anything we should review all of the branches and tags in the local repo. Execute the following commands to get a list.  We want to review that all the branches and tags we want to keep exist and they are named properly.

Now we are ready to push our repository:

  1. Go to github.com and create a new repository in the Kuali GitHub (or your own personal GitHub account if you are just testing). For our example, we will call it "rice".
  2. Add the repository as our remote origin for our converted repository:

  3. Execute a push for all of our branches.  Keep in mind that trunk is now called master and is considered a branch:

  4. Finally, push all of our tags:

Check the repository on github.com and verify that all branches and tags are there. The migration is now complete!

Step 8 - Final Steps

Notify the committers that the migration is complete and they can fork the GitHub repository and being doing development again. At this point, changes can be applied to pom files, etc. to replace SVN-based tooling (for Maven release, tagging, etc.) with Git-based tooling. See Kuali Rice SVN to GitHub Tooling Migration for more details.

 

 

  • No labels

1 Comment

  1. We should also include updating our README.txt to README.md for better viewing on github