Find and remove large files in a Mercurial repository

Sometimes stuff happens, and a large file gets committed to your Mercurial source repository by mistake.

This post shows you how to find changesets with large files, and how to remove these files from the repository's history.

Here's a one-liner to list the largest (say larger than 1MB) files a Mercurial repository.

    hg grep -l ".*" "set:size('>1MB')"

This command does a grep for all files in all revisions that match the specified file set, returning the path and first revision of these files.

You can limit the file set to large binary files:

    set:size(1MB) and binary()

The output is a list of file paths, one per line, each appended with the revision number where the file first appeared.

Update: From this question on Stack Overflow, it appears that grep only search files in the current working directory.

Now, you just need to erase the files from history. (Which is easier said than done.)

If the revision with the large file has not been pushed to a public repository (where other people may have accessed it), and it's the last commit you made, you can simply hg rollback to undo the commit, remove the file, and then commit again. This effectively replaces the changeset containing the large file with a changeset without the large file.

If the large file is in an older revision, or the revision has been pushed to a public repository, you can't just replace the changeset, since it will be brought back when pushing from other repositories containing the changeset with the large file.

The way you can erase the file from the repository's history is by using the convert extension with a file map that excludes the large file, to create a clone of the repository without the large file.

You then have to remove all public versions of the old repository, and also get people you are collaborating with to delete their copies, and then you can distribute the new repository without the large file.