The power of git-sed

Replacing content in thousands of files? No problem!

In the recent weeks and months, the FSFE Web Team has been doing some heavy work on the FSFE website. We moved and replaced thousands of files and their respective links to improve the structure of a historically grown website (19+ years, 23243 files, almost 39k commits). But how to do that most efficiently in a version controlled system like Git?

In our scenarios, the steps executed often looked like the following:

  1. Move/rename a directory full of XML files representing website pages
  2. Find all links that pointed to this directory, and change them
  3. Create a rewrite rule

For the first step, using the included git mv is perfectly fine.

For the second, we would usually need a combination of grep and sed, e.g.:

grep -lr "/old/page.html" | xargs sed 's;/old/page.html;/new/page.html;g'

This has a few major flaws:

  • In a Git repository, this also greps inside the .git directory where we do not want to edit files directly
  • The grep is slow in huge repositories
  • The searched old link has to be mentioned two times, so hard for semi-manual replacement of a large number of links
  • Depending on the Regex complexity we need, the command becomes long, and we need to take care of using the correct flags for grep and sed.

git-sed to the rescue

After some research, I found git-sed, basically a Bash file in the git-extras project. With some modifications (pull request pending) it’s the perfect tool for mass search and replacement.

It solves all of the above problems:

  • It uses git grep that ignores the .git/ directory, and is much faster because it uses git’s index.
  • The command is much shorter and easier to understand and write
  • Flags are easy to add, and this only has to be done once per command

Install

You can just install the git-extras package which also contains a few other scripts.

I opted for using it standalone, so downloaded the shell file, put it in a directory which is in my $PATH, and removed one dependency on a script which is only available in git-extras (see my aforementioned PR). So for instance, you could copy git-sed.sh in /usr/local/bin/ and make it executable. To enable calling it via git sed, put in your ~/.gitconfig:

[alias]
  sed = !sh git-sed.sh

Usage

After installing git-sed, the command above would become:

git sed -f g "/old/page.html" "/new/page.html"

My modifications also allow people to use extended Regex, so things like reference captures, so I hope these will be merged soon. With this, some more advanced replacements are possible:

# Use reference capture (save absolute link as \1)
git sed -f g "http://fsfe.org(/.*?\.html)" "https://fsfe.org\1"

# Optional tokens (.html is optional here)
git sed -f g "/old/page(\.html)?" "/new/page.html"

And if you would like to limit git-sed to a certain directory, e.g. news/, that’s also no big deal:

git sed -f g "oldstring" "newstring" -- news/

You may have notived the -f flag with the g argument. People used to sed know that g replaces all appearances of the searched pattern in a file, not only the first one. You could also make it gi if you want a case-insensitive search and replace.

Conclusion

As you can see, using git-sed is really a time and nerve saver when doing mass changes on your repositories. Of course, there is also room for improvement. For instance, it could be useful to use the Perl Regex library (PCRE) for the grep and sed to also allow for look-aheads or look-behinds. I encourage you to try git-sed and make suggestions to upstream directly to improve this handy tool.



Comments