The Linux Rain Linux General/Gaming News, Reviews and Tutorials

I think I like backreferences (sometimes)

By Bob Mesibov, published 01/12/2015 in Tutorials


Backreferences are part of the complicated and sometimes confusing world of regular expressions. The basic idea is this: you can substitute any part of a regular expression with a backslash followed by a number, so long as you surround the part to be substituted with round brackets. The command that processes the regular expression will 'remember' the part enclosed in round brackets by its number.

For an example, I'll do a regular expression search with grep (GNU grep 2.20) in the word list in /usr/share/dict/words on my Debian system. I'm going to use the '-E' option, so that grep can see patterns as extended regular expresssions (ERE) — more on that in a moment. What I'll look for is any word that contains 'ee' in 2 places:

Here grep looks for any word in the list that contains 'ee' followed by 1 or more characters (.+) followed by 'ee'. I've piped the result to head to get just the first 5 results.

If I surround the first occurrence of the repeated pattern 'ee' with round brackets and refer to it with the backreference '\1', the command can be simplified a little:

Without the -E' option, grep defaults to basic regular expressions (BRE) and to get the same result I'd have to escape both the rounded brackets and the '+' metacharacter:

...complicated and sometimes confusing...

A backreference can be used more than once in a pattern, too, as shown below. Here I'm looking for words that contain 3 separate occurrences of 'ss':

Single patterns

In an earlier Linux Rain article I explained how I built a tab-separated Australian gazetteer table, gazOz, in which the second field contains placenames. How many placenames are doubles, as in Wagga Wagga (New South Wales) and Nowa Nowa (Victoria)?

This command will find the doubled placenames, sort them alphabetically, uniquify them ignoring case, and count the total:

grep is here looking for a line that begins (^) with some combination of 1 or more letters ([[:alpha:]]+) followed by a space, followed by the same combination of letters (\1) followed by end-of-line ($). The first 5 results are:

And here are the first 5 doubled 'W' placenames:

Backreferences work with all characters, not just letters. The table gazOz has latitude and longitude in decimal degrees as fields 5 and 6. Here's a backreferenced command that looks for the same string of numbers after the decimal point in both latitude and longitude:

Note that the decimal point is escaped to prevent grep seeing it as 'any character', and that the '\s' is shorthand for a space or a tab.

Being an AWK enthusiast, I wouldn't use grep and backreferences for this job. For me it would simpler to tell AWK that fields are separated by a decimal point or a tab, then find the lines where the 4-digit number strings are the same:

A common exercise on online forums and regular expression websites is finding doubled words in a block of text, as in the sentence I went to to the market and and I bought a pie. Two equivalent commands to do this are shown here:

In both cases grep is searching for a single 'word'. That 'word' begins with 1 or more word elements (\w+) followed by a single space, followed by the same 1 or more word elements as a backreference. The 'word' search is specified for grep either by using the '-w' option or by beginning and ending the 'word' with the word boundary character '\b'.

Multiple patterns

Backreferences are especially handy when substituting text strings using sed, as in the example below. Notice that when interpreting backreferences, the round-bracketed expressions are numbered from left to right: '(dog)' is \1, '(bit a)' is \2 and '(man)' is \3.

I recently audited a list of names followed by a comma followed by a space followed by a year, as in 'A. Smith, 1900' and 'van der Heusen, 1883'. That was the correct format. Some of the lines were missing a comma, as in 'Jones 1900' and 'W. Yao and T. Wang 1983'. To fix these I ran the list through a sed command with backreferences. Examples:

In this command sed searches for a letter (first bit to be backreferenced), followed by zero or more spaces, followed by a numeral (second bit to be backreferenced). sed replaces this pattern with 'letter, numeral'.

Want more?

Backreferences are good for finding palindromes — strings that read the same forwards and backwards. I honestly can't remember ever needing to find a palindrome, but anyway, these two commands have found 6- and 7-letter palindromes in my word list:

Finally, for a very clear and thorough explanation of backreferences see

http://www.regular-expressions.info/backref.html
http://www.regular-expressions.info/backref2.html

by regex guru Jan Goyvaerts.



About the Author

Bob Mesibov is Tasmanian, retired and a keen Linux tinkerer.

Tags: tutorials scripting bash backreferences awk regular-expressions
blog comments powered by Disqus