The Linux Rain Linux General/Gaming News, Reviews and Tutorials

Sorting numbers inside text strings

By Bob Mesibov, published 15/06/2017 in Tutorials


I have a script that compiles and sorts museum collection data. One of the data fields is for museum registration codes, and a plain sort doesn't understand that humans expect "KS.99", for example, to come before "KS.100":

My script instead uses the magic "-V" option for GNU sort, and gets the human-friendly result:

Behind the magic trick

"V" is short for "version" (the long option is "--version"). As it says in the GNU coreutils manual, the purpose is to:

Sort by version name and number, lowest first. It behaves like a default sort, except that each sequence of decimal digits is treated numerically as an index/version number.

The "-V" trick also works if the numbers are in the middle of a string:

and it gives the order that humans would expect even when there are numbers embedded in several places in the string:

You can also use sort -V on particular fields in a sort:

Limits to the magic

The GNU developers warn that version-sorting looks for strings built like this:

prefix[number]suffix

where "suffix" matches the regular expression

(\.[A-Za-z~][A-Za-z0-9~]*)*

and that some strings may not sort as expected. I haven't had a problem with my museum registration codes, though, even for "sub-sample" codes:

Another warning is that non-numeric "prefix" strings are sorted using C-style collation, with uppercase letters before lowercase. This could indeed be a problem:

Sorting headaches

Sorting by computer is horribly complicated. I'm very grateful to the GNU folk for sort's "-V" option, and just wish there was something equally straightforward for "version sorting" in Gnumeric and LibreOffice Calc spreadsheets. The simplest way I know to get the expected result in those two programs is to split the field to be sorted into non-numeric and numeric pieces (for example, "KS.99" becomes "KS" and "99"), sort the numbers, then join the pieces together again. There's a "natural sort" option in Calc, but it doesn't work for some strings and is still a little buggy.

Top image: sorting crabs in Scotland; public domain, Wikimedia Commons



About the Author

Bob Mesibov is Tasmanian, retired and a keen Linux tinkerer.

Tags: sort sorting tutorials coreutils cli gnu
blog comments powered by Disqus