The Linux Rain Linux General/Gaming News, Reviews and Tutorials

Split a table and number the pieces: two methods

By Bob Mesibov, published 28/05/2015 in Tutorials


I worked on these bits of code to simplify a shell script of mine. You really wouldn't want to see the script's original commands here — they're way too ugly! The new commands are much prettier and work well.

The problem

To simplify the story, here's a comma-separated table called demo, sorted on the strings in the second field:

What I want to do is split demo into 3 tables named A1, A2 and A3, based on the unique, sorted values in field 2. A1 will contain the 'xxx' lines, A2 the 'yyy' lines and A3 the 'zzz' lines.

Method 1

The first solution is based on code in a Stack Overflow contribution from user 'Birei':

awk -F"," '{if (!a[$2]) {a[$2]=++c}; print >> "A"c}' demo

Like some other elegant AWK commands, this one has to be teased apart carefully to see how it works.

AWK is first told with -F"," that lines in demo are broken into fields by a comma.

The main part of the command begins with an 'if' test which asks a question about the array 'a'. (For an introduction to AWK arrays, see this article by Ramesh Natarajan.) Asking the question causes AWK to create 'a' as it reads the first line of demo. The question is: are the contents of field 2 ($2) not already an index of 'a'?

If the contents of field aren't already in 'a', AWK sets the value of the array element for which $2 is an index to an auto-incremented variable, '++c'. The prefixing of '++' means that the incremented value of 'c' can be used elsewhere in the command. The default starting value of an auto-incremented variable is 0, and the default increment is 1.

When AWK reads the first line of demo and finds that 'xxx' isn't already an index of 'a', i.e. th answer to the 'if' question is 'true', AWK increments 'c' to 1. AWK then prints the first line as an appendix to a file it creates whose name is the current value of 'c', namely 1, prefixed with the letter A.

AWK now reads the second line of demo. This time the answer to the 'if' question is 'false', so AWK doesn't increment 'c'. The array 'a' still contains just one element, with index 'xxx' and value '1'. The second line is appended to A1. Same with the third line of demo.

When AWK reads the fourth line, it sees 'yyy' in field 2 and the answer to the 'if' question is again 'true' — 'yyy' isn't an index in the array. AWK makes 'yyy' an index, but this time the value of the array element is incremented to 2. AWK prints the fourth line to a new file A2 because the current value of 'c' is now 2.

And so on. To watch this work line by line, we can modify the command so that after reading each line AWK prints field 2 (also the array index), the corresponding value in the array, and the value of 'c':

Method 2

If you're not comfortable with AWK arrays, this 2-step solution might be more appealing.

awk -F"," '{print >> "@@"$2}' demo
c=1; for i in @@*; do printf -v nom "A%01d" "$((c++))"; mv $i $nom; done

In the first command, AWK works through demo one line at a time, sending the 'xxx' lines to a new file called '@@xxx', the 'yyy' lines to '@@yyy' and the 'zzz' lines to '@@zzz'. This command is a simplification of one I demonstrated in a 2014 Linux Rain article.

The second command is a 'for' loop in which each of the '@@' files is renamed one at a time by the mv command. The new name is a value of the variable 'nom', which is refreshed by the printf command for each '@@' file. printf prints 'A' followed by an integer incremented 1 by 1 with BASH double-bracket arithmetic on the variable 'c'. The command starts by giving 'c' an initial value of 1.

Caution

The AWK commands works well if the number of unique values in the 'splitting field' is small. In the real-world case that inspired this article, there are usually less than 20 unique values. If the number of unique values is very much larger, it would be wise to close the split file at each reading of a line, to avoid having too many open files:

awk -F"," '{if (!a[$2]) {a[$2]=++c}; print >> "A"c; close("A"c)}' demo

awk -F",t" '{print >> "zz"$2; close("zz"$2)}' demo


About the Author

Bob Mesibov is Tasmanian, retired and a keen Linux tinkerer.

Tags: tutorials scripting bash awk arrays tables
blog comments powered by Disqus