Several of my data-auditing scripts take a long time to execute. Until recently I had no way of knowing how long they'd take. I'd enter the command for the script, press Enter and wait. And wait.
I've now discovered the pipeline viewer or pv command, which tracks the progress of my scripts.
An example script is gremlins2, described on my data-cleaning website. The script looks in a text file for non-printing characters other than space, horizontal tab, soft hyphen and non-breaking space, then tallies the various invisible characters it finds and gives their hexadecimal values. The command takes the filename as its one argument.
To run gremlins2 on the 250+ MB file refs0, which contains a total of 263,737,236 characters, I just enter "gremlins2 refs0" in a terminal. And wait.
To follow the progress of the script with pv I have a couple of choices. One is to use pv like cat and pipe refs0 to gremlins2, like this:
pv refs0 | gremlins2
My preferred options for pv are "-p", which generates a simple progress bar; "-b", which reports the number of bytes processed; "-t", which keeps track of elapsed time; and "-w 50", which constrains the pv output in my terminal to a width of 50 characters. So:
pv -w 50 -pbt refs0 | gremlins2
And here's what I see after the process has finished:
Does running pv slow down the overall process? Not according to the time command:
A second way to use pv is to incorporate it into a script. In other words, instead of
pv filename | command-in-script "$1"
you have the first line of the script saying
pv "$1" | command-in-script
Here's the result of running a version of the gremlins2 script with pv included:
So how come I'd never heard of this excellent pv program before? (Scratches grey beard in puzzlement...)
Top image by "btr" from Wikimedia Commons