The Linux Rain Linux General/Gaming News, Reviews and Tutorials

How to tidy copied PDF text with a CoPa script

By Bob Mesibov, published 30/06/2017 in Tutorials


The image below shows a small, demonstration PDF:

When I highlight the text in the PDF, only the paragraphs and their internal spaces go to the X, or primary clipboard:

When I middle-click-paste from the primary clipboard into a text editor, each line of text in the PDF appears on a new line, with the spaces outside the paragraph text (in the original) missing:

I'd prefer each paragraph on its own line, with hyphenated words stitched back together again, like this:

I could edit the pasted text by hand, but that would get tedious with big slabs of copied PDF text. A quicker way is to use the command line.

Tidying commands

I first put "flags" or spaces at the end of each line with sed, then collapse the text into a single line, then break the text into lines and tidy up the flags. Here's how:

The sed command above looks for a letter followed by a hyphen followed by the end of a line, indicating a hyphenated word. It replaces the hyphen with a three "@" flag.

The next sed command looks for a full stop (".") at the end of a line, and flags any such lines with three "q"s. This flag usually (but not always) indicates the end of a paragraph in the original PDF.

The third sed command in this chain adds a trailing space to each line.

I use tr to collapse the text into a single line by deleting all new lines.

The final tidying command uses sed to delete each hyphen flag and its following space, and to replace each end-of-paragraph flag and its following space with a pair of newlines.

The CoPa trick

My "CoPa" scripts (I named them for Copy and Paste) rely on the xclip utility. All of them work by doing something to text or an image between copying and pasting:

  • Highlight [something] to copy it to the X clipboard
  • Launch CoPa script to modify [something]
  • Middle-click-paste the modified [something]

Below is a CoPa script for tidying copied PDF text. I call it lfkill (for "linefeed killer") and use it quite a bit for grabbing chunks of text in PDFs and pasting them into text or LibreOffice Writer documents. In other words, it does what's shown above without the copied text having to be saved to a new file.

I launch lfkill with the keyboard shortcut Super+k because I'm a keyboard addict, but it might be more efficient if I launched lfkill from a desktop icon, making three mouse actions in a row: highlight, left-click, middle-click.

#!/bin/bash

xclip -o \
| sed 's/\([[:alpha:]]\)-$/\1@@@/;s/\.$/\.qqq/;s/$/ /' \
| tr -d '\n' \
| sed 's/@@@ //g;s/qqq /\n\n/g' \
| xclip -i

exit


About the Author

Bob Mesibov is Tasmanian, retired and a keen Linux tinkerer.

Tags: scripting pdf copy-paste cli gnu sed tr tutorials
blog comments powered by Disqus