神刀安全网

Proofreading for illusions with grep and AWK

Lexical illusions are very hard to find when proofreading. The most common lexical illusion is a duplicated word, as in this well-known example:

A lexical illusion: many people are not aware that the the brain will automatically ignore a second instance of the word 'the' when it starts a new line.
Proofreading for illusions with grep and AWK

But if you let grep and AWK do the proofreading — problem solved!

Duplications on one line:grep

Word duplications are sometimes just as easily overlooked when they’re on a single line, as in this example:

The Liberal Party has taken the old line so beloved of economic think tanks since the 1980s of cutting company taxes and and assisting innovation, while the ALP's approach involves an emphasis on on education spending and concerns over inequality that is increasingly becoming the new standard.
Proofreading for illusions with grep and AWK

A good way to find the dupes is with this grep code, which I found here :

grep -onE '(/b.+) /1/b'
Proofreading for illusions with grep and AWK

The grep options used are ‘-o’, which returns only the looked-for string, rather than the whole line; ‘-n’, which gives the number of the line with the looked-for string; and ‘-E’, which allows grep to use nifty things likebackreferences.

The looked-for string is a regular expression. The bit in brackets is word boundary (/b) followed by 1 or more appearances (+) of any character (.) . This is followed by a space, then a backreference (/1) representing the bit in brackets, and closing with another word boundary. If that last ‘/b’ wasn’t in the regex, grep would return non-duplications like ‘ on on e occasion’.

Duplications across successive lines: AWK

Here’s an AWK command that finds line pairs in which the last word of the first line is also the first word of the second line. It relies on the fact that AWK recognises words as separate fields, because they’re separated by whitespace, and that’s the default field separator for AWK.

awk '$1==a {print b"/n"$0; a=""; b=""} {a=$NF; b=$0}'
Proofreading for illusions with grep and AWK

AWK reads the text line by line, and the command begins with a pattern-action statement: if the first field (word) equals the variable ‘a’, AWK does the action in brackets, namely print… But wait! No variable ‘a’ has been defined yet, so that action can’t happen on the first line. AWK moves to the second action, which is to set the variable ‘a’ equal to the contents of the last field/word ( $NF ) in the first line, and the variable ‘b’ to the contents of the whole first line ( $0 ).

Next, AWK reads the second line. If the first word in this line is the same as the last word in the last line, AWK prints the last line (as stored in ‘b’), followed by a newline, followed by the whole current line. It then ’empties’ the two variables by setting them equal to the empty string ‘ "" ‘. Whether the first word equals the last word or not, AWK also does the second action, namely fill the two variables with the last word in the current line and the whole current line, ready for a test of the third line. If no line pairs pass the test, AWK doesn’t print any lines.

We can make this command slightly cooler by allowing for the possibility that there are lexical illusions on successive lines, as here:

A lexical illusion: many people are not aware that the the brain will automatically ignore ignore a second instance of the word 'the' when it starts a new line.  awk '$1==a {print b"/n"$0"/n"; a=""; b=""} {a=$NF; b=$0}'
Proofreading for illusions with grep and AWK

The command now adds a blank line after each two-line match, so that successive illusion-pairs appear separately in the output. Note that AWK will ignore spaces at the beginnings and ends of lines, since they’re separators, not fields:

Proofreading for illusions with grep and AWK

About the Author

Bob Mesibov is Tasmanian, retired and a keen Linux tinkerer.

转载本站任何文章请注明:转载至神刀安全网,谢谢神刀安全网 » Proofreading for illusions with grep and AWK

分享到:更多 ()

评论 抢沙发

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址