Saving Time with Regular Expressions
By John Bealle
I remember distinctly sitting in a coffee shop a few years ago, reading about a strange computer concept called "regular expressions." Then, a few years later, thinking, "This may be the most important computer skill I'll learn in this decade." So here I am, and I've just finished an index where regular expressions, or "regex" as they call it, was an absolute lifesaver.
Regular expressions are a way to search and replace patterns in text strings. You use a faint trace of it when you search with wildcards, such as "*" for any string of text. Most well-developed software has some wildcard search capacity, and a few contain regex engines. There's no standard that I know of, so all regex modules differ in their details. In this article, I just want to give a few examples to show how regex is used.
The infamous last-minute request
In a recent index—a huge catalog of sheet music using catalog numbers as locators—the author decided at the last minute that combining strings of locators in a single series could shorten some of the locator lists. The locators that could be combined were actually catalog numbers in this form:
Eaton, Jonas, 2.34a, 2.433c, 4.567a, 4.567b, 4.567c, 4.567d, 4.567e, 6.788m
I wanted to shorten the sequential catalog numbers to ranges:
Eaton, Jonas, 2.34a, 2.433c, 4.567a-e, 6.788m
Catalog numbers contained as many as two digits before the period and up to four after.
Rather than look manually through the whole file, I imported the file into a text editor (more about that later) and wrote a regex search to find any five sequential catalog numbers:
So the first part—"(\d\d?\.\d\d?\d?\d?)a"—finds "4.567a" or any other catalog number with an "a" suffix. And the search string as a whole finds any instance where the same number is repeated five times with sequential alphabetic suffixes. The regex search highlighted the sequential catalog numbers for me:
Eaton, Jonas, 2.34a, 2.433c,
Converting the original PDF to a workable format
The most important use of regex in this index was in importing the data at the beginning. The proofs were in PDF format, so I was able to use the Adobe Save-As-Text command to create a very messy .txt file. However, each catalog entry began with the catalog number, which could be easily recognized in the form used above. (The catalog numbers would eventually become locators, but in the .txt file it was easiest to work with them at the beginning of the line.)
The first task was to strip away line breaks so that each catalog entry was a single line. So, where the catalog might have
Milburn Publishers, 1933.
I wanted to work with a single line
4.567a Eaton, Jonas Etude Milburn Publishers, 1933
This was the regular expression that I used:
Since the line break “\n” is not included in the replace, this procedure will remove line breaks that precede a line that is not a catalog number. The first replace (“$1”) combines the “Eaton, Jonas” line with “Etude,” the second (“$2”) combines that with “Milburn Publishers,” until eventually all is on one line. I repeat it a few times, tweak the "\D" to eliminate some digit patterns that aren't catalog numbers, and pretty soon I have a list of catalog entries with the catalog number flush left. I sort the lines, and voilà, I have a sorted list of catalog entries.
Importing the .txt file into an indexing program
I'll just illustrate one more thing that's important but at this point not that difficult—how to get this modified file into an indexing program. Fortunately, the catalog entries were comma-separated fields that were mostly in the same order
4.567a, Etude, Jonas Eaton, Milburn Publishers, 1933
so I could use this to produce a csv-style format that I could then import into the index software.
^ means the next character is the beginning of a line
After the regex search and replace, entries in the .txt file (csv format) looked like this:
“4.567a”, “Etude”, “Jonas Eaton”, “Milburn Publishers”, “1933”
I can use the same Find text and adjust the Replace to produce different types of import files. If the index format is "Main | Sub1 | Page" then I can import records such as these:
"$1","$2","" — imports as "Catalog Number | Title | blank"
"$3","","$1" — "Composer | blank | Catalog Number"
In our example entry, the above replacements give the following records:
“4.567a”, “Etude”, [blank]
“Jonas Eaton”, [blank], “4.567a”
The bottom line is that in about an hour I was able to import over 4,000 catalog records into my index without typing a single one. In truth it's a bit more complicated than that—page numbers, for example, are difficult to extract from PDF text. Still, this was a huge timesaver.
The Where and the How
So you might be wondering how to get started using regular expressions. In word processors, this is one area where OpenOffice far surpasses Microsoft Word. As for text editors, a few years ago I did an exhaustive survey to find the best one. I would download one, try it a few days, then move on to the next, entering my impressions in a table of features I was evaluating. When I got to a free program called Jedit, I simply gasped in awe and stopped looking. It is a little complicated and not very pretty, but in features it outclasses the competition by light years.
Jedit operates on .txt files the way Notepad does, so I can export or import from my indexing software to .txt or csv, or just copy selected text from MSWord and paste into Jedit. As with most all programs, text editors use Ctl-F to search. There will be a "Use regular expressions" check box if that feature is available.
Regular expressions are all about being clever—about seeing and representing patterns in text. I remember discovering them when I was learning to program in Perl, which was a popular web development platform not long ago. Perl is a favorite in hacker culture, where cleverness is highly regarded. They even have a game called Perl Golf, where they assign a task and see who can write a program with the fewest lines. That's the spirit involved in using regex.
© 2010 by Heartland Chapter of ASI. All rights reserved.