• Home
  • About
  • Events
  • Blog
  • Articles
  • Find an Indexer
  • Join
  • Member Documents
    • Meeting Minutes >
      • 2017 Spring
      • 2016 Fall
      • 2016 Spring
      • 2015 Fall
      • 2015 Spring
      • 2014 Fall
      • 2014 Spring
      • 2013 Fall
      • 2013 Spring
      • 2012 Fall
      • 2012 Spring
      • 2011 Fall
      • 2011 Spring
      • 2010 Fall
      • 2010 Spring
    • Brochure
    • Bylaws
    • Policies & Procedures
    • Disbursement Form
    • Reimbursement Form
  • Contact
Heartland Chapter of the American Society for Indexing

Saving Time with Regular Expressions

By John Bealle
Spring 2010


I remember distinctly sitting in a coffee shop a few years ago, reading about a strange computer concept called "regular expressions." Then, a few years later, thinking, "This may be the most important computer skill I'll learn in this decade." So here I am, and I've just finished an index where regular expressions, or "regex" as they call it, was an absolute lifesaver.

Regular expressions are a way to search and replace patterns in text strings. You use a faint trace of it when you search with wildcards, such as "*" for any string of text. Most well-developed software has some wildcard search capacity, and a few contain regex engines. There's no standard that I know of, so all regex modules differ in their details. In this article, I just want to give a few examples to show how regex is used.

The infamous last-minute request

In a recent index—a huge catalog of sheet music using catalog numbers as locators—the author decided at the last minute that combining strings of locators in a single series could shorten some of the locator lists. The locators that could be combined were actually catalog numbers in this form:

          Eaton, Jonas, 2.34a, 2.433c, 4.567a, 4.567b, 4.567c, 4.567d, 4.567e, 6.788m

I wanted to shorten the sequential catalog numbers to ranges:

          Eaton, Jonas, 2.34a, 2.433c, 4.567a-e, 6.788m

Catalog numbers contained as many as two digits before the period and up to four after.

Rather than look manually through the whole file, I imported the file into a text editor (more about that later) and wrote a regex search to find any five sequential catalog numbers:
(\d\d?\.\d\d?\d?\d?)a, \1b, \1c, \1d, \1e
(…)  

 
\d   
\d?  
 
\.    
a    

\1  
  
parentheses are a grouping operator, with each set of parentheses taking a number in the order of appearance. Since this is the first, it is number one and the text found can be referenced later in the search or replace by "\1" or "$1"
any digit  
any digit that might or might not occur    
the period character    
the letter "a"—the five letters "a, b, c, d, e" and the comma-space that follow are the only real text characters that appear in the search    
a "backreference" which matches the text found in the first group (the preceding parentheses)

So the first part—"(\d\d?\.\d\d?\d?\d?)a"—finds "4.567a" or any other catalog number with an "a" suffix. And the search string as a whole finds any instance where the same number is repeated five times with sequential alphabetic suffixes. The regex search highlighted the sequential catalog numbers for me:

          Eaton, Jonas, 2.34a, 2.433c, 4.567a, 4.567b, 4.567c, 4.567d, 4.567e, 6.788m 

The same regex search could be modified for any number of sequential locators. I chose to conflate the sequential catalog numbers to ranges manually, but using the regex searches allowed me to make my way quickly through a very large index.

Converting the original PDF to a workable format

The most important use of regex in this index was in importing the data at the beginning. The proofs were in PDF format, so I was able to use the Adobe Save-As-Text command to create a very messy .txt file. However, each catalog entry began with the catalog number, which could be easily recognized in the form used above. (The catalog numbers would eventually become locators, but in the .txt file it was easiest to work with them at the beginning of the line.)

The first task was to strip away line breaks so that each catalog entry was a single line. So, where the catalog might have


          4.567a  Eaton, Jonas
                     
Etude 
                      Milburn Publishers, 1933.
I wanted to work with a single line

          4.567a Eaton, Jonas Etude Milburn Publishers, 1933

This was the regular expression that I used:

Search--
          (\d\d?\.\d\d?\d?\d?\w\w?,.*?)\n(\D)

Replace--
          $1 $2

Where
\w  
  
\w?
    
.*?    



\n    
\D    
the catalog number search string is the same, except now there is a \w code representing any letter
any letter that might or might not appear, since some sequences ran past "z" and used "aa, bb, etc."    
this is a common search string: "." means any character of any type; "*" means the previous character repeated any number of times; "?" after a repeat operator represents a "stingy" search that repeats until the first (rather than the last) instance of the next character
the "newline" character, or line break    
any character that is not a digit (e.g., the first letter of Etude)

Since the line break “\n” is not included in the replace, this procedure will remove line breaks that precede a line that is not a catalog number. The first replace (“$1”) combines the “Eaton, Jonas” line with “Etude,” the second (“$2”) combines that with “Milburn Publishers,” until eventually all is on one line. I repeat it a few times, tweak the "\D" to eliminate some digit patterns that aren't catalog numbers, and pretty soon I have a list of catalog entries with the catalog number flush left. I sort the lines, and voilà, I have a sorted list of catalog entries.

Importing the .txt file into an indexing program
I'll just illustrate one more thing that's important but at this point not that difficult—how to get this modified file into an indexing program. Fortunately, the catalog entries were comma-separated fields that were mostly in the same order

          4.567a, Etude, Jonas Eaton, Milburn Publishers, 1933

so I could use this to produce a csv-style format that I could then import into the index software.

Search-- 
         ^(.*?),(.*?),(.*?),(.*?),(.*?)

          ^       means the next character is the beginning of a line

Replace--          "$1","$2",""
$1    
the "backreference" for a replace operation, which fills in the text from the first found group (the sets of parentheses in the Find text)    
After the regex search and replace, entries in the .txt file (csv format) looked like this:   
          “4.567a”, “Etude”, “Jonas Eaton”, “Milburn Publishers”, “1933”

I can use the same Find text and adjust the Replace to produce different types of import files. If the index format is "Main | Sub1 | Page" then I can import records such as these: 

 Replace--
          "$1","$2","" — imports as "Catalog Number | Title | blank"
          "$3","","$1" — "Composer | blank | Catalog Number"

In our example entry, the above replacements give the following records:

          “4.567a”, “Etude”, [blank]
          “Jonas Eaton”, [blank], “4.567a”

The bottom line is that in about an hour I was able to import over 4,000 catalog records into my index without typing a single one. In truth it's a bit more complicated than that—page numbers, for example, are difficult to extract from PDF text. Still, this was a huge timesaver. 

The Where and the How

So you might be wondering how to get started using regular expressions. In word processors, this is one area where OpenOffice far surpasses Microsoft Word. As for text editors, a few years ago I did an exhaustive survey to find the best one. I would download one, try it a few days, then move on to the next, entering my impressions in a table of features I was evaluating. When I got to a free program called Jedit, I simply gasped in awe and stopped looking. It is a little complicated and not very pretty, but in features it outclasses the competition by light years.

Jedit operates on .txt files the way Notepad does, so I can export or import from my indexing software to .txt or csv, or just copy selected text from MSWord and paste into Jedit. As with most all programs, text editors use Ctl-F to search. There will be a "Use regular expressions" check box if that feature is available.

Regular expressions are all about being clever—about seeing and representing patterns in text. I remember discovering them when I was learning to program in Perl, which was a popular web development platform not long ago. Perl is a favorite in hacker culture, where cleverness is highly regarded. They even have a game called Perl Golf, where they assign a task and see who can write a program with the fewest lines. That's the spirit involved in using regex.   


© 2010 by Heartland Chapter of ASI. All rights reserved.
Powered by Create your own unique website with customizable templates.