Backreferences in replacements - OpenOffice.org Ninja

Backreferences in replacements

Posted by Andrew Z at Monday, December 31, 2007 | Permalink

Say no to manual labor (and use automation in OpenOffice.org)

OpenOffice.org 2.4 introduces backreferences in replacements. The feature is also called backward references, subexpression substitutions, or submatches in search and replace. Regardless of the name, the feature opens new, sophisticated possibilities in automation through regular expressions.

Backreferences themselves are not new: OpenOffice.org has supported backreferences in the Search for field, and now you can also use them in the Replace With field.

Example: Changing YYYY-MM-DD dates to MM/DD/YYYY

Assume you have a long, 100-page Writer document and you need to convert all dates in the format YYYY-MM-DD to MM/DD/YYYYY. It would be tedious to do it manually, so automate it with regular expressions and backreferences.

  1. Open the document in Writer.
  2. Choose Edit > Find & Replace from the menu.
  3. In Search for field, type ([0-9]{4})-([0-9]{2})-([0-9]{2})
  4. In the Replace with field, type $2/$3/$1
  5. Click the button More Options.
  6. Check the box Regular Expressions.
  7. Click Replace All.
The find and replace dialog demonstrates how to use a regular expression (also called a regex, regexp) with backreference substitutions (backward references, subexpression substitutions, or submatches)

How it works

The diagram illustrates how the regular expression with backreference substitution works (OpenOffice.org Writer 2.4+)

In the Search For field, each group is enclosed in parenthesis. In the example above, there are three groups. The first matches the year, the second the month, the third the year. In the Replace With field, each $x code substitutes the matched text found in the respective group.

Example: Redacting a Social Security Number

Say you want to turn many Social Security Numbers (SSN) like 987-65-4320 into a format like XXX-XX-4320. With backreferences in replacement, it is easy to automate. Using the instructions in the first example as a guide, search for this: [0-9]{3}-[0-9]{2}-([0-9]{4}) and replace with this: XXX-XX-$1

Related articles

9 comments:

Anonymous said...

Looking at your two examples would the 'replace' in the "Redacting the SSN" be XXX-XX-$3 instead of XXX-XX-$1 ??

Andrew Z said...

Anonymous,

I could see how that is confusing. The $1 refers to the first group in parenthesis. There is only one group, so $1 is correct.

If the search expression were ([0-9]{3})-([0-9]{2})-([0-9]{4}) the replacement could be XXX-XX-$3 and produce the same results.

I did forget the closing parenthesis, so I'm glad you encouraged me to take another look.


Andrew

Dad 2.0 (beta) said...

Even with "Regular expressions" checked, my backreferences are treated as literals. For example, cell contains "foobar", search is "(foobar)", replace is "$1"; this gives me "$1" in the cell _not_ "foobar".

I am familiar with regex but not OpenOffice - I guess I am missing a trick somewhere. Have you come across this behaviour before?

(OOO 2.3.0 on Fedora Core 7)

Andrew Z said...

Dan,

You need OpenOffice.org 2.4 which is due as a final release March 2008. Until then, you can test the unstable developer's snapshot.


Andrew

Dad 2.0 (beta) said...

D'oh! This is pretty clear from the title of the article - I should have spotted it, sorry. Many thanks for taking the time to respond.

Anonymous said...

In your first example, doesn't the example have the stated goal of changing the year to 2 digits YY from original 4 digits YYYY? (besides changing year location)?

Andrew Z said...

Orn: No, the year remains in YYYY, 4-digit format, so the first example just shuffles the locations and changes the delimiters. However, I updated the article to be a little more clear.

Unknown said...

Fantastic example! That was the "trick" I needed for my data!!

sbufe said...

Hi,
I'm looking to do a RegEx find and replace in OpenOffice where the backreference is made lower case. For example

1St

becomes

1st

I'm doing fine with the number matching for the first reference, and I can spit out the second reference as is, but is it possible to change to lower case?
Alternatively, I could return the whole thing ($) lower case. That would work just as well.

Thanks so much!