Powerful text matching with regular expressions - OpenOffice.org Ninja

Powerful text matching with regular expressions

Posted by Andrew Z at Thursday, December 27, 2007 | Permalink

R-G-B by code poet from Flickr Regular expressions (also called regex or regexp) are a sophisticated system for precisely matching text patterns. Regexps give you a high-level of control and flexibility for matching text not possible with regular find or simple wildcards. When writing a regexp, you can creatively specify what to match, where to match it, and how many times to match it (quantification). We'll cover those, give examples, and walk through using regexps in OpenOffice.org. Then, once you find patterns with regexps, you can globally perform any of the normal operations on selected text including replacing, formatting, and deleting. Regexps come in many dialects and are used in various systems including Perl, PHP, C#, and Microsoft Office. The dialect of regexps in Writer, Calc, Impress, and Base resembles (but does not exactly mimic) the POSIX regexps, which are based off of traditional Unix regexps. Do not expect any two dialects to be exactly the same.

Important notice

In the example matches, only the underline portions are matched. The unmatched parts are included to provide better context for examples.

Literals

Most non-regexp search patterns are a literal string, or simply called a literal. When you search for "bar," then "bar" is a plain literal.
ExpressionMatchesDoes not match
barbar barb bars crowbar embarrassfoo

Escaping special characters

Certain characters have special meanings (covered later in this article), so they must be escaped to be used as literals. These special characters are the following:
. ^ $ * + ? \ | [ ( ) {
To escape one of these, precede it with a backslash (\). Here are some examples:
ExpressionMatchesDoes not match
Not escaped period .. foo barEnd of line characters
Escaped period \..foo bar End of line characters
Not escaped dollar sign $End of line charactersThe price is $30
Escaped dollar sign \$The price is $30End of line characters

Period

The period (.) matches any one character.
ExpressionMatchesDoes not match
b.rbar beryl burnsaber
Microsoft Office uses a question mark ? where OpenOffice.org uses the period (.).

Sets and classes

Sets are a collection of characters enclosed in brackets, and the set matches one character. A class can contain multiple individual characters, a range of characters, or any combination. A range is designated with a hyphen (-). Here are some examples:
ExpressionMatchesDoes not match
[ab]a b bananac d kiwi
[0-3]2001 2456456 456.7
[a-z145]banana 2001 984236 :)
You can eliminate characters against matching by using the negation operator called the caret (^) as the first character in the square brackets. If the caret is not first in the brackets, it matches normally (instead of using negation). Here are some examples:
ExpressionMatchesDoes not match
[^ab] c d banana a b ^
[a^b]a b banana ^c d
There are a few predefined classes (also called POSIX bracket expressions) for your convenience. They are shorthand. For example [:digit:] is basically the same as [0-9] or [012345679].
NameDescription
[:alpha:] Alphabetic characters (depending on the document language)
[:digit:]Decimal digits
[:alnum:] Alphanumeric characters
[:space:]Just a space---and not other whitespace characters (see issue 41706)
[:print:] Most printable characters (see issue 83290)
[:cntrl:] Non-printable characters
[:lower:] Lowercase characters
[:upper:]Uppercase characters
The use of these predefined classes may be confusing (see issue 64368). You can use predefined classes like the following examples:
ExpressionMatchesDoes not match
[:digit:]NothingEverything
[:digit:]+ or ([:digit:])A single digit 1 123Non-digits abc

Alternation

The pipe operator (|) matches elements on either side of the operator. For example:
ExpressionMatchesDoes not match
foo|bar foo barmonkey
(mail|sand)boxmailbox sandboxbreadbox
([0-1][0-9]|2[0-3]):[0-5][0-9]Time of day in 24-hour notation 09:41 23:5909:60 24:00
See issue 84828 for more information about using alternation with the caret (^).

Groups

Parentheses form groups. They are typically combined with quantifiers, with backreferences, or in replacements. Groups differ from sets because a set matches one character while a group matches all characters in the group.
ExpressionMatchesDoes not match
(foo)foo foofo foofoof fo of
f(o)+foobar
(.).(\1)three letter palindromes bob momfoo

Location

There are several ways to match by location:
SymbolMatches
^Beginning of a paragraph
$End of a paragraph
\nManual line break
\tTab
\<Beginning of a word
\>End of a word
Here are some examples:
ExpressionMatchesDoes not match
^MonkeyMonkey Island is up ahead!Welcome to Monkey Island!
Island!$Welcome to Monkey Island!Monkey Island is up ahead!
and\>island island! handhandy
\<keykey keys keyboardmonkey
\<.{4}\> Words with exactly 4 letters CodeI am

Quantification

Use asterisk (*) to match an element zero or more times. Here are some examples:
ExpressionMatchesDoes not match
cd*c followed by zero or more d's c cd cdd dcdc dc cc dd
[cd]*either c or d zero or more times in any order cd cdd dcdabefeghi
Use the plus sign (+) to match an element one or more times.
ExpressionMatchesDoes not match
cd+c followed by one or more d's cd cdd c dc
Use the question mark (?) to find an element zero or one times.
ExpressionMatchesDoes not match
cd?c followed by zero or one d's c cd cdd a d
fees?fee fees feedfree
(wo)?manman womanmen women
Use the curly brackets, or braces, to specify any number of repetitions. You may use the curly brackets in the forms
  • {x}
  • {y,z}
  • {y,}
where x is an exact number of repetitions, y is a minimum, and z is the maximum. Here are some examples:
ExpressionMatchesDoes not match
[aeiou]{3}exactly three consecutive vowels adieufoo
[0-9]{2,3}two to three digits 232 3
[a-z]{7,}seven letters or more letters expirationexpire

Backreferences

Use a backreferences to match a previously-matched group. Form the backreference using a forward slash followed by a number that corresponds to the group. Use \1 to match the first group, \2 for the second group, \3 for the third, and so on.
ExpressionMatchesDoes not match
\<(.).\1\>Three letter palindromes including gig nungigabyte
\<(.)*\>[:space:]\<(\1)\>Duplicate words such as I see the the monkey.I see the monkey.
Read more about backreferences in replacements, a new feature in OpenOffice.org 2.4.

How to search with regular expressions

Here is the basic way to search with regexps:
  1. Open a Writer, Calc, or Impress document.
  2. Choose Edit > Find & Replace from the menu.
  3. Type your own regexp in the Search for field.
  4. Click the More Options button.
  5. Check the box Regular expressions.
  6. Click the Find button or the Find All button.
Then, the matched text is selected, so you can perform all the normal operations including delete and format. This Flash video shows how to perform the procedure:

Other ways to use regular expressions

You can also use regexps in these places:
  • Base: Find Record
  • Calc: Standard Filter
  • Calc: functions including COUNTIF, DCOUNT, DCOUNTA, DGET, DMAX, DMIN, DAVERAGE, DPRODUCT, DSTDEV, DSTDEVP, DSUM, DVAR, DVARP, HLOOKUP, LOOKUP, MATCH, SEARCH, SUMIF, VLOOKUP,
  • Macros
  • Writer: filter comments in Accept/Reject changes

Common examples

See this long list of example regular expressions.

Microsoft Office 2003

Microsoft Office 2003 has a similar regexp feature called wildcards. A few regexp may transfer between Office and OpenOffice.org without modification but many will require modification.

13 comments:

Anonymous said...

Small typo (C&P error) - in the cd+ example you say "Matches: c followed by zero or more d's" but that should be "Matches: c followed by one or more d's".

Interesting article, thanks.

Andrew Z said...

Huw, nice catch! Thanks! I fixed it now. I learned it is quite the task to proofread an article such as this.

Anonymous said...

Developing nicely - a good resource.

Here are a few things I noticed:

Sets and Classes
[^ab] matches C and D too

Groups
three letter palindromes (palindromes not italicised)

Location
Words with exactly 4 (not 3) letters

Quantification
where x (not z) is an exact number of repetitions

Common Examples
Link still points to blogspot, not new domain.

Anonymous said...

Oh. The backreferences link also still points to blogspot.

Andrew Z said...

Huw,

You're sharp. I fixed those, and then I noticed a few where it was written incorrectly that the white space was matched. Thanks! :)

Andrew

Unknown said...

I could not get the duplicate word check to work. I set up several examples in writer 2.3.1.2 on openSuse 10.2. Is there a typo, or has something changed in my version? Thanks for the help.

Unknown said...

I must update my post. My dictionary is set to Spanish, and not only does spanish not seem to spellcheck, it also renders the regexp useless. Sorry for the post. Yes myspell-spanish is installed. :) I'm sure it's a configuration thing on my end.

Chao,

Anonymous said...

Thanks very much for the detailed list.

Now I'm one step further into solving my problem with cleaning up text.
You all maybe know the problem yourself: Some people prefer less or more new lines, no new line between paragraphs or even double, others may want to add a new paragrah or at least a new line before every "direct speech". Now I can find the problematic sections, but I still need to find a way to replace it with other options, like a second new line. \p replace \p\n won't do that, at least not for me.

Anonymous said...

I'm looking for a way to select a sentence, and assign this to a macro. Triple-clicking doesn't record as a macro, but if there's a way to put it into a macro that would solve it.

To this end I've been trying to reliably find the start of sentences.

Searching backwards for "." or even "\.\>" and then moving forward x spaces doesn't work perfectly, e.g. if there's a heading, or if there's full stops within a sentence to signify abbreviation "approx."

The OOo inbuilt functions of Go To Previous Sentence and Go To Next Sentence don't work if the sentence is the first, or last, in a paragraph, depending on how I'm trying to select the sentence from that point.

Any help would be much appreciated.

Charlie

Anonymous said...

Never mind, found this:
sub SelectSentence2
oViewCursor = ThisComponent.CurrentController.getViewCursor
oTextCursor = ThisComponent.Text.createTextCursorByRange(oViewCursor)
oTextCursor.gotoStartOfSentence(false)
oTextCursor.gotoEndOfSentence(true)
oViewCursor.gotoRange(oTextCursor,false)
end sub

Charlie

Claudio Romeo said...

With regard to the Backreferences section, I think there is a little mistake. The correct form should be
\<(.*)\>[:space:]\<(\1)\>
with the star INSIDE the round brackets, shouldn't it? Otherwise, it finds every long sentence ending with a one-letter word if it is identical to the last letter of the previous word.
Thanks for you work.

nemelk said...

I cannot make something like:
eval\\((((?!eval).)+)\\)
work in OpenOffice regexp Find.

Does it support: ?!

Capt_Safety said...

Hi. I need to add file name prefixes and suffixes to a couple thousand file names in cells. The file names are currently "BBD304 24x26" I want to change them to "public:\\BBD304 24x26.jpg" The numerals are different in each file name, but the pattern is all the same.

Thanks,
Bill