Powerful text matching with regular expressions

Regular expressions (also called regex or regexp) are a sophisticated system for precisely matching text patterns. Regexps give you a high-level of control and flexibility for matching text not possible with regular find or simple wildcards. When writing a regexp, you can creatively specify what to match, where to match it, and how many times to match it (quantification). We'll cover those, give examples, and walk through using regexps in OpenOffice.org. Then, once you find patterns with regexps, you can globally perform any of the normal operations on selected text including replacing, formatting, and deleting. Regexps come in many dialects and are used in various systems including Perl, PHP, C#, and Microsoft Office. The dialect of regexps in Writer, Calc, Impress, and Base resembles (but does not exactly mimic) the POSIX regexps, which are based off of traditional Unix regexps. Do not expect any two dialects to be exactly the same.

Important notice

In the example matches, only the underline portions are matched. The unmatched parts are included to provide better context for examples.

Literals

Most non-regexp search patterns are a literal string, or simply called a literal. When you search for "bar," then "bar" is a plain literal.

Expression	Matches	Does not match
bar	bar barb bars crowbar embarrass	foo

Escaping special characters

Certain characters have special meanings (covered later in this article), so they must be escaped to be used as literals. These special characters are the following:

. ^ $ * + ? \ | [ ( ) {

To escape one of these, precede it with a backslash (\). Here are some examples:

Expression	Matches	Does not match
Not escaped period .	. foo bar	End of line characters
Escaped period \.	.	foo bar End of line characters
Not escaped dollar sign $	End of line characters	The price is $30
Escaped dollar sign \$	The price is $30	End of line characters

Period

The period (.) matches any one character.

Expression	Matches	Does not match
b.r	bar beryl burn	saber

Microsoft Office uses a question mark ? where OpenOffice.org uses the period (.).

Sets and classes

Sets are a collection of characters enclosed in brackets, and the set matches one character. A class can contain multiple individual characters, a range of characters, or any combination. A range is designated with a hyphen (-). Here are some examples:

Expression	Matches	Does not match
[ab]	a b banana	c d kiwi
[0-3]	2001 2456	456 456.7
[a-z145]	banana 2001 984	236 :)

You can eliminate characters against matching by using the negation operator called the caret (^) as the first character in the square brackets. If the caret is not first in the brackets, it matches normally (instead of using negation). Here are some examples:

Expression	Matches	Does not match
[^ab]	c d banana	a b ^
[a^b]	a b banana ^	c d

There are a few predefined classes (also called POSIX bracket expressions) for your convenience. They are shorthand. For example [:digit:] is basically the same as [0-9] or [012345679].

Name	Description
[:alpha:]	Alphabetic characters (depending on the document language)
[:digit:]	Decimal digits
[:alnum:]	Alphanumeric characters
[:space:]	Just a space---and not other whitespace characters (see issue 41706)
[:print:]	Most printable characters (see issue 83290)
[:cntrl:]	Non-printable characters
[:lower:]	Lowercase characters
[:upper:]	Uppercase characters

The use of these predefined classes may be confusing (see issue 64368). You can use predefined classes like the following examples:

Expression	Matches	Does not match
[:digit:]	Nothing	Everything
[:digit:]+ or ([:digit:])	A single digit 1 123	Non-digits abc

Alternation

The pipe operator (|) matches elements on either side of the operator. For example:

Expression	Matches	Does not match
foo\|bar	foo bar	monkey
(mail\|sand)box	mailbox sandbox	breadbox
([0-1][0-9]\|2[0-3]):[0-5][0-9]	Time of day in 24-hour notation 09:41 23:59	09:60 24:00

See issue 84828 for more information about using alternation with the caret (^).

Groups

Parentheses form groups. They are typically combined with quantifiers, with backreferences, or in replacements. Groups differ from sets because a set matches one character while a group matches all characters in the group.

Expression	Matches	Does not match
(foo)	foo foofo foofoo	f fo of
f(o)+	foo	bar
(.).(\1)	three letter palindromes bob mom	foo

Location

There are several ways to match by location:

Symbol	Matches
^	Beginning of a paragraph
$	End of a paragraph
\n	Manual line break
\t	Tab
\<	Beginning of a word
\>	End of a word

Here are some examples:

Expression	Matches	Does not match
^Monkey	Monkey Island is up ahead!	Welcome to Monkey Island!
Island!$	Welcome to Monkey Island!	Monkey Island is up ahead!
and\>	island island! hand	handy
\<key	key keys keyboard	monkey
\<.{4}\>	Words with exactly 4 letters Code	I am

Quantification

Use asterisk (*) to match an element zero or more times. Here are some examples:

Expression	Matches	Does not match
cd*	c followed by zero or more d's c cd cdd dcd	c dc cc dd
[cd]*	either c or d zero or more times in any order cd cdd dcd	abefeghi

Use the plus sign (+) to match an element one or more times.

Expression	Matches	Does not match
cd+	c followed by one or more d's cd cdd	c dc

Use the question mark (?) to find an element zero or one times.

Expression	Matches	Does not match
cd?	c followed by zero or one d's c cd cdd	a d
fees?	fee fees feed	free
(wo)?man	man woman	men women

Use the curly brackets, or braces, to specify any number of repetitions. You may use the curly brackets in the forms

{x}
{y,z}
{y,}

where x is an exact number of repetitions, y is a minimum, and z is the maximum. Here are some examples:

Expression	Matches	Does not match
[aeiou]{3}	exactly three consecutive vowels adieu	foo
[0-9]{2,3}	two to three digits 23	2 3
[a-z]{7,}	seven letters or more letters expiration	expire

Backreferences

Use a backreferences to match a previously-matched group. Form the backreference using a forward slash followed by a number that corresponds to the group. Use \1 to match the first group, \2 for the second group, \3 for the third, and so on.

Expression	Matches	Does not match
\<(.).\1\>	Three letter palindromes including gig nun	gigabyte
\<(.)*\>[:space:]\<(\1)\>	Duplicate words such as I see the the monkey.	I see the monkey.

Read more about backreferences in replacements, a new feature in OpenOffice.org 2.4.

How to search with regular expressions

Here is the basic way to search with regexps:

Open a Writer, Calc, or Impress document.
Choose Edit > Find & Replace from the menu.
Type your own regexp in the Search for field.
Click the More Options button.
Check the box Regular expressions.
Click the Find button or the Find All button.

Then, the matched text is selected, so you can perform all the normal operations including delete and format. This Flash video shows how to perform the procedure:

Other ways to use regular expressions

You can also use regexps in these places:

Base: Find Record
Calc: Standard Filter
Calc: functions including COUNTIF, DCOUNT, DCOUNTA, DGET, DMAX, DMIN, DAVERAGE, DPRODUCT, DSTDEV, DSTDEVP, DSUM, DVAR, DVARP, HLOOKUP, LOOKUP, MATCH, SEARCH, SUMIF, VLOOKUP,
Macros
Writer: filter comments in Accept/Reject changes

Common examples

See this long list of example regular expressions.

Microsoft Office 2003

Microsoft Office 2003 has a similar regexp feature called wildcards. A few regexp may transfer between Office and OpenOffice.org without modification but many will require modification.

13 comments:

Anonymous said...: Small typo (C&P error) - in the cd+ example you say "Matches: c followed by zero or more d's" but that should be "Matches: c followed by one or more d's".

Interesting article, thanks.; January 2, 2008 at 4:28 AM
Andrew Z said...: Huw, nice catch! Thanks! I fixed it now. I learned it is quite the task to proofread an article such as this.; January 2, 2008 at 7:24 AM
Anonymous said...: Developing nicely - a good resource.

Here are a few things I noticed:

Sets and Classes
[^ab] matches C and D too

Groups
three letter palindromes (palindromes not italicised)

Location
Words with exactly 4 (not 3) letters

Quantification
where x (not z) is an exact number of repetitions

Common Examples
Link still points to blogspot, not new domain.; February 4, 2008 at 2:22 AM
Anonymous said...: Oh. The backreferences link also still points to blogspot.; February 4, 2008 at 2:25 AM
Andrew Z said...: Huw,

You're sharp. I fixed those, and then I noticed a few where it was written incorrectly that the white space was matched. Thanks! :)

Andrew; February 4, 2008 at 7:47 AM
Unknown said...: I could not get the duplicate word check to work. I set up several examples in writer 2.3.1.2 on openSuse 10.2. Is there a typo, or has something changed in my version? Thanks for the help.; March 28, 2008 at 8:39 PM
Unknown said...: I must update my post. My dictionary is set to Spanish, and not only does spanish not seem to spellcheck, it also renders the regexp useless. Sorry for the post. Yes myspell-spanish is installed. :) I'm sure it's a configuration thing on my end.

Chao,; March 28, 2008 at 9:17 PM
Anonymous said...: Thanks very much for the detailed list.

Now I'm one step further into solving my problem with cleaning up text.
You all maybe know the problem yourself: Some people prefer less or more new lines, no new line between paragraphs or even double, others may want to add a new paragrah or at least a new line before every "direct speech". Now I can find the problematic sections, but I still need to find a way to replace it with other options, like a second new line. \p replace \p\n won't do that, at least not for me.; August 18, 2009 at 5:22 AM
Anonymous said...: I'm looking for a way to select a sentence, and assign this to a macro. Triple-clicking doesn't record as a macro, but if there's a way to put it into a macro that would solve it.

To this end I've been trying to reliably find the start of sentences.

Searching backwards for "." or even "\.\>" and then moving forward x spaces doesn't work perfectly, e.g. if there's a heading, or if there's full stops within a sentence to signify abbreviation "approx."

The OOo inbuilt functions of Go To Previous Sentence and Go To Next Sentence don't work if the sentence is the first, or last, in a paragraph, depending on how I'm trying to select the sentence from that point.

Any help would be much appreciated.

Charlie; September 1, 2009 at 4:41 AM
Anonymous said...: Never mind, found this:
sub SelectSentence2
oViewCursor = ThisComponent.CurrentController.getViewCursor
oTextCursor = ThisComponent.Text.createTextCursorByRange(oViewCursor)
oTextCursor.gotoStartOfSentence(false)
oTextCursor.gotoEndOfSentence(true)
oViewCursor.gotoRange(oTextCursor,false)
end sub

Charlie; September 1, 2009 at 5:04 AM
Claudio Romeo said...: With regard to the Backreferences section, I think there is a little mistake. The correct form should be
\<(.*)\>[:space:]\<(\1)\>
with the star INSIDE the round brackets, shouldn't it? Otherwise, it finds every long sentence ending with a one-letter word if it is identical to the last letter of the previous word.
Thanks for you work.; November 10, 2009 at 10:52 AM
nemelk said...: I cannot make something like:
eval\$(((?!eval).)+)\$
work in OpenOffice regexp Find.

Does it support: ?!; September 30, 2010 at 4:10 AM
Capt_Safety said...: Hi. I need to add file name prefixes and suffixes to a couple thousand file names in cells. The file names are currently "BBD304 24x26" I want to change them to "public:\\BBD304 24x26.jpg" The numerals are different in each file name, but the pattern is all the same.

Thanks,
Bill; June 13, 2014 at 12:46 PM

Comments, Site