
Regular expressions (also called
regex or
regexp) are a sophisticated system for precisely matching text patterns. Regexps give you a high-level of control and flexibility for matching text not possible with regular find or simple wildcards. When writing a regexp, you can creatively specify what to match, where to match it, and how many times to match it (quantification). We'll cover those, give examples, and walk through using regexps in OpenOffice.org.
Then, once you find patterns with regexps, you can globally perform any of the normal operations on selected text including replacing, formatting, and deleting.
Regexps come in many dialects and are used in various systems including Perl, PHP, C#, and Microsoft Office. The dialect of regexps in Writer, Calc, Impress, and Base resembles (but does not exactly mimic) the POSIX regexps, which are based off of traditional Unix regexps. Do not expect any two dialects to be exactly the same.
Important notice
In the example matches, only the underline portions are matched. The unmatched parts are included to provide better context for examples.
Literals
Most non-regexp search patterns are a literal string, or simply called a literal. When you search for "bar," then "bar" is a plain literal.
Expression | Matches | Does not match |
bar | bar
barb
bars
crowbar
embarrass | foo |
Escaping special characters
Certain characters have special meanings (covered later in this article), so they must be escaped to be used as literals. These special characters are the following:
. ^ $ * + ? \ | [ ( ) {
To escape one of these, precede it with a backslash (\). Here are some examples:
Expression | Matches | Does not match |
Not escaped period
. | . foo bar | End of line characters |
Escaped period
\. | . | foo
bar
End of line characters |
Not escaped dollar sign
$ | End of line characters | The price is $30 |
Escaped dollar sign
\$ | The price is $30 | End of line characters |
Period
The period (.) matches any one character.
Expression | Matches | Does not match |
b.r | bar
beryl
burn | saber
|
Microsoft Office uses a question mark ? where OpenOffice.org uses the period (.).
Sets and classes
Sets are a collection of characters enclosed in brackets, and the set matches one character. A class can contain multiple individual characters, a range of characters, or any combination. A range is designated with a hyphen (-). Here are some examples:
Expression | Matches | Does not match |
[ab] | a b banana | c
d
kiwi
|
[0-3] | 2001 2456 | 456
456.7 |
[a-z145] | banana 2001 984 | 236
:) |
You can eliminate characters against matching by using the negation operator called the caret (^) as the first character in the square brackets. If the caret is not first in the brackets, it matches normally (instead of using negation). Here are some examples:
Expression | Matches | Does not match |
[^ab] |
c d banana |
a b ^ |
[a^b] | a b banana
^ | c
d |
There are a few predefined classes (also called POSIX bracket expressions) for your convenience. They are shorthand. For example [:digit:] is basically the same as [0-9] or [012345679].
Name | Description |
[:alpha:]
| Alphabetic characters (depending on the document language) |
[:digit:] | Decimal digits |
[:alnum:]
| Alphanumeric characters |
[:space:] | Just a space---and not other whitespace characters (see issue 41706) |
[:print:]
| Most printable characters (see issue 83290) |
[:cntrl:]
| Non-printable characters |
[:lower:]
| Lowercase characters |
[:upper:] | Uppercase characters |
The use of these predefined classes may be confusing (see issue
64368). You can use predefined classes like the following examples:
Expression | Matches | Does not match |
[:digit:] | Nothing | Everything |
[:digit:]+
or
([:digit:]) | A single digit
1
123 | Non-digits
abc |
Alternation
The pipe operator (|) matches elements on either side of the operator. For example:
Expression | Matches | Does not match |
foo|bar
| foo bar | monkey
|
(mail|sand)box | mailbox
sandbox | breadbox |
([0-1][0-9]|2[0-3]):[0-5][0-9] | Time of day in 24-hour notation
09:41
23:59 | 09:60
24:00 |
See
issue 84828 for more information about using alternation with the caret (^).
Groups
Parentheses form groups. They are typically combined with quantifiers, with backreferences, or in replacements. Groups differ from sets because a set matches one character while a group matches all characters in the group.
Expression | Matches | Does not match |
(foo) | foo
foofo
foofoo | f
fo
of |
f(o)+ | foo | bar |
(.).(\1) | three letter palindromes bob mom | foo |
Location
There are several ways to match by location:
Symbol | Matches |
^ | Beginning of a paragraph |
$ | End of a paragraph |
\n | Manual line break |
\t | Tab |
\< | Beginning of a word |
\> | End of a word |
Here are some examples:
Expression | Matches | Does not match |
^Monkey | Monkey Island is up ahead! | Welcome to Monkey Island! |
Island!$ | Welcome to Monkey Island! | Monkey Island is up ahead! |
and\> | island
island!
hand | handy |
\<key | key
keys
keyboard | monkey |
\<.{4}\> |
Words with exactly 4 letters
Code | I am |
Quantification
Use asterisk (*) to match an element zero or more times. Here are some examples:
Expression | Matches | Does not match |
cd* | c followed by zero or more d's
c
cd
cdd
dcd | c
dc
cc
dd |
[cd]* | either c or d zero or more times in any order
cd cdd dcd | abefeghi
|
Use the plus sign (+) to match an element one or more times.
Expression | Matches | Does not match |
cd+ | c followed by one or more d's
cd
cdd
| c
dc |
Use the question mark (?) to find an element zero or one times.
Expression | Matches | Does not match |
cd? | c followed by zero or one d's
c
cd
cdd
| a
d |
fees? | fee
fees
feed | free |
(wo)?man | man woman | men
women |
Use the curly brackets, or braces, to specify any number of repetitions. You may use the curly brackets in the forms
where x is an exact number of repetitions, y is a minimum, and z is the maximum. Here are some examples:
Expression | Matches | Does not match |
[aeiou]{3} | exactly three consecutive vowels
adieu | foo |
[0-9]{2,3} | two to three digits
23 | 2
3 |
[a-z]{7,} | seven letters or more letters
expiration | expire |
Backreferences
Use a backreferences to match a previously-matched group. Form the backreference using a forward slash followed by a number that corresponds to the group. Use \1 to match the first group, \2 for the second group, \3 for the third, and so on.
Expression | Matches | Does not match |
\<(.).\1\> | Three letter palindromes including
gig
nun | gigabyte
|
\<(.)*\>[:space:]\<(\1)\> | Duplicate words such as
I see the the monkey. | I see the monkey.
|
Read more about
backreferences in replacements, a new feature in OpenOffice.org 2.4.
How to search with regular expressions
Here is the basic way to search with regexps:
- Open a Writer, Calc, or Impress document.
- Choose Edit > Find & Replace from the menu.
- Type your own regexp in the Search for field.
- Click the More Options button.
- Check the box Regular expressions.
- Click the Find button or the Find All button.
Then, the matched text is selected, so you can perform all the normal operations including delete and format.
This Flash video shows how to perform the procedure:
Other ways to use regular expressions
You can also use regexps in these places:
- Base: Find Record
- Calc: Standard Filter
- Calc: functions including COUNTIF, DCOUNT, DCOUNTA, DGET, DMAX, DMIN, DAVERAGE, DPRODUCT, DSTDEV, DSTDEVP, DSUM, DVAR, DVARP, HLOOKUP, LOOKUP, MATCH, SEARCH, SUMIF, VLOOKUP,
- Macros
- Writer: filter comments in Accept/Reject changes
Common examples
See this long list of
example regular expressions.
Microsoft Office 2003
Microsoft Office 2003 has a similar regexp feature called wildcards. A few regexp may transfer between Office and OpenOffice.org without modification but many will require modification.
13 comments:
Small typo (C&P error) - in the cd+ example you say "Matches: c followed by zero or more d's" but that should be "Matches: c followed by one or more d's".
Interesting article, thanks.
Huw, nice catch! Thanks! I fixed it now. I learned it is quite the task to proofread an article such as this.
Developing nicely - a good resource.
Here are a few things I noticed:
Sets and Classes
[^ab] matches C and D too
Groups
three letter palindromes (palindromes not italicised)
Location
Words with exactly 4 (not 3) letters
Quantification
where x (not z) is an exact number of repetitions
Common Examples
Link still points to blogspot, not new domain.
Oh. The backreferences link also still points to blogspot.
Huw,
You're sharp. I fixed those, and then I noticed a few where it was written incorrectly that the white space was matched. Thanks! :)
Andrew
I could not get the duplicate word check to work. I set up several examples in writer 2.3.1.2 on openSuse 10.2. Is there a typo, or has something changed in my version? Thanks for the help.
I must update my post. My dictionary is set to Spanish, and not only does spanish not seem to spellcheck, it also renders the regexp useless. Sorry for the post. Yes myspell-spanish is installed. :) I'm sure it's a configuration thing on my end.
Chao,
Thanks very much for the detailed list.
Now I'm one step further into solving my problem with cleaning up text.
You all maybe know the problem yourself: Some people prefer less or more new lines, no new line between paragraphs or even double, others may want to add a new paragrah or at least a new line before every "direct speech". Now I can find the problematic sections, but I still need to find a way to replace it with other options, like a second new line. \p replace \p\n won't do that, at least not for me.
I'm looking for a way to select a sentence, and assign this to a macro. Triple-clicking doesn't record as a macro, but if there's a way to put it into a macro that would solve it.
To this end I've been trying to reliably find the start of sentences.
Searching backwards for "." or even "\.\>" and then moving forward x spaces doesn't work perfectly, e.g. if there's a heading, or if there's full stops within a sentence to signify abbreviation "approx."
The OOo inbuilt functions of Go To Previous Sentence and Go To Next Sentence don't work if the sentence is the first, or last, in a paragraph, depending on how I'm trying to select the sentence from that point.
Any help would be much appreciated.
Charlie
Never mind, found this:
sub SelectSentence2
oViewCursor = ThisComponent.CurrentController.getViewCursor
oTextCursor = ThisComponent.Text.createTextCursorByRange(oViewCursor)
oTextCursor.gotoStartOfSentence(false)
oTextCursor.gotoEndOfSentence(true)
oViewCursor.gotoRange(oTextCursor,false)
end sub
Charlie
With regard to the Backreferences section, I think there is a little mistake. The correct form should be
\<(.*)\>[:space:]\<(\1)\>
with the star INSIDE the round brackets, shouldn't it? Otherwise, it finds every long sentence ending with a one-letter word if it is identical to the last letter of the previous word.
Thanks for you work.
I cannot make something like:
eval\\((((?!eval).)+)\\)
work in OpenOffice regexp Find.
Does it support: ?!
Hi. I need to add file name prefixes and suffixes to a couple thousand file names in cells. The file names are currently "BBD304 24x26" I want to change them to "public:\\BBD304 24x26.jpg" The numerals are different in each file name, but the pattern is all the same.
Thanks,
Bill
Post a Comment