Example regular expressions for Writer - OpenOffice.org Ninja

Example regular expressions for Writer

Posted by Andrew Z at Sunday, December 30, 2007 | Permalink

Here are some sample regular expressions for OpenOffice.org Writer. Use these example as is or as a basis for building your own regular expressions.

In the Find & Replace dialog box, don't forget to check the box Regular Expressions. Also, you usually will want Match case to be unchecked.

DescriptionSearch for
Empty paragraph without whitespace^$
Empty paragraph with whitespace^[ \t]$
MM/DD/YYYY and M/D/YY dates[01]?[0-9]/[0-3]?[0-9]/[21]?[0-9]{0,3}
Email addresses (not a perfect regex pattern)\<[a-z0-9._%+-]+@[a-z0-9.-]+\.[a-z]{2,4}\>
10-digit US phone number such as 123-456-6789[0-9]{3}-[0-9]{3}-[0-9]{4}
Second letter in a word is capitalized like FOo. Make sure the box Match case is checked.\<[A-Za-z][A-Z][a-z]*\>
Long words (10 characters or more)\<[a-z]{10,}\>
Paragraphs beginning with demonstrative pronouns^(This|That|These|Those)
Palindromes with letter letters\<(.).\1\>
HTML, XML, SGML, and similar tags<[a-z/][a-z]*>
IP addresses\<((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(25[0-5]
|2[0-4][0-9]|[01]?[0-9][0-9]?)\>

Note: You must combine the above expression into one line. It was broken to fit the table.

Search and replace

Here are some common regular expression for search and replace.

DescriptionSearch forReplace with
Replace each tab with three spaces\tthree spaces
Replace three spaces with a tabthree spaces\t
Replace non-breaking spaces with regular spaces([:space:)]space
Replace manual line breaks with paragraph breaks\n\n
Replace double smart quotes (aka book quotes or curly quotes) with straight quotes (aka dumb quotes)[\x201C\x201D\x201F]"
Replace single smart quote with straight quotes[\x2018\x2019\x201B]'
Converting dates in YYYY-MM-DD format to MM/DD/YYYY format([0-9]{4})-([0-9]{2})-([0-9]{2})$2/$3/$1

Removing non-breaking spaces without regular expressions

Non-breaking spaces (gray rectangles) often come when pasting from PDFs of web pages. To view them, choose View > Field Shadings. An alternate way to remove them is to:

  1. Highlight one non-breaking space.
  2. Copy it to the clipboard.
  3. Paste it into the Search for field.
  4. Type a space in Replace with.

Replacing with line breaks

Using the standard Find & Replace dialog, it is not possible to use manual line breaks in the Replace with field (issue 46165).

One source of confusion is \n has different meanings between the Search for and Replace with fields. In Search for, \n matches a line break. In Replace with, \n inserts a paragraph ending. (Paragraph breaks are more common, and they are inserted in a document by simply striking the ENTER key.) Another source of confusion is the term carriage return which may refer to either paragraph breaks or line breaks.

To replace with a line break, your choices include these:

  1. Do it manually.
  2. Record a macro to do it once. Assign the macro to a keyboard shortcut. Then, type the shortcut key as many times as required.
  3. Use Tomas Bilek's Alternative dialog Find & Replace for Writer. It is an extension, so it is relatively easy to install.
  4. Use Ian Laurenson's Find and Replace macro. It is not packaged as an extension, so it requires more steps to install.
If you are replacing paragraph breaks with line breaks, keep in mind that you could end up creating a very long paragraph, and Writer has a limitation of 65,534 characters per paragraph (issue 17171).

Replacing dumb quotes with smart quotes

To replacing all the straight quotation marks with curly quotation marks, follow these steps:

  1. Type any word in smart quotes.
  2. Highlight the right smart quote (”).
  3. Copy the selection to the clipboard (shortcut is CTRL+C).
  4. Choose Edit > Find & Replace from the menu.
  5. If you place periods inside the quotation mark, type [\.\?!]" in the Search for field.
  6. If you place periods outside the quotation mark, type "[\.\?!] in the Search for field.
  7. In the Replace with field, paste the clipboard contents (by right clicking or typing CTRL+V).
  8. Press the More Options button.
  9. Check the box Regular Expressions.
  10. Click the Replace All button.
  11. Highlight the left smart quote (“).
  12. Copy the selection to the clipboard.
  13. In the Search for field, type a single dumb quote (").
  14. In the Replace with field, paste the clipboard contents.
  15. Click the Replace All button.

Removing manual page breaks

This is the simplest way to remove all manual page breaks:

  1. Highlight all text in the document (shortcut is CTRL+A).
  2. Chose Format > Paragraph from the menu.
  3. Choose the Text Flow tab.
  4. Uncheck the box Break > Insert.

Finding, removing, or inserting manual page breaks cannot be done with regular expressions or with the Find & Replace dialog (issues 26719 and 63606).

OpenOffice.org version

A few of these examples use backreferences (like $1) which require OpenOffice.org 2.4.

Introduction to regular expressions

In case you missed it, read the introduction to regular expressions.

26 comments:

Anonymous said...

I read dozens of pages in forums and did not find a so natural thing:
how can one
automatically replace all double blanc lines by a single blank line.
(If configured correctly, whil also
change a triple repetition of a blank line into a single blanc line).

Something so easy in emacs and so difficult in OOW.

Fionn said...

To the Anonymous commenter regarding replacing blank lines....

Remember that paragraphs are single lines in Writer (or any Word Processor, for that matter), so the caret (beginning of line) and DollarSign (End of Line) combo (^$) is a blank line, not just a blank paragraph as indicated in the blog entry above.

You then want to repeat it (because the RegEx in Writer doesn't have {min,max} features) with a backreference ( use "$1"), with a plus-sign, since you want at least one (contrast this with the asterisk/star ("*") which matches zero or more.

You'll want something like:

(^$)$1+ as your seaerch pattern.

Juegos said...

Interesting article, thanks for all.

chuck said...

Regarding the double newlines, or the empty paragraph.
In my text, removing the empty paragraph won't work, because it eliminates the format information of the next line...

end of prev paragaraph.new paragraph
new paragraph
HEADING 2

- would end up as ... -

end of prev paragaraph.new paragraph
heading 2 with default format

heading 2 would actually be stripped off its heading formatting and document hierarchy.

Such an everyday task, do we have to do it by hand?

danlj said...

I am searching how to use regular expressions properly to REPLACE wildcards.

e.g. I have numerous occurrences of
" This is a sentence. "
I want
"This is a sentence."
"\ [A_Z]
finds all instances of a double quote separated by a spurious space from a following capital letter, but I cannot find how to use the 'found' character in the replacement string.
I had expected ? would do, but the found letter in this instance is replace with the literal question mark.

Where can I find instructions on replacement wildcards in OO (ver 2.3, 2.4)

Tim Richardson said...

I want to replace every paragraph that is followed by a tab at the beginning of the next line with a newline and a tab.

How do I do that? the find pattern
$\t
doesn't match :-)

Anonymous said...

In response to the comment earlier today by T. Richardson:

The problem your experiencing is a common one, so don't feel too bad about it! What you're telling the system to look for is an end-of-line (the '$') followed by a tab (the '\t'). Regexp's weren't originally designed to cover more than one line at a time, so we need to look at the beginning of the line for our match rather than at the end of the line.

Try looking for either 1) a 'New line' character followed by a tab, or 2) a 'start of line' marker followed by a tab, or 3) a 'carriage return' character followed by a tab.

Each of these tells the system to look for a tab at the beginning of a paragraph.

The search string would be either 1) '\n\t', 2)^\t, or 3) a '\r\t' (and of course, don't include the quote-marks!). My personal preference is choice #2.

The problem is that it will catch the first paragraph, and give it a tab. This does not solve your criteria that the tab must always follow a paragraph-end -- but in a large document, a manual fix for just one paragraph at the top of the document shouldn't be too much trouble, right?

-Fionn

Wheat said...

Can you include a regular expression for replacing a long em-dash with two regular hyphens?

Anonymous said...

To "Wheat", earlier today:

Frankly, the easiest way to find-&-replace an em-dash with two regular hyphens is to use the standard non-regular-expression way in OpenOffice; just copy an em-dash from somewhere in the document, and paste it in the 'Search for' field, and then put two '--'s (hyphens) in the 'Replace with' field.

If you're *really* going to go through the trouble of using a regular expression, you could use the unicode for the em-dash (U+2014, or ), and replace it with two hyphens, but -- my goodness -- why do that?

OpenOffice's find-and-replace feature really uses the regexp portions to locate positions and contextual clues to locate search results, rather than specifiying *what* to find... It's just the flavor the codewriters settled-on.

Good luck with the em-dash to double-hyphen work; let us know how it goes.

-Fionn

Yaffle said...

This all seems massively more difficult than in W4W. Why doesn't the Find/Replace dialog box give you the option of inserting non-printing characters?

Anonymous said...

This is directed towards 's reply earlier, today.

As I'm not a developer, I couldn't speak towards why it's not 'like in WFW'.That's a design choice by them who do the programming.

This thread has been about using 'Regular Expressions' as a means to do a search in OpenOffice.org. "Regular Expressions' ("Regex" for short), can give the user an extremely fine-tuned search-and-replace action -- with a very strong emphasis on 'extremely'.

Sometimes you might want to change a word (or part of a word) only if it's in the past tense, and then only if the paragraph is talking about a certain topic. With a general find-and-replace (as found in WFW), the user has to manually check each potential change (including the paragraph's topic / context!), one after another -- and this can take quite a long time in a document that's several hundred pages long.

A well-crafted Regex can have the program (the word processor, in this case) do the checking for you -- hit 'enter' once, and it's done.

Sure, using Regex takes a bit of effort to learn at first. But compare riding a bicycle to riding a tricycle: one takes effort to learn but is much more effective at getting you from one place to another; the other is easy and you never fall off, but it is very slow and you really wouldn't want to travel far with it.

arnotixe said...

Hi I scanned a few books and used tesseract to read them. The problem is that the words in the original book were hyphenated a lot, resulting in a hyphen and a lineshift, like this:

'The Encyclapedia Arnericana libro-
pika kashnami nin:

Now, I needed to remove all lineshifts preceeded by an hyphen (-$ in regexish). Couldn't find a way to do it directly in openoffice, so I
1) Searched for "-$" and replaced with "DASHHERE"
2) Searched for "$" and replaced with "NEWLINE"
3) Searched for "DASHHERENEWLINE" and replaced with "" (nothing)
4) Searched for "NEWLINE" and replaced with \n

Of course, DASHHERE, NEWLINE and DASHHERENEWLINE must not occur in the original text.
Now I'll try out whether the alternative search and replace works better...

arnotixe said...

Appending my own comment:
De-hyphenation was very easy with the alternative search-and-replace tool.

Search for "-\p" (without " 's)

Replace with nothing.

Bob's your uncle.
\p means new paragraph, while $ has the normal regexp "end of line" meaning.

Anonymous said...

"Replacing dumb quotes with smart quotes

To replacing all the straight quotation marks with curly quotation marks, follow these steps:
[...]
5. If you place periods inside the quotation mark, type [\.\?!]" in the Search for field."

Did you actually bother to test this? Probably not, because it is worse than useless. It also removes your punctuation, and it does not replace the dumb quotes which do not end sentences or follow a comma.

Tip: go the other way around. First replace all opening dumb quotes ("\<), since they are most likely to come right before a word. Then replace all the remaining dumb quotes with a closing smart quote. Repeat for single quotes.

olegkirillov said...

The problem is that I cannot reformat Project Gutenberg's txt. I need to replace double newlines with a single ones and single newlines with a spaces. In W4W I just place a special character (double Paragraph mark) in search box and PARAGRAPMARK special word in replacement box. After that I turn all other paragraph marks to spaces and then PARAGRAPHMARK to a paragraph mark. Fine.
It's impossible in Writer and I do not have Word installed so I'm stuck...

Andrew Z said...

olegkirillov: Check out the Alternative dialog Find & Replace for Writer which provides advanced find and replace features

Ely said...

I am very disappointed with how OOw treats paragraph marks. I often have need to replace paragraph marks with manual line breaks. The only way I could find to do this is either manually, or by using the Alternative Find & Replace. But the add-in is buggy and won't do a replace all, so it's almost like doing it manually.

Is there a single point of contact somewhere in charge of search & replace who could possible be persuaded to consider that paragraph marks need to be handled like any other flow or non-flow character. What is so difficult about this concept? IT'S JUST A CHARACTER. DEAL WITH IT!

Andrew Z said...

Ely: Unzip the .odt, look at the .xml, and you should see that paragraphs are not actually characters. They are represented by markup similar to HTML

Ely said...

You are so correct. OK, I understand the greater complexity. In the html files I'm working with, the \n is a br tag and the paragraph mark is an enclosing p.../p. So as I migrate from Vista with MS Office I battle the formatting issue. Word has no problems searching for and replacing paragraph marks. OOW has no problems displaying the paragraph marks. If it can display them, It's frustrating that it can't search and replace them.

I wouldn't need to replace paragraph marks if OOW could generate properly formatted paragraphs. But what I create with OOW is not how it appears in a browser. Didn't have that problem with Word either.

Praneel said...

Does OpenOffice support LookAhead & LookBehind assertions in regexp

NunoB said...

Does anybody know how can I make a find and replace and change the format of part of the replace?
For example, substitute VIN by VIN, where IN is in subscript.

Is it possible to do ?

flyer said...

Hello

I want to find all lines in a document which are enclosed between two special tags

like between "BLABLA"

BLABLA
sth sth sth

sth sth sth
BLABLA

I tried it with your above mentionend solution of using "^" but I can not get it working my attempts have been to use

"BLABLA$(^.*$)+^BLABLA"
"BLABLA$(^.*)+^BLABLA"
"BLABLA.*^.*^BLABLA"
...

Do u have any idea how to solve this? It works when I change all $ to \n but since it is quite a big document I will get trouble with the paragraph limit I think

Thanks
Florian

Jim said...

To "Flyer," above:

You've already found the solution to your problem -- you've said "it works" when you substitute with the '$' as the end-of-line mark in your regex. I'm not sure I understand what you mean when you say you'll have problems with the 'paragraph limit'.

I'm not aware of a 'paragraph limit'.

Have you thought about breaking a Very Large Document into several smaller sub-documents, each called by a Master Document (an '.ODM' file)? Document management -- especially on large documents -- is *so* very much easier when they're used that way.

Try it (the 'Master Document' method); you just might find it's a feature you've been missing and never knew you needed.

Eliodoro said...

thanks for this kind of information, usefull really.

I want to comment something that I found in Calc and in the use of Regular Expressions inside OO.

I am using OO in spanish and in the formulas and Regular Expressions, many of the boolean parameters (and, or, etc) don´t work if I put this in spanish (y, o) in Calc, the same occur whit things like [:digit:] that in spanish [:digito:] dont work, OO just accept this parameters in English, in the help of OO this parameters are in Spanish but don´t work.

A internationalization problem I think.

Switched to Wordpress said...

I am trying to find paragraph breaks (the pilcrow, $ sign), and replace them with TWO paragraph breaks, so I can upload them somewhere where indents aren't supported. Thing is, it FIND paragraph marks but it REPLACES with dollar signs. Help?

Luis Roberto Rodriguez said...

Hi, I have read every single page in google about REGEX and It results imposible for me to do this. I need to find the word DOWN in a paragraph that is like this
Interface Status Portstt
Gigaethernet admin down down
ethernet down down

so I need to find the interfaces in state "DOWN" but not the ones that says "administratively down", I need to make a exception for this ones. I used this regular expresion ^*down.*down.*$ but when I search with this regex I get every word that says down even the administratively down interfaces which doesnt work for me! :( please help me!!!!