PDF import and hybrid PDFs - OpenOffice.org Ninja

PDF import and hybrid PDFs

Posted by Andrew Z at Wednesday, June 4, 2008 | Permalink

Now available for testing is the PDF import extension, which also includes hybrid PDF-ODF export. PDFs are designed for layout instead of for further editing, so when a word processor, spreadsheet, or presentation application exports to a PDF, the layout and document structure are lost. To avoid disappointment, you must keep in mind creating a PDF is not a reversible process because of the limitations of PDF as a format.

Implementations may either favor editing or layout. This extension's current implementation favors layout, so it imports PDFs into Draw. Keeping all that in mind, this extension still has many uses such as adding annotation, filling out forms, making minor edits, using a PDF as a picture, and reusing PDFs for which the original source is lost. Because of the limitations of PDFs, these uses may not be completely painless.

PDF import

The extension installs as easily as any OpenOffice.org or Firefox extension. OpenOffice.org extensions cannot register file associations with the operating system (though you can set them up manually), but importing a PDF is as simple as clicking on File and then Open. The import process takes a long time compared to opening an OpenOffice.org document because of the necessary guesswork caused by the limitations of PDF.

For a test, I exported ODF_text_reference_v1_1.odt from OpenOffice.org and imported it again. When the initial screen appeared with the results, I stared at it in disbelief. It looked just like the original. The text, layout, font faces, text colors, bold, italics, underline, and picture were well preserved.

Below are the original in Writer and the imported document in Draw. Doesn't it take more than a glance to identify which is the original?

Screenshot of OpenOffice.org Writer of original document used in this test Screenshot of OpenOffice.org Draw showing the result of the PDF import extension

It was only later after closer look that imperfections appeared. For example, interactive PDF form elements (including a text input form and a button) were visibly mangled. That may be fixed in a future release. Then, there are the limitations of PDF import. Each line of text is one or more text boxes. Hyperlinks are merely blue, underlined text without interaction. Superscript is just a smaller font positioned in higher text box. Comments are discarded.

Overall, this is a remarkable result for an early release. Future versions of the extension may take on other forms of PDF import, such as favoring text streams.

Alternative PDF import

OpenOffice.org did not pioneer PDF import—not even in the open source market. Some of the work in OpenOffice.org is done by xpdf, a PDF viewer. To import PDFs, open source alternatives include pdftohtml, Abiword, KWord, and Inkscape. There are also a host of proprietary applications.

Depending on your needs, there are other ways to import PDFs into OpenOffice.org. To import PDFs into Writer or Impress, you may be able to combine the new PDF import extension with copy and paste. If you just need to extract text, copy the text in Adobe Acrobat Reader and paste it into OpenOffice.org. This retains some formatting. If you just need to place a picture from a PDF into OpenOffice.org, take a screenshot or use Adobe Acrobat Reader's snapshot tool. To place a whole PDF as a read-only image, insert the PDF as an OLE object on Windows, click Insert - Object - OLE Object - Create from file. On Linux, you can convert PDFs to bitmap images (such as PNGs) using ImageMagick's convert tool with a command such as

convert foo.pdf bar.png

What makes OpenOffice.org stand from out these other solutions is hybrid PDFs.

Hybrid ODF-PDF files

Combine the viewing and printing portability of a PDF with the editing capabilities of OpenDocument Format. "Have your cake and eat it too," promises ODF is embedded in PDF. When these two open standards team up, you better watch your back, OOXML.

Most applications (such as Adobe Acrobat Reader) ignore the ODF bits and treat the whole hybrid file as a normal PDF. Presentation is pixel perfect. Wait. That's not all. OpenOffice.org 3.0 with this extension treats the hybrid as a normal ODF, so the ODF document opens in Writer, Impress, Calc, or Draw according on the original. (You didn't just expect Writer, did you?) Now you have lossless, editable, round-trip PDFs.

To export hybrid PDF files, you need the (inaccurately named) PDF import extension which adds a new checkbox to the PDF export dialog box. To import hybrid files, you also need this extension.

OpenOffice.org 3.0 PDF export dialog showing the new hybrid PDF export option

One downside of this hybrid system is adoption by users will be slow. Especially at first, not everyone will have OpenOffice.org 3.0 and this extension. Other applications may not adopt support for hybrid PDF-ODFs, but the genius is the dual-format strategy mitigates the problem, so everyone will gain at least some use from these hybrids.

Another downside is hybrid documents are larger files because some of the information is duplicated. For what it's worth, ODFs are compressed, and storage capacities are steadily climbing. Monster 2 TB drives are coming next year. Space is cheap, and you probably won't store all your documents as hybrids anyway.

What will be Microsoft's reaction? Office 2007 already has PDF export, and next year Office 2007 SP2 will support ODF natively. Will Microsoft Office one day see hybrid OOXML-PDFs, ODF-PDFs, both, or neither? Don't forget any move Microsoft makes will likely favor XPS, its PDF competitor.

Download

Test this extension in OpenOffice.org 3.0 or later. Though 3.0 comes out in September, the 3.0 beta and developers snapshots are available now. Currently the developer snapshot DEV300_m14 is newer than 3.0 beta.

Remember this extension is not yet a stable 1.0 release. PDF import extension builds are currently available for Linux, for Windows, or for Mac (download links updated 9 June 2008) courtesy of Pavel Janik. (Update 11 June 2008) The extension is available on the OpenOffice.org extensions web site, but it does not exactly require OpenOffice.org 3.0 Beta 2 (not yet available) as written. It works in DEV300_m18 (available now).

Related articles

28 comments:

Anonymous said...

Hybrid PDFs seem incredibly usefull. Good job Sun!

Anonymous said...

I would like to add that there is another free (GPL) crossplatform tool which can import PDF very accurate. It's inkscape (www.inkscape.org/). I was really surprised about the excellent quality in the latest release 0.46. I'm in academia where I have to deal with charts/diagrams and so on. Colleagues of me frequently hand my PDFs over for reviewing. If needed editing charts or whatever for the final print is totally easy. Of course one can save it in another format from there, may it be vector graphics (svg, eps, "pdf", odg,...) or pixel graphics (png).

Anonymous said...

At last! -the fonts can get saved into an .odt, odg etc. This will be -SO- handy!!

Andrew Z said...

Anonymous: While OpenOffice.org embeds fonts in PDFs (including hybrid PDFs), it doesn't exactly embed fonts in OpenDocument Format files. If that is what you need, add 1-2 votes for issue 20370.

Fox Cole said...

Just a caution about the suggested alternative, inserting PDFs as OLE objects. It doesn't work unless whoever is viewing the document has something that can edit PDFs... one of the main reasons why this extension was needed and developed, I think.

mandehu said...

Your OO extension pdfimport.oxt poses a pb on my iMac Intel under Tiger 11: when I dl it it gets converted to pdfimport.oxt.jar. Unfortunatelymy system can't cope with .jar files -- I get an error message saying "Jar Launcher The jar file "pdfimport.oxt.jar" couldn't be launched. Check the Console for possible error messages". However the Java Console.log is *empty*.
Thank you for any help

Jakub Narebski said...

Does this extension use CDF (Compound Document Format) from W3C?

(Although CFD is described to be used for XHTML, SVG, SMIL and XForms composite formats, I don't see why it could not work for PDF+ODF... well, perhaps except the fact that PDF is not XML based format, but again PNG also is not XML).

Andrew Z said...

mandehu: Have you tried to rename the .jar to .oxt? OpenOffice.org expects extensions to be .oxt. When I download using Firefox on Linux, it works fine.

Jakub Narebski: I think the hybrid PDF's container format is simply PDF. Wouldn't applications such as Acrobat Reader require special support for CDF? Also, I don't see references to any actual implementations of CDF.

Anonymous said...

well, with this extension PDF-Import is not working

Martin

Anonymous said...

Hm, everytime I want to open a PDF file (downloaded from the Internet) I get a runtime error on my windows system!

The message:

"R6034
An application has made an attempt to load the C runtime library incorrecty."

The pdf file:
http://svn.reactos.org/media/reactos_0.3.4_booklet_cover.pdf

Unknown said...

Same error here:
R6034
An application has made an attempt to load the C runtime library incorrecty.

- WinXP Pro SP2 full update
- OOo-DEV300m17
- OOo 2.4

Andrew Z said...

I filed a bug report for the C++ runtime error on Windows. There's a chance an OpenOffice.org build from Pavel's site would be a workaround.

Anonymous said...

Seems to be fixed in DEV300_m18

Unknown said...

> Seems to be fixed in DEV300_m18

Yes, it works! Thanks.

Anonymous said...

You added a permalink for this article on http://wiki.services.openoffice.org/wiki/Pdf_Import_Extension but this link is broken because of a missing l at the end of the link.

Andrew Z said...

Anonymous: Thanks for the tip! :)

Anonymous said...

Hi, Just installed ver 3 from a mirror site and added the pdf extension.

Noticed with Hybrid...

if you start with a passworded odt, then export a hybrid without adding pdf security, you get the pdf viewed immediately in a reader as expected, but when opening the hybrid pdf in OpenOffice, Writer opens the odt without needing any pass phrase.

So, it seems that to protect the odt in a hybrid, you have to protect the hybrid pdf?

BTW, I don't use certificates and such, but I guess an "MD5" of the odt if saved from the PDF to disk again would be different than the passworded original?

(Maybe I'm missing something but thought I'd mention it sooner than later, and not to confuse the forum if the documentation hasn't caught up to the new feature yet)

Thanks

Bh

Ubu Walker said...

I downloaded this extension and was surprised at how poorly it imported the US Internal Revenue Service's 1040 form. Text ran over into other fields, the '20' in 2008 was a few cm too high, and I couldn't figure out how to add my own text to blank spaces, so that I could fill out the form.

Boo...Hiss...this extension sucks and is not ready for prime time. This is alpha quality, not beta.

Anonymous said...

ı have followed your writing for a long time.really you have given very successful information.
In spite of my english trouale,I am trying to read and understand your writing.
And ı am following frequently.I hope that you will be with us together with much more scharings.
I hope that your success will go on.

Anonymous said...

This worked great for PDFs that were setup in Portrait. But I had problems with a PDF setup as Landscape...

Jonathan Hayward said...

I've found that import seems to work for Windows 32-bit but not Linux 64-bit: when I try to open a PDF I had earlier accessed in Linux 32-bit, it opened without trouble, and it also opened without trouble under Windows 32-bit, but was opened as text under Linux 64-bit with the PDF import extension installed and (apparently) active.

Is this a known issue?

Jonathan
Jonathan's Corner: A Library of Free Online Books to Read

Anonymous said...

Does any one know how to add hyperlinks to the PDF file once it has been imported?

The PDF file had working hyperlinks before the import, after exporting the new version of the PDF the hyperlinks did not work.

Any suggestions?

Thank you.

Anonymous said...

http://uwf.edu/envhs/pdffiles/ELECSAFt.pdf

I had a problem with this page, some of the images arrive flipped.

otherwise great program.

Andrew Z said...

Anonymous (Apr 17): That may be bug 92908 "ignores transformation matrix on an image."

dexter said...

how do you identify if it is a regular pdf file or a hybrid pdf file?

Andrew Z said...

Dexter: Try and open it in OpenOffice.org? :)

EricMarceau said...

(N.B. Previously submitted at http://wiki.services.openoffice.org/wiki/Writer/ToDo/PDF_Import)

A key tool for many documents is "marking up" and saving such personal highlites and commentaries for documents we download for review/research purposes. The PDF-import function would be a key enabler of this irreplaceable activity. Allow me to expand. Not often consciously considered as for this specific purpose, PDF is also deemed an archival format, meaning a frozen snapshot in time. If one expects to make use of this original, unmodified form, every time, PDF is often stored in a common repository for search and retrieval. However, for more personal use, as for researchers identifying citations relevant to a given study, this original form must be either

a) printed and highlited to bring attention to relevent excerpts, or

b) excerpted and copy/pasted into a separate file for such references.

It would be desirable to be able to save the highliting mask, with possible commentary, either into a personalized version of the PDF file, or as a separate "commentary" file. The commentaries, if stored separately, would minimize what needs to be forwarded when collaborating, and would minimize the data growth if large groups of reviewers need to store such commentaries in the same central repository. If a first level of capability for the import was ONLY to facilitate this highliting overlay process, it would truly address a widespread need.

newhardware said...

At last! -the fonts can get saved into an .odt, odg etc. This will be -SO- handy!!