Batch command line file conversion with PyODConverter - OpenOffice.org Ninja

Batch command line file conversion with PyODConverter

Posted by Andrew Z at Wednesday, February 27, 2008 | Permalink


You want to batch convert .doc to .pdf using the command line on a server without a GUI? Or you need automated .ppt to .swf conversion through cron, a sysvinit service, or a remote web server? Online conversion services such as Zamzar.com and Media-convert.com not working for you? Whichever formats you need to batch convert, PyODConverter is a simple Python script for just this purpose.

On Linux

Make sure you have the headless package installed. For the vanilla OpenOffice.org 2.4, it has the name openoffice.org-headless-2.4.0-9286.i586.rpm.

This shell script demonstrates the full use of PyODConverter for Linux:

#!/bin/bash # Try to autodetect OOFFICE and OOOPYTHON. OOFFICE=`ls /usr/bin/openoffice.org2.4 /usr/bin/ooffice /usr/lib/openoffice/program/soffice | head -n 1` OOOPYTHON=`ls /opt/openoffice.org*/program/python /usr/bin/python | head -n 1` if [ ! -x "$OOFFICE" ] then echo "Could not auto-detect OpenOffice.org binary" exit fi if [ ! -x "$OOOPYTHON" ] then echo "Could not auto-detect OpenOffice.org Python" exit fi echo "Detected OpenOffice.org binary: $OOFFICE" echo "Detected OpenOffice.org python: $OOOPYTHON" # Reference: http://wiki.services.openoffice.org/wiki/Using_Python_on_Linux # If you use the OpenOffice.org that comes with Fedora or Ubuntu, uncomment the following line: # export PYTHONPATH="/usr/lib/openoffice.org/program" # If you want to simulate for testing that there is no X server, uncomment the next line. #unset DISPLAY # Kill any running OpenOffice.org processes. killall -u `whoami` -q soffice # Download the converter script if necessary. test -f DocumentConverter.py || wget http://www.artofsolving.com/files/DocumentConverter.py # Start OpenOffice.org in listening mode on TCP port 8100. $OOFFICE "-accept=socket,host=localhost,port=8100;urp;StarOffice.ServiceManager" -norestore -nofirststartwizard -nologo -headless & # Wait a few seconds to be sure it has started. sleep 5s # Convert as many documents as you want serially (but not concurrently). # Substitute whichever documents you wish. $OOOPYTHON DocumentConverter.py sample.ppt sample.swf $OOOPYTHON DocumentConverter.py sample.ppt sample.pdf # Close OpenOffice.org. killall -u `whoami` soffice

Xvfb vs -headless

To use the -headless command line parameter, you must use OpenOffice.org 2.3.0 or later with the RPM package openoffice.org-headless-2.3.1-9238.i586.rpm. If you use an older OpenOffice.org version, you will need Xvfb to simulate an X server where one is not available:

!/bin/bash # Set DISPLAY to something besides :1 (because :1 is the standard display). DISPLAY=:1000 # Kill any existing virtual framebuffers. killall -u `whoami` Xvfb # Start the framebuffer. Xvfb $DISPLAY -screen 0 800x600x24 & # Run the OpenOffice.org conversion script above. ooo-convert.sh # Clean up killall -u `whoami` Xvfb

On Windows

On Windows, convert documents as follows. Of course, download PyODConverter and adjust the OpenOffice.org pathnames to match the version of OpenOffice.org you have installed.

Windows users don't have to worry about Xvfb or the headless package.

"C:\Program Files\OpenOffice.org2.4\program\soffice.exe" -headless -nologo -norestore -accept=socket,host=localhost,port=8100;urp;StarOffice.ServiceManager "C:\Program Files\OpenOffice.org2.4\program\python" DocumentConverter.py test.odt test.pdf

UPDATE: On Windows XP in OpenOffice.org 2.4, you may need to switch the socket to a pipe if you get the following error:

_main__.com.sun.star.connection.NoConnectException: Connector : couldn't connect to socket (WSANOTINITIALISED, WSAStartup() has not been called)

Supported formats

PyODConverter supports all of the formats that you can import or export manually in OpenOffice.org. They include:

  • odt OpenDocument (ODF) Text
  • sxw OpenOffice.org version 1 Text
  • doc Microsoft Word 97/2000/XP
  • rtf Rich Text Format
  • txt Plain text
  • wpd WordPerfect
  • html Web page
  • ods OpenDocument (ODF) Spreadsheet
  • sxc OpenOffice.org version 1 Spreadsheet
  • xls Microsoft Excel 97/2000/XP
  • odp OpenDocument (ODF) Presentation
  • sxi OpenOffice.org version 1 Presentation
  • ppt Microsoft PowerPoint 97/2000/XP
  • swf Adobe Flash

To convert .docx, .xlsx, .pptx, substitute OxygenOffice for OpenOffice.org.

To export to .csv, use JODConverter.

Reliability

Especially when batch converting many documents, you should take into consideration that the OpenOffice.org process may crash. If it does, the conversion script will fail too. A reliable method is to start OpenOffice.org, convert one document, tear down the process, and repeat; however, this is slow. You may wish to reuse the OpenOffice.org process, say 10 times, while using error checking to determine whether the process needs restarted prematurely.

JODConverter 3.0 (currently in development) will automatically restart OpenOffice.org in the event of a crash.

PDF printer method

If you just want to generate PDF files, you don't need a Python script, a Basic macro, Java code, or any other kind of programming. Just install a PDF printer such as PDFCreator (Windows) or CUPS-PDF. Then, use the -pt command with the the first argument as the printer name and the second argument as the source document.

Linux:

openoffice.org2.4 -norestore -nofirststartwizard -nologo -headless -pt Cups-PDF sample.ppt

Windows:

"C:\Program Files\OpenOffice.org2.4\program\soffice" -norestore -nofirststartwizard -nologo -headless -pt PDFCreator sample.ppt

Thanks to bikram for this tip.

Converting to image formats

Do you need to convert to .png, .jpg, or .tif? Install ImageMagick. Then, convert whichever format to .pdf. Then, convert the .pdf to the image format like this:

convert sample.pdf sample.png convert sample.pdf sample.jpg convert sample.pdf sample.tif

If the document has multiple pages, ImageMagick creates filenames like sample-0.png.

If you prefer a more direct approach, use a script such as unoconv.

Other methods

There are many ways to batch convert files with OpenOffice.org:

7 comments:

Aidan PangChi said...

Great blog. But in "-accept=socket,host=localhost,port=8100;urp;StarOffice.ServiceManager" -norestore -nofirststart -nologo -headless &
it's -nofirststartwizard and not -nofirststart

Andrew Z said...

Aidan PangChi: Nice catch! I fixed that and made small updates.

Anonymous said...

Has anyone an ideo how to convert something like html into mediawiki text?

Thank you very much!

Andrew Z said...

Anonymous: Try HTML::WikiConverter which I use online, but you can download it too.

Anonymous said...

Hello,
thx for the solution but i don´t want to use the online tool. I want to use my own server. Hay anyone an idea ?

Thank you !

Andrew Z said...

Are you same anonymous? :) If so, the answer is you can download HTML::WikiConverter for use on your own server.

Frank Woeckener said...

first of all - thank you a lot for the very helpful script - we appreciate it a lot!
When we start OOffice as headless server on a linux machine and use your script on another host (mainly ppt to html) the OO-instance grows endless. Limiting it with ulimit to say 1GByte leads to a crash (alloc error) after a while. Has anyone a hint how we could get around that?
Kindly Your's f.woeckener