Batch command line file conversion with PyODConverter

Posted by Andrew Z at Wednesday, February 27, 2008 | Permalink

You want to batch convert .doc to .pdf using the command line on a server without a GUI? Or you need automated .ppt to .swf conversion through cron, a sysvinit service, or a remote web server? Online conversion services such as Zamzar.com and Media-convert.com not working for you? Whichever formats you need to batch convert, PyODConverter is a simple Python script for just this purpose.

On Linux

Make sure you have the headless package installed. For the vanilla OpenOffice.org 2.4, it has the name openoffice.org-headless-2.4.0-9286.i586.rpm.

This shell script demonstrates the full use of PyODConverter for Linux:

#!/bin/bash
# Try to autodetect OOFFICE and OOOPYTHON.
OOFFICE=`ls /usr/bin/openoffice.org2.4 /usr/bin/ooffice /usr/lib/openoffice/program/soffice | head -n 1`
OOOPYTHON=`ls /opt/openoffice.org*/program/python /usr/bin/python | head -n 1`
if [ ! -x "$OOFFICE" ]
then
echo "Could not auto-detect OpenOffice.org binary"
exit
fi
if [ ! -x "$OOOPYTHON" ]
then
echo "Could not auto-detect OpenOffice.org Python"
exit
fi
echo "Detected OpenOffice.org binary: $OOFFICE"
echo "Detected OpenOffice.org python: $OOOPYTHON"
# Reference: http://wiki.services.openoffice.org/wiki/Using_Python_on_Linux
# If you use the OpenOffice.org that comes with Fedora or Ubuntu, uncomment the following line:
# export PYTHONPATH="/usr/lib/openoffice.org/program"
# If you want to simulate for testing that there is no X server, uncomment the next line.
#unset DISPLAY
# Kill any running OpenOffice.org processes.
killall -u `whoami` -q soffice
# Download the converter script if necessary.
test -f DocumentConverter.py || wget http://www.artofsolving.com/files/DocumentConverter.py
# Start OpenOffice.org in listening mode on TCP port 8100.
$OOFFICE "-accept=socket,host=localhost,port=8100;urp;StarOffice.ServiceManager" -norestore -nofirststartwizard -nologo -headless &
# Wait a few seconds to be sure it has started.
sleep 5s
# Convert as many documents as you want serially (but not concurrently).
# Substitute whichever documents you wish.
$OOOPYTHON DocumentConverter.py sample.ppt sample.swf
$OOOPYTHON DocumentConverter.py sample.ppt sample.pdf
# Close OpenOffice.org.
killall -u `whoami` soffice

Xvfb vs -headless

To use the -headless command line parameter, you must use OpenOffice.org 2.3.0 or later with the RPM package openoffice.org-headless-2.3.1-9238.i586.rpm. If you use an older OpenOffice.org version, you will need Xvfb to simulate an X server where one is not available:

!/bin/bash
# Set DISPLAY to something besides :1 (because :1 is the standard display).
DISPLAY=:1000
# Kill any existing virtual framebuffers.
killall -u `whoami` Xvfb
# Start the framebuffer.
Xvfb $DISPLAY -screen 0 800x600x24 &
# Run the OpenOffice.org conversion script above.
ooo-convert.sh
# Clean up
killall -u `whoami` Xvfb

On Windows

On Windows, convert documents as follows. Of course, download PyODConverter and adjust the OpenOffice.org pathnames to match the version of OpenOffice.org you have installed.

Windows users don't have to worry about Xvfb or the headless package.

"C:\Program Files\OpenOffice.org2.4\program\soffice.exe" -headless -nologo -norestore -accept=socket,host=localhost,port=8100;urp;StarOffice.ServiceManager
"C:\Program Files\OpenOffice.org2.4\program\python" DocumentConverter.py test.odt test.pdf

UPDATE: On Windows XP in OpenOffice.org 2.4, you may need to switch the socket to a pipe if you get the following error:

_main__.com.sun.star.connection.NoConnectException: Connector : couldn't connect to socket (WSANOTINITIALISED, WSAStartup() has not been called)

Supported formats

PyODConverter supports all of the formats that you can import or export manually in OpenOffice.org. They include:

odt OpenDocument (ODF) Text
sxw OpenOffice.org version 1 Text
doc Microsoft Word 97/2000/XP
rtf Rich Text Format
txt Plain text
wpd WordPerfect
html Web page
ods OpenDocument (ODF) Spreadsheet
sxc OpenOffice.org version 1 Spreadsheet
xls Microsoft Excel 97/2000/XP
odp OpenDocument (ODF) Presentation
sxi OpenOffice.org version 1 Presentation
ppt Microsoft PowerPoint 97/2000/XP
swf Adobe Flash

To convert .docx, .xlsx, .pptx, substitute OxygenOffice for OpenOffice.org.

To export to .csv, use JODConverter.

Reliability

Especially when batch converting many documents, you should take into consideration that the OpenOffice.org process may crash. If it does, the conversion script will fail too. A reliable method is to start OpenOffice.org, convert one document, tear down the process, and repeat; however, this is slow. You may wish to reuse the OpenOffice.org process, say 10 times, while using error checking to determine whether the process needs restarted prematurely.

JODConverter 3.0 (currently in development) will automatically restart OpenOffice.org in the event of a crash.

PDF printer method

If you just want to generate PDF files, you don't need a Python script, a Basic macro, Java code, or any other kind of programming. Just install a PDF printer such as PDFCreator (Windows) or CUPS-PDF. Then, use the -pt command with the the first argument as the printer name and the second argument as the source document.

Linux:

openoffice.org2.4 -norestore -nofirststartwizard -nologo -headless -pt Cups-PDF sample.ppt

Windows:

"C:\Program Files\OpenOffice.org2.4\program\soffice" -norestore -nofirststartwizard -nologo -headless -pt PDFCreator sample.ppt

Thanks to bikram for this tip.

Converting to image formats

Do you need to convert to .png, .jpg, or .tif? Install ImageMagick. Then, convert whichever format to .pdf. Then, convert the .pdf to the image format like this:

convert sample.pdf sample.png
convert sample.pdf sample.jpg
convert sample.pdf sample.tif

If the document has multiple pages, ImageMagick creates filenames like sample-0.png.

If you prefer a more direct approach, use a script such as unoconv.

Other methods

There are many ways to batch convert files with OpenOffice.org:

OxygenOffice as a Word 2007 (.docx) importer if you want to convert between OpenXML formats (.docx, .pptx, and .xlsx) and other formats supported by OpenOffice.org
Windows: Convert Office Open XML files (.docx, .xlsx, .pptx) to OpenDocument (.odt, .ods, .odp) using command line
Linux: Convert Office Open XML files (.docx, .xlsx, .pptx) to OpenDocument (.odt, .ods, .odp) using command line
Windows: Convert Office Open XML files (.docx, .xlsx, .pptx) to Microsoft Office binary files (.doc, .xls, .ppt)
JODConverter robust method using Java
Moving to OpenOffice: Batch Converting Legacy Documents uses a Basic macro
ooo2any another Python script

54 comments:

Unknown said...: Great blog. But in "-accept=socket,host=localhost,port=8100;urp;StarOffice.ServiceManager" -norestore -nofirststart -nologo -headless &
it's -nofirststartwizard and not -nofirststart; April 9, 2008 at 2:42 AM
Andrew Z said...: Aidan PangChi: Nice catch! I fixed that and made small updates.; April 12, 2008 at 10:33 PM
Anonymous said...: Has anyone an ideo how to convert something like html into mediawiki text?

Thank you very much!; April 17, 2008 at 5:39 AM
Andrew Z said...: Anonymous: Try HTML::WikiConverter which I use online, but you can download it too.; April 17, 2008 at 7:02 AM
Anonymous said...: Hello,
thx for the solution but i don´t want to use the online tool. I want to use my own server. Hay anyone an idea ?

Thank you !; April 18, 2008 at 9:41 AM
Andrew Z said...: Are you same anonymous? :) If so, the answer is you can download HTML::WikiConverter for use on your own server.; April 19, 2008 at 8:01 AM
Anonymous said...: first of all - thank you a lot for the very helpful script - we appreciate it a lot!
When we start OOffice as headless server on a linux machine and use your script on another host (mainly ppt to html) the OO-instance grows endless. Limiting it with ulimit to say 1GByte leads to a crash (alloc error) after a while. Has anyone a hint how we could get around that?
Kindly Your's f.woeckener; May 11, 2008 at 2:36 PM
J.Naveen said...: This is very nice, you information helped for me but I want a converter program which can convert batch file into shell script file. Please tell me a way for that.; May 14, 2008 at 4:56 AM
Anonymous said...: Can use the same script in BROffice ?!; July 2, 2008 at 3:26 PM
Andrew Z said...: Anonymous: Basically, yes, you can use it in BROffice. You may need to adjust the command line slightly to account for any difference in BROffice's different pathnames.; July 2, 2008 at 6:43 PM
J.Naveen said...: Please any body tell me a way to extract Text data from the pdf file with java program.; July 9, 2008 at 10:05 PM
Unknown said...: FlexiDoc Server is a product that combines OpenOffice.org, Apache and Wiki in a single. You may givi it a try, look for more info here: www.flexidocserver.com; August 25, 2008 at 8:49 AM
Anonymous said...: I am getting following error when using PyODConverter:
Traceback (most recent call last):
File "./DocumentConverter.py", line 144, in module
converter.convert(argv[1], argv[2])
File "./DocumentConverter.py", line 76, in convert
document = self.desktop.loadComponentFromURL(inputUrl, "_blank", 0, self._toProperties(Hidden=True))
__main__.IllegalArgumentException: URL seems to be an unsupported one.

Any clues?; September 27, 2008 at 3:31 PM
Paolo Benvenuto said...: Does it work with OO.o 3.0?

I can't get it working, i get an IO Exception; October 21, 2008 at 7:02 AM
Anonymous said...: I have sucessfully used this to convert a ppt to pdf, however when I try to convert to html I get "ERROR! ErrorCodeIOException 2074". Has anyone used this to convert ppt to html? Do I need to do something different when specifing the "output file" since it should create a number of files to render the ppt in html?; November 3, 2008 at 3:55 PM
Anonymous said...: As a followup to my previous post about converting ppt to html - boy am I feeling foolish! I haven't fully backtracked to see what exactly I did to get things working, but I have had success. I think it may be because I didn't have necessary additional OpenOffice packages installed (I added the SDK, draw, and graphicfilter to my installation).; November 3, 2008 at 4:51 PM
Anonymous said...: Hello

Anyone can help me with following errors?

/usr/lib/openoffice.org/program/soffice.bin X11 error: Can't open display:
Set DISPLAY environment variable, use -display option
or check permissions of your X-Server
(See "man X" resp. "man xhost" for details)
error opening security policy file /usr/lib/xserver/SecurityPolicy
Traceback (most recent call last):
File "DocumentConverter.py", line 13, in ?
import uno
ImportError: No module named uno
Traceback (most recent call last):
File "DocumentConverter.py", line 13, in ?
import uno
ImportError: No module named uno
soffice: no process killed
Could not init font path element unix/:7100, removing from list!
FreeFontPath: FPE "built-ins" refcount is 2, should be 1; fixing.; December 31, 2008 at 11:04 AM
Andrew Z said...: Crirus: The first error about DISPLAY means you either need to use Xvfb or use -headless; December 31, 2008 at 11:09 AM
Crirus said...: This comment has been removed by the author.; January 3, 2009 at 8:45 AM
Anonymous said...: I am ..I spent three days figuring this out

I use OO 2.3

Whatever I try to launch soffice it keeps bugging me about X Display

Any more thoughts?; January 3, 2009 at 8:47 AM
Anonymous said...: Can anyone point me to a clean instalation of OOo 2.4 headless on a centos 5?

I spent enough figuring out on bad versions...

Thanks; January 3, 2009 at 10:43 AM
Andrew Z said...: Crirus: Clean copies of OpenOffice.org 2.4 are available from http://download.openoffice.org/2.4.2/index.html; January 3, 2009 at 4:12 PM
Mike said...: This is great. I'm having an issue though, xls files with multiple worksheets only convert the first one. Any idea how to get around that?; May 20, 2009 at 3:42 PM
Andrew Z said...: MScappa: Sorry, I am not sure. Try asking the author of PyODConverter at his web site.; June 8, 2009 at 10:04 PM
Anonymous said...: You may be missing the package of "openoffice.org-pyuno"; July 14, 2009 at 1:57 PM
Yargon said...: I am getting the following error when running the script

Traceback (most recent call last):
File "DocumentConverter.py", line 13, in <module>
import uno
ImportError: No module named uno

any idea how to fix this?; July 21, 2009 at 1:44 AM
Andrew Z said...: Yargon: Are you Ubuntu? Try 'apt-get install openoffice.org-uno' (or a similar name, I am recalling this from memory); July 21, 2009 at 7:18 AM
Yargon said...: I'm using gentoo, I couldn't find any ebuild with uno in it. (no keywords either).

:S; July 23, 2009 at 6:21 AM
Andrew Z said...: Yargon: You need something to install UNO.py. Maybe you can search for that filename. If you can't build it, you can download the Vanilla RPMs and "unzip" them using this command:
for i in `ls *rpm`; do rpm2cpio $i | cpio -ivd; done; July 23, 2009 at 9:12 AM
Anonymous said...: Does this only work with openoffice running as a service? Would it be possible to manually create the openoffice process and do the conversion all in one command?; July 30, 2009 at 3:40 AM
Andrew Z said...: Anonymous: OpenOffice.org doesn't naturally run as a service, and the first script shows you how to do it all in "one command" (considering the script itself as one command).; July 30, 2009 at 8:25 AM
Anonymous said...: Small diff to allow ODS to CSV conversion :

--- /home/alex/téléchargements/DocumentConverter.py 2008-05-05 19:09:05.000000000 +0200
+++ DocumentConverter.py 2009-08-05 08:22:45.259555899 +0200
@@ -41,7 +41,8 @@
"xls": { FAMILY_SPREADSHEET: "MS Excel 97" },
"odp": { FAMILY_PRESENTATION: "impress8" },
"ppt": { FAMILY_PRESENTATION: "MS PowerPoint 97" },
- "swf": { FAMILY_PRESENTATION: "impress_flash_Export" }
+ "swf": { FAMILY_PRESENTATION: "impress_flash_Export" },
+ "csv": { FAMILY_SPREADSHEET: "Text - txt - csv (StarCalc)" }
}
# see http://wiki.services.openoffice.org/wiki/Framework/Article/Filter
# for more available filters; August 5, 2009 at 1:51 AM
Anonymous said...: hi all
i'm using PyODConverter witn openoffice 3 to convert ppt to swg with command lines...The probleme is that in the resulting swf, each ppt frame is put on 2 frames. It means that you'll find frame 1 of the ppt on frame 1 and 2 of the swf...i need 1 frame / image...any clue ?; September 16, 2009 at 10:08 PM
Brian Fenton said...: Hi, thanks for this amazing piece of work. One thing I'm wondering is have you ever looked into isolating the part of OpenOffice that does the conversion, so that you could avoid installing the entire OpenOffice suite? It must be a small component inside there somewhere! :-)

cheers; October 7, 2009 at 11:21 AM
Andrew Z said...: Brian: You can't really save space by installing part of OpenOffice.org. OpenOffice.org shares most of its code. Well, you probably could break it up ("anything is possible"), but it would take an impractically-large effort.; October 7, 2009 at 2:15 PM
Yousef Sawalha said...: Dear all,

I am using PyODConverter with openoffice 3 to convert doc ,docx and rtf documents to HTML, Its working fine with me, but I still have a problem with embded images.

Is there anyway to ignore embeded images ? i don't want to have them in the exported HTML file?

Please Help :(; October 29, 2009 at 6:22 AM
Sublime1 said...: Hi Andrew,

you've mentioned exporting to CSV but what about importing CSVs? I've tried had no success. I tried adding this line but it didn't work
"csv": { FAMILY_SPREADSHEET: "MS Excel 97" },

Any suggestions?
thanks
Brian; November 2, 2009 at 12:53 PM
Yousef Sawalha said...: Dear Andrew,

I am using PyODConverter with openoffice 3 to convert doc ,docx and rtf documents to HTML, Its working fine with me, but I still have a problem with embded images.

Is there anyway to ignore embeded images ? i don't want to have them in the exported HTML file?

Please Help :(; November 16, 2009 at 7:03 AM
Bob Coret said...: Is it possible to use PyODConverter to convert PDF doc's to PDF/A?; January 19, 2010 at 7:50 AM
Unknown said...: Hello Guys,

I have a quick question. I am using DocumentConverter.py with OpenOffice 3 in a windows environment. When i used command prompt to convert documents everything works perfectly, with no problem. However, when I attempt to convert from PHP on a webserver via command line, using exec() - see below, everything goes through but no conversion occurs. So, for example, the soffice.exe is started (soffice.bin & soffice.exe) with NO problem. However, in the second line, no document conversion happens. I am hoping that this is something simple. I have tried everything but to no avail. I have tried all sorts of variations but with no success. I am on Windows XP.

//Convert source to PDF...
//Start Service
exec('"c:\\Program Files\\OpenOffice.org 3\\program\\soffice.exe" -headless -nologo -norestore -accept=socket,host=localhost,port=8100;urp;StarOffice.ServiceManager'); ----->THIS WORKS FINE

//Convert File
exec('"c:\\Program Files\\OpenOffice.org 3\\program\\python" c:/wamp/www/DocumentConverter.py xr2.docx xr2.pdf'); -------->THIS DOES NOT WORK: GIVES NO ERRORS BUT NO CONVERSION OCCURS; January 19, 2010 at 8:24 AM
Sublime1 said...: Hello Ethan-Anthony, it's probably just your exec call. I notice you're using a mixture of forward and back slashes. Play around with making exec calls to other Windows programs that accept parameters until you get it right.

Brian; January 20, 2010 at 2:42 AM
Luis said...: Hello I am having these error when trying to use the script, im using ubuntu, and have instaled the open office 2.4 with headless library and pyton 2.3 with uno library, can anyone help me, please.

Traceback (most recent call last):
File "DocumentConverter.py", line 224, in
converter.convert(argv[1], argv[2])
File "DocumentConverter.py", line 143, in convert
document = self.desktop.loadComponentFromURL(inputUrl, "_blank", 0, self._toProperties(loadProperties))
__main__.IllegalArgumentException: URL seems to be an unsupported one.; January 26, 2010 at 3:14 PM
Luis said...: Im sorry its python 2.5 not 2.3; January 26, 2010 at 3:39 PM
Luis said...: Ok, after looking around i found that i have the openoffice.org-writer package missing in my instalation of openoffice, another tip take a look of the name of the process for your openoffice application, on my system its ooffice.bin instead of soffice, so you be sure u are killing the process, then i restarted the oofice service and the convertion was made.; January 26, 2010 at 6:28 PM
inder said...: Can i convert .doc file to .ppt file in such a way that i can get full functionality of ppt files. reply me soon this is my project.; June 2, 2010 at 12:08 PM
howtomakemoney said...: hey good info... but how can i use it in web scripts like php,jsp etc... i want to create a site for converting documents from one form to another.. .plz help me guys... plz help me with scripts...; August 6, 2010 at 10:36 AM
e_metib said...: I too am interested in scripting PHP to convert uploads to pdf to swf.

There is little to no information I have googled that has been of any help.; November 15, 2010 at 5:26 AM
e_metib said...: Wish to PHP script uploads from ppt to pdf conversion and pdf to swf conversion plus audio addition via a php script.

I have googled this for months and so far no luck. Please contact me or point me to a site which has open source scripts for this.; November 15, 2010 at 5:38 AM
Reynold P J said...: Hai,

First of all thanks for this wonderful article:)

I can't convert doc format to csv or excel. Is this a bug in openoffice?
==========
# unoconv -vvv -f csv test.doc
Verbosity set to level 2
Connection type: socket,host=localhost,port=2002;urp;StarOffice.ComponentContext
Existing listener not found.
Connector : couldn't connect to socket (Success)
Launching our own listener using /usr/lib/openoffice.org3/program/soffice.bin.
OpenOffice listener successfully started. (pid=24575)
Input file: test.doc
Selected output format: Text CSV [.csv]
Selected ooffice filter: Text - txt - csv (StarCalc)
Used doctype: spreadsheet
unoconv: UnoException during conversion in :
ERROR: The provided document cannot be converted to the desired format. (code: 2074)
[root@convertmagic unoconv-0.4]#
==========

Thanks in advance.; November 26, 2010 at 11:31 AM
Unknown said...: I am getting following error when using PyODConverter:
Traceback (most recent call last):
File "./DocumentConverter.py", line 144, in module
converter.convert(argv[1], argv[2])
File "./DocumentConverter.py", line 76, in convert
document = self.desktop.loadComponentFromURL(inputUrl, "_blank", 0, self._toProperties(Hidden=True))
__main__.IllegalArgumentException: URL seems to be an unsupported one.

sudo apt-get install openoffice.org-writer

solved the problem; April 9, 2011 at 11:20 PM
Anonymous said...: using batch via cups-pdf printer on xubuntu, the printer was not named "Cups-Pdf", but plain "PDF":
openoffice.org -norestore -nofirststartwizard -nologo -headless -pt PDF mydoc.odt

this, anyway, stores PDF's under ~/PDF
:-S

have to investigate..

cheers
:m); September 12, 2011 at 6:28 AM
sleeping8 said...: hi andrew - merry christmas and happy new year - I don't know what i have gotten into. life has become harder by becoming easier! I got myself python!

I want lines of text from .txt file into separate spreadsheets.

I:\spreadsheets-pages>"h:\Program Files\LibreOffice 3.4\program\soffice" -accept="socket,port=8100;urp;

I:\spreadsheets-pages>"h:\Program Files\LibreOffice 3.4\program\python" DocumentConverter.py test.txt test.ods
DocumentConverter.py:116: DeprecationWarning: BaseException.message has been deprecated as of Python 2.6
self.message = message
DocumentConverter.py:119: DeprecationWarning: BaseException.message has been deprecated as of Python 2.6
return self.message
ERROR! unsupported conversion: from 'Text' to 'ods'

that was sad! I tried test.txt to test.odt - works fine .odt to .ods does not work.

renamed test.txt to test.csv and then pyod it for test.odt.

now I have spreadsheets ok but have to merge the row!!.

What I want is a line of text in single cells for many rows!

do you know a way out? thanks; December 28, 2011 at 6:34 PM
ADS said...: Hi there,
The writeup mentions using the program as a python module (ie not calling it from the command prompt) but doesn't give any instructions or details. Maybe it's obvious, but could someone explain how to do this to me?

Thanks a lot,
Alex; January 27, 2012 at 1:36 PM
Anonymous said...: Hi Guys,

I am converting ppt to html file using unoconv library. It convert total slide into text format or image format but I want html page having text with image.

Anyone help me to use unoconv library to get html page as per required. If you have any more library for the same? please guide.

Thanks in advance,
Yash; May 22, 2012 at 3:51 AM