tag:blogger.com,1999:blog-8544609315733972726.post-33853756923437615562008-02-27T14:45:00.016-07:002008-04-12T22:30:53.082-06:002008-04-12T22:30:53.082-06:00Batch command line file conversion with PyODConverter<p>You want to batch convert .doc to .pdf using the command line on a server without a GUI? Or you need automated .ppt to .swf conversion through cron, a sysvinit service, or a remote web server? Online conversion services such as <a href="http://www.oooninja.com/2008/02/zamzarcom-docx-odt-converter-review.html">Zamzar.com</a> and Media-convert.com not working for you? Whichever formats you need to batch convert, <a href="http://www.artofsolving.com/opensource/pyodconverter">PyODConverter</a> is a simple Python script for just this purpose. </p>
<h3>On Linux</h3>
<p>Make sure you have the headless package installed. For the <a href="http://katana.oooninja.com/w/editions_of_openoffice.org">vanilla</a> OpenOffice.org 2.4, it has the name openoffice.org-headless-2.4.0-9286.i586.rpm.</p>
<p>This shell script demonstrates the full use of PyODConverter for Linux:</p>
<code>#!/bin/bash
# Try to autodetect OOFFICE and OOOPYTHON.
OOFFICE=`ls /usr/bin/openoffice.org2.4 /usr/bin/ooffice /usr/lib/openoffice/program/soffice | head -n 1`
OOOPYTHON=`ls /opt/openoffice.org*/program/python /usr/bin/python | head -n 1`
if [ ! -x "$OOFFICE" ]
then
echo "Could not auto-detect OpenOffice.org binary"
exit
fi
if [ ! -x "$OOOPYTHON" ]
then
echo "Could not auto-detect OpenOffice.org Python"
exit
fi
echo "Detected OpenOffice.org binary: $OOFFICE"
echo "Detected OpenOffice.org python: $OOOPYTHON"
# Reference: http://wiki.services.openoffice.org/wiki/Using_Python_on_Linux
# If you use the OpenOffice.org that comes with Fedora or Ubuntu, uncomment the following line:
# export PYTHONPATH="/usr/lib/openoffice.org/program"
# If you want to simulate for testing that there is no X server, uncomment the next line.
#unset DISPLAY
# Kill any running OpenOffice.org processes.
killall -u `whoami` -q soffice
# Download the converter script if necessary.
test -f DocumentConverter.py || wget http://www.artofsolving.com/files/DocumentConverter.py
# Start OpenOffice.org in listening mode on TCP port 8100.
$OOFFICE "-accept=socket,host=localhost,port=8100;urp;StarOffice.ServiceManager" -norestore -nofirststartwizard -nologo -headless &
# Wait a few seconds to be sure it has started.
sleep 5s
# Convert as many documents as you want serially (but not concurrently).
# Substitute whichever documents you wish.
$OOOPYTHON DocumentConverter.py sample.ppt sample.swf
$OOOPYTHON DocumentConverter.py sample.ppt sample.pdf
# Close OpenOffice.org.
killall -u `whoami` soffice</code>
<h3>Xvfb vs -headless</h3>
<p>To use the <a rel="external" href="http://blogs.linux.ie/caolan/2007/05/04/headless-ooo/">-headless</a> command line parameter, you must use OpenOffice.org 2.3.0 or later with the RPM package openoffice.org-headless-2.3.1-9238.i586.rpm. If you use an older OpenOffice.org version, you will need Xvfb to simulate an X server where one is not available:</p>
<code>!/bin/bash
# Set DISPLAY to something besides :1 (because :1 is the standard display).
DISPLAY=:1000
# Kill any existing virtual framebuffers.
killall -u `whoami` Xvfb
# Start the framebuffer.
Xvfb $DISPLAY -screen 0 800x600x24 &
# Run the OpenOffice.org conversion script above.
ooo-convert.sh
# Clean up
killall -u `whoami` Xvfb</code>
<h3>On Windows</h3>
<p>On Windows, convert documents as follows. Of course, download <a href="http://www.artofsolving.com/opensource/pyodconverter">PyODConverter</a> and adjust the OpenOffice.org pathnames to match the version of OpenOffice.org you have installed.</p>
<p>Windows users don't have to worry about Xvfb or the headless package.</p>
<code>"C:\Program Files\OpenOffice.org2.4\program\soffice.exe" -headless -nologo -norestore -accept=socket,host=localhost,port=8100;urp;StarOffice.ServiceManager
"C:\Program Files\OpenOffice.org2.4\program\python" DocumentConverter.py test.odt test.pdf</code>
<p>UPDATE: On Windows XP in OpenOffice.org 2.4, you may need to switch the socket <a rel="external nofofollow" href="http://www.oooforum.org/forum/viewtopic.phtml?t=70223">to a pipe</a> if you get the following error:
<blockquote>_main__.com.sun.star.connection.NoConnectException: Connector : couldn't connect to socket (WSANOTINITIALISED, WSAStartup() has not been called)</blockquote>
<h3>Supported formats</h3>
<p>PyODConverter supports all of the formats that you can import or export manually in OpenOffice.org. They include:</p>
<ul>
<li>odt OpenDocument (ODF) Text</li>
<li>sxw OpenOffice.org version 1 Text</li>
<li>doc Microsoft Word 97/2000/XP</li>
<li>rtf Rich Text Format</li>
<li>txt Plain text</li>
<li>wpd WordPerfect</li>
<li>html Web page</li>
<li>ods OpenDocument (ODF) Spreadsheet</li>
<li>sxc OpenOffice.org version 1 Spreadsheet</li>
<li>xls Microsoft Excel 97/2000/XP</li>
<li>odp OpenDocument (ODF) Presentation</li>
<li>sxi OpenOffice.org version 1 Presentation</li>
<li>ppt Microsoft PowerPoint 97/2000/XP</li>
<li>swf Adobe Flash</li>
</ul>
<p>To convert .docx, .xlsx, .pptx, substitute <a href="http://www.oooninja.com/2008/02/word-2007-docx-converter-oxygenoffice.html">OxygenOffice</a> for OpenOffice.org.</p>
<p>To export to .csv, use <a href="http://www.artofsolving.com/opensource/jodconverter">JODConverter</a>.</p>
<h3>Reliability</h3>
<p>Especially when batch converting many documents, you should take into consideration that the OpenOffice.org process may crash. If it does, the conversion script will fail too. A reliable method is to start OpenOffice.org, convert one document, tear down the process, and repeat; however, this is slow. You may wish to reuse the OpenOffice.org process, say 10 times, while using error checking to determine whether the process needs restarted prematurely.</p>
<p><a href="http://www.artofsolving.com/opensource/jodconverter">JODConverter 3.0 (currently in development) </a> will automatically restart OpenOffice.org in the event of a crash.</p>
<h3>PDF printer method</h3>
<p>If you just want to generate PDF files, you don't need a Python script, a Basic macro, Java code, or any other kind of programming. Just install a PDF printer such as <a rel="external nofollow" href="http://www.pdfforge.org/products/pdfcreator">PDFCreator</a> (Windows) or <a rel="external nofollow" href="http://cip.physik.uni-wuerzburg.de/~vrbehr/cups-pdf">CUPS-PDF</a>. Then, use the <b>-pt</b> command with the the first argument as the printer name and the second argument as the source document.</p>
<p>Linux:</p>
<code>openoffice.org2.4 -norestore -nofirststartwizard -nologo -headless -pt Cups-PDF sample.ppt</code>
<p>Windows:</p>
<code>"C:\Program Files\OpenOffice.org2.4\program\soffice" -norestore -nofirststartwizard -nologo -headless -pt PDFCreator sample.ppt</code>
<p>Thanks to <a rel="nofollow external" href="http://www.oooforum.org/forum/viewtopic.phtml?p=192602#192602">bikram</a> for this tip.</p>
<h3>Converting to image formats</h3>
<p>Do you need to convert to .png, .jpg, or .tif? Install ImageMagick. Then, convert whichever format to .pdf. Then, convert the .pdf to the image format like this:</p>
<code>convert sample.pdf sample.png
convert sample.pdf sample.jpg
convert sample.pdf sample.tif</code>
<p>If the document has multiple pages, ImageMagick creates filenames like sample-0.png.</p>
<p>If you prefer a more direct approach, use a script such as <a href="http://dag.wieers.com/home-made/unoconv/">unoconv</a>.</p>
<h3>Other methods</h3>
<p>There are many ways to batch convert files with OpenOffice.org:</p>
<ul>
<li><a href="http://www.oooninja.com/2008/02/word-2007-docx-converter-oxygenoffice.html">OxygenOffice as a Word 2007 (.docx) importer</a> if you want to convert between OpenXML formats (.docx, .pptx, and .xlsx) and other formats supported by OpenOffice.org</li>
<li><a href="http://www.oooninja.com/2008/01/convert-openxml-documents-in-windows.html">Windows: Convert Office Open XML files (.docx, .xlsx, .pptx) to OpenDocument (.odt, .ods, .odp) using command line</a></li>
<li><a href="http://www.oooninja.com/2008/01/convert-openxml-docx-etc-in-linux-using.html">Linux: Convert Office Open XML files (.docx, .xlsx, .pptx) to OpenDocument (.odt, .ods, .odp) using command line</a></li>
<li><a href="http://www.oooninja.com/2008/02/office-compatibility-pack-review.html">Windows: Convert Office Open XML files (.docx, .xlsx, .pptx) to Microsoft Office binary files (.doc, .xls, .ppt)</a></li>
<li><a rel="external nofollow" href="http://www.artofsolving.com/opensource/jodconverter">JODConverter</a> robust method using Java</li>
<li><a rel="external nofollow" href="http://www.xml.com/pub/a/2006/01/11/from-microsoft-to-openoffice.html">Moving to OpenOffice: Batch Converting Legacy Documents</a> uses a Basic macro</li>
<li><a rel="external nofollow" href="http://mail.python.org/pipermail/python-announce-list/2006-May/004951.html">ooo2any</a> another Python script</li>
</ul><div class="blogger-post-footer"><a href="http://www.pheedo.com/click.phdo?x=9b46c817936b44038e3def7b73e77e6e&u=%%UNIQUEID%%"><img src="http://www.pheedo.com/img.phdo?x=9b46c817936b44038e3def7b73e77e6e&u=%%UNIQUEID%%" border="0"/></a></div>Andrew Zhttp://www.blogger.com/profile/10108637160465346326noreply@blogger.com