tag:blogger.com,1999:blog-8544609315733972726.post-33853756923437615562008-02-27T14:45:00.016-07:002008-04-12T22:30:53.082-06:002008-04-12T22:30:53.082-06:00Batch command line file conversion with PyODConverter<p>You want to batch convert .doc to .pdf using the command line on a server without a GUI? Or you need automated .ppt to .swf conversion through cron, a sysvinit service, or a remote web server? Online conversion services such as <a href="http://www.oooninja.com/2008/02/zamzarcom-docx-odt-converter-review.html">Zamzar.com</a> and Media-convert.com not working for you? Whichever formats you need to batch convert, <a href="http://www.artofsolving.com/opensource/pyodconverter">PyODConverter</a> is a simple Python script for just this purpose. </p> <h3>On Linux</h3> <p>Make sure you have the headless package installed. For the <a href="http://katana.oooninja.com/w/editions_of_openoffice.org">vanilla</a> OpenOffice.org 2.4, it has the name openoffice.org-headless-2.4.0-9286.i586.rpm.</p> <p>This shell script demonstrates the full use of PyODConverter for Linux:</p> <code>#!/bin/bash # Try to autodetect OOFFICE and OOOPYTHON. OOFFICE=`ls /usr/bin/openoffice.org2.4 /usr/bin/ooffice /usr/lib/openoffice/program/soffice | head -n 1` OOOPYTHON=`ls /opt/openoffice.org*/program/python /usr/bin/python | head -n 1` if [ ! -x "$OOFFICE" ] then echo "Could not auto-detect OpenOffice.org binary" exit fi if [ ! -x "$OOOPYTHON" ] then echo "Could not auto-detect OpenOffice.org Python" exit fi echo "Detected OpenOffice.org binary: $OOFFICE" echo "Detected OpenOffice.org python: $OOOPYTHON" # Reference: http://wiki.services.openoffice.org/wiki/Using_Python_on_Linux # If you use the OpenOffice.org that comes with Fedora or Ubuntu, uncomment the following line: # export PYTHONPATH="/usr/lib/openoffice.org/program" # If you want to simulate for testing that there is no X server, uncomment the next line. #unset DISPLAY # Kill any running OpenOffice.org processes. killall -u `whoami` -q soffice # Download the converter script if necessary. test -f DocumentConverter.py || wget http://www.artofsolving.com/files/DocumentConverter.py # Start OpenOffice.org in listening mode on TCP port 8100. $OOFFICE "-accept=socket,host=localhost,port=8100;urp;StarOffice.ServiceManager" -norestore -nofirststartwizard -nologo -headless & # Wait a few seconds to be sure it has started. sleep 5s # Convert as many documents as you want serially (but not concurrently). # Substitute whichever documents you wish. $OOOPYTHON DocumentConverter.py sample.ppt sample.swf $OOOPYTHON DocumentConverter.py sample.ppt sample.pdf # Close OpenOffice.org. killall -u `whoami` soffice</code> <h3>Xvfb vs -headless</h3> <p>To use the <a rel="external" href="http://blogs.linux.ie/caolan/2007/05/04/headless-ooo/">-headless</a> command line parameter, you must use OpenOffice.org 2.3.0 or later with the RPM package openoffice.org-headless-2.3.1-9238.i586.rpm. If you use an older OpenOffice.org version, you will need Xvfb to simulate an X server where one is not available:</p> <code>!/bin/bash # Set DISPLAY to something besides :1 (because :1 is the standard display). DISPLAY=:1000 # Kill any existing virtual framebuffers. killall -u `whoami` Xvfb # Start the framebuffer. Xvfb $DISPLAY -screen 0 800x600x24 & # Run the OpenOffice.org conversion script above. ooo-convert.sh # Clean up killall -u `whoami` Xvfb</code> <h3>On Windows</h3> <p>On Windows, convert documents as follows. Of course, download <a href="http://www.artofsolving.com/opensource/pyodconverter">PyODConverter</a> and adjust the OpenOffice.org pathnames to match the version of OpenOffice.org you have installed.</p> <p>Windows users don't have to worry about Xvfb or the headless package.</p> <code>"C:\Program Files\OpenOffice.org2.4\program\soffice.exe" -headless -nologo -norestore -accept=socket,host=localhost,port=8100;urp;StarOffice.ServiceManager "C:\Program Files\OpenOffice.org2.4\program\python" DocumentConverter.py test.odt test.pdf</code> <p>UPDATE: On Windows XP in OpenOffice.org 2.4, you may need to switch the socket <a rel="external nofofollow" href="http://www.oooforum.org/forum/viewtopic.phtml?t=70223">to a pipe</a> if you get the following error: <blockquote>_main__.com.sun.star.connection.NoConnectException: Connector : couldn't connect to socket (WSANOTINITIALISED, WSAStartup() has not been called)</blockquote> <h3>Supported formats</h3> <p>PyODConverter supports all of the formats that you can import or export manually in OpenOffice.org. They include:</p> <ul> <li>odt OpenDocument (ODF) Text</li> <li>sxw OpenOffice.org version 1 Text</li> <li>doc Microsoft Word 97/2000/XP</li> <li>rtf Rich Text Format</li> <li>txt Plain text</li> <li>wpd WordPerfect</li> <li>html Web page</li> <li>ods OpenDocument (ODF) Spreadsheet</li> <li>sxc OpenOffice.org version 1 Spreadsheet</li> <li>xls Microsoft Excel 97/2000/XP</li> <li>odp OpenDocument (ODF) Presentation</li> <li>sxi OpenOffice.org version 1 Presentation</li> <li>ppt Microsoft PowerPoint 97/2000/XP</li> <li>swf Adobe Flash</li> </ul> <p>To convert .docx, .xlsx, .pptx, substitute <a href="http://www.oooninja.com/2008/02/word-2007-docx-converter-oxygenoffice.html">OxygenOffice</a> for OpenOffice.org.</p> <p>To export to .csv, use <a href="http://www.artofsolving.com/opensource/jodconverter">JODConverter</a>.</p> <h3>Reliability</h3> <p>Especially when batch converting many documents, you should take into consideration that the OpenOffice.org process may crash. If it does, the conversion script will fail too. A reliable method is to start OpenOffice.org, convert one document, tear down the process, and repeat; however, this is slow. You may wish to reuse the OpenOffice.org process, say 10 times, while using error checking to determine whether the process needs restarted prematurely.</p> <p><a href="http://www.artofsolving.com/opensource/jodconverter">JODConverter 3.0 (currently in development) </a> will automatically restart OpenOffice.org in the event of a crash.</p> <h3>PDF printer method</h3> <p>If you just want to generate PDF files, you don't need a Python script, a Basic macro, Java code, or any other kind of programming. Just install a PDF printer such as <a rel="external nofollow" href="http://www.pdfforge.org/products/pdfcreator">PDFCreator</a> (Windows) or <a rel="external nofollow" href="http://cip.physik.uni-wuerzburg.de/~vrbehr/cups-pdf">CUPS-PDF</a>. Then, use the <b>-pt</b> command with the the first argument as the printer name and the second argument as the source document.</p> <p>Linux:</p> <code>openoffice.org2.4 -norestore -nofirststartwizard -nologo -headless -pt Cups-PDF sample.ppt</code> <p>Windows:</p> <code>"C:\Program Files\OpenOffice.org2.4\program\soffice" -norestore -nofirststartwizard -nologo -headless -pt PDFCreator sample.ppt</code> <p>Thanks to <a rel="nofollow external" href="http://www.oooforum.org/forum/viewtopic.phtml?p=192602#192602">bikram</a> for this tip.</p> <h3>Converting to image formats</h3> <p>Do you need to convert to .png, .jpg, or .tif? Install ImageMagick. Then, convert whichever format to .pdf. Then, convert the .pdf to the image format like this:</p> <code>convert sample.pdf sample.png convert sample.pdf sample.jpg convert sample.pdf sample.tif</code> <p>If the document has multiple pages, ImageMagick creates filenames like sample-0.png.</p> <p>If you prefer a more direct approach, use a script such as <a href="http://dag.wieers.com/home-made/unoconv/">unoconv</a>.</p> <h3>Other methods</h3> <p>There are many ways to batch convert files with OpenOffice.org:</p> <ul> <li><a href="http://www.oooninja.com/2008/02/word-2007-docx-converter-oxygenoffice.html">OxygenOffice as a Word 2007 (.docx) importer</a> if you want to convert between OpenXML formats (.docx, .pptx, and .xlsx) and other formats supported by OpenOffice.org</li> <li><a href="http://www.oooninja.com/2008/01/convert-openxml-documents-in-windows.html">Windows: Convert Office Open XML files (.docx, .xlsx, .pptx) to OpenDocument (.odt, .ods, .odp) using command line</a></li> <li><a href="http://www.oooninja.com/2008/01/convert-openxml-docx-etc-in-linux-using.html">Linux: Convert Office Open XML files (.docx, .xlsx, .pptx) to OpenDocument (.odt, .ods, .odp) using command line</a></li> <li><a href="http://www.oooninja.com/2008/02/office-compatibility-pack-review.html">Windows: Convert Office Open XML files (.docx, .xlsx, .pptx) to Microsoft Office binary files (.doc, .xls, .ppt)</a></li> <li><a rel="external nofollow" href="http://www.artofsolving.com/opensource/jodconverter">JODConverter</a> robust method using Java</li> <li><a rel="external nofollow" href="http://www.xml.com/pub/a/2006/01/11/from-microsoft-to-openoffice.html">Moving to OpenOffice: Batch Converting Legacy Documents</a> uses a Basic macro</li> <li><a rel="external nofollow" href="http://mail.python.org/pipermail/python-announce-list/2006-May/004951.html">ooo2any</a> another Python script</li> </ul><div class="blogger-post-footer"><a href="http://www.pheedo.com/click.phdo?x=9b46c817936b44038e3def7b73e77e6e&u=%%UNIQUEID%%"><img src="http://www.pheedo.com/img.phdo?x=9b46c817936b44038e3def7b73e77e6e&u=%%UNIQUEID%%" border="0"/></a></div>Andrew Zhttp://www.blogger.com/profile/10108637160465346326noreply@blogger.com