Tim Golden
2008-08-19 15:58:54 UTC
Hi, all !
I'm a totally newbie huh:)
I want to convert MS WORD docs to HTML, I found python windows
extension win32com can make this. But I can't find the method, and I
can't find any document helpful.
You have broadly two approaches here, bothI'm a totally newbie huh:)
I want to convert MS WORD docs to HTML, I found python windows
extension win32com can make this. But I can't find the method, and I
can't find any document helpful.
involving automating Word (ie using the
COM object model it exposes, referred to
in another post in this thread).
1) Use the COM model to have Word load your
doc, and SaveAs it in HTML format. Advantage:
it's relatively straightforward. Disadvantage:
you're at the mercy of whatever HTML Word emits.
2) Use the COM model to iterate over the paragraphs
in your document, emitting your own HTML. Advantage:
you get control. Disadvantage: the more complex your
doc, the more work you have to do. (What do you do with
images, for example? Internal links?)
To do the first, just record a macro in Word to
do what you want and then reproduce the macro
in Python. Something like this:
<code>
import win32com.client
doc = win32com.client.GetObject ("c:/data/temp/songs.doc")
doc.SaveAs (FileName="c:/data/temp/songs.html", FileFormat=8)
doc.Close ()
</code>
To do the second, you have to roll your own html
doc. Crudely, this would do it:
<code>
import codecs
import win32com.client
doc = win32com.client.GetObject ("c:/data/temp/songs.doc")
with codecs.open ("c:/data/temp/s2.html", "w", encoding="utf8") as f:
f.write ("<html><body>")
for para in doc.Paragraphs:
text = para.Range.Text
style = para.Style.NameLocal
f.write ('<p class="%(style)s">%(text)s</p>\n' % locals ())
doc.Close ()
</code>
TJG