Discussion:
clean up html document created by Word
Peter Otten
2007-03-30 17:29:19 UTC
Permalink
I am looking for python code (working or sample code) that can take an
html document created by Microsoft Word and clean it up (if you've
never had to look at a Word-generated html document, consider yourself
lucky ;-) Alternatively, if you know of a non-python solution, I'd
like to hear about it.
The non-python solution:

http://www.w3.org/People/Raggett/tidy/

Peter
jd
2007-03-31 01:17:19 UTC
Permalink
Wow, thanks for all the great responses!

Here's my summary:

- demoronizer (from John Walker) is designed to solve some very
particular problems that could be considered bugs. However, it
doesn't remove the unnecessary html generated by Word.
http://www.fourmilab.ch/webtools/demoroniser/


- The tool from Microsoft can be used in two ways: you can copy html
to the clipboard or export to "compact html". The former results in
slightly cleaner html but doesn't include the style sheet and so the
rendering isn't as nice; the latter does include the style sheet but
it's got slightly more junk in it. Both approaches preserve the
"blank" paragraphs (basically, <p>&nbsp;</p>) for spacing, which is
unnecessary and clutters up the html. This tool did properly preserve
the footnotes in my test document.
http://www.microsoft.com/downloads/details.aspx?FamilyID=209ADBEE-3FBD-482C-83B0-96FB79B74DED&displaylang=EN

BTW, I didn't know this, but much of the extra html was added by
Microsoft to allow round-tripping between html and Word.

- Tidy with Win2000 configuration: It's already bundled in with my
editor (PSPad) so this was a nice surprise (I guess I never explored
that submenu -- that's the "problem" with modern editors and their
zillions of features). The tidy output could use a more whitespace to
improve html readability, but I assume I can change the config file to
do this. No "blank paragraphs" (better than the Microsoft tool) but
footnotes were messed up.
http://www.w3.org/People/Raggett/tidy/

-- jeff
Peter Otten
2007-03-30 17:40:57 UTC
Permalink
IIUC, the original poster is asking about 'cleaning up' in the sense
of removing the swathes of unnecessary and/or redundant 'cruft' that
Word puts in there, rather than making valid HTML out of invalid HTML.
Again, IIUC, HTMLtidy does not do this.
"""
Tidy can now perform wonders on HTML saved from Microsoft Word 2000! Word
bulks out HTML files with stuff for round-tripping presentation between
HTML and Word. If you are more concerned about using HTML on the Web, check
out Tidy's "Word-2000" config option! Of course Tidy does a good job on
Word'97 files as well!
"""

Peter
kyosohma
2007-03-30 17:24:45 UTC
Permalink
I am looking for python code (working or sample code) that can take an
html document created by Microsoft Word and clean it up (if you've
never had to look at a Word-generated html document, consider yourself
lucky ;-) Alternatively, if you know of a non-python solution, I'd
like to hear about it.
Thanks...
-- jeff
You could try Beautiful Soup at http://www.crummy.com/software/BeautifulSoup/documentation.html

Python is good for parsing HTML/XML, so you could also try googling
Python parsing as well.

Mike
bearophileHUGS
2007-03-30 17:52:23 UTC
Permalink
I am looking for python code (working or sample code) that can take an
html document created by Microsoft Word and clean it up (if you've
never had to look at a Word-generated html document, consider yourself
lucky ;-) Alternatively, if you know of a non-python solution, I'd
like to hear about it.
It's not an easy job, and it may require some manual editing, because
that html is the worst I have seen. You can use Tidy, there is a GUI
too, and you can use its suggestions to manually remove the offending
things, at the end Tidy is able to digest it, and return a cleaned up
html. But then you have just started, you need to process it even
more.

A solution is to avoid creating the Html in the first place, or to use
something more like Word 97 to create it. Dreamweaver too is able to
help with Word2000+ trashy html, but usually not enough.

If the structure of the Html document is simple enough, and assuming
you are using Windows, you can open it with Word, save it as RTF,
reopen it with Wordpad, save it again to remove some trash, and then
use something else (like Word 97, or maybe even Aracnophobia, etc) to
convert it to Html. Generally I've never found a really good way to
convert Rtf to a very good Html.

Bye,
bearophile
Laurent Pointal
2007-03-30 21:35:02 UTC
Permalink
I am looking for python code (working or sample code) that can take an
html document created by Microsoft Word and clean it up (if you've
never had to look at a Word-generated html document, consider yourself
lucky ;-) Alternatively, if you know of a non-python solution, I'd
like to hear about it.
Thanks...
-- jeff
demoroniser - correct moronic and gratuitously incompatible HTML generated
by Microsoft applications

http://www.fourmilab.ch/webtools/demoroniser/

Unless you want to write your own with Python.
Claudio Grondi
2007-03-30 23:53:18 UTC
Permalink
I am looking for python code (working or sample code) that can take an
html document created by Microsoft Word and clean it up (if you've
never had to look at a Word-generated html document, consider yourself
lucky ;-) Alternatively, if you know of a non-python solution, I'd
like to hear about it.
Thanks...
-- jeff
There is a Microsoft add-on for Word which helps to reduce the mess
called 'HTML filter'. Go for it here:

http://www.microsoft.com/downloads/details.aspx?FamilyID=209ADBEE-3FBD-482C-83B0-96FB79B74DED&displaylang=EN

run it and then use afterwards the other in this thread suggested
'cleaning' methods.

Claudio
jkn
2007-03-30 17:37:13 UTC
Permalink
IIUC, the original poster is asking about 'cleaning up' in the sense
of removing the swathes of unnecessary and/or redundant 'cruft' that
Word puts in there, rather than making valid HTML out of invalid HTML.
Again, IIUC, HTMLtidy does not do this.

If Beautiful Soup does, then I'm intererested!

jon N
Shane Geiger
2007-03-30 17:44:22 UTC
Permalink
Tidy can now perform wonders on HTML saved from Microsoft Word 2000!
Word bulks out HTML files with stuff for round-tripping presentation
between HTML and Word. If you are more concerned about using HTML on the
Web, check out Tidy's "Word-2000"
<http://www.w3.org/People/Raggett/tidy/#word2000> config option! Of
course Tidy does a good job on Word'97 files as well!
-- source: http://www.w3.org/People/Raggett/tidy/
IIUC, the original poster is asking about 'cleaning up' in the sense
of removing the swathes of unnecessary and/or redundant 'cruft' that
Word puts in there, rather than making valid HTML out of invalid HTML.
Again, IIUC, HTMLtidy does not do this.
If Beautiful Soup does, then I'm intererested!
jon N
--
Shane Geiger
IT Director
National Council on Economic Education
sgeiger at ncee.net | 402-438-8958 | http://www.ncee.net

Leading the Campaign for Economic and Financial Literacy

-------------- next part --------------
A non-text attachment was scrubbed...
Name: sgeiger.vcf
Type: text/x-vcard
Size: 310 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/python-list/attachments/20070330/5c2d97e4/attachment.vcf>
jd
2007-03-30 17:20:49 UTC
Permalink
I am looking for python code (working or sample code) that can take an
html document created by Microsoft Word and clean it up (if you've
never had to look at a Word-generated html document, consider yourself
lucky ;-) Alternatively, if you know of a non-python solution, I'd
like to hear about it.

Thanks...

-- jeff
Continue reading on narkive:
Loading...