For a taste of Word HTML, see the section following this article. If you write HTML, a little part of you will die inside every time you see MS Word HTML.
One of the hard parts is extracting ordered and unordered lists. Especially since that varies so much between browsers.
For this PHP project, I needed to accept HTML from a WYSIWYG editor to save into a CMS system. The HTML needed to be quite restricted; e.g. to disallow changing font faces and font colors. Anyway, here is the algorithm I ended up with:
- Find and replace common entities that seem to get mangled in a normal DOMDocument conversion
- Remove comments
- Remove nested FONT tags since DOMDocument discards content inside nested FONT tags
- Extract ordered and unordered lists with regular expressions
- Convert B and I tags to EM and STRONG and disallow any attributes in any of those
- Replace TABLE tags with a simple tag containing a css class and cellspacing of 0
- Load into DOMDocument
- Remove any tags not on the whitelist
- Remove any attributes not on the whitelist
- Disable href and src attributes that do not contain an http, https or relative link
- Strip out disallowed css styles from the style attribute
- Remove css classes that Word commonly uses
- Dump back to HTML
- Remove DOCTYPE, HTML and BODY tags that DOMDocument automatically adds
- Use Tidy to clean up any additional problems, add P tags to plain text and remove duplicate white space
- Remove the linefeeds that Tidy leaves (we are sending content to Adobe Flex which parses HTML but interprets all white space literally)
- Remove spans that have no attributes and thus no effect
- Remove empty paragraphs
It feels good to get that worked out!