Wrangling MS Word’s HTML

For a taste of Word HTML, see the section following this article. If you write HTML, a little part of you will die inside every time you see MS Word HTML.

One of the hard parts is extracting ordered and unordered lists. Especially since that varies so much between browsers.

For this PHP project, I needed to accept HTML from a WYSIWYG editor to save into a CMS system. The HTML needed to be quite restricted; e.g. to disallow changing font faces and font colors. Anyway, here is the algorithm I ended up with:

  1. Find and replace common entities that seem to get mangled in a normal DOMDocument conversion
  2. Remove comments
  3. Remove nested FONT tags since DOMDocument discards content inside nested FONT tags
  4. Extract ordered and unordered lists with regular expressions
  5. Convert B and I tags to EM and STRONG and disallow any attributes in any of those
  6. Replace TABLE tags with a simple tag containing a css class and cellspacing of 0
  7. Load into DOMDocument
  8. Remove any tags not on the whitelist
  9. Remove any attributes not on the whitelist
  10. Disable href and src attributes that do not contain an http, https or relative link
  11. Strip out disallowed css styles from the style attribute
  12. Remove css classes that Word commonly uses
  13. Dump back to HTML
  14. Remove DOCTYPE, HTML and BODY tags that DOMDocument automatically adds
  15. Use Tidy to clean up any additional problems, add P tags to plain text and remove duplicate white space
  16. Remove the linefeeds that Tidy leaves (we are sending content to Adobe Flex which parses HTML but interprets all white space literally)
  17. Remove spans that have no attributes and thus no effect
  18. Remove empty paragraphs

It feels good to get that worked out!

Continue reading “Wrangling MS Word’s HTML”