Oh boy.
Handling a paste from a Word document into a browser-based WYSIWYG editor is a pain in the butt. I'm using CKEDITOR, which does have a built-in tool for stripping Word's nasty HTML, but it doesn't work well. I also had no success using PHP's HTMLPurifier, htmLawed or Tidy alone.
For a taste of Word HTML, see the section following this article. If you write HTML, a little part of you will die inside every time you see MS Word HTML.
One of the hard parts is extracting ordered and unordered lists. Especially since that varies so much between browsers.
For this PHP project, I needed to accept HTML from a WYSIWYG editor to save into a CMS system. The HTML needed to be quite restricted; e.g. to disallow changing font faces and font colors. Anyway, here is the algorithm I ended up with:
- Find and replace common entities that seem to get mangled in a normal DOMDocument conversion
- Remove comments
- Remove nested FONT tags since DOMDocument discards content inside nested FONT tags
- Extract ordered and unordered lists with regular expressions
- Convert B and I tags to EM and STRONG and disallow any attributes in any of those
- Replace TABLE tags with a simple tag containing a css class and cellspacing of 0
- Load into DOMDocument
- Remove any tags not on the whitelist
- Remove any attributes not on the whitelist
- Disable href and src attributes that do not contain an http, https or relative link
- Strip out disallowed css styles from the style attribute
- Remove css classes that Word commonly uses
- Dump back to HTML
- Remove DOCTYPE, HTML and BODY tags that DOMDocument automatically adds
- Use Tidy to clean up any additional problems, add P tags to plain text and remove duplicate white space
- Remove the linefeeds that Tidy leaves (we are sending content to Adobe Flex which parses HTML but interprets all white space literally)
- Remove spans that have no attributes and thus no effect
- Remove empty paragraphs
It feels good to get that worked out!
Pasted From Word 2007 with CTRL+v into CKEDITOR running on IE8:
<p>
<span style="font-family: 'calibri', 'sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"><font color="#000000" size="3">“</font></span><span class="Heading1Char"><span style="font-size: 14pt"><strong><font color="#365f91" face="Cambria">Heading</font></strong></span></span><span style="font-family: 'calibri', 'sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"><font color="#000000"><font size="3">”<o:p></o:p></font></font></span></p>
<p>
<span style="font-family: 'calibri', 'sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"><font color="#000000" size="3">Normal text </font></span><a href="http://fusionio.com"><span style="font-family: 'calibri', 'sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"><font size="3">with a link</font></span></a><span style="font-family: 'calibri', 'sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"><font size="3"><font color="#000000">. Special chars: “ ” ‘ ’ — – ™ © ® « » ° µ<o:p></o:p></font></font></span></p>
<p>
<span style="font-family: 'calibri', 'sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"><font size="3"><font color="#000000">That is <span style="background: yellow; mso-highlight: yellow">highlighted text</span>.<o:p></o:p></font></font></span></p>
<p style="text-indent: -0.25in; margin-left: 0.5in; mso-list: l0 level1 lfo1">
<font color="#000000"><span style="font-family: symbol; mso-fareast-font-family: symbol; mso-bidi-font-family: symbol"><span style="mso-list: ignore"><font size="3">·</font><span style="font: 7pt 'times new roman'"> </span></span></span><span style="font-family: 'calibri', 'sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"><font size="3">bullet1<o:p></o:p></font></span></font></p>
<p style="text-indent: -0.25in; margin-left: 0.5in; mso-list: l0 level1 lfo1">
<font color="#000000"><span style="font-family: symbol; mso-fareast-font-family: symbol; mso-bidi-font-family: symbol"><span style="mso-list: ignore"><font size="3">·</font><span style="font: 7pt 'times new roman'"> </span></span></span><span style="font-family: 'calibri', 'sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"><font size="3">bullet2<o:p></o:p></font></span></font></p>
<table border="1" cellpadding="0" cellspacing="0" class="MsoTableGrid" style="border-bottom: medium none; border-left: medium none; border-collapse: collapse; border-top: medium none; border-right: medium none; mso-border-alt: solid black .5pt; mso-border-themecolor: text1; mso-yfti-tbllook: 1184; mso-padding-alt: 0in 5.4pt 0in 5.4pt">
<tbody>
<tr style="mso-yfti-irow: 0; mso-yfti-firstrow: yes">
<td style="border-bottom: black 1pt solid; border-left: black 1pt solid; padding-bottom: 0in; background-color: transparent; padding-left: 5.4pt; width: 159.6pt; padding-right: 5.4pt; border-top: black 1pt solid; border-right: black 1pt solid; padding-top: 0in; mso-border-alt: solid black .5pt; mso-border-themecolor: text1" valign="top" width="213">
<p>
<span style="font-family: 'calibri', 'sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"><font size="3"><font color="#000000">specs<o:p></o:p></font></font></span></p>
</td>
<td style="border-bottom: black 1pt solid; border-left: #000000; padding-bottom: 0in; background-color: transparent; padding-left: 5.4pt; width: 159.6pt; padding-right: 5.4pt; border-top: black 1pt solid; border-right: black 1pt solid; padding-top: 0in; mso-border-alt: solid black .5pt; mso-border-themecolor: text1; mso-border-left-alt: solid black .5pt; mso-border-left-themecolor: text1" valign="top" width="213">
<p>
<span style="font-family: 'calibri', 'sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"><font size="3"><font color="#000000">b<o:p></o:p></font></font></span></p>
</td>
<td style="border-bottom: black 1pt solid; border-left: #000000; padding-bottom: 0in; background-color: transparent; padding-left: 5.4pt; width: 159.6pt; padding-right: 5.4pt; border-top: black 1pt solid; border-right: black 1pt solid; padding-top: 0in; mso-border-alt: solid black .5pt; mso-border-themecolor: text1; mso-border-left-alt: solid black .5pt; mso-border-left-themecolor: text1" valign="top" width="213">
<p>
<span style="font-family: 'calibri', 'sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"><font size="3"><font color="#000000">c<o:p></o:p></font></font></span></p>
</td>
</tr>
<tr style="mso-yfti-irow: 1">
<td style="border-bottom: black 1pt solid; border-left: black 1pt solid; padding-bottom: 0in; background-color: transparent; padding-left: 5.4pt; width: 159.6pt; padding-right: 5.4pt; border-top: #000000; border-right: black 1pt solid; padding-top: 0in; mso-border-alt: solid black .5pt; mso-border-themecolor: text1; mso-border-top-alt: solid black .5pt; mso-border-top-themecolor: text1" valign="top" width="213">
<p>
<span style="font-family: 'calibri', 'sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"><font size="3"><font color="#000000">a1<o:p></o:p></font></font></span></p>
</td>
<td style="border-bottom: black 1pt solid; border-left: #000000; padding-bottom: 0in; background-color: transparent; padding-left: 5.4pt; width: 159.6pt; padding-right: 5.4pt; border-top: #000000; border-right: black 1pt solid; padding-top: 0in; mso-border-alt: solid black .5pt; mso-border-themecolor: text1; mso-border-left-alt: solid black .5pt; mso-border-left-themecolor: text1; mso-border-top-alt: solid black .5pt; mso-border-top-themecolor: text1; mso-border-bottom-themecolor: text1; mso-border-right-themecolor: text1" valign="top" width="213">
<p>
<span style="font-family: 'arial black', 'sans-serif'; font-size: 9pt"><font color="#000000">b1<o:p></o:p></font></span></p>
</td>
<td style="border-bottom: black 1pt solid; border-left: #000000; padding-bottom: 0in; background-color: transparent; padding-left: 5.4pt; width: 159.6pt; padding-right: 5.4pt; border-top: #000000; border-right: black 1pt solid; padding-top: 0in; mso-border-alt: solid black .5pt; mso-border-themecolor: text1; mso-border-left-alt: solid black .5pt; mso-border-left-themecolor: text1; mso-border-top-alt: solid black .5pt; mso-border-top-themecolor: text1; mso-border-bottom-themecolor: text1; mso-border-right-themecolor: text1" valign="top" width="213">
<p>
<span style="font-family: 'arial black', 'sans-serif'; font-size: 9pt"><font color="#000000">c1<o:p></o:p></font></span></p>
</td>
</tr>
<tr style="mso-yfti-irow: 2; mso-yfti-lastrow: yes">
<td style="border-bottom: black 1pt solid; border-left: black 1pt solid; padding-bottom: 0in; background-color: transparent; padding-left: 5.4pt; width: 159.6pt; padding-right: 5.4pt; border-top: #000000; border-right: black 1pt solid; padding-top: 0in; mso-border-alt: solid black .5pt; mso-border-themecolor: text1; mso-border-top-alt: solid black .5pt; mso-border-top-themecolor: text1" valign="top" width="213">
<p>
<span style="font-family: 'calibri', 'sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"><font color="#000000">a2<o:p></o:p></font></span></p>
</td>
<td style="border-bottom: black 1pt solid; border-left: #000000; padding-bottom: 0in; background-color: transparent; padding-left: 5.4pt; width: 159.6pt; padding-right: 5.4pt; border-top: #000000; border-right: black 1pt solid; padding-top: 0in; mso-border-alt: solid black .5pt; mso-border-themecolor: text1; mso-border-left-alt: solid black .5pt; mso-border-left-themecolor: text1; mso-border-top-alt: solid black .5pt; mso-border-top-themecolor: text1; mso-border-bottom-themecolor: text1; mso-border-right-themecolor: text1" valign="top" width="213">
<p>
<span style="font-family: 'arial black', 'sans-serif'; font-size: 9pt"><font color="#000000">b2<o:p></o:p></font></span></p>
</td>
<td style="border-bottom: black 1pt solid; border-left: #000000; padding-bottom: 0in; background-color: transparent; padding-left: 5.4pt; width: 159.6pt; padding-right: 5.4pt; border-top: #000000; border-right: black 1pt solid; padding-top: 0in; mso-border-alt: solid black .5pt; mso-border-themecolor: text1; mso-border-left-alt: solid black .5pt; mso-border-left-themecolor: text1; mso-border-top-alt: solid black .5pt; mso-border-top-themecolor: text1; mso-border-bottom-themecolor: text1; mso-border-right-themecolor: text1" valign="top" width="213">
<p>
<span style="font-family: 'arial black', 'sans-serif'; font-size: 9pt"><font color="#000000">c2<o:p></o:p></font></span></p>
</td>
</tr>
</tbody>
</table>
<p style="text-indent: -0.25in; margin-left: 0.5in; mso-list: l1 level1 lfo2">
<font color="#000000"><span style="font-family: 'calibri', 'sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin; mso-fareast-font-family: calibri; mso-fareast-theme-font: minor-latin; mso-bidi-font-family: calibri; mso-bidi-theme-font: minor-latin"><span style="mso-list: ignore"><font size="3">1.</font><span style="font: 7pt 'times new roman'"> </span></span></span><span style="font-family: 'calibri', 'sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"><font size="3">list item a<o:p></o:p></font></span></font></p>
<p style="text-indent: -0.25in; margin-left: 0.5in; mso-list: l1 level1 lfo2">
<font color="#000000"><span style="font-family: 'calibri', 'sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin; mso-fareast-font-family: calibri; mso-fareast-theme-font: minor-latin; mso-bidi-font-family: calibri; mso-bidi-theme-font: minor-latin"><span style="mso-list: ignore"><font size="3">2.</font><span style="font: 7pt 'times new roman'"> </span></span></span><span style="font-family: 'calibri', 'sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"><font size="3">list item b<o:p></o:p></font></span></font></p>
<p class="MsoNormal" style="text-indent: 0in; margin: 0in 0in 3pt -4.5pt">
<font size="3"><font color="#000000"><font face="Calibri"><b style="mso-bidi-font-weight: normal">Bold</b> text. Another </font><span style="font-family: 'tahoma', 'sans-serif'">font</span><font face="Calibri">. </font></font><font face="Calibri"><span style="color: #c00000">Red</span><font color="#000000"> Text. </font></font></font><font color="#000000"><font face="Calibri"><span style="line-height: 105%; font-size: 14pt; mso-bidi-font-size: 11.0pt">Larger </span><font size="3">text. some </font><sub><font size="2">subscript</font></sub><font size="3">. some </font><sup><font size="2">superscript</font></sup><font size="3">. some <s>strikethrough</s>. some <i style="mso-bidi-font-style: normal">italic</i>.</font></font></font></p>
<p class="MsoNormal" style="text-indent: 0in; margin: 0in 0in 3pt -4.5pt">
<o:p><font color="#000000" face="Calibri" size="3"> </font></o:p></p>
<p class="MsoNormal" style="text-indent: 0in; margin: 0in 0in 3pt -4.5pt">
<o:p><font color="#000000" face="Calibri" size="3"> </font></o:p></p>
<p class="MsoNormal" style="text-indent: 0in; margin: 0in 0in 3pt 0.5in">
<font color="#000000" face="Calibri" size="3">Weird, indented text.</font></p>
<p class="MsoNormal" style="text-indent: 0in; margin: 0in 0in 3pt -4.5pt">
<o:p><font color="#000000" face="Calibri" size="3"> </font></o:p></p>
<p class="MsoNormal" style="margin: 0in 0in 3pt 0.25in">
<font color="#000000" face="Calibri" size="3">hanging indent.</font></p>
<p class="MsoNormal" style="text-indent: 0in; margin: 0in 0in 3pt -4.5pt">
<o:p><font color="#000000" face="Calibri" size="3"> </font></o:p></p>
<p class="MsoNormal" style="text-indent: 0in; margin: 0in 0in 3pt -4.5pt">
<font color="#000000" face="Calibri" size="3">justified text right here baby. but it needs to be long enough that it will actually wrap to the next line. a little bit more text please.</font></p>
<p class="MsoNormal" style="text-indent: 0in; margin: 0in 0in 3pt -4.5pt">
<font color="#000000" face="Calibri" size="3">tab<span style="mso-tab-count: 1"> </span>tab<span style="mso-tab-count: 1"> </span>taberoo</font></p>
<p class="MsoNormal" style="margin: 0in 0in 3pt 0.75in">
<o:p><font color="#000000" face="Calibri" size="3"> </font></o:p></p>
After my cleanup script:
<p>“<strong>Heading</strong>”</p><p>Normal text <a>with a link</a>. Special chars: “ ” ‘ ’ — – ™ © ® « » ° µ</p><p>That is highlighted text.</p><ul><li>bullet1</li><li>bullet2</li></ul><table cellspacing="0" class="content-info"><tbody><tr><td valign="top" width="213"><p>specs</p></td><td valign="top" width="213"><p>b</p></td><td valign="top" width="213"><p>c</p></td></tr><tr><td valign="top" width="213"><p>a1</p></td><td valign="top" width="213"><p>b1</p></td><td valign="top" width="213"><p>c1</p></td></tr><tr><td valign="top" width="213"><p>a2</p></td><td valign="top" width="213"><p>b2</p></td><td valign="top" width="213"><p>c2</p></td></tr></tbody></table><ol><li>list item a</li><li>list item b</li></ol><p style="text-indent: 0in; margin: 0in 0in 3pt -4.5pt"><strong>Bold</strong> text. Another font. Red Text. Larger text. some <sub>subscript</sub>. some <sup>superscript</sup>. some <s>strikethrough</s>. some <em>italic</em>.</p><p style="text-indent: 0in; margin: 0in 0in 3pt 0.5in">Weird, indented text.</p><p style="margin: 0in 0in 3pt 0.25in">hanging indent.</p><p style="text-indent: 0in; margin: 0in 0in 3pt -4.5pt">justified text right here baby. but it needs to be long enough that it will actually wrap to the next line. a little bit more text please.</p><p style="text-indent: 0in; margin: 0in 0in 3pt -4.5pt">tab tab taberoo</p>
Looks great, would you be willing to share your code?
~ Ran