Monthly Archives: October 2009

Wrangling MS Word’s HTML

For a taste of Word HTML, see the section following this article. If you write HTML, a little part of you will die inside every time you see MS Word HTML.

One of the hard parts is extracting ordered and unordered lists. Especially since that varies so much between browsers.

For this PHP project, I needed to accept HTML from a WYSIWYG editor to save into a CMS system. The HTML needed to be quite restricted; e.g. to disallow changing font faces and font colors. Anyway, here is the algorithm I ended up with:

  1. Find and replace common entities that seem to get mangled in a normal DOMDocument conversion
  2. Remove comments
  3. Remove nested FONT tags since DOMDocument discards content inside nested FONT tags
  4. Extract ordered and unordered lists with regular expressions
  5. Convert B and I tags to EM and STRONG and disallow any attributes in any of those
  6. Replace TABLE tags with a simple tag containing a css class and cellspacing of 0
  7. Load into DOMDocument
  8. Remove any tags not on the whitelist
  9. Remove any attributes not on the whitelist
  10. Disable href and src attributes that do not contain an http, https or relative link
  11. Strip out disallowed css styles from the style attribute
  12. Remove css classes that Word commonly uses
  13. Dump back to HTML
  14. Remove DOCTYPE, HTML and BODY tags that DOMDocument automatically adds
  15. Use Tidy to clean up any additional problems, add P tags to plain text and remove duplicate white space
  16. Remove the linefeeds that Tidy leaves (we are sending content to Adobe Flex which parses HTML but interprets all white space literally)
  17. Remove spans that have no attributes and thus no effect
  18. Remove empty paragraphs

It feels good to get that worked out!

Continue reading “Wrangling MS Word’s HTML”

Flash’s Über Compatibility

Here is the HTML that Adobe Recommends to embed a flash file:

<OBJECT classid="clsid:D27CDB6E-AE6D-11cf-96B8-444553540000" codebase=",0,40,0" WIDTH="550" HEIGHT="400" id="myMovieName"><PARAM NAME=movie VALUE="myFlashMovie.swf"><PARAM NAME=quality VALUE=high><PARAM NAME=bgcolor VALUE=#FFFFFF><EMBED href="/support/flash/ts/documents/myFlashMovie.swf" quality=high bgcolor=#FFFFFF WIDTH="550" HEIGHT="400" NAME="myMovieName" ALIGN="" TYPE="application/x-shockwave-flash" PLUGINSPAGE=""></EMBED></OBJECT>

It is compatible with IE3 and Netscape 2. Really? Do we need that level of compatibility? I mean really!

And if you want something more “automated” you can use Adobe’s AC_RunActiveContent.js JavaScript file. Oh boy!

So I looked at the popular video sites and their “embed source” links. All are shorter and simpler. was the shortest. Here is what uses:

<embed src="" type="application/x-shockwave-flash" width="480" height="300" allowscriptaccess="always" allowfullscreen="true"></embed>

More examples are below.


<embed id=VideoPlayback src= style=width:400px;height:326px allowFullScreen=true allowScriptAccess=always type=application/x-shockwave-flash> </embed>

<object width="560" height="340"><param name="movie" value=""></param><param name="allowFullScreen" value="true"></param><param name="allowscriptaccess" value="always"></param><embed src="" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="560" height="340"></embed></object>

<div><object width="512" height="322"><param name="movie" value="" /><param name="allowFullScreen" value="true" /><param name="AllowScriptAccess" VALUE="always" /><param name="bgcolor" value="#000000" /><param name="flashVars" value="id=10032117&vid=3641591&lang=en-us&intl=us&thumbUrl=http%3A//" /><embed src="" type="application/x-shockwave-flash" width="512" height="322" allowFullScreen="true" AllowScriptAccess="always" bgcolor="#000000" flashVars="id=10032117&vid=3641591&lang=en-us&intl=us&thumbUrl=http%3A//" ></embed></object><br /><a href="">E for All 2008 - Interview with Fusion-io</a> @ <a href="" >Yahoo! Video</a></div>

<font face="Verdana" size="1" color="#999999"><br/><a style="font: Verdana" href="">E for All 2008 - Interview with Fusion-io</a><br/><object width="425px" height="360px" ><param name="allowFullScreen" value="true"/><param name="wmode" value="transparent"/><param name="movie" value=",t=1,mt=video"/><embed src=",t=1,mt=video" width="425" height="360" allowFullScreen="true" type="application/x-shockwave-flash" wmode="transparent"></embed></object><br/><a style="font: Verdana" href="">Stuff We Like</a> | <a style="font: Verdana" href="">MySpace Video</a></font>

<embed src="" type="application/x-shockwave-flash" width="480" height="300" allowscriptaccess="always" allowfullscreen="true"></embed>


I developed the following php function after writing trim($path,'/') too many times. It took me a lot of iterations to pass all the unit tests, but it works with URI and file paths for all OSs. It goes as far as to account for the strange possibility of a path containing an escaped slash. It runs pretty quickly–less than twice as long as a simple use of join: join('/',$parts).

Much of the time, simply joining with a slash is acceptable–file systems and web servers treat consecutive slashes as one. When comparing two paths for equality, spurious slashes are a problem. Or when running rewrite rules, an extra slash may throw the application into an entirely different path.

function pathConcat() {
  $parts = func_get_args();
  $base = array_shift($parts);
  $base = str_replace(\/,\x01″,$base);
  $base = rtrim($base, ‘/’);
  $paths = array();
  foreach ($parts as $part) {
    $part = str_replace(\/,\x01″,$part);
    $part = trim($part, ‘/’);
    if (strlen($part)) {
      $paths[] = $part;
  $fullpath = join($paths, ‘/’);
  $fullpath = $base . ‘/’ . $fullpath
  $fullpath = str_replace(\x01″,\/,$fullpath);
  return $fullpath;


pathConcat(‘one’,‘two’,‘three/’); // ‘one/two/three’
pathConcat(‘one’,,‘two’,‘/’,‘three/’); // ‘one/two/three’
pathConcat(‘http://one/’,‘/two/’,‘//three/’); // ‘http://one/two/three’
pathConcat(‘/one’,‘/two/’,‘//three.php’); // ‘/one/two/three.php’
pathConcat(‘c:/one/two/’,‘/../three/’); // ‘c:/one/two/../three’
pathConcat(“bats\\/”,“like”,“thedark”); // “bats\\//like/thedark”
pathConcat(“bats\\//like”,“caves”); // “bats\\//like/caves”

Multiple Codebases on localhost

In httpd.conf:

<VirtualHost *:80>
   DocumentRoot "c:/wamp/www"
   ServerName localhost

<VirtualHost *:80>
   DocumentRoot "c:/wamp/www/app1/"
   ServerName app1

<VirtualHost *:80>
   DocumentRoot "c:/wamp/www/app2"
   ServerName app2

In hosts file (c:/WINDOWS/system32/drivers/etc/hosts):    localhost    app1    app2

Access in the browser:


It can’t get much easier.

Multitasking with Apache RewriteCond


But one thing I run into a lot is the fact that many applications reference images (or css or js) from the root only. For example, the css files don’t know that I’m running out of /app-1/ and they reference /images/photo.jpg instead of /app-1/images/photo.jpg. Usually those types of path adjustments are best done in the application; but not all apps are enterprise grade.

I stumbled on one way to solve a lot of the problem with Apache’s RewriteCond:

RewriteCond %{REQUEST_URI} ^/(images|js|css)
RewriteCond %{DOCUMENT_ROOT}app-1%{REQUEST_URI} -f
RewriteRule ^(.+)$ app-1/$1 [L,QSA]

RewriteCond %{REQUEST_URI} ^/(images|js|css)
RewriteCond %{DOCUMENT_ROOT}app-2%{REQUEST_URI} -f
RewriteRule ^(.+)$ app-2/$1 [L,QSA]

Basically, it says that if the request starts with “/images”,”/js”,or “/css”, look to see if a file with that name exists within the /app-1/ folder, then look within the /app-2/ folder. If a file with that name does exist, rewrite the url to use that subpath off of localhost.

One downside is if app-1 and app-2 have an image with the same file name, I will see the app-1 version in app-2. But it is a great little hack since I’m usually working with only 2 or 3 codebases in one day.