Beauty’s where you find it
10 Jul
We have a client who is having issues with character encoding. They recently had a massive content migration and along with the migration came numerous character issues. They moved from a character set of iso-8859-1 to UTF-8, and with multiple content editors copying/pasting from many different resources (pdf’s, Word, Web Pages, etc) they now have some cleanup to do. We were recently asked if we could provide a report on all the pages where certain characters resided. No problem. We ran some SQL queries (most content resides in the database thankfully) that looked similar to the following:
SELECT p.id, p.title FROM `pages` p WHERE p.content REGEXP '(¢|â|Ã)';
In the case of the above query, we’re just taking advantage of MySQL’s native REGEXP operator to check for the existence of any of the three characters (simplified for illustration) in a content block. After running our SQL statement we quickly realized that the work necessary to clean this up would likely amount to at least a few weeks for our client. With an additional few hours of work (after backing things up and testing, testing, testing) we were able to provide a solution that will take a matter of minutes to implement (if the client chooses to hire us to do so). Here is an abbreviated version of what we came up with:
UPDATE `pages` p SET p.content = replace(p.content,'â€Â',''); UPDATE `pages` p SET p.content = replace(p.content,'â€Å\“',''); UPDATE `pages` p SET p.content = replace(p.content,'¢â‚',''); ...etc
There isn’t anything earth shattering here. These are simple queries, but they illustrate the inherent abilities of the technology employed with our client’s application. These inherent abilities, if known, have the potential to save dozens of hours. In the case of the above example, I truly believe we will likely save our client at least 30 man hours.
Recent Comments