SniptoolsSniptools | Design & Technology Observations

RSS

Finding non-UTF8 values in PostgreSQL

Jul 23rd 2009
One Comment

Respond
Trackback

This regexp worked for me.

1
2
3
4
5
6
7
8
9
10
11
12
SELECT * FROM TABLE
WHERE NOT COLUMN ~ ( '^('||
 $$[\09\0A\0D\x20-\x7E]|$$||               -- ASCII
 $$[\xC2-\xDF][\x80-\xBF]|$$||             -- non-overlong 2-byte
  $$\xE0[\xA0-\xBF][\x80-\xBF]|$$||        -- excluding overlongs
 $$[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}|$$||  -- straight 3-byte
  $$\xED[\x80-\x9F][\x80-\xBF]|$$||        -- excluding surrogates
  $$\xF0[\x90-\xBF][\x80-\xBF]{2}|$$||     -- planes 1-3
 $$[\xF1-\xF3][\x80-\xBF]{3}|$$||          -- planes 4-15
  $$\xF4[\x80-\x8F][\x80-\xBF]{2}$$||      -- plane 16
 ')*$' )
;

Or, if you have iconv on your system, as most UNIX variants do, you can pg_dumpall a database and run the following command on it:

if (@iconv($input, 'UTF-8', 'UTF-8') == $input) echo "Good UTF-8!"; else echo "Nope."

Then take a plain text pg_dump of the database in UTF-8. Let’s call it “db.orig.dmp”. Strip all problem characters:

iconv -f UTF-8 -t UTF-8 -c <db .orig.dmp >db.stripped.sql
</db>

Compare both files:

diff -u db.orig.dmp db.stripped.sql

That output will show all lines containing a problem character. Now the tedious part: Use “db.orig.dmp” to find out which tables they belong to, locate the records in the database by primary key and fix them. Alternatively, you can use iconv’s auto-repair if you know which bytes give you a problem.

For example, if you know that the trouble stems only from 0x80 bytes that should be Euro symbols, you could:

iconv -f UTF-8 -t UTF-8 --byte-subst="&lt;0x%x>" < db.orig.sql | sed -e 's/&lt;0x80>/EUR/g' >db.fixed.sql

The resulting “db.fixed.sql” could then be loaded into the new database.




This post is tagged , ,

One Comment

  1. Thank you so much for your instruction on how to delete Saft. It was driving me mad. All I wanted it for is the option in Safari tabs to have the new tab open next to the currently active one. By chance to you have a fix for that?

    Thanks again.

Incoming Links