This regexp worked for me.
1 2 3 4 5 6 7 8 9 10 11 12 | SELECT * FROM TABLE WHERE NOT COLUMN ~ ( '^('|| $$[\09\0A\0D\x20-\x7E]|$$|| -- ASCII $$[\xC2-\xDF][\x80-\xBF]|$$|| -- non-overlong 2-byte $$\xE0[\xA0-\xBF][\x80-\xBF]|$$|| -- excluding overlongs $$[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}|$$|| -- straight 3-byte $$\xED[\x80-\x9F][\x80-\xBF]|$$|| -- excluding surrogates $$\xF0[\x90-\xBF][\x80-\xBF]{2}|$$|| -- planes 1-3 $$[\xF1-\xF3][\x80-\xBF]{3}|$$|| -- planes 4-15 $$\xF4[\x80-\x8F][\x80-\xBF]{2}$$|| -- plane 16 ')*$' ) ; |
Or, if you have iconv on your system, as most UNIX variants do, you can pg_dumpall a database and run the following command on it:
if (@iconv($input, 'UTF-8', 'UTF-8') == $input) echo "Good UTF-8!"; else echo "Nope."
Then take a plain text pg_dump of the database in UTF-8. Let’s call it “db.orig.dmp”. Strip all problem characters:
iconv -f UTF-8 -t UTF-8 -c <db .orig.dmp >db.stripped.sql </db>
Compare both files:
diff -u db.orig.dmp db.stripped.sql
That output will show all lines containing a problem character. Now the tedious part: Use “db.orig.dmp” to find out which tables they belong to, locate the records in the database by primary key and fix them. Alternatively, you can use iconv’s auto-repair if you know which bytes give you a problem.
For example, if you know that the trouble stems only from 0x80 bytes that should be Euro symbols, you could:
iconv -f UTF-8 -t UTF-8 --byte-subst="<0x%x>" < db.orig.sql | sed -e 's/<0x80>/EUR/g' >db.fixed.sql
The resulting “db.fixed.sql” could then be loaded into the new database.
This post is tagged Databases, postgresql, sql

One Comment
Thank you so much for your instruction on how to delete Saft. It was driving me mad. All I wanted it for is the option in Safari tabs to have the new tab open next to the currently active one. By chance to you have a fix for that?
Thanks again.
Incoming Links