This regexp worked for me.
1 2 3 4 5 6 7 8 9 10 11 12 | SELECT * FROM TABLE WHERE NOT COLUMN ~ ( '^('|| $$[\09\0A\0D\x20-\x7E]|$$|| -- ASCII $$[\xC2-\xDF][\x80-\xBF]|$$|| -- non-overlong 2-byte $$\xE0[\xA0-\xBF][\x80-\xBF]|$$|| -- excluding overlongs $$[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}|$$|| -- straight 3-byte $$\xED[\x80-\x9F][\x80-\xBF]|$$|| -- excluding surrogates $$\xF0[\x90-\xBF][\x80-\xBF]{2}|$$|| -- planes 1-3 $$[\xF1-\xF3][\x80-\xBF]{3}|$$|| -- planes 4-15 $$\xF4[\x80-\x8F][\x80-\xBF]{2}$$|| -- plane 16 ')*$' ) ; |
Or, if you have iconv on your system, as most UNIX variants do, you can pg_dumpall a database and run the following command on it:
if (@iconv($input, 'UTF-8', 'UTF-8') == $input) echo "Good UTF-8!"; else echo "Nope." |
Then take a plain text pg_dump of the database in UTF-8. Let's call it "db.orig.dmp". Strip all problem characters:
iconv -f UTF-8 -t UTF-8 -c <db .orig.dmp >db.stripped.sql </db> |
Compare both files:
diff -u db.orig.dmp db.stripped.sql |
That output will show all lines containing a problem character. Now the tedious part: Use "db.orig.dmp" to find out which tables they belong to, locate the records in the database by primary key and fix them. Alternatively, you can use iconv's auto-repair if you know which bytes give you a problem.
For example, if you know that the trouble stems only from 0×80 bytes that should be Euro symbols, you could:
iconv -f UTF-8 -t UTF-8 --byte-subst="<0x%x>" < db.orig.sql | sed -e 's/<0x80>/EUR/g' >db.fixed.sql |
The resulting "db.fixed.sql" could then be loaded into the new database.
Read More
RECENT COMMENTS