Finding non-UTF8 values in PostgreSQL
This regexp worked for me.
1 2 3 4 5 6 7 8 9 10 11 12 | SELECT * FROM TABLE WHERE NOT COLUMN ~ ( '^('|| $$[\09\0A\0D\x20-\x7E]|$$|| -- ASCII $$[\xC2-\xDF][\x80-\xBF]|$$|| -- non-overlong 2-byte $$\xE0[\xA0-\xBF][\x80-\xBF]|$$|| -- excluding overlongs $$[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}|$$|| -- straight 3-byte $$\xED[\x80-\x9F][\x80-\xBF]|$$|| -- excluding surrogates $$\xF0[\x90-\xBF][\x80-\xBF]{2}|$$|| -- planes 1-3 $$[\xF1-\xF3][\x80-\xBF]{3}|$$|| -- planes 4-15 $$\xF4[\x80-\x8F][\x80-\xBF]{2}$$|| -- plane 16 ')*$' ) ; |
Or, if you have iconv on your system, as most UNIX variants do, you can pg_dumpall a database and run the following command on it:
if (@iconv($input, 'UTF-8', 'UTF-8') == $input) echo "Good UTF-8!"; else echo "Nope."
Then take a plain text pg_dump of the database in UTF-8. Let's call it "db.orig.dmp". Strip all problem characters:
iconv -f UTF-8 -t UTF-8 -c <db .orig.dmp >db.stripped.sql </db>
Compare both files:
diff -u db.orig.dmp db.stripped.sql
That output will show all lines containing a problem character. Now the tedious part: Use "db.orig.dmp" to find out which tables they belong to, locate the records in the database by primary key and fix them. Alternatively, you can use iconv's auto-repair if you know which bytes give you a problem.
For example, if you know that the trouble stems only from 0×80 bytes that should be Euro symbols, you could:
iconv -f UTF-8 -t UTF-8 --byte-subst="<0x%x>" < db.orig.sql | sed -e 's/<0x80>/EUR/g' >db.fixed.sql
The resulting "db.fixed.sql" could then be loaded into the new database.
Thank you so much for your instruction on how to delete Saft. It was driving me mad. All I wanted it for is the option in Safari tabs to have the new tab open next to the currently active one. By chance to you have a fix for that?
Thanks again.
Thanks a lot for this post, the last command with the idea of using iconv to create intermediary representations of problematic characters and manually fixing them with sed solved the problem I've been trying to crack purely with sed for way too many hours!
I ended up just converting to numeric HTML entities and only fixing the ones which are not being displayed (using –byte-subst="&#x%x;")