Finding non-UTF8 values in PostgreSQL

This reg­exp worked for me.

1
2
3
4
5
6
7
8
9
10
11
12
SELECT * FROM TABLE
WHERE NOT COLUMN ~ ( '^('||
$$[\09\0A\0D\x20-\x7E]|$$||               -- ASCII
$$[\xC2-\xDF][\x80-\xBF]|$$||             -- non-overlong 2-byte
$$\xE0[\xA0-\xBF][\x80-\xBF]|$$||        -- excluding overlongs
$$[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}|$$||  -- straight 3-byte
$$\xED[\x80-\x9F][\x80-\xBF]|$$||        -- excluding surrogates
$$\xF0[\x90-\xBF][\x80-\xBF]{2}|$$||     -- planes 1-3
$$[\xF1-\xF3][\x80-\xBF]{3}|$$||          -- planes 4-15
$$\xF4[\x80-\x8F][\x80-\xBF]{2}$$||      -- plane 16
')*$' )
;

Or, if you have iconv on your sys­tem, as most UNIX vari­ants do, you can pg_dumpall a data­base and run the fol­low­ing com­mand on it:

if (@iconv($input, 'UTF-8', 'UTF-8') == $input) echo "Good UTF-8!"; else echo "Nope."

Then take a plain text pg_dump of the data­base in UTF-8. Let's call it "db.orig.dmp". Strip all prob­lem characters:

iconv -f UTF-8 -t UTF-8 -c <db .orig.dmp >db.stripped.sql
</db>

Com­pare both files:

diff -u db.orig.dmp db.stripped.sql

That out­put will show all lines con­tain­ing a prob­lem char­ac­ter. Now the tedious part: Use "db.orig.dmp" to find out which tables they belong to, locate the records in the data­base by pri­mary key and fix them. Alter­na­tively, you can use iconv's auto-repair if you know which bytes give you a problem.

For exam­ple, if you know that the trou­ble stems only from 0×80 bytes that should be Euro sym­bols, you could:

iconv -f UTF-8 -t UTF-8 --byte-subst="&lt;0x%x>" < db.orig.sql | sed -e 's/&lt;0x80>/EUR/g' >db.fixed.sql

The result­ing "db.fixed.sql" could then be loaded into the new database.

  • http://clientpractice.com/ Monty Sher

    Thank you so much for your instruc­tion on how to delete Saft. It was dri­ving me mad. All I wanted it for is the option in Safari tabs to have the new tab open next to the cur­rently active one. By chance to you have a fix for that?

    Thanks again.

  • http://danieldemmel.me Daniel Dem­mel

    Thanks a lot for this post, the last com­mand with the idea of using iconv to cre­ate inter­me­di­ary rep­re­sen­ta­tions of prob­lem­atic char­ac­ters and man­u­ally fix­ing them with sed solved the prob­lem I've been try­ing to crack purely with sed for way too many hours!

    I ended up just con­vert­ing to numeric HTML enti­ties and only fix­ing the ones which are not being dis­played (using –byte-subst="&#x%x;")

  • Luma Macagnan

    Thank you so much for this post. It was really helpfull.