Archive
Tag "sql"

This reg­exp worked for me.

1
2
3
4
5
6
7
8
9
10
11
12
SELECT * FROM TABLE
WHERE NOT COLUMN ~ ( '^('||
$$[\09\0A\0D\x20-\x7E]|$$||               -- ASCII
$$[\xC2-\xDF][\x80-\xBF]|$$||             -- non-overlong 2-byte
$$\xE0[\xA0-\xBF][\x80-\xBF]|$$||        -- excluding overlongs
$$[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}|$$||  -- straight 3-byte
$$\xED[\x80-\x9F][\x80-\xBF]|$$||        -- excluding surrogates
$$\xF0[\x90-\xBF][\x80-\xBF]{2}|$$||     -- planes 1-3
$$[\xF1-\xF3][\x80-\xBF]{3}|$$||          -- planes 4-15
$$\xF4[\x80-\x8F][\x80-\xBF]{2}$$||      -- plane 16
')*$' )
;

Or, if you have iconv on your sys­tem, as most UNIX vari­ants do, you can pg_dumpall a data­base and run the fol­low­ing com­mand on it:

if (@iconv($input, 'UTF-8', 'UTF-8') == $input) echo "Good UTF-8!"; else echo "Nope."

Then take a plain text pg_dump of the data­base in UTF-8. Let's call it "db.orig.dmp". Strip all prob­lem characters:

iconv -f UTF-8 -t UTF-8 -c <db .orig.dmp >db.stripped.sql
</db>

Com­pare both files:

diff -u db.orig.dmp db.stripped.sql

That out­put will show all lines con­tain­ing a prob­lem char­ac­ter. Now the tedious part: Use "db.orig.dmp" to find out which tables they belong to, locate the records in the data­base by pri­mary key and fix them. Alter­na­tively, you can use iconv's auto-repair if you know which bytes give you a problem.

For exam­ple, if you know that the trou­ble stems only from 0×80 bytes that should be Euro sym­bols, you could:

iconv -f UTF-8 -t UTF-8 --byte-subst="&lt;0x%x>" < db.orig.sql | sed -e 's/&lt;0x80>/EUR/g' >db.fixed.sql

The result­ing "db.fixed.sql" could then be loaded into the new database.

Read More