
Characters beyond ASCII in database ( US7ASCII and UTF8 character set )
Oracle 8.1.7.x.x on Solaris 8.
I am currently testing a conversion of an instance from the US7ASCII
character set to a UTF8 character set. The client applications are all Java
( which by default uses UCS2 ) via the JDBC 1.2 thin drivers.
What I did was I exported the data from the production database using the
US7ASCII char set and imported the data into a new database also using
US7ASCII, thereby having a duplicate of the existing production database. I
then changed the character set of the database where I imported the data
into UTF8, also changed the national character set, as stated in the Oracle
docs.
So far, the conversion of the character set was fine.
However ....
... just today, after tyring to find out why I was getting a
"java.sql.SQLException: Fail to convert between UTF8 and UCS2: failUTF8Conv"
for one particular row, I narrowed it down to a single row where one column
had a single character beyond the ASCII set ( an e with an acute on top ).
It was a plain VARCHAR2(15) column. ( These messages and data comes from SMS
messages ).
I looked at the production database ( US7ASCII ) to see if it had been like
that before, and true enough, it was.
Here is how it looks in the US7ASCII database:
SQL> ;
1* select nickname, length(nickname), dump(nickname) from player where
player_id = 48589
SQL> /
NICKNAME LENGTH(NICKNAME) DUMP(NICKNAME)
--------------- ---------------- ----------------------------------------
DVIL13 7 Typ=1 Len=7: 68,201,86,73,76,49,51
Here is how it looks in the UTF8 database:
SQL> column dump(nickname) format a40
SQL> select nickname, length(nickname), dump(nickname) from player where
player_id = 48589
2
SQL> /
NICKNAME LENGTH(NICKNAME) DUMP(NICKNAME)
--------------- ---------------- ----------------------------------------
D?IL13 6 Typ=1 Len=7: 68,201,86,73,76,49,51
As you can see, they both have the same byte sequence, but Oracle treats
them as different strings ( one is 7 characters, the other is 6
characters ).
Of course, the original problem is that the non-ASCII character got inserted
into the database in the first place, when the character set was still
US7ASCII. The JDBC driver then receives ( 201 ) and ( 86 ) as a single
character, thereby failing because it is not a valid UTF8 character(!?).
If I then try to insert the same character into the UTF8 database, the byte
sequence that I get is:
SQL> ;
1* select text, length(text), dump(text) from messages where message_id =
6023026
SQL> /
TEXT LENGTH(TEXT) DUMP(TEXT)
-------------------- ------------ --------------------
E 1 Typ=1 Len=2: 195,137
.... which is how the character would be saved in the database IF the DB
character was already UTF8. Retrieving this row from the database using JDBC
does NOT throw the exception noted abover.
So my questions are:
1) Shouldn't Oracle have complained in the first place when someone was
trying to insert a character beyond the ASCII when its DB character set is
US7ASCII?
2) Does anybody know of any way to look for non-ASCII characters for a
specified set of tables .... and then convert them into the proper UTF8
encoding??
I could have created a separate UTF8 database from the onset and then import
the data exported from the US7ASCII database, and probably then the import
would have converted that non-ASCII character set to the proper UTF8
encoding ... would it?
Anyway, I did what I did above ( i.e.: create a separate US7ASCII db,
import, then convert to UTF8 ) as an exercise on what will be done later on
to the production database ... and to minimise the downtime required.
Regards,