Characters beyond ASCII in database ( US7ASCII and UTF8 character set ) 
Author Message
 Characters beyond ASCII in database ( US7ASCII and UTF8 character set )

Oracle 8.1.7.x.x on Solaris 8.

I am currently testing a conversion of an instance from the US7ASCII
character set to a UTF8 character set. The client applications are all Java
( which by default uses UCS2 ) via the JDBC 1.2 thin drivers.

What I did was I exported the data from the production database using the
US7ASCII char set and imported the data into a new database also using
US7ASCII, thereby having a duplicate of the existing production database. I
then changed the character set of the database where I imported the data
into UTF8, also changed the national character set, as stated in the Oracle
docs.

So far, the conversion of the character set was fine.

However ....

... just today, after tyring to find out why I was getting a
"java.sql.SQLException: Fail to convert between UTF8 and UCS2: failUTF8Conv"
for one particular row, I narrowed it down to a single row where one column
had a single character beyond the ASCII set ( an e with an acute on top ).
It was a plain VARCHAR2(15) column. ( These messages and data comes from SMS
messages ).

I looked at the production database ( US7ASCII ) to see if it had been like
that before, and true enough, it was.

Here is how it looks in the US7ASCII database:

SQL> ;
  1* select nickname, length(nickname), dump(nickname) from player where
player_id = 48589
SQL> /

NICKNAME        LENGTH(NICKNAME) DUMP(NICKNAME)
--------------- ---------------- ----------------------------------------
DVIL13                        7 Typ=1 Len=7: 68,201,86,73,76,49,51

Here is how it looks in the UTF8 database:

SQL> column dump(nickname) format a40
SQL> select nickname, length(nickname), dump(nickname) from player where
player_id = 48589
  2
SQL> /

NICKNAME        LENGTH(NICKNAME) DUMP(NICKNAME)
--------------- ---------------- ----------------------------------------
D?IL13                         6 Typ=1 Len=7: 68,201,86,73,76,49,51

As you can see, they both have the same byte sequence, but Oracle treats
them as different strings ( one is 7 characters, the other is 6
characters ).

Of course, the original problem is that the non-ASCII character got inserted
into the database in the first place, when the character set was still
US7ASCII. The JDBC driver then receives ( 201 ) and ( 86 ) as a single
character, thereby failing because it is not a valid UTF8 character(!?).

If I then try to insert the same character into the UTF8 database, the byte
sequence that I get is:

SQL> ;
  1* select text, length(text), dump(text) from messages where message_id =
6023026
SQL> /

TEXT                 LENGTH(TEXT) DUMP(TEXT)
-------------------- ------------ --------------------
E                               1 Typ=1 Len=2: 195,137

.... which is how the character would be saved in the database IF the DB
character was already UTF8. Retrieving this row from the database using JDBC
does NOT throw the exception noted abover.

So my questions are:

1) Shouldn't Oracle have complained in the first place when someone was
trying to insert a character beyond the ASCII when its DB character set is
US7ASCII?

2) Does anybody know of any way to look for non-ASCII characters for a
specified set of tables .... and then convert them into the proper UTF8
encoding??

I could have created a separate UTF8 database from the onset and then import
the data exported from the US7ASCII database, and probably then the import
would have converted that non-ASCII character set to the proper UTF8
encoding ... would it?

Anyway, I did what I did above ( i.e.: create a separate US7ASCII db,
import, then convert to UTF8 ) as an exercise on what will be done later on
to the production database ... and to minimise the downtime required.

Regards,



Sun, 23 May 2004 12:58:27 GMT
 Characters beyond ASCII in database ( US7ASCII and UTF8 character set )

I just tried exporting that table from the US7ASCII again, but having my
NLS_LANG=English_Australia.UTF8 and NLS_CHAR=UTF8 ... and then import that
data into the UTF8 database, also having NLS_LANG=English_Australia.UTF8 and
NLS_CHAR=UTF8 during the import.

That non-ASCII character, the e with an acute, became the letter I.

So my alternative of exporting the US7ASCII data as UTF8 will not work. I
tried exporting the US7ASCII data as US7ASCII and then importing it as UTF8
but it does not allow me to ( I got some IMP error ).

Any other ideas?

Going back to my original questions:

1) Shouldn't Oracle have complained in the first place when someone was
trying to insert a character beyond the ASCII when its DB character set is
US7ASCII?

2) Does anybody know of any way to look for non-ASCII characters for a
specified set of tables .... and then convert them into the proper UTF8
encoding??



Quote:

> Oracle 8.1.7.x.x on Solaris 8.

> I am currently testing a conversion of an instance from the US7ASCII
> character set to a UTF8 character set. The client applications are all
Java
> ( which by default uses UCS2 ) via the JDBC 1.2 thin drivers.

> What I did was I exported the data from the production database using the
> US7ASCII char set and imported the data into a new database also using
> US7ASCII, thereby having a duplicate of the existing production database.
I
> then changed the character set of the database where I imported the data
> into UTF8, also changed the national character set, as stated in the
Oracle
> docs.

> So far, the conversion of the character set was fine.

> However ....

> ... just today, after tyring to find out why I was getting a
> "java.sql.SQLException: Fail to convert between UTF8 and UCS2:
failUTF8Conv"
> for one particular row, I narrowed it down to a single row where one
column
> had a single character beyond the ASCII set ( an e with an acute on top ).
> It was a plain VARCHAR2(15) column. ( These messages and data comes from
SMS
> messages ).

> I looked at the production database ( US7ASCII ) to see if it had been
like
> that before, and true enough, it was.

> Here is how it looks in the US7ASCII database:

> SQL> ;
>   1* select nickname, length(nickname), dump(nickname) from player where
> player_id = 48589
> SQL> /

> NICKNAME        LENGTH(NICKNAME) DUMP(NICKNAME)
> --------------- ---------------- ----------------------------------------
> DVIL13                        7 Typ=1 Len=7: 68,201,86,73,76,49,51

> Here is how it looks in the UTF8 database:

> SQL> column dump(nickname) format a40
> SQL> select nickname, length(nickname), dump(nickname) from player where
> player_id = 48589
>   2
> SQL> /

> NICKNAME        LENGTH(NICKNAME) DUMP(NICKNAME)
> --------------- ---------------- ----------------------------------------
> D?IL13                         6 Typ=1 Len=7: 68,201,86,73,76,49,51

> As you can see, they both have the same byte sequence, but Oracle treats
> them as different strings ( one is 7 characters, the other is 6
> characters ).

> Of course, the original problem is that the non-ASCII character got
inserted
> into the database in the first place, when the character set was still
> US7ASCII. The JDBC driver then receives ( 201 ) and ( 86 ) as a single
> character, thereby failing because it is not a valid UTF8 character(!?).

> If I then try to insert the same character into the UTF8 database, the
byte
> sequence that I get is:

> SQL> ;
>   1* select text, length(text), dump(text) from messages where message_id
=
> 6023026
> SQL> /

> TEXT                 LENGTH(TEXT) DUMP(TEXT)
> -------------------- ------------ --------------------
> E                               1 Typ=1 Len=2: 195,137

> .... which is how the character would be saved in the database IF the DB
> character was already UTF8. Retrieving this row from the database using
JDBC
> does NOT throw the exception noted abover.

> So my questions are:

> 1) Shouldn't Oracle have complained in the first place when someone was
> trying to insert a character beyond the ASCII when its DB character set is
> US7ASCII?

> 2) Does anybody know of any way to look for non-ASCII characters for a
> specified set of tables .... and then convert them into the proper UTF8
> encoding??

> I could have created a separate UTF8 database from the onset and then
import
> the data exported from the US7ASCII database, and probably then the import
> would have converted that non-ASCII character set to the proper UTF8
> encoding ... would it?

> Anyway, I did what I did above ( i.e.: create a separate US7ASCII db,
> import, then convert to UTF8 ) as an exercise on what will be done later
on
> to the production database ... and to minimise the downtime required.

> Regards,



Sun, 23 May 2004 13:33:55 GMT
 Characters beyond ASCII in database ( US7ASCII and UTF8 character set )
On Wed, 05 Dec 2001 05:33:55 GMT, Jesus \(John\) Salvo Jr.

Quote:
>I just tried exporting that table from the US7ASCII again, but having my
>NLS_LANG=English_Australia.UTF8 and NLS_CHAR=UTF8 ... and then import that
>data into the UTF8 database, also having NLS_LANG=English_Australia.UTF8 and
>NLS_CHAR=UTF8 during the import.

>That non-ASCII character, the e with an acute, became the letter I.

>So my alternative of exporting the US7ASCII data as UTF8 will not work. I
>tried exporting the US7ASCII data as US7ASCII and then importing it as UTF8
>but it does not allow me to ( I got some IMP error ).

We had a similar problem. Our first database was set (badly) tu US7ASCII,
but accepted insert of accentuated text ...
A second database was set to ISO-latin-1. and, all ''(é) became 'i'

A workaround was given by oracle support : we had to modify the third byte
in our dmp file to set the binary value to the correct charset
(seems the character were set on 8bits and following the client-side
encoding without conversion in the database)

--
Damien



Mon, 24 May 2004 06:33:51 GMT
 Characters beyond ASCII in database ( US7ASCII and UTF8 character set )

Can you give me more info on what you did? In the case of a US7ASCII
database:

1) When you export data from the US7ASCII database, what NLS_LANG and
NLS_CHAR setting did you use?
2) What is the mapping of the third byte in the export file to a charset?
3) When you imported data, what NLS_LANG and NLS_CHAR setting did you use?

So far, I have done the following:

1)
Export Server:   US7ASCII
Export Client:    did not set NLS_LANG / NLS_CHAR, so presumably US7ASCII as
well

Import Client:    did not set NLS_LANG / NLS_CHAR, so presumably US7ASCII as
well
Import Server:    UTF8

RESULT: Acute E became I

2)
Export Server:   US7ASCII
Export Client:    did not set NLS_LANG / NLS_CHAR, so presumably US7ASCII as
well

Import Client:    NLS_LANG=English_Australia.UTF8 and NLS_CHAR=UTF8
Import Server:    UTF8

RESULT: IMP 00016 error

3)
Export Server:    US7ASCII
Export Client:    NLS_LANG=English_Australia.UTF8 and NLS_CHAR=UTF8

Import Client:    did not set NLS_LANG / NLS_CHAR, so presumably US7ASCII as
well
Import Server:    UTF8

RESULT: Acute E became I

4)
Export Server:    US7ASCII
Export Client:    NLS_LANG=English_Australia.UTF8 and NLS_CHAR=UTF8

Import Client:    NLS_LANG=English_Australia.UTF8 and NLS_CHAR=UTF8
Import Server:    UTF8

RESULT: Acute E became I

I would have through that setting the NLS_LANG/NLS_CHAR during export would
have set the "third byte" in the export file.

Thanks,

John


Quote:
> On Wed, 05 Dec 2001 05:33:55 GMT, Jesus \(John\) Salvo Jr.

> >I just tried exporting that table from the US7ASCII again, but having my
> >NLS_LANG=English_Australia.UTF8 and NLS_CHAR=UTF8 ... and then import
that
> >data into the UTF8 database, also having NLS_LANG=English_Australia.UTF8
and
> >NLS_CHAR=UTF8 during the import.

> >That non-ASCII character, the e with an acute, became the letter I.

> >So my alternative of exporting the US7ASCII data as UTF8 will not work. I
> >tried exporting the US7ASCII data as US7ASCII and then importing it as
UTF8
> >but it does not allow me to ( I got some IMP error ).

> We had a similar problem. Our first database was set (badly) tu US7ASCII,
> but accepted insert of accentuated text ...
> A second database was set to ISO-latin-1. and, all ''(é) became
'i'

> A workaround was given by oracle support : we had to modify the third byte
> in our dmp file to set the binary value to the correct charset
> (seems the character were set on 8bits and following the client-side
> encoding without conversion in the database)

> --
> Damien



Mon, 24 May 2004 07:07:24 GMT
 
 [ 4 post ] 

 Relevant Pages 

1. Character set conversion to UTF8 from WE8ISO8859P1 to UTF8

2. database character sets utf8

3. character set conversion: AL24UTFFSS to UTF8

4. Export & Import with UTF8 Character Set

5. Non supported character set: oracle-character-set-96

6. db character set & N character set

7. Change Chracter Set - UTF8 to US7ASCII

8. Database conversion from US7ASCII to UTF8

9. Accented Characters with US7ASCII

10. Inserting Ascii(128-255) characters into SQL Server database

11. utf8 ,sqlplus, unix, sqlldr and character translation

12. UTF8 characters in an ASCII7 Oracle Instance.


 
Powered by phpBB® Forum Software