COPY BINARY file format proposal 
Author Message
 COPY BINARY file format proposal

Grumble, I forgot about COPY WITH OIDS.  Amend that proposal as follows:

... We should use two different
magic numbers depending on whether OIDs are included in the dump or not.

If OIDs are included in the dump, the OID field immediately follows the
field-count word.  It is a normal field except that it's not included
in the field-count.  In particular it has a typlen --- this will allow
handling of 4-byte vs 8-byte OIDs without too much pain, and will allow
OIDs to be shown as NULL if we someday allow OIDs to be optional.

                        regards, tom lane



Mon, 26 May 2003 08:58:34 GMT
 COPY BINARY file format proposal

Quote:

> I'd prefer to see a single magic number for all binary COPY output, then a
> few bytes of header including a version number, and flags to indicate
> endianness, OIDs etc. It seems a lot cleaner than overloading the magic
> number.

OK, we can do it that way.  I'm still going to pick a magic number that
looks different depending on endianness, however ;-).

What might we need in the header besides a version indicator and a
has-OIDs flag?

Quote:
> Also, IIRC part of the problem with text-based COPY is that we can't
> specify field order (I think this affectes dumping the regression DB).
> Would it be possible to add the ability to (a) specify field order, and (b)
> dump a subset of fields?

This is not an issue for the file format, but for the COPY command itself.
And considering we're in beta now (or as soon as Marc gets the tarball
made, anyway) I'm going to call that a new feature and say it should
wait for 7.2.

                        regards, tom lane



Mon, 26 May 2003 10:38:11 GMT
 COPY BINARY file format proposal

Quote:

>> I'd prefer to see a single magic number for all binary COPY output, then a
>> few bytes of header including a version number, and flags to indicate
>> endianness, OIDs etc. It seems a lot cleaner than overloading the magic
>> number.

>OK, we can do it that way.  I'm still going to pick a magic number that
>looks different depending on endianness, however ;-).

What does the smiley mean in this context? I hope you're not serious...or
if you are, I'd be interested to know why.

Quote:
>What might we need in the header besides a version indicator and a
>has-OIDs flag?

Just of the top of my head, some things that could be there in the future:

- floating point representation (for portability)

- flag for compressed or uncompressed toast fields (I assume you dump them
uncompressed?)

- version number may be important if we dump a subset of fields (ie. we'll
need to store the field names somewhere).

I really have no idea what might be there, but it seems prudent to do it
this way.

----------------------------------------------------------------
Philip Warner                    |     __---_____
Albatross Consulting Pty. Ltd.   |----/       -  \

Tel: (+61) 0500 83 82 81         |                 _________  \
Fax: (+61) 0500 83 82 82         |                 ___________ |
Http://www.rhyme.com.au          |                /           \|
                                 |    --________--
PGP key available upon request,  |  /
and from pgp5.ai.mit.edu:11371   |/



Mon, 26 May 2003 10:43:15 GMT
 COPY BINARY file format proposal

Quote:

>> OK, we can do it that way.  I'm still going to pick a magic number that
>> looks different depending on endianness, however ;-).
> What does the smiley mean in this context?

Just thinking that the only way an endianness flag inside the header
would be useful is if we pick a magic number that's a bytewise
palindrome.

Quote:
> - floating point representation (for portability)

Specified how?  (For that matter, determined how?)

Quote:
> - flag for compressed or uncompressed toast fields (I assume you dump them
> uncompressed?)

Yes, I want COPY to force 'em to uncompressed so as to avoid problems
with cross-version changes of compression algorithm.  (Right at the
moment it gets that wrong.)

Quote:
> - version number may be important if we dump a subset of fields (ie. we'll
> need to store the field names somewhere).

No we don't.  ASCII COPY format doesn't store field names either ... at
least not as part of the data stream ... and should not IMHO.  Don't you
want to be able to reload into a table that you've changed the column
names of?

                        regards, tom lane



Mon, 26 May 2003 10:46:19 GMT
 COPY BINARY file format proposal

Quote:

>>> OK, we can do it that way.  I'm still going to pick a magic number that
>>> looks different depending on endianness, however ;-).

>> What does the smiley mean in this context?

>Just thinking that the only way an endianness flag inside the header
>would be useful is if we pick a magic number that's a bytewise
>palindrome.

You could just read the 1st, 2nd, 3rd, etc bytes and require that they be
'P', 'G', 'C', 'P', 'Y' or some such. I *think* reading five bytes and
doing a strcmp works...ie. don't rely on the integer value, use a string.

Quote:
>> - floating point representation (for portability)

>Specified how?  (For that matter, determined how?)

I'd recommend a crystal ball. You did ask a question about the future ;-}.

Quote:
>> - flag for compressed or uncompressed toast fields (I assume you dump them
>> uncompressed?)

>Yes, I want COPY to force 'em to uncompressed so as to avoid problems
>with cross-version changes of compression algorithm.  (Right at the
>moment it gets that wrong.)

Sounds reasonable, but there could be an advantage in allowing a binary
compressed dump for short-term work.

Quote:
>> - version number may be important if we dump a subset of fields (ie. we'll
>> need to store the field names somewhere).

>No we don't.  ASCII COPY format doesn't store field names either ... at
>least not as part of the data stream ... and should not IMHO.  Don't you
>want to be able to reload into a table that you've changed the column
>names of?

This is essential if we ever allow subsets of columns - even if it is only
for displaying information to the user. If I dump 5 out of 7 columns then
rename half of them, I'd say I'm asking for trouble. At least with the
names available, you have a chance of working out what goes where. But
again, without copy-a-subset-of-columns, this also requires a crystal ball.

It all gets back to whether it's a good idea to overload a magic number.

----------------------------------------------------------------
Philip Warner                    |     __---_____
Albatross Consulting Pty. Ltd.   |----/       -  \

Tel: (+61) 0500 83 82 81         |                 _________  \
Fax: (+61) 0500 83 82 82         |                 ___________ |
Http://www.rhyme.com.au          |                /           \|
                                 |    --________--
PGP key available upon request,  |  /
and from pgp5.ai.mit.edu:11371   |/



Mon, 26 May 2003 10:48:25 GMT
 COPY BINARY file format proposal

Quote:
>Grumble, I forgot about COPY WITH OIDS.  Amend that proposal as follows:

>... We should use two different
>magic numbers depending on whether OIDs are included in the dump or not.

I'd prefer to see a single magic number for all binary COPY output, then a
few bytes of header including a version number, and flags to indicate
endianness, OIDs etc. It seems a lot cleaner than overloading the magic
number.

Also, IIRC part of the problem with text-based COPY is that we can't
specify field order (I think this affectes dumping the regression DB).
Would it be possible to add the ability to (a) specify field order, and (b)
dump a subset of fields?

----------------------------------------------------------------
Philip Warner                    |     __---_____
Albatross Consulting Pty. Ltd.   |----/       -  \

Tel: (+61) 0500 83 82 81         |                 _________  \
Fax: (+61) 0500 83 82 82         |                 ___________ |
Http://www.rhyme.com.au          |                /           \|
                                 |    --________--
PGP key available upon request,  |  /
and from pgp5.ai.mit.edu:11371   |/



Mon, 26 May 2003 10:49:19 GMT
 COPY BINARY file format proposal

Quote:

>> Just thinking that the only way an endianness flag inside the header
>> would be useful is if we pick a magic number that's a bytewise
>> palindrome.
> You could just read the 1st, 2nd, 3rd, etc bytes and require that they be
> 'P', 'G', 'C', 'P', 'Y' or some such. I *think* reading five bytes and
> doing a strcmp works...ie. don't rely on the integer value, use a string.

Oh.  We could use a string instead of an integer, I suppose, although
I'm not sure I see the point for what's basically a binary format.

Given all that, here is a proposed spec for the header:

First 8 bytes: signature, ASCII "PGBCOPY\0" --- note that the null is a
required part of the signature.  (This is to catch files that have been
munged by a non-8-bit-clean transfer.)

Next 4 bytes: integer layout field.  This consists of the int32 constant
0x0A820D0A expressed in the source machine's endianness.  (Again, value
chosen with malice aforethought, to catch files munged by things like
DOS/Unix newline conversion or high-bit-stripping.)  Potentially, a
reader could engage in byte-flipping of subsequent fields if the wrong
byte order is detected here.

Next 4 bytes: version number, currently 1 (expressed in source machine's
endianness, as are all subsequent integer fields).  A reader should
abort if it does not recognize the version number.

Next 4 bytes: length of remainder of header, not including self.  In
the initial version this will be zero, and the first tuple follows
immediately.  Future changes to the format might allow additional data
to be present in the header.  A reader should silently ignore any header
extension data it does not know what to do with.

This allows for both backwards-compatible header additions (extend the
header without changing the version number) and non-backwards-compatible
changes (bump the version number).

Since we don't yet know what we might do about the issue of
floating-point format, I left that out of the spec.  It can be added to
the header extension area when and if we figure out how to do it.

Likewise, addons such as column names are also punted until later.

Comments?

                        regards, tom lane



Tue, 27 May 2003 03:40:00 GMT
 COPY BINARY file format proposal

Quote:
> I will not complain about sticking a "version 1.0" field into a format
> when there is no real intention of changing it in the future ... but
> assigning deep significance to major/minor numbers, or something like

I assume the version would be the COPY format version, not the
PostgreSQL version.

--
  Bruce Momjian                        |  http://candle.pha.pa.us

  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026



Fri, 30 May 2003 07:51:00 GMT
 COPY BINARY file format proposal

Quote:
> Also, IIRC part of the problem with text-based COPY is that we can't
> specify field order (I think this affectes dumping the regression DB).
> Would it be possible to add the ability to (a) specify field order, and (b)
> dump a subset of fields?

Informix does this nicely:

        UNLOAD TO "file"
        SELECT *
        FROM tab

Merging COPY and SELECT has some real advantages.  You can specify
columns, parts of a table using WHERE, and even joins.  Very flexible.

Perhaps, if the table name is missing from COPY, we can allow a SELECT:

        COPY TO 'file'
        SELECT *
        FROM tab

--
  Bruce Momjian                        |  http://candle.pha.pa.us

  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026



Fri, 30 May 2003 07:52:26 GMT
 
 [ 9 post ] 

 Relevant Pages 

1. COPY BINARY file format proposal

2. COPY BINARY file format proposal

3. Postgresql binary copy file format

4. Decoding the binary format of Ingres copy-out data

5. Informix v2.x binary file format

6. DLL to copy binary files into SQL Server as a stored procedure

7. HLP: Copying binary file to field

8. help info for binary file format for sql micrisoft

9. Oracle export file binary format

10. store chr(0) in database or microsoft binary file format

11. VB Binary File Format


 
Powered by phpBB® Forum Software