COPY BINARY file format proposal 
Author Message
 COPY BINARY file format proposal

Well, no one seemed very unhappy at the idea of changing the file format
for binary COPY, so here is a proposal.

The objectives of this change are:

1. Get rid of the tuple count at the front of the file.  This requires
an extra pass over the relation, which is a lot more trouble than the
count is worth.  Use an explicit EOF marker instead.
2. Send fields of a tuple individually, instead of dumping out raw tuples
(complete with alignment padding and so forth) as is currently done.
This is mainly to simplify TOAST-related processing.
3. Make the format somewhat self-identifying, so that the reader has at
least some chance of detecting it when the data doesn't match the table
it's supposed to be loaded into.

The proposed format consists of a file header, zero or more tuples, and a
file trailer.

The file header will just be a 32-bit magic number; it's present so that a
reader can reject non-COPY-binary input data, as well as detect problems
like incompatible endianness.  (We could also use changes in the magic
number as a flag for future format changes.)

Each tuple begins with an int16 count of the number of fields in the
tuple.  (Presently, all tuples in a table will have the same count, but
that might not always be true.)  Then, repeated for each field in the
tuple, there is an int16 typlen word possibly followed by field data.
The typlen field is interpreted thus:

        Zero            Field is NULL.  No data follows.

        > 0          Field is a fixed-length datatype.  Exactly N
                        bytes of data follow the typlen word.

        -1              Field is a varlena datatype.  The next four
                        bytes are the varlena header, which contains
                        the total value length including itself.

        < -1         Reserved for future use.

For non-NULL fields, the reader can check that the typlen matches the
expected typlen for the destination column.  This provides a simple
but very useful check that the data is as expected.

There is no alignment padding or any other extra data between fields.
Note also that the format does not distinguish whether a datatype is
pass-by-reference or pass-by-value.  Both of these provisions are
deliberate: they might help improve portability of the files (although
of course endianness and floating-point-format issues can still keep
you from moving a binary file across machines).

The file trailer consists of an int16 word containing -1.  This is
easily distinguished from a tuple's field-count word.

A reader should report an error if a field-count word is neither -1
nor the expected number of columns.  This provides a pretty strong
check against somehow getting out of sync with the data.

Comments?

                        regards, tom lane



Mon, 26 May 2003 08:17:49 GMT
 
 [ 1 post ] 

 Relevant Pages 

1. COPY BINARY file format proposal

2. COPY BINARY file format proposal

3. Postgresql binary copy file format

4. Decoding the binary format of Ingres copy-out data

5. HLP: Copying binary file to field

6. store chr(0) in database or microsoft binary file format

7. help info for binary file format for sql micrisoft

8. Oracle export file binary format

9. DLL to copy binary files into SQL Server as a stored procedure

10. Informix v2.x binary file format

11. VB Binary File Format

12. bulk copy using format file


 
Powered by phpBB® Forum Software