Discussion:
zlib, gzip and HTTP compression.
Alan Kennedy
2002-01-11 19:38:29 UTC
Permalink
dont use zlib,
http requires data in gzip format, zlip.compress returns quite
different structure.
so import gzip ...
Thanks JJ.

OK, I understand the problem better now. I read RFC 1952, which
explains the structure of gzip files. And looking at
python/Lib/gzip.py, it appears to construct exactly the structure
required.

So I reworked my code to use gzip, and I'm almost there. The first
~200 bytes of the HTML file now appear exactly as they should, but
then it corrupts after that. Obviously the mechanism for communicating
from the server to the client that I am sending gzipped data is
working, but it looks like I'm sending gzipped data that is slightly
corrupt, or I'm telling the client the wrong length, or some such.
Close, but no cigar!

There is obviously some small detail that I am missing, such as
character translation during the print statement(?), one extra byte
need somewhere, etc?

Any hints anyone?

The new code is presented below.

TIA,

Alan.

------------------------------------------
#! C:/python21/python.exe

import string
import os
import gzip
import StringIO

# Use any old HTML file (which displays fine standalone)

f = open("test.html")
buf = f.read()
f.close()

def compressBuf(buf):
zbuf = StringIO.StringIO()
zfile = gzip.GzipFile(None, 'wb', 9, zbuf)
zfile.write(buf)
zfile.close()
return zbuf.getvalue()

acceptsGzip = 0
if string.find(os.environ["HTTP_ACCEPT_ENCODING"], "gzip") != -1:
acceptsGzip = 1
zbuf = compressBuf(buf)

print "Content-type: text/html"
if acceptsGzip:
print "Content-Encoding: gzip"
print "Content-Length: %d" % (len(zbuf))
print # end of headers
print zbuf # and then the buffer
else:
print # end of headers
print buf # and then the buffer

# end of script
------------------------------------------------------
Michael Ströder
2002-01-12 15:25:18 UTC
Permalink
rmunn at pobox.com (Robin Munn) wrote in message
I have no idea if this might be causing your problem, but print adds a
newline after whatever it prints. What happens if you use
sys.stdout.write() instead?
I tried that, but unfortunately to no avail.
See my other posting for the compression issues. But IMHO you should
never ever use print statements in web applications. It makes life
easier to use file objects from the very beginning (think of writing
simple wrappers for CGI-BIN, SimpleHTTPServer, FastCGI, mod_python,
etc.). It also makes it easier to pipe the output through other file
objects like gzip.GzipFile.
When outputting the HTTP headers, they should be separated by CR-LF,
or "\r\n" in a python string, so print is definitely doing the right
thing there.
IMHO this depends on the underlying OS. You should output the line
separator '\r\n' manually in your application.

Ciao, Michael.
Alan Kennedy
2002-01-12 13:00:29 UTC
Permalink
rmunn at pobox.com (Robin Munn) wrote in message
I have no idea if this might be causing your problem, but print adds a
newline after whatever it prints. What happens if you use
sys.stdout.write() instead?
I tried that, but unfortunately to no avail.

When outputting the HTTP headers, they should be separated by CR-LF,
or "\r\n" in a python string, so print is definitely doing the right
thing there. When it comes to the HTTP body, I don't think print does
any translation, although things might get confused if it added at
CRLF after the HTTP body.

But to be sure to be sure, I changed

-------------------
print "Content-Encoding: gzip"
print "Content-Length: %d" % (len(zbuf))
print # end of headers
print zbuf # and then the buffer
-------------------

to

-------------------
sys.stdout.write("Content-Encoding: gzip\r\n")
sys.stdout.write("Content-Length: %d\r\n" % (len(zbuf)))
sys.stdout.write("\r\n") # end of headers
sys.stdout.write(zbuf) # and then the buffer
-------------------

But got exactly the same results between the two.

It was definitely worth a try.

I'm going to get to the bottom of this!

optimistical-ly yrs :-)

Alan.
Neil Schemenauer
2002-01-12 19:23:01 UTC
Permalink
Post by Alan Kennedy
I'm going to get to the bottom of this!
It might help to look at the Quixote source. It supports gzip
compression. See:

http://www.mems-exchange.org/software/quixote/

You want to look at the compress_output method in publish.py. Here is
the main piece of code in case you don't feel like downloading it:

_gzip_header = ("\037\213" # magic
"\010" # compression method
"\000" # flags
"\000\000\000\000" # time, who cares?
"\002"
"\377")

def compress_output (self, request, output):
import zlib, struct

encoding = request.get_encoding(["gzip", "x-gzip"])
if encoding:
co = zlib.compressobj(6, zlib.DEFLATED, -zlib.MAX_WBITS,
zlib.DEF_MEM_LEVEL, 0)
chunks = [self._gzip_header,
co.compress(output),
co.flush(),
struct.pack("<ll", zlib.crc32(output), len(output))]
output = "".join(chunks)
request.response.set_header("Content-Encoding", encoding)
return output

HTH,

Neil
Alan Kennedy
2002-01-11 11:37:27 UTC
Permalink
I thought the tag was Content-transfer-encoding but I'm not sure and
don't feel like looking it up right now.
No, "Content-transfer-encoding" is a MIME header, not a HTTP header.
The HTTP headers are "Content-encoding" and "Transfer-encoding",
according to RFC2616.
You -could- find the apache mod_gzip module linked from apache.org
and compare it to your cgi.
I read extensively on the mod_gzip mailing lists, and everything that
I'm doing seems to agree with any information there.

I was hoping that someone here might have done this before(and maybe
even got it working ;-). That might save me having to figure out the
algorithms from the 11000+ lines of C source that makes up
mod_gzip.......

The only thing I can think of is that python zlib.compress is not
compatible with gzip compression as used in HTTP. Can anyone confirm
or deny that?

TIA,

Alan.
Paul Rubin
2002-01-11 22:59:57 UTC
Permalink
Post by Alan Kennedy
The only thing I can think of is that python zlib.compress is not
compatible with gzip compression as used in HTTP. Can anyone confirm
or deny that?
I think someone answers, they are not the same.
Alan Kennedy
2002-01-12 13:49:24 UTC
Permalink
Content-Length must be the length of original, before compression.
So try
print "Content-Length: %d" % os.path.getsize("test.html")
Thanks again for the suggestion JJ.

However, my reading of RFC2616 tells me otherwise. Though I could be
wrong, since there is much room for interpretation in HTTP specs.

http://www.ietf.org/rfc/rfc2616.txt
From Section: 7.2.2 Entity Length
The entity-length of a message is the length of the message-body
before any transfer-codings have been applied. Section 4.4 defines
how the transfer-length of a message-body is determined.

AK> So the "entity-length" is the uncompressed length of the file
From Section: 4.4 Message Length
The transfer-length of a message is the length of the message-body
as it appears in the message; that is, after any transfer-codings
have been applied. When a message-body is included with a message,
the transfer-length of that body is determined by one of the
following (in order of precedence):

AK> And the "transfer-length" is the compressed length of the file,
AK> i.e. the length as it appears in the message

1.Any response message which "MUST NOT" include a message-body
(such as the 1xx, 204, and 304 responses
[elided, not relevant]

2.If a Transfer-Encoding header field (section 14.41) is present
[elided, not relevant]

3.If a Content-Length header field (section 14.13) is present,
its decimal value in OCTETs represents both the entity-length
and the transfer-length. The Content-Length header field MUST NOT
be sent if these two lengths are different (i.e., if a
Transfer-Encoding header field is present). If a message is
received with both a Transfer-Encoding header field and a
Content-Length header field, the latter MUST be ignored.

AK> So according to this, since the "entity-length" and the
AK> "transfer-length" are different in this case, I shouldn't be
AK> sending a "Content-length" at all! (Which I tried, and it didn't
AK> work)

4.If the message uses the media type "multipart/byteranges",
[elided, not relevant]

5.By the server closing the connection. (Closing the conn...
[elided, not relevant]

For compatibility with HTTP/1.0 applications, HTTP/1.1 requests
containing a message-body MUST include a valid Content-Length
[elided, relevant to requests only, not responses]

All HTTP/1.1 applications that receive entities MUST accept the
"chunked" transfer-coding (section 3.6), thus allowing this
mechanism to be used for messages when the message length cannot
be determined in advance.

Messages MUST NOT include both a Content-Length header field
and a non-identity transfer-coding. If the message does include
a non-identity transfer-coding, the Content-Length MUST be ignored.

AK> This appears to agree with point 3 above about not sending a
AK> Content-length when the "entity-length" and the "transfer-length"
AK> are different.

When a Content-Length is given in a message where a message-body
is allowed, its field value MUST exactly match the number of OCTETs
in the message-body. HTTP/1.1 user agents MUST notify the user when
an invalid length is received and detected.

AK> This seems to me the conclusive statement. If "Content-length"
AK> is present, it MUST represent the length of the message body, i.e.
AK> the compressed length of the file.

I think that there is confusion around the interpretation of
"Content-length" because of lack of clarity on the difference
between "Content-encoding" and "Transfer-encoding".

"Content-encoding" is supposed to represent the inherent encoding
of the entity (i.e. file) being transferred. Most likely it was
intended to communicate that a file was being sent which was
compressed in some way, and which is permanently exists in
compressed format.

"Transfer-encoding" is supposed to be a transient thing, lasting
only for the duration of the tx/rx of the HTTP message. That is,
it is a mechanism for temporarily encoding (compressing) a file
purely so that it can be transmitted safely or using less
bandwidth.

The difficulty comes in deciding when to use Transfer-encoding
or Content-encoding. For example, If I dynamically generate a HTML
"file", and want to send it compressed, is the compression
inherent to the nature of the "file", or is it merely a transient
thing which is purely to save bandwidth in transmission of the file.

Of course, the answer to this question is decided by the actual
interpretation that people have made in writing their software.
The general consensus seems to be that "Content-encoding" is the
way to go. "Transfer-encoding" seems not to be used.

Or I could be wrong :-)

But of course, none of this helps me with my current problem, since
I have tried all possible combinations of Content-encoding,
Transfer-encoding, Content-length, no Content-length, etc,
etc, etc, etc.

I'm not quite sure what to try next.

I think I am either

1. Sending the wrong compressed data
2. Sending correct compressed data in a way that is resulting in
corruption
3. Missing out on some further processing that must be conducted on
the message.

I will get to the bottom of this.

And I will document it so no-one will have to go through this hassle
again.

Regards,

Alan.
Robin Munn
2002-01-11 21:53:25 UTC
Permalink
Post by Alan Kennedy
dont use zlib,
http requires data in gzip format, zlip.compress returns quite
different structure.
so import gzip ...
Thanks JJ.
OK, I understand the problem better now. I read RFC 1952, which
explains the structure of gzip files. And looking at
python/Lib/gzip.py, it appears to construct exactly the structure
required.
So I reworked my code to use gzip, and I'm almost there. The first
~200 bytes of the HTML file now appear exactly as they should, but
then it corrupts after that. Obviously the mechanism for communicating
from the server to the client that I am sending gzipped data is
working, but it looks like I'm sending gzipped data that is slightly
corrupt, or I'm telling the client the wrong length, or some such.
Close, but no cigar!
There is obviously some small detail that I am missing, such as
character translation during the print statement(?), one extra byte
need somewhere, etc?
Any hints anyone?
The new code is presented below.
TIA,
Alan.
[snip code]

I have no idea if this might be causing your problem, but print adds a
newline after whatever it prints. What happens if you use
sys.stdout.write() instead?
--
Robin Munn
rmunn at pobox.com
jj
2002-01-11 09:49:31 UTC
Permalink
dont use zlib,
http requires data in gzip format, zlip.compress returns quite
different structure.
so import gzip ...

JJ


On 10 Jan 2002 12:27:42 -0800, alanmk at hotmail.com (Alan Kennedy)
Greetings all,
I'm trying to send compressed HTML over a HTTP 1.1 connection, and I
just cannot get it to work. I'm trying to get it working with
mod_python, but the same thing happens with CGI, so I've reproduced it
with a CGI script, to simplify the problem.
I wonder if anyone might be able to indicate to me what is wrong with
the CGI code below?
------------------------------------
#! C:/python21/python.exe
import string
import os
import zlib
# Use any old HTML file (which displays fine standalone)
f = open("test.html")
buf = f.read()
f.close()
zbuf = zlib.compress(buf)
acceptsGzip = 0
acceptsGzip = 1
print "Content-type: text/html"
print "Content-encoding: gzip"
print "Content-length: %d" % (len(zbuf))
print # end of headers
print zbuf # and then the buffer
print # end of headers
print buf # and then the buffer
# End of script
------------------------------------
So far I've tried
1. Using "Content-encoding: x-gzip"
2. Using "Transfer-encoding: gzip"
3. Using "Transfer-encoding: x-gzip"
4. Setting the "Content-length" to the compressed length of the buffer
5. Setting the "Content-length" to the uncompressed length of the
buffer
6. All possibilities of case, i.e. "Content-Encoding",
"CONTENT-ENCODING", "content-encoding", etc (which shouldn't matter,
it should be case insensitive)
and all combinations of the above, but to no avail.
IE5.0, which is a HTTP 1.1 client, displays an empty HTML page with
the above code. When I use "Content-encoding: x-gzip",
"Transfer-encoding: gzip" or "Transfer-encoding: x-gzip", IE5.0
displays the compressed content (i.e. gobbledegook).
NN4.5, which is a HTTP 1.0 client which accepts gzip encoding, tells
me "A communications error has occurred. Please Try again".
It's getting maddening now :-( It should be simple, shouldn't it?
Is python zlib.compress not compatible with gzip? If not, are there
are Python library routines that are?
I'm using Apache 1.3.22 and Pyton 2.1.1
TIA for any help offered.
Sincerely,
Alan.
Michael Ströder
2002-01-12 15:27:01 UTC
Permalink
Content-Length must be the length of original, before compression.
So try
print "Content-Length: %d" % os.path.getsize("test.html")
Note that HTTP 1.1 keep-alives (requiring Content-Length header)
makes unbuffered streaming output of unknown length impossible
because your have to know the length in advance to produce the
correct header value.

Ciao, Michael.
Alan Kennedy
2002-01-12 23:08:16 UTC
Permalink
Andrew,

Yes, that was exactly the problem. I am running on win2k and winnt4.
By adding the lines

import msvcrt
msvcrt.setmode(sys.stdout.fileno(), os.O_BINARY)

The entire problem went away!

I never realised that stdout was doing character translation. So using

sys.stdout.write(compressedBuf)

thinking that this was giving me a raw binary write, and bypassing
possible "print" related character translation, was the wrong
approach.

So the character translation was happening on the file descriptor
itself? Is that standard C behaviour? Has it been that long since I
coded in C? :-) Even then, I never did any stdio coding on dos/win32.

After many hours of struggling with this, I hadn't really gotten any
further. I had discovered that some files compressed and transmitted
correctly with the code that I had, whereas other did not. I'm sure
now that if I go back, I'll find that the gzipped files that didn't
work contained 0x0A.

Whew! That's a relief!

Thanks to all who helped out with hints, tips, suggestions and
solutions.

Sincerely,

Alan.
Post by Alan Kennedy
There is obviously some small detail that I am missing, such as
character translation during the print statement(?), one extra byte
need somewhere, etc?
...
Post by Alan Kennedy
------------------------------------------
#! C:/python21/python.exe
...
From the above I'd guess that you're using Cygwin. On Windows (MSVC or
Cygwin) or OS/2 (VACPP or EMX/gcc), stdout (which is what print uses) will
be in text mode - ie newline characters get translated to CRLF. If your
compressed data has a newline character (LF) in it, this will result in a
corrupted output stream.
You'll need to find a way to change stdout to binary mode ('wb'). At the
moment I don't know of a practical way to do this with either OS/2 port,
and have no information about the Win32 ports.
Martin v. Loewis
2002-01-13 21:14:58 UTC
Permalink
Post by Alan Kennedy
So the character translation was happening on the file descriptor
itself? Is that standard C behaviour?
Standard C says that stdout is a text stream, see 7.19.3p7:

# At program startup, three text streams are predefined and need not
# be opened explicitly standard input (for reading conventional
# input), standard output (for writing conventional output), and
# standard error (for writing diagnostic output).

That, in turn, means that system-specific things may happen to it
(7.19.2p2):

# A text stream is an ordered sequence of characters composed into
# lines, each line consisting of zero or more characters plus a
# terminating new-line character. Whether the last line requires a
# terminating new-line character is implementation-defined. Characters
# may have to be added, altered, or deleted on input and output to
# conform to differing conventions for representing text in the host
# environment. Thus, there need not be a one-to- one correspondence
# between the characters in a stream and those in the external
# representation. Data read in from a text stream will necessarily
# compare equal to the data that were earlier written out to that
# stream only if: the data consist only of printing characters and the
# control characters horizontal tab and new-line; no new-line
# character is immediately preceded by space characters; and the last
# character is a new-line character. Whether space characters that are
# written out immediately before a new-line character appear when read
# in is implementation-defined.

Regards,
Martin

Andrew MacIntyre
2002-01-14 10:23:58 UTC
Permalink
Post by Alan Kennedy
Yes, that was exactly the problem. I am running on win2k and winnt4.
By adding the lines
import msvcrt
msvcrt.setmode(sys.stdout.fileno(), os.O_BINARY)
The entire problem went away!
;-) glad you found a way to deal with this.

....
Post by Alan Kennedy
So the character translation was happening on the file descriptor
itself? Is that standard C behaviour? Has it been that long since I
coded in C? :-) Even then, I never did any stdio coding on dos/win32.
It's been that way since the earliest days of C compilers (or more
correctly, their libraries) on DOS, AFAIK, not to mention other non-Unix
OSes.

--
Andrew I MacIntyre "These thoughts are mine alone..."
E-mail: andymac at bullseye.apana.org.au | Snail: PO Box 370
andymac at pcug.org.au | Belconnen ACT 2616
Web: http://www.andymac.org/ | Australia
Alan Kennedy
2002-01-10 20:27:42 UTC
Permalink
Greetings all,

I'm trying to send compressed HTML over a HTTP 1.1 connection, and I
just cannot get it to work. I'm trying to get it working with
mod_python, but the same thing happens with CGI, so I've reproduced it
with a CGI script, to simplify the problem.

I wonder if anyone might be able to indicate to me what is wrong with
the CGI code below?

------------------------------------
#! C:/python21/python.exe

import string
import os
import zlib

# Use any old HTML file (which displays fine standalone)

f = open("test.html")
buf = f.read()
f.close()
zbuf = zlib.compress(buf)

acceptsGzip = 0
if string.find(os.environ["HTTP_ACCEPT_ENCODING"], "gzip") != -1:
acceptsGzip = 1

print "Content-type: text/html"
if acceptsGzip:
print "Content-encoding: gzip"
print "Content-length: %d" % (len(zbuf))
print # end of headers
print zbuf # and then the buffer
else:
print # end of headers
print buf # and then the buffer

# End of script
------------------------------------


So far I've tried

1. Using "Content-encoding: x-gzip"
2. Using "Transfer-encoding: gzip"
3. Using "Transfer-encoding: x-gzip"
4. Setting the "Content-length" to the compressed length of the buffer
5. Setting the "Content-length" to the uncompressed length of the
buffer
6. All possibilities of case, i.e. "Content-Encoding",
"CONTENT-ENCODING", "content-encoding", etc (which shouldn't matter,
it should be case insensitive)

and all combinations of the above, but to no avail.

IE5.0, which is a HTTP 1.1 client, displays an empty HTML page with
the above code. When I use "Content-encoding: x-gzip",
"Transfer-encoding: gzip" or "Transfer-encoding: x-gzip", IE5.0
displays the compressed content (i.e. gobbledegook).

NN4.5, which is a HTTP 1.0 client which accepts gzip encoding, tells
me "A communications error has occurred. Please Try again".

It's getting maddening now :-( It should be simple, shouldn't it?

Is python zlib.compress not compatible with gzip? If not, are there
are Python library routines that are?

I'm using Apache 1.3.22 and Pyton 2.1.1

TIA for any help offered.

Sincerely,

Alan.
Paul Rubin
2002-01-11 06:06:54 UTC
Permalink
I thought the tag was Content-transfer-encoding but I'm not sure and
don't feel like looking it up right now.

You -could- find the apache mod_gzip module linked from apache.org
and compare it to your cgi.
Michael Ströder
2002-01-12 15:00:46 UTC
Permalink
I'm trying to send compressed HTML over a HTTP 1.1 connection, and I
just cannot get it to work.
Well, it also took me quite a while to figure out the following
issues:
1. Make sure the HTTP header is not compressed including the empty
line separating the header from the body.
2. Make sure the compression module you use does not write any
compression header to the output stream before the HTTP body starts.
The compression header MUST exactly start at the first byte of the
body.
3. Use the right HTTP header lines:
Content-Encoding: gzip
Vary: Accept-Encoding
4. If you really use HTTP 1.1 features with HTTP-Keepalives you have
to set the Content-Length: header to an appropriate value.

BTW http://web2ldap.de has this feature. I'm using a class derived
from gzip.GzipFile in module msgzip as file object for streamed
output. The headers are set by pyweblib.httphelper.SendHeader()
(Note PyWebLib is a separate package. Hmm, I should clean up some
things...).

You might also wanna look into a multi-language package called
cgibuffer or something which inspired my activities in this regard.
Note that it buffers the whole HTTP body before compressing it en
bloc and therefore does not have problems with compression header
being sent somewhere in the stream.

However you will experience that it does not make sense to use that
in all cases and that some browser still chokes on some data (e.g.
Netscape Navigator 4.x when delivering a JPEG image with
Content-Encoding: gzip).

Ciao, Michael.
Andrew MacIntyre
2002-01-12 07:01:53 UTC
Permalink
Post by Alan Kennedy
There is obviously some small detail that I am missing, such as
character translation during the print statement(?), one extra byte
need somewhere, etc?
...
Post by Alan Kennedy
------------------------------------------
#! C:/python21/python.exe
...
Post by Alan Kennedy
From the above I'd guess that you're using Cygwin. On Windows (MSVC or
Cygwin) or OS/2 (VACPP or EMX/gcc), stdout (which is what print uses) will
be in text mode - ie newline characters get translated to CRLF. If your
compressed data has a newline character (LF) in it, this will result in a
corrupted output stream.

You'll need to find a way to change stdout to binary mode ('wb'). At the
moment I don't know of a practical way to do this with either OS/2 port,
and have no information about the Win32 ports.

--
Andrew I MacIntyre "These thoughts are mine alone..."
E-mail: andymac at bullseye.apana.org.au | Snail: PO Box 370
andymac at pcug.org.au | Belconnen ACT 2616
Web: http://www.andymac.org/ | Australia
jj
2002-01-11 22:55:01 UTC
Permalink
Content-Length must be the length of original, before compression.
So try

print "Content-Length: %d" % os.path.getsize("test.html")

JJ
Post by Alan Kennedy
dont use zlib,
http requires data in gzip format, zlip.compress returns quite
different structure.
so import gzip ...
Thanks JJ.
OK, I understand the problem better now. I read RFC 1952, which
explains the structure of gzip files. And looking at
python/Lib/gzip.py, it appears to construct exactly the structure
required.
So I reworked my code to use gzip, and I'm almost there. The first
~200 bytes of the HTML file now appear exactly as they should, but
then it corrupts after that. Obviously the mechanism for communicating
from the server to the client that I am sending gzipped data is
working, but it looks like I'm sending gzipped data that is slightly
corrupt, or I'm telling the client the wrong length, or some such.
Close, but no cigar!
There is obviously some small detail that I am missing, such as
character translation during the print statement(?), one extra byte
need somewhere, etc?
Any hints anyone?
The new code is presented below.
TIA,
Alan.
------------------------------------------
#! C:/python21/python.exe
import string
import os
import gzip
import StringIO
# Use any old HTML file (which displays fine standalone)
f = open("test.html")
buf = f.read()
f.close()
zbuf = StringIO.StringIO()
zfile = gzip.GzipFile(None, 'wb', 9, zbuf)
zfile.write(buf)
zfile.close()
return zbuf.getvalue()
acceptsGzip = 0
acceptsGzip = 1
zbuf = compressBuf(buf)
print "Content-type: text/html"
print "Content-Encoding: gzip"
print "Content-Length: %d" % (len(zbuf))
print # end of headers
print zbuf # and then the buffer
print # end of headers
print buf # and then the buffer
# end of script
------------------------------------------------------
Continue reading on narkive:
Loading...