Discussion:
BadZipfile "file is not a zip file"
webcomm
2009-01-09 00:47:39 UTC
Permalink
The error...
file = zipfile.ZipFile('data.zip', "r")
Traceback (most recent call last):
File "<pyshell#23>", line 1, in <module>
file = zipfile.ZipFile('data.zip', "r")
File "C:\Python25\lib\zipfile.py", line 346, in __init__
self._GetContents()
File "C:\Python25\lib\zipfile.py", line 366, in _GetContents
self._RealGetContents()
File "C:\Python25\lib\zipfile.py", line 378, in _RealGetContents
raise BadZipfile, "File is not a zip file"
BadZipfile: File is not a zip file

When I look at data.zip in Windows, it appears to be a valid zip
file. I am able to uncompress it in Windows XP, and can also
uncompress it with 7-Zip. It looks like zipfile is not able to read a
"table of contents" in the zip file. That's not a concept I'm
familiar with.

data.zip is created in this script...

decoded = base64.b64decode(datum)
f = open('data.zip', 'wb')
f.write(decoded)
f.close()
file = zipfile.ZipFile('data.zip', "r")

datum is a base64 encoded zip file. Again, I am able to open data.zip
as if it's a valid zip file. Maybe there is something wrong with the
approach I've taken to writing the data to data.zip? I'm not sure if
it matters, but the zipped data is Unicode.

What would cause a zip file to not have a table of contents? Is there
some way I can add a table of contents to a zip file using python?
Maybe there is some more fundamental problem with the data that is
making it seem like there is no table of contents?

Thanks in advance for your help.
Ryan
MRAB
2009-01-09 01:02:44 UTC
Permalink
Post by webcomm
The error...
file = zipfile.ZipFile('data.zip', "r")
File "<pyshell#23>", line 1, in <module>
file = zipfile.ZipFile('data.zip', "r")
File "C:\Python25\lib\zipfile.py", line 346, in __init__
self._GetContents()
File "C:\Python25\lib\zipfile.py", line 366, in _GetContents
self._RealGetContents()
File "C:\Python25\lib\zipfile.py", line 378, in _RealGetContents
raise BadZipfile, "File is not a zip file"
BadZipfile: File is not a zip file
When I look at data.zip in Windows, it appears to be a valid zip
file. I am able to uncompress it in Windows XP, and can also
uncompress it with 7-Zip. It looks like zipfile is not able to read a
"table of contents" in the zip file. That's not a concept I'm
familiar with.
data.zip is created in this script...
decoded = base64.b64decode(datum)
f = open('data.zip', 'wb')
f.write(decoded)
f.close()
file = zipfile.ZipFile('data.zip', "r")
datum is a base64 encoded zip file. Again, I am able to open data.zip
as if it's a valid zip file. Maybe there is something wrong with the
approach I've taken to writing the data to data.zip? I'm not sure if
it matters, but the zipped data is Unicode.
What would cause a zip file to not have a table of contents? Is there
some way I can add a table of contents to a zip file using python?
Maybe there is some more fundamental problem with the data that is
making it seem like there is no table of contents?
You're just creating a file called "data.zip". That doesn't make it a
zip file. A zip file has a specific format. If the file doesn't have
that format then the zipfile module will complain.
webcomm
2009-01-09 01:28:17 UTC
Permalink
Post by MRAB
You're just creating a file called "data.zip". That doesn't make it a
zip file. A zip file has a specific format. If the file doesn't have
that format then the zipfile module will complain.
Hmm. When I open it in Windows or with 7-Zip, it contains a text file
that has the data I would expect it to have. I guess that alone
doesn't necessarily prove it's a zip file?

datum is something I'm downloading via a web service. The providers
of the service say it's a zip file, and have provided a code sample in
C# (which I know nothing about) that shows how to deal with it. In
the code sample, the file is base64 decoded and then unzipped. I'm
trying to write something in Python to decode and unzip the file.

I checked the file for comments and it has none. At least, when I
view the properties in Windows, there are no comments.
James Mills
2009-01-09 01:39:50 UTC
Permalink
Post by webcomm
Hmm. When I open it in Windows or with 7-Zip, it contains a text file
that has the data I would expect it to have. I guess that alone
doesn't necessarily prove it's a zip file?
datum is something I'm downloading via a web service. The providers
of the service say it's a zip file, and have provided a code sample in
C# (which I know nothing about) that shows how to deal with it. In
the code sample, the file is base64 decoded and then unzipped. I'm
trying to write something in Python to decode and unzip the file.
Send us a sample of this file in question...

cheers
James
webcomm
2009-01-09 01:44:45 UTC
Permalink
Post by James Mills
Send us a sample of this file in question...
It contains data that I can't share publicly. I could ask the
providers of the service if they have a dummy file I could use that
doesn't contain any real data, but I don't know how responsive they'll
be. It's an event registration service called RegOnline.
MRAB
2009-01-09 01:54:54 UTC
Permalink
Post by webcomm
Post by MRAB
You're just creating a file called "data.zip". That doesn't make it
a zip file. A zip file has a specific format. If the file doesn't
have that format then the zipfile module will complain.
Hmm. When I open it in Windows or with 7-Zip, it contains a text
file that has the data I would expect it to have. I guess that alone
doesn't necessarily prove it's a zip file?
datum is something I'm downloading via a web service. The providers
of the service say it's a zip file, and have provided a code sample
in C# (which I know nothing about) that shows how to deal with it.
In the code sample, the file is base64 decoded and then unzipped.
I'm trying to write something in Python to decode and unzip the file.
I checked the file for comments and it has none. At least, when I
view the properties in Windows, there are no comments.
Ah, OK. You didn't explicitly say in your original posting that the
decoded data was definitely zipfile data. There was a thread a month ago
about gzip Unix commands which could also handle non-gzipped files and I
was wondering whether this problem was something like that. Have you
tried gzip instead?
webcomm
2009-01-09 02:56:14 UTC
Permalink
Have you tried gzip instead?
There's no option to download the data in a gzipped format. The files
are .zip archives.
webcomm
2009-01-09 22:00:36 UTC
Permalink
Post by James Mills
Send us a sample of this file in question...
Here's a sample with some dummy data from the web service:
http://webcomm.webfactional.com/htdocs/data.zip

That's the zip created in this line of my code...
f = open('data.zip', 'wb')

If I open the file it contains as unicode in my text editor (EditPlus)
on Windows XP, there is ostensibly nothing wrong with it. It looks
like valid XML. But if I return it to my browser with python+django,
there are bad characters every other character

If I unzip it like this...
popen("unzip data.zip")
...then the bad characters are 'FFFD' characters as described and
pictured here...
http://groups.google.com/group/comp.lang.python/browse_thread/thread/...

If I unzip it like this...
getzip('data.zip', ignoreable=30000)
...using Scott's function at...
http://groups.google.com/group/comp.lang.python/msg/c2008e48368c6543
...then the bad characters are \x00 characters.
webcomm
2009-01-09 22:04:15 UTC
Permalink
Post by webcomm
If I unzip it like this...
popen("unzip data.zip")
...then the bad characters are 'FFFD' characters as described and
pictured here...http://groups.google.com/group/comp.lang.python/browse_thread/thread/...
trying again to post the link re: FFFD characters...
http://groups.google.com/group/comp.lang.python/browse_thread/thread/4f57abea978cc0bf?hl=en#
MRAB
2009-01-09 23:13:20 UTC
Permalink
Post by webcomm
Post by James Mills
Send us a sample of this file in question...
http://webcomm.webfactional.com/htdocs/data.zip
That's the zip created in this line of my code...
f = open('data.zip', 'wb')
If I open the file it contains as unicode in my text editor (EditPlus)
on Windows XP, there is ostensibly nothing wrong with it. It looks
like valid XML. But if I return it to my browser with python+django,
there are bad characters every other character
If I unzip it like this...
popen("unzip data.zip")
...then the bad characters are 'FFFD' characters as described and
pictured here...
http://groups.google.com/group/comp.lang.python/browse_thread/thread/...
If I unzip it like this...
getzip('data.zip', ignoreable=30000)
...using Scott's function at...
http://groups.google.com/group/comp.lang.python/msg/c2008e48368c6543
...then the bad characters are \x00 characters.
I can unzip it in Windows XP. The file within it (called "data") is XML
encoded as UTF-16LE (2 bytes per character, low byte first), but without
the initial byte order mark. Python's zipfile module says "BadZipfile:
File is not a zip file".
MRAB
2009-01-09 23:52:02 UTC
Permalink
Post by MRAB
Post by webcomm
Post by James Mills
Send us a sample of this file in question...
http://webcomm.webfactional.com/htdocs/data.zip
That's the zip created in this line of my code...
f = open('data.zip', 'wb')
If I open the file it contains as unicode in my text editor (EditPlus)
on Windows XP, there is ostensibly nothing wrong with it. It looks
like valid XML. But if I return it to my browser with python+django,
there are bad characters every other character
If I unzip it like this...
popen("unzip data.zip")
...then the bad characters are 'FFFD' characters as described and
pictured here...
http://groups.google.com/group/comp.lang.python/browse_thread/thread/...
If I unzip it like this...
getzip('data.zip', ignoreable=30000)
...using Scott's function at...
http://groups.google.com/group/comp.lang.python/msg/c2008e48368c6543
...then the bad characters are \x00 characters.
I can unzip it in Windows XP. The file within it (called "data") is XML
encoded as UTF-16LE (2 bytes per character, low byte first), but without
File is not a zip file".
If I strip off all but the last 4 zero-bytes then the zipfile module can
open it:

decoded = base64.b64decode(datum)
five_zeros = chr(0) * 5
while decoded.endswith(five_zeros):
decoded = decoded[ : -1]
f = open('data.zip', 'wb')
f.write(decoded)
f.close()
x = zipfile.ZipFile('data.zip', 'r')
Steven D'Aprano
2009-01-09 08:16:17 UTC
Permalink
Post by webcomm
The error...
...
Post by webcomm
BadZipfile: File is not a zip file
When I look at data.zip in Windows, it appears to be a valid zip file.
I am able to uncompress it in Windows XP, and can also uncompress it
with 7-Zip. It looks like zipfile is not able to read a "table of
contents" in the zip file. That's not a concept I'm familiar with.
No, ZipFile can read table of contents:

Help on method printdir in module zipfile:

printdir(self) unbound zipfile.ZipFile method
Print a table of contents for the zip file.



In my experience, zip files originating from Windows sometimes have
garbage at the end of the file. WinZip just ignores the garbage, but
other tools sometimes don't -- if I recall correctly, Linux unzip
successfully unzips the file but then complains that the file was
corrupt. It's possible that you're running into a similar problem.
Post by webcomm
data.zip is created in this script...
decoded = base64.b64decode(datum)
f = open('data.zip', 'wb')
f.write(decoded)
f.close()
file = zipfile.ZipFile('data.zip', "r")
datum is a base64 encoded zip file. Again, I am able to open data.zip
as if it's a valid zip file. Maybe there is something wrong with the
approach I've taken to writing the data to data.zip? I'm not sure if it
matters, but the zipped data is Unicode.
The full signature of ZipFile is:

ZipFile(file, mode="r", compression=ZIP_STORED, allowZip64=True)

Try passing compression=zipfile.ZIP_DEFLATED and/or allowZip64=False and
see if that makes any difference.

The zip format does support alternative compression methods, it's
possible that this particular file uses a different sort of compression
which Python doesn't deal with.
Post by webcomm
What would cause a zip file to not have a table of contents?
What makes you think it doesn't have one?
--
Steven
Carl Banks
2009-01-09 08:46:27 UTC
Permalink
On Jan 9, 2:16?am, Steven D'Aprano <st... at REMOVE-THIS-
Post by webcomm
The error...
...
Post by webcomm
BadZipfile: File is not a zip file
When I look at data.zip in Windows, it appears to be a valid zip file.
I am able to uncompress it in Windows XP, and can also uncompress it
with 7-Zip. ?It looks like zipfile is not able to read a "table of
contents" in the zip file. ?That's not a concept I'm familiar with.
? ? printdir(self) unbound zipfile.ZipFile method
? ? ? ? Print a table of contents for the zip file.
In my experience, zip files originating from Windows sometimes have
garbage at the end of the file. WinZip just ignores the garbage, but
other tools sometimes don't -- if I recall correctly, Linux unzip
successfully unzips the file but then complains that the file was
corrupt. It's possible that you're running into a similar problem.
The zipfile format is kind of brain dead, you can't tell where the end
of the file is supposed to be by looking at the header. If the end of
file hasn't yet been reached there could be more data. To make
matters worse, somehow zip files came to have text comments simply
appended to the end of them. (Probably this was for the benefit of
people who would cat them to the terminal.)

Anyway, if you see something that doesn't adhere to the zipfile
format, you don't have any foolproof way to know if it's because the
file is corrupted or if it's just an appended comment.

Most zipfile readers use a heuristic to distinguish. Python's zipfile
module just assumes it's corrupted.

The following post from a while back gives a solution that tries to
snip the comment off so that zipfile module can handle it. It might
help you out.

http://groups.google.com/group/comp.lang.python/msg/c2008e48368c6543


Carl Banks
Steven D'Aprano
2009-01-09 10:09:58 UTC
Permalink
Post by Carl Banks
On Jan 9, 2:16?am, Steven D'Aprano <st... at REMOVE-THIS-
Post by webcomm
The error...
...
Post by webcomm
BadZipfile: File is not a zip file
When I look at data.zip in Windows, it appears to be a valid zip
file. I am able to uncompress it in Windows XP, and can also
uncompress it with 7-Zip. ?It looks like zipfile is not able to read
a "table of contents" in the zip file. ?That's not a concept I'm
familiar with.
? ? printdir(self) unbound zipfile.ZipFile method
? ? ? ? Print a table of contents for the zip file.
In my experience, zip files originating from Windows sometimes have
garbage at the end of the file. WinZip just ignores the garbage, but
other tools sometimes don't -- if I recall correctly, Linux unzip
successfully unzips the file but then complains that the file was
corrupt. It's possible that you're running into a similar problem.
The zipfile format is kind of brain dead, you can't tell where the end
of the file is supposed to be by looking at the header. If the end of
file hasn't yet been reached there could be more data. To make matters
worse, somehow zip files came to have text comments simply appended to
the end of them. (Probably this was for the benefit of people who would
cat them to the terminal.)
Anyway, if you see something that doesn't adhere to the zipfile format,
you don't have any foolproof way to know if it's because the file is
corrupted or if it's just an appended comment.
Yes, this has lead to a nice little attack vector, using hostile Java
classes inside JAR files (a variant of ZIP).

http://www.infoworld.com/article/08/08/01/
A_photo_that_can_steal_your_online_credentials_1.html

or http://snipurl.com/9oh0e
--
Steven
John Machin
2009-01-09 10:42:32 UTC
Permalink
Post by Carl Banks
On Jan 9, 2:16?am, Steven D'Aprano <st... at REMOVE-THIS-
Post by webcomm
The error...
...
Post by webcomm
BadZipfile: File is not a zip file
When I look at data.zip in Windows, it appears to be a valid zip file.
I am able to uncompress it in Windows XP, and can also uncompress it
with 7-Zip. ?It looks like zipfile is not able to read a "table of
contents" in the zip file. ?That's not a concept I'm familiar with.
? ? printdir(self) unbound zipfile.ZipFile method
? ? ? ? Print a table of contents for the zip file.
In my experience, zip files originating from Windows sometimes have
garbage at the end of the file. WinZip just ignores the garbage, but
other tools sometimes don't -- if I recall correctly, Linux unzip
successfully unzips the file but then complains that the file was
corrupt. It's possible that you're running into a similar problem.
The zipfile format is kind of brain dead, you can't tell where the end
of the file is supposed to be by looking at the header. ?If the end of
file hasn't yet been reached there could be more data. ?To make
matters worse, somehow zip files came to have text comments simply
appended to the end of them. ?(Probably this was for the benefit of
people who would cat them to the terminal.)
Anyway, if you see something that doesn't adhere to the zipfile
format, you don't have any foolproof way to know if it's because the
file is corrupted or if it's just an appended comment.
Most zipfile readers use a heuristic to distinguish. ?Python's zipfile
module just assumes it's corrupted.
The following post from a while back gives a solution that tries to
snip the comment off so that zipfile module can handle it. ?It might
help you out.
http://groups.google.com/group/comp.lang.python/msg/c2008e48368c6543
And here's a little gadget that might help the diagnostic effort; it
shows the archive size and the position of all the "magic" PKnn
markers. In a "normal" uncommented archive, EndArchive_pos + 22 ==
archive_size.
8<---
# usage: python zip_susser.py name_of_archive.zip
import sys
grimoire = [
("FileHeader", "PK\003\004"), # magic number for file
header
("CentralDir", "PK\001\002"), # magic number for central
directory
("EndArchive", "PK\005\006"), # magic number for end of
archive record
("EndArchive64", "PK\x06\x06"), # magic token for Zip64
header
("EndArchive64Locator", "PK\x06\x07"), # magic token for locator
header
]
f = open(sys.argv[1], 'rb')
buff = f.read()
f.close()
blen = len(buff)
print "archive size is", blen
for magic_name, magic in grimoire:
pos = 0
while pos < blen:
pos = buff.find(magic, pos)
if pos < 0:
break
print "%s at %d" % (magic_name, pos)
pos += 4
8<---

HTH,
John
webcomm
2009-01-09 15:22:16 UTC
Permalink
Post by John Machin
And here's a little gadget that might help the diagnostic effort; it
shows the archive size and the position of all the "magic" PKnn
markers. In a "normal" uncommented archive, EndArchive_pos + 22 ==
archive_size.
I ran the diagnostic gadget...

archive size is 69888
FileHeader at 0
CentralDir at 43796
EndArchive at 43846
John Machin
2009-01-09 22:21:55 UTC
Permalink
Post by webcomm
Post by John Machin
And here's a little gadget that might help the diagnostic effort; it
shows the archive size and the position of all the "magic" PKnn
markers. In a "normal" uncommented archive, EndArchive_pos + 22 ==
archive_size.
I ran the diagnostic gadget...
archive size is 69888
FileHeader at 0
CentralDir at 43796
EndArchive at 43846
Thanks. Would you mind spending a few minutes more on this so that we
can see if it's a problem that can be fixed easily, like the one that
Chris Mellon reported?

The above output says that there are 43868 (43846 + 22) bytes of
useable data. That leaves 69888 - 43868 = 26020 bytes of "comment" ...
rather large for a comment. Have you run a virus scanner over this
file?

At the end is an updated version of the diagnostic gadget. It explores
the "EndArchive" structure and the comment at the end, with a special
check for all '\0' (as per Chris's bug report) and another for all
blank. Please run it over your file and show us the results. Note: you
may want to suppress the display of the first 100 bytes of comment if
it turns out to be private data.

Cheers,
John

8<---
# zip_susser_v2.py
import sys
grimoire = [
("FileHeader", "PK\003\004"), # magic number for file
header
("DataDescriptor", "PK\x07\x08"), # see PKZIP APPNOTE (V) (C)
("CentralDir", "PK\001\002"), # magic number for central
directory
("EndArchive", "PK\005\006"), # magic number for end of
archive record
("EndArchive64", "PK\x06\x06"), # magic token for Zip64
header
("EndArchive64Locator", "PK\x06\x07"), # magic token for locator
header
("ArchiveExtraData", "PK\x06\x08"), # APPNOTE (V) (E)
("DigitalSignature", "PK\x05\x05"), # APPNOTE (V) (F)
]
f = open(sys.argv[1], 'rb')
buff = f.read()
f.close()
blen = len(buff)
print "archive size is", blen
for magic_name, magic in grimoire:
pos = 0
while pos < blen:
pos = buff.find(magic, pos)
if pos < 0:
break
print "%s at %d" % (magic_name, pos)
pos += 4
#
# find what is in the EndArchive struct
#
structEndArchive = "<4s4H2LH" # 9 [sic] items, end of archive, 22
bytes
import struct
posEndArchive = buff.find("PK\005\006")
print "using posEndArchive =", posEndArchive
assert 0 < posEndArchive < blen
endArchive = struct.unpack(structEndArchive, buff
[posEndArchive:posEndArchive+22])
print "endArchive:", repr(endArchive)
endArchiveFieldNames = """
signature
this_disk_num
central_dir_disk_num
central_dir_this_disk_num_entries
central_dir_overall_num_entries
central_dir_size
central_dir_offset
comment_size
""".split()
for name, value in zip(endArchiveFieldNames, endArchive):
print "%33s : %r" % (name, value)
#
# inspect the comment
#
actual_comment_size = blen - 22 - posEndArchive
expected_comment_size = endArchive[7]
comment = buff[posEndArchive + 22:]
print
print "expected_comment_size:", expected_comment_size
print "actual_comment_size:", actual_comment_size
print "comment is all spaces:", comment == ' ' * actual_comment_size
print "comment is all '\\0':", comment == '\0' * actual_comment_size
print "comment (first 100 bytes):", repr(comment[:100])
8<---
webcomm
2009-01-09 22:52:24 UTC
Permalink
Post by John Machin
Thanks. Would you mind spending a few minutes more on this so that we
can see if it's a problem that can be fixed easily, like the one that
Chris Mellon reported?
Don't mind at all. I'm now working with a zip file with some dummy
data I downloaded from the web service. You'll notice it's a smaller
archive than the one I was working with when I ran zip_susser.py, but
it has the same problem (whatever the problem is). It's the one I
uploaded to http://webcomm.webfactional.com/htdocs/data.zip

Here's what I get when I run zip_susser_v2.py...

archive size is 1092
FileHeader at 0
CentralDir at 844
EndArchive at 894
using posEndArchive = 894
endArchive: ('PK\x05\x06', 0, 0, 1, 1, 50, 844, 0)
signature : 'PK\x05\x06'
this_disk_num : 0
central_dir_disk_num : 0
central_dir_this_disk_num_entries : 1
central_dir_overall_num_entries : 1
central_dir_size : 50
central_dir_offset : 844
comment_size : 0

expected_comment_size: 0
actual_comment_size: 176
comment is all spaces: False
comment is all '\0': True
comment (first 100 bytes):
'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00'

Not sure if you've seen this thread...
http://groups.google.com/group/comp.lang.python/browse_thread/thread/d84f42493fe81864?hl=en#

Thanks,
Ryan
John Machin
2009-01-10 00:33:11 UTC
Permalink
Post by John Machin
Thanks. Would you mind spending a few minutes more on this so that we
can see if it's a problem that can be fixed easily, like the one that
Chris Mellon reported?
Don't mind at all. ?I'm now working with a zip file with some dummy
data I downloaded from the web service. ?You'll notice it's a smaller
archive than the one I was working with when I ran zip_susser.py, but
it has the same problem (whatever the problem is).
You mean it produces the same symptom. The zipfile.py has several
paths to the symptom i.e. the uninformative "bad zipfile" exception;
we don't know which path, yet. That's why Martin was suggesting that
you debug the sucker; that's why I'm trying to do it for you by remote
control. It is not impossible for a file with dummy data to have been
handcrafted or otherwise produced by a process different to that used
for a real-data file. Please run v2 of the gadget on the real-data zip
and report the results.
?It's the one I
uploaded tohttp://webcomm.webfactional.com/htdocs/data.zip
Here's what I get when I run zip_susser_v2.py...
archive size is 1092
FileHeader at 0
CentralDir at 844
EndArchive at 894
using posEndArchive = 894
endArchive: ('PK\x05\x06', 0, 0, 1, 1, 50, 844, 0)
? ? ? ? ? ? ? ? ? ? ? ? signature : 'PK\x05\x06'
? ? ? ? ? ? ? ? ? ? this_disk_num : 0
? ? ? ? ? ? ?central_dir_disk_num : 0
central_dir_this_disk_num_entries : 1
? central_dir_overall_num_entries : 1
? ? ? ? ? ? ? ? ?central_dir_size : 50
? ? ? ? ? ? ? ?central_dir_offset : 844
? ? ? ? ? ? ? ? ? ? ?comment_size : 0
expected_comment_size: 0
actual_comment_size: 176
comment is all spaces: False
comment is all '\0': True
'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x0?0\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x0?0\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x0?0\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x0?0\x00
\x00\x00\x00\x00\x00\x00\x00'
Not sure if you've seen this thread...http://groups.google.com/group/comp.lang.python/browse_thread/thread/...
Yeah, I've seen it ... (sigh) ... pax Steve Holden, but *please* stick
with one thread ...
webcomm
2009-01-10 19:32:01 UTC
Permalink
Post by John Machin
It is not impossible for a file with dummy data to have been
handcrafted or otherwise produced by a process different to that used
for a real-data file.
I knew it was produced by the same process, or I wouldn't have shared
it. : )
But you couldn't have known that.
Post by John Machin
Not sure if you've seen this thread...http://groups.google.com/group/comp.lang.python/browse_thread/thread/...
Yeah, I've seen it ... (sigh) ... pax Steve Holden, but *please* stick
with one thread ...
Thanks... I thought I was posting about separate issues and would
annoy people who were only interested in one of the issues if I put
them both in the same thread. I guess all posts re: the same script
should go in one thread, even if the questions posed may be unrelated
and may be separate issues. There are grey areas.

Problem solved in John Machin's post at
http://groups.google.com/group/comp.lang.python/browse_thread/thread/d84f42493fe81864/03b8341539d87989?hl=en&lnk=raot#03b8341539d87989

I'll post the final code when it's prettier.

-Ryan
webcomm
2009-01-12 16:28:50 UTC
Permalink
If anyone's interested, here are my django views...


from django.shortcuts import render_to_response
from django.http import HttpResponse
from xml.etree.ElementTree import ElementTree
import urllib, base64, subprocess

def get_data(request):
service_url = 'http://www.something.com/webservices/someservice/
etc?user=etc&pass=etc'
xml = urllib.urlopen(service_url)
#the base64-encoded string is in a one-element xml doc...
tree = ElementTree()
xml_doc = tree.parse(xml)
datum = ""
for node in xml_doc.getiterator():
datum = "%s" % (node.text)
decoded = base64.b64decode(datum)

dir = '/path/to/data/'
f = open(dir+'data.zip', 'wb')
f.write(decoded)
f.close()

file = subprocess.call('unzip '+dir+'data.zip -d '+dir,
shell=True)
file = open(dir+'data', 'rb').read()
txt = file.decode('utf_16_le')

return render_to_response('output.html',{
'output' : txt
})

def read_xml(request):
xml = urllib.urlopen('http://www.something.org/get_data/') #page
using the get_data view
xml = xml.read()
xml = unicode(xml)
xml = '<?xml version="1.0" encoding="UTF-8"?>\n<stuff>'+xml+'</
stuff>'

f = open('/path/to/temp.txt','w')
f.write(xml)
f.close()

tree = ElementTree()
xml_doc = tree.parse('/path/to/temp.txt')
datum = ""
for node in xml_doc.getiterator():
datum = "%s<br />%s - %s" % (datum, node.tag, node.text)

return render_to_response('output.html',{
'output' : datum
})
Chris Mellon
2009-01-12 16:53:26 UTC
Permalink
Post by webcomm
Post by John Machin
It is not impossible for a file with dummy data to have been
handcrafted or otherwise produced by a process different to that used
for a real-data file.
I knew it was produced by the same process, or I wouldn't have shared
it. : )
But you couldn't have known that.
Post by John Machin
Not sure if you've seen this thread...http://groups.google.com/group/comp.lang.python/browse_thread/thread/...
Yeah, I've seen it ... (sigh) ... pax Steve Holden, but *please* stick
with one thread ...
Thanks... I thought I was posting about separate issues and would
annoy people who were only interested in one of the issues if I put
them both in the same thread. I guess all posts re: the same script
should go in one thread, even if the questions posed may be unrelated
and may be separate issues. There are grey areas.
Problem solved in John Machin's post at
http://groups.google.com/group/comp.lang.python/browse_thread/thread/d84f42493fe81864/03b8341539d87989?hl=en&lnk=raot#03b8341539d87989
It's worth pointing out (although the provider probably doesn't care)
that this isn't really an XML document and this was a bad way of them
to distribute the data. If they'd used a correctly formatted XML
document (with the prelude and everything) with the correct encoding
information, existing XML parsers should have just Done The Right
Thing with the data, instead of you needing to know the encoding a
priori to extract an XML fragment.
webcomm
2009-01-12 20:27:42 UTC
Permalink
Post by Chris Mellon
Post by webcomm
Post by John Machin
It is not impossible for a file with dummy data to have been
handcrafted or otherwise produced by a process different to that used
for a real-data file.
I knew it was produced by the same process, or I wouldn't have shared
it. : )
But you couldn't have known that.
Post by John Machin
Not sure if you've seen this thread...http://groups.google.com/group/comp.lang.python/browse_thread/thread/...
Yeah, I've seen it ... (sigh) ... pax Steve Holden, but *please* stick
with one thread ...
Thanks... I thought I was posting about separate issues and would
annoy people who were only interested in one of the issues if I put
them both in the same thread. ?I guess all posts re: the same script
should go in one thread, even if the questions posed may be unrelated
and may be separate issues. ?There are grey areas.
Problem solved in John Machin's post at
http://groups.google.com/group/comp.lang.python/browse_thread/thread/...
It's worth pointing out (although the provider probably doesn't care)
that this isn't really an XML document and this was a bad way of them
to distribute the data. If they'd used a correctly formatted XML
document (with the prelude and everything) with the correct encoding
information, existing XML parsers should have just Done The Right
Thing with the data, instead of you needing to know the encoding a
priori to extract an XML fragment.
Agreed. I can't say I understand their rationale for doing it this way.
Steve Holden
2009-01-13 06:14:57 UTC
Permalink
[file distribution horror story ...]
Post by webcomm
Post by Chris Mellon
It's worth pointing out (although the provider probably doesn't care)
that this isn't really an XML document and this was a bad way of them
to distribute the data. If they'd used a correctly formatted XML
document (with the prelude and everything) with the correct encoding
information, existing XML parsers should have just Done The Right
Thing with the data, instead of you needing to know the encoding a
priori to extract an XML fragment.
Agreed. I can't say I understand their rationale for doing it this way.
Sadly their rationale is irrelevant to the business of making sense of
the data, which we all hope you have eventually managed to do.

regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
Holden Web LLC http://www.holdenweb.com/
webcomm
2009-01-09 15:05:25 UTC
Permalink
Post by Carl Banks
The zipfile format is kind of brain dead, you can't tell where the end
of the file is supposed to be by looking at the header. ?If the end of
file hasn't yet been reached there could be more data. ?To make
matters worse, somehow zip files came to have text comments simply
appended to the end of them. ?(Probably this was for the benefit of
people who would cat them to the terminal.)
Anyway, if you see something that doesn't adhere to the zipfile
format, you don't have any foolproof way to know if it's because the
file is corrupted or if it's just an appended comment.
Most zipfile readers use a heuristic to distinguish. ?Python's zipfile
module just assumes it's corrupted.
The following post from a while back gives a solution that tries to
snip the comment off so that zipfile module can handle it. ?It might
help you out.
http://groups.google.com/group/comp.lang.python/msg/c2008e48368c6543
Carl Banks
Thanks Carl. I tried Scott's getzip() function yesterday... I
stumbled upon it in my searches. It didn't seem to help in my case,
though it did produce a different error: ValueError, substring not
found. Not sure what that means.
Chris Mellon
2009-01-09 15:14:35 UTC
Permalink
Post by webcomm
Post by Carl Banks
The zipfile format is kind of brain dead, you can't tell where the end
of the file is supposed to be by looking at the header. If the end of
file hasn't yet been reached there could be more data. To make
matters worse, somehow zip files came to have text comments simply
appended to the end of them. (Probably this was for the benefit of
people who would cat them to the terminal.)
Anyway, if you see something that doesn't adhere to the zipfile
format, you don't have any foolproof way to know if it's because the
file is corrupted or if it's just an appended comment.
Most zipfile readers use a heuristic to distinguish. Python's zipfile
module just assumes it's corrupted.
The following post from a while back gives a solution that tries to
snip the comment off so that zipfile module can handle it. It might
help you out.
http://groups.google.com/group/comp.lang.python/msg/c2008e48368c6543
Carl Banks
Thanks Carl. I tried Scott's getzip() function yesterday... I
stumbled upon it in my searches. It didn't seem to help in my case,
though it did produce a different error: ValueError, substring not
found. Not sure what that means.
--
http://mail.python.org/mailman/listinfo/python-list
This is a ticket about another issue or 2 with invalid zipfiles that
the zipfile module won't load, but that other tools will compensate
for:

http://bugs.python.org/issue1757072
webcomm
2009-01-09 15:29:05 UTC
Permalink
Post by Chris Mellon
This is a ticket about another issue or 2 with invalid zipfiles that
the zipfile module won't load, but that other tools will compensate
http://bugs.python.org/issue1757072
Hmm. That's interesting. Are there other tools I can use in a python
script that are more forgiving? I am using the zipfile module only
because it seems to be the most widely used. Are other options in
python likely to be just as unforgiving? Guess I'll look and see...
webcomm
2009-01-09 16:58:14 UTC
Permalink
Post by Chris Mellon
This is a ticket about another issue or 2 with invalid zipfiles that
the zipfile module won't load, but that other tools will compensate
http://bugs.python.org/issue1757072
Looks like I just need to do this to unzip with unix...

from os import popen
popen("unzip data.zip")

That works for me. No idea why I didn't think of that earlier. I'm
new to python but should have realized I could run unix commands with
python. I had blinders on. Now I just need to get rid of some bad
characters in the unzipped file. I'll start a new thread if I need
help with that...
Wesley Brooks
2009-01-09 17:06:28 UTC
Permalink
I missed the begining of this thread and so appologise if I'm repeating what
someone else has said!

I had a very similar problem with this error and it turned out it was due to
me moving a file across a socket connection and either not reading it or
writing it in the binary mode, ie open(filename, 'rb') or write(filename,
'wb'). This didn't make any difference on the linux machines (where the
error didn't occur) but did on the windows machines and fixed the problem.

Cheers,

Wes
Post by webcomm
Post by Chris Mellon
This is a ticket about another issue or 2 with invalid zipfiles that
the zipfile module won't load, but that other tools will compensate
http://bugs.python.org/issue1757072
Looks like I just need to do this to unzip with unix...
from os import popen
popen("unzip data.zip")
That works for me. No idea why I didn't think of that earlier. I'm
new to python but should have realized I could run unix commands with
python. I had blinders on. Now I just need to get rid of some bad
characters in the unzipped file. I'll start a new thread if I need
help with that...
--
http://mail.python.org/mailman/listinfo/python-list
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20090109/187553a8/attachment.html>
Scott David Daniels
2009-01-09 18:32:54 UTC
Permalink
This post might be inappropriate. Click to display it.
webcomm
2009-01-09 20:18:35 UTC
Permalink
Post by Scott David Daniels
I'd certainly try to figure out if the archive was mis-handled
somewhere along the way. ?
Quite possible that I'm mishandling something, or the service provider
is mishandling something. Probably the former. Please see this more
recent thread...
http://groups.google.com/group/comp.lang.python/browse_thread/thread/d84f42493fe81864?hl=en#
John Machin
2009-01-09 20:48:57 UTC
Permalink
.... ?I tried Scott's getzip() function yesterday... I
stumbled upon it in my searches. ?It didn't seem to help in my case,
though it did produce a different error: ?ValueError, substring not
found. ?Not sure what that means.
?> I ran the diagnostic gadget...
?>
?> archive size is 69888
?> FileHeader at 0
?> CentralDir at 43796
?> EndArchive at 43846
This is telling you that the archive ends at 43846,
Not quite. """In a "normal" uncommented archive, EndArchive_pos + 22
==
archive_size."""
but the file
is 69888 bytes long (69888 - 43846 = 26042 post-archive bytes).
Have you tried calling getzip(filename, ignoreable=30000)?
The whole point of the function is to ignore the nasty stuff at the
end, but if _I_ had a file with more than 25K of post-archive bytes,
I'd certainly try to figure out if the archive was mis-handled
somewhere along the way.
Me too. Further, if I wasn't "ever diplomatic" :-), I wouldn't be
calling software (or people!) that blithely ignored 25kb of
unexplained data "forgiving" ... some other f-words, perhaps.
?Byt the way, one reason you cannot find
the archive by looking at the start of the file is that the zip file
format is meant to allow you to append a zip file to another file (such
as an executable) and treat the combination as an archive.
--Scott David Daniels
Scott.Dani... at Acm.Org
John Machin
2009-01-09 09:37:50 UTC
Permalink
On Jan 9, 7:16?pm, Steven D'Aprano <st... at REMOVE-THIS-
Post by webcomm
The error...
...
Post by webcomm
BadZipfile: File is not a zip file
When I look at data.zip in Windows, it appears to be a valid zip file.
I am able to uncompress it in Windows XP, and can also uncompress it
with 7-Zip. ?It looks like zipfile is not able to read a "table of
contents" in the zip file. ?That's not a concept I'm familiar with.
? ? printdir(self) unbound zipfile.ZipFile method
? ? ? ? Print a table of contents for the zip file.
In my experience, zip files originating from Windows sometimes have
garbage at the end of the file. WinZip just ignores the garbage, but
other tools sometimes don't -- if I recall correctly, Linux unzip
successfully unzips the file but then complains that the file was
corrupt. It's possible that you're running into a similar problem.
Post by webcomm
data.zip is created in this script...
? ? decoded = base64.b64decode(datum)
? ? f = open('data.zip', 'wb')
? ? f.write(decoded)
? ? f.close()
? ? file = zipfile.ZipFile('data.zip', "r")
datum is a base64 encoded zip file. ?Again, I am able to open data.zip
as if it's a valid zip file. ?Maybe there is something wrong with the
approach I've taken to writing the data to data.zip? ?I'm not sure if it
matters, but the zipped data is Unicode.
ZipFile(file, mode="r", compression=ZIP_STORED, allowZip64=True)
Try passing compression=zipfile.ZIP_DEFLATED and/or allowZip64=False and
see if that makes any difference.
"compression" is irrelevant when reading. The compression method used
is stored on a per-file basis, not on a per-archive basis, and it
hasn't got anywhere near per-file details when that exception is
raised. "allowZip64" has not been used either.
The zip format does support alternative compression methods, it's
possible that this particular file uses a different sort of compression
which Python doesn't deal with.
Post by webcomm
What would cause a zip file to not have a table of contents?
What makes you think it doesn't have one?
--
Steven
webcomm
2009-01-09 14:59:36 UTC
Permalink
On Jan 9, 3:16?am, Steven D'Aprano <st... at REMOVE-THIS-
Post by Steven D'Aprano
ZipFile(file, mode="r", compression=ZIP_STORED, allowZip64=True)
Try passing compression=zipfile.ZIP_DEFLATED and/or allowZip64=False and
see if that makes any difference.
Those arguments didn't make a difference in my case.
Post by Steven D'Aprano
The zip format does support alternative compression methods, it's
possible that this particular file uses a different sort of compression
which Python doesn't deal with.
Post by webcomm
What would cause a zip file to not have a table of contents?
What makes you think it doesn't have one?
Because when I search for the "file is not a zip file" error in
zipfile.py, there is a function that checks for a table of contents.
Tho it looks like there are other ideas in this thread about what
might cause that error... I'll keep reading...
&quot;Martin v. Löwis&quot;
2009-01-09 08:30:17 UTC
Permalink
Post by webcomm
What would cause a zip file to not have a table of contents?
AFAICT, _EndRecData is failing to find the "end of zipfile" structure in
the file. You might want debug through it to see where it looks, and how
it decides that this structure is not present in the file. Towards
22 bytes before the end of the file, the bytes PK\005\006 should appear.
If they don't appear, you don't have a zipfile. If they appear, but
elsewhere towards the end of the file, there might be a bug in the
zip file module (or, more likely, the zip file uses an optional zip
feature which the module doesn't implement).

Regards,
Martin
Loading...