Discussion:
CRC-checksum failed in gzip
andrea crotti
2012-08-01 10:39:57 UTC
Permalink
We're having some really obscure problems with gzip.
There is a program running with python2.7 on a 2.6.18-128.el5xen (red
hat I think) kernel.

Now this program does the following:
if filename == 'out2.txt':
out2 = open('out2.txt')
elif filename == 'out2.txt.gz'
out2 = open('out2.txt.gz')

text = out2.read()

out2.close()

very simple right? But sometimes we get a checksum error.
Reading the code I got the following:

- CRC is at the end of the file and is computed against the whole
file (last 8 bytes)
- after the CRC there is the \0000 marker for the EOF
- readline() doesn't trigger the checksum generation in the
beginning, but only when the EOF is reached
- until a file is flushed or closed you can't read the new content in it

but the problem is that we can't reproduce it, because doing it
manually on the same files it works perfectly,
and the same files some time work some time don't work.

The files are on a shared NFS drive, I'm starting to think that it's a
network/fs problem, which might truncate the file
adding an EOF before the end and thus making the checksum fail..
But is it possible?
Or what else could it be?
Laszlo Nagy
2012-08-01 10:47:00 UTC
Permalink
Post by andrea crotti
We're having some really obscure problems with gzip.
There is a program running with python2.7 on a 2.6.18-128.el5xen (red
hat I think) kernel.
out2 = open('out2.txt')
elif filename == 'out2.txt.gz'
out2 = open('out2.txt.gz')
Gzip file is binary. You should open it in binary mode.

out2 = open('out2.txt.gz',"b")

Otherwise carriage return and newline characters will be converted (depending on the platform).
andrea crotti
2012-08-01 10:58:10 UTC
Permalink
Post by Laszlo Nagy
Post by andrea crotti
We're having some really obscure problems with gzip.
There is a program running with python2.7 on a 2.6.18-128.el5xen (red
hat I think) kernel.
out2 = open('out2.txt')
elif filename == 'out2.txt.gz'
out2 = open('out2.txt.gz')
Gzip file is binary. You should open it in binary mode.
out2 = open('out2.txt.gz',"b")
Otherwise carriage return and newline characters will be converted
(depending on the platform).
--
http://mail.python.org/mailman/listinfo/python-list
Ah no sorry I just wrote wrong that part of the code, it was
otu2 = gzip.open('out2.txt.gz') because otherwise nothing would possibly work..
Laszlo Nagy
2012-08-01 11:11:18 UTC
Permalink
Post by andrea crotti
very simple right? But sometimes we get a checksum error.
Do you have a traceback showing the actual error?
Post by andrea crotti
- CRC is at the end of the file and is computed against the whole
file (last 8 bytes)
- after the CRC there is the \0000 marker for the EOF
- readline() doesn't trigger the checksum generation in the
beginning, but only when the EOF is reached
- until a file is flushed or closed you can't read the new content in it
How do you write the file? Is it written from another Python program?
Can we see the source code of that?
Post by andrea crotti
but the problem is that we can't reproduce it, because doing it
manually on the same files it works perfectly,
and the same files some time work some time don't work.
The problem might be with the saved file. Once you get an error for a
given file, can you reproduce the error using the same file?
Post by andrea crotti
The files are on a shared NFS drive, I'm starting to think that it's a
network/fs problem, which might truncate the file
adding an EOF before the end and thus making the checksum fail..
But is it possible?
Or what else could it be?
Can your try to run the same program on a local drive?
andrea crotti
2012-08-01 13:01:45 UTC
Permalink
Full traceback:

Exception in thread Thread-8:
Traceback (most recent call last):
File "/user/sim/python/lib/python2.7/threading.py", line 530, in
__bootstrap_inner
self.run()
File "/user/sim/tests/llif/AutoTester/src/AutoTester2.py", line 67, in run
self.processJobData(jobData, logger)
File "/user/sim/tests/llif/AutoTester/src/AutoTester2.py", line 204,
in processJobData
self.run_simulator(area, jobData[1] ,log)
File "/user/sim/tests/llif/AutoTester/src/AutoTester2.py", line 142,
in run_simulator
report_file, percentage, body_text = SimResults.copy_test_batch(log, area)
File "/user/sim/tests/llif/AutoTester/src/SimResults.py", line 274,
in copy_test_batch
out2_lines = out2.read()
File "/user/sim/python/lib/python2.7/gzip.py", line 245, in read
self._read(readsize)
File "/user/sim/python/lib/python2.7/gzip.py", line 316, in _read
self._read_eof()
File "/user/sim/python/lib/python2.7/gzip.py", line 338, in _read_eof
hex(self.crc)))
IOError: CRC check failed 0x4f675fba != 0xa9e45aL


- The file is written with the linux gzip program.
- no I can't reproduce the error with the same exact file that did
failed, that's what is really puzzling,
there seems to be no clear pattern and just randmoly fails. The file
is also just open for read from this program,
so in theory no way that it can be corrupted.

I also checked with lsof if there are processes that opened it but
nothing appears..

- can't really try on the local disk, might take ages unfortunately
(we are rewriting this system from scratch anyway)
Laszlo Nagy
2012-08-01 13:27:26 UTC
Permalink
- The file is written with the linux gzip program.
- no I can't reproduce the error with the same exact file that did
failed, that's what is really puzzling,

How do you make sure that no process is reading the file before it is
fully flushed to disk?

Possible way of testing for this kind of error: before you open a file,
use os.stat to determine its size, and write out the size and the file
path into a log file. Whenever an error occurs, compare the actual size
of the file with the logged value. If they are different, then you have
tried to read from a file that was growing at that time.

Suggestion: from the other process, write the file into a different file
(for example, "file.gz.tmp"). Once the file is flushed and closed, use
os.rename() to give its final name. On POSIX systems, the rename()
operation is atomic.
Post by andrea crotti
there seems to be no clear pattern and just randmoly fails. The file
is also just open for read from this program,
so in theory no way that it can be corrupted.
Yes, there is. Gzip stores CRC for compressed *blocks*. So if the file
is not flushed to the disk, then you can only read a fragment of the
block, and that changes the CRC.
Post by andrea crotti
I also checked with lsof if there are processes that opened it but
nothing appears..
lsof doesn't work very well over nfs. You can have other processes on
different computers (!) writting the file. lsof only lists the processes
on the system it is executed on.
Post by andrea crotti
- can't really try on the local disk, might take ages unfortunately
(we are rewriting this system from scratch anyway)
andrea crotti
2012-08-01 13:52:59 UTC
Permalink
Post by andrea crotti
there seems to be no clear pattern and just randmoly fails. The file
is also just open for read from this program,
so in theory no way that it can be corrupted.
Yes, there is. Gzip stores CRC for compressed *blocks*. So if the file is
not flushed to the disk, then you can only read a fragment of the block, and
that changes the CRC.
Post by andrea crotti
I also checked with lsof if there are processes that opened it but
nothing appears..
lsof doesn't work very well over nfs. You can have other processes on
different computers (!) writting the file. lsof only lists the processes on
the system it is executed on.
Post by andrea crotti
- can't really try on the local disk, might take ages unfortunately
(we are rewriting this system from scratch anyway)
Thanks a lotl, someone that writes on the file while reading might be
an explanation, the problem is that everyone claims that they are only
reading the file.

Apparently this file is generated once and a long time after only read
by two different tools (in sequence), so this could not be possible
either in theory.. I'll try to investigate more in this sense since
it's the only reasonable explation I've found so far.
Laszlo Nagy
2012-08-01 14:17:25 UTC
Permalink
Post by andrea crotti
Thanks a lotl, someone that writes on the file while reading might be
an explanation, the problem is that everyone claims that they are only
reading the file.
If that is true, then make that file system read only. Soon it will turn
out who is writing them. ;-)
Post by andrea crotti
Apparently this file is generated once and a long time after only read
by two different tools (in sequence), so this could not be possible
either in theory.. I'll try to investigate more in this sense since
it's the only reasonable explation I've found so far.
Safe solution would be to develop a system where files go through
"states" in a predefined order:

* allow programs to write into files with .incomplete extension.
* allow them to rename the file to .complete.
* create a single program that renames .complete files to .gz files
AFTER making them read-only for everybody else.
* readers should only read .gz file
* .gz files are then guaranteed to be complete.
Steven D'Aprano
2012-08-01 16:17:44 UTC
Permalink
"DANGER DANGER DANGER WILL ROBINSON!!!"

Why didn't you say that there were threads involved? That puts a
completely different perspective on the problem.

I *was* going to write back and say that you probably had either file
system corruption, or network errors. But now that I can see that you
have threads, I will revise that and say that you probably have a bug in
your thread handling code.

I must say, Andrea, your initial post asking for help was EXTREMELY
misleading. You over-simplified the problem to the point that it no
longer has any connection to the reality of the code you are running.
Please don't send us on wild goose chases after bugs in code that you
aren't actually running.
Post by andrea crotti
there seems to be no clear pattern and just randmoly fails.
When you start using threads, you have to expect these sorts of
intermittent bugs unless you are very careful.

My guess is that you have a bug where two threads read from the same file
at the same time. Since each read shares state (the position of the file
pointer), you're going to get corruption. Because it depends on timing
details of which threads do what at exactly which microsecond, the effect
might as well be random.

Example: suppose the file contains three blocks A B and C, and a
checksum. Thread 8 starts reading the file, and gets block A and B. Then
thread 2 starts reading it as well, and gets half of block C. Thread 8
gets the rest of block C, calculates the checksum, and it doesn't match.

I recommend that you run a file system check on the remote disk. If it
passes, you can eliminate file system corruption. Also, run some network
diagnostics, to eliminate corruption introduced in the network layer. But
I expect that you won't find anything there, and the problem is a simple
thread bug. Simple, but really, really hard to find.

Good luck.
--
Steven
andrea crotti
2012-08-01 16:38:56 UTC
Permalink
Post by Steven D'Aprano
"DANGER DANGER DANGER WILL ROBINSON!!!"
Why didn't you say that there were threads involved? That puts a
completely different perspective on the problem.
I *was* going to write back and say that you probably had either file
system corruption, or network errors. But now that I can see that you
have threads, I will revise that and say that you probably have a bug in
your thread handling code.
I must say, Andrea, your initial post asking for help was EXTREMELY
misleading. You over-simplified the problem to the point that it no
longer has any connection to the reality of the code you are running.
Please don't send us on wild goose chases after bugs in code that you
aren't actually running.
Post by andrea crotti
there seems to be no clear pattern and just randmoly fails.
When you start using threads, you have to expect these sorts of
intermittent bugs unless you are very careful.
My guess is that you have a bug where two threads read from the same file
at the same time. Since each read shares state (the position of the file
pointer), you're going to get corruption. Because it depends on timing
details of which threads do what at exactly which microsecond, the effect
might as well be random.
Example: suppose the file contains three blocks A B and C, and a
checksum. Thread 8 starts reading the file, and gets block A and B. Then
thread 2 starts reading it as well, and gets half of block C. Thread 8
gets the rest of block C, calculates the checksum, and it doesn't match.
I recommend that you run a file system check on the remote disk. If it
passes, you can eliminate file system corruption. Also, run some network
diagnostics, to eliminate corruption introduced in the network layer. But
I expect that you won't find anything there, and the problem is a simple
thread bug. Simple, but really, really hard to find.
Good luck.
Thanks a lot, that makes a lot of sense.. I haven't given this detail
before because I didn't write this code, and I forgot that there were
threads involved completely, I'm just trying to help to fix this bug.

Your explanation makes a lot of sense, but it's still surprising that
even just reading files without ever writing them can cause troubles
using threads :/
Laszlo Nagy
2012-08-01 17:05:11 UTC
Permalink
Post by andrea crotti
Thanks a lot, that makes a lot of sense.. I haven't given this detail
before because I didn't write this code, and I forgot that there were
threads involved completely, I'm just trying to help to fix this bug.
Your explanation makes a lot of sense, but it's still surprising that
even just reading files without ever writing them can cause troubles
using threads :/
Make sure that file objects are not shared between threads. If that is
possible. It will probably solve the problem (if that is related to
threads).
andrea crotti
2012-08-01 17:17:56 UTC
Permalink
Post by Laszlo Nagy
Post by andrea crotti
Thanks a lot, that makes a lot of sense.. I haven't given this detail
before because I didn't write this code, and I forgot that there were
threads involved completely, I'm just trying to help to fix this bug.
Your explanation makes a lot of sense, but it's still surprising that
even just reading files without ever writing them can cause troubles
using threads :/
Make sure that file objects are not shared between threads. If that is
possible. It will probably solve the problem (if that is related to
threads).
Well I just have to create a lock I guess right?
with lock:
# open file
# read content
Laszlo Nagy
2012-08-01 17:57:19 UTC
Permalink
Post by andrea crotti
Post by Laszlo Nagy
Make sure that file objects are not shared between threads. If that is
possible. It will probably solve the problem (if that is related to
threads).
Well I just have to create a lock I guess right?
That is also a solution. You need to call file.read() inside an acquired
lock.
Post by andrea crotti
# open file
# read content
But not that way! Your example will keep the lock acquired for the
lifetime of the file, so it cannot be shared between threads.

More likely:

## Open file
lock = threading.Lock()
fin = gzip.open(file_path...)
# Now you can share the file object between threads.

# and do this inside any thread:
## data needed. block until the file object becomes usable.
with lock:
data = fin.read(....) # other threads are blocked while I'm reading
## use your data here, meanwhile other threads can read
Ulrich Eckhardt
2012-08-02 08:49:07 UTC
Permalink
Post by Laszlo Nagy
## Open file
lock = threading.Lock()
fin = gzip.open(file_path...)
# Now you can share the file object between threads.
## data needed. block until the file object becomes usable.
data = fin.read(....) # other threads are blocked while I'm reading
## use your data here, meanwhile other threads can read
Technically, that is correct, but IMHO its complete nonsense to share
the file object between threads in the first place. If you need the data
in two threads, just read the file once and then share the read-only,
immutable content. If the file is small or too large to be held in
memory at once, just open and read it on demand. This also saves you
from having to rewind the file every time you read it.

Am I missing something?

Uli
Laszlo Nagy
2012-08-02 10:14:14 UTC
Permalink
Post by Ulrich Eckhardt
Technically, that is correct, but IMHO its complete nonsense to share
the file object between threads in the first place. If you need the
data in two threads, just read the file once and then share the
read-only, immutable content. If the file is small or too large to be
held in memory at once, just open and read it on demand. This also
saves you from having to rewind the file every time you read it.
Am I missing something?
We suspect that his program reads the same file object from different
threads. At least this would explain his problem. I agree with you -
usually it is not a good idea to share a file object between threads.
This is what I told him the first time. But it is not in our hands - he
already has a program that needs to be fixed. It might be easier for him
to protect read() calls with a lock. Because it can be done
automatically, without thinking too much.
andrea crotti
2012-08-02 09:26:51 UTC
Permalink
Post by Steven D'Aprano
When you start using threads, you have to expect these sorts of
intermittent bugs unless you are very careful.
My guess is that you have a bug where two threads read from the same file
at the same time. Since each read shares state (the position of the file
pointer), you're going to get corruption. Because it depends on timing
details of which threads do what at exactly which microsecond, the effect
might as well be random.
Example: suppose the file contains three blocks A B and C, and a
checksum. Thread 8 starts reading the file, and gets block A and B. Then
thread 2 starts reading it as well, and gets half of block C. Thread 8
gets the rest of block C, calculates the checksum, and it doesn't match.
I recommend that you run a file system check on the remote disk. If it
passes, you can eliminate file system corruption. Also, run some network
diagnostics, to eliminate corruption introduced in the network layer. But
I expect that you won't find anything there, and the problem is a simple
thread bug. Simple, but really, really hard to find.
Good luck.
One last thing I would like to do before I add this fix is to actually
be able to reproduce this behaviour, and I thought I could just do the
following:

import gzip
import threading


class OpenAndRead(threading.Thread):
def run(self):
fz = gzip.open('out2.txt.gz')
fz.read()
fz.close()


if __name__ == '__main__':
for i in range(100):
OpenAndRead().start()


But no matter how many threads I start, I can't reproduce the CRC
error, any idea how I can try to help it happening?

The code in run should be shared by all the threads since there are no
locks, right?
Laszlo Nagy
2012-08-02 10:21:24 UTC
Permalink
Post by andrea crotti
One last thing I would like to do before I add this fix is to actually
be able to reproduce this behaviour, and I thought I could just do the
import gzip
import threading
fz = gzip.open('out2.txt.gz')
fz.read()
fz.close()
OpenAndRead().start()
But no matter how many threads I start, I can't reproduce the CRC
error, any idea how I can try to help it happening?
Your example did not share the file object between threads. Here an
example that does that:

class OpenAndRead(threading.Thread):
def run(self):
global fz
fz.read(100)

if __name__ == '__main__':
fz = gzip.open('out2.txt.gz')
for i in range(10):
OpenAndRead().start()

Try this with a huge file. And here is the one that should never throw
CRC error, because the file object is protected by a lock:

class OpenAndRead(threading.Thread):
def run(self):
global fz
global fl
with fl:
fz.read(100)

if __name__ == '__main__':
fz = gzip.open('out2.txt.gz')
fl = threading.Lock()
for i in range(2):
OpenAndRead().start()
Post by andrea crotti
The code in run should be shared by all the threads since there are no
locks, right?
The code is shared but the file object is not. In your example, a new
file object is created, every time a thread is started.
andrea crotti
2012-08-02 10:57:06 UTC
Permalink
Your example did not share the file object between threads. Here an example
global fz
fz.read(100)
fz = gzip.open('out2.txt.gz')
OpenAndRead().start()
Try this with a huge file. And here is the one that should never throw CRC
global fz
global fl
fz.read(100)
fz = gzip.open('out2.txt.gz')
fl = threading.Lock()
OpenAndRead().start()
Post by andrea crotti
The code in run should be shared by all the threads since there are no
locks, right?
The code is shared but the file object is not. In your example, a new file
object is created, every time a thread is started.
Ok sure that makes sense, but then this explanation is maybe not right
anymore, because I'm quite sure that the file object is *not* shared
between threads, everything happens inside a thread..

I managed to get some errors doing this with a big file
class OpenAndRead(threading.Thread):
def run(self):
global fz
fz.read(100)

if __name__ == '__main__':

fz = gzip.open('bigfile.avi.gz')
for i in range(20):
OpenAndRead().start()

and it doesn't fail without the *global*, but this is definitively not
what the code does, because every thread gets a new file object, it's
not shared..

Anyway we'll read once for all the threads or add the lock, and
hopefully it should solve the problem, even if I'm not convinced yet
that it was this.
andrea crotti
2012-08-02 10:59:28 UTC
Permalink
Post by andrea crotti
Ok sure that makes sense, but then this explanation is maybe not right
anymore, because I'm quite sure that the file object is *not* shared
between threads, everything happens inside a thread..
I managed to get some errors doing this with a big file
global fz
fz.read(100)
fz = gzip.open('bigfile.avi.gz')
OpenAndRead().start()
and it doesn't fail without the *global*, but this is definitively not
what the code does, because every thread gets a new file object, it's
not shared..
Anyway we'll read once for all the threads or add the lock, and
hopefully it should solve the problem, even if I'm not convinced yet
that it was this.
Just for completeness as suggested this also does not fail:

class OpenAndRead(threading.Thread):
def __init__(self, lock):
threading.Thread.__init__(self)
self.lock = lock

def run(self):
global fz
with self.lock:
fz.read(100)

if __name__ == '__main__':
lock = threading.Lock()
fz = gzip.open('bigfile.avi.gz')
for i in range(20):
OpenAndRead(lock).start()

Loading...