Discussion:
Iterating over files of a huge directory
Gilles Lenfant
2012-12-17 15:28:46 UTC
Permalink
Hi,

I have googled but did not find an efficient solution to my problem. My customer provides a directory with a huuuuge list of files (flat, potentially 100000+) and I cannot reasonably use os.listdir(this_path) unless creating a big memory footprint.

So I'm looking for an iterator that yields the file names of a directory and does not make a giant list of what's in.

i.e :

for filename in enumerate_files(some_directory):
# My cooking...

Many thanks by advance.
--
Gilles Lenfant
Chris Angelico
2012-12-17 15:41:42 UTC
Permalink
On Tue, Dec 18, 2012 at 2:28 AM, Gilles Lenfant
Post by Gilles Lenfant
Hi,
I have googled but did not find an efficient solution to my problem. My customer provides a directory with a huuuuge list of files (flat, potentially 100000+) and I cannot reasonably use os.listdir(this_path) unless creating a big memory footprint.
So I'm looking for an iterator that yields the file names of a directory and does not make a giant list of what's in.
Sounds like you want os.walk. But... a hundred thousand files? I know
the Zen of Python says that flat is better than nested, but surely
there's some kind of directory structure that would make this
marginally manageable?

http://docs.python.org/3.3/library/os.html#os.walk

ChrisA
Tim Golden
2012-12-17 15:48:19 UTC
Permalink
Post by Chris Angelico
On Tue, Dec 18, 2012 at 2:28 AM, Gilles Lenfant
Post by Gilles Lenfant
Hi,
I have googled but did not find an efficient solution to my
problem. My customer provides a directory with a huuuuge list of
files (flat, potentially 100000+) and I cannot reasonably use
os.listdir(this_path) unless creating a big memory footprint.
So I'm looking for an iterator that yields the file names of a
directory and does not make a giant list of what's in.
Sounds like you want os.walk. But... a hundred thousand files? I
know the Zen of Python says that flat is better than nested, but
surely there's some kind of directory structure that would make this
marginally manageable?
http://docs.python.org/3.3/library/os.html#os.walk
Unfortunately all of the built-in functions (os.walk, glob.glob,
os.listdir) rely on the os.listdir functionality which produces a list
first even if (as in glob.iglob) it later iterates over it.

There are external functions to iterate over large directories in both
Windows & Linux. I *think* the OP is on *nix from his previous posts, in
which case someone else will have to produce the Linux-speak for this.
If it's Windows, you can use the FindFilesIterator in the pywin32 package.

TJG
Oscar Benjamin
2012-12-17 15:52:19 UTC
Permalink
Post by Gilles Lenfant
I have googled but did not find an efficient solution to my problem. My customer provides a directory with a huuuuge list of files (flat, potentially 100000+) and I cannot reasonably use os.listdir(this_path) unless creating a big memory footprint.
So I'm looking for an iterator that yields the file names of a directory and does not make a giant list of what's in.
# My cooking...
In the last couple of months there has been a lot of discussion (on
python-list or python-dev - not sure) about creating a library to more
efficiently iterate over the files in a directory. The result so far
is this library on github:
https://github.com/benhoyt/betterwalk

It says there that
"""
Somewhat relatedly, many people have also asked for a version of
os.listdir() that yields filenames as it iterates instead of returning
them as one big list.

So as well as a faster walk(), BetterWalk adds iterdir_stat() and
iterdir(). They're pretty easy to use, but see below for the full API
docs.
"""

Does that code work for you? If so, I imagine the author would be
interested to get some feedback on how well it works.

Alternatively, perhaps consider calling an external utility.


Oscar
Evan Driscoll
2012-12-17 18:40:49 UTC
Permalink
Post by Oscar Benjamin
In the last couple of months there has been a lot of discussion (on
python-list or python-dev - not sure) about creating a library to more
efficiently iterate over the files in a directory. The result so far
https://github.com/benhoyt/betterwalk
This is very useful to know about; thanks.

I actually wrote something very similar on my own (I wanted to get
information about whether each directory entry was a file, directory,
symlink, etc. without separate stat() calls). I'm guessing that the
library you linked is more mature than mine (I only have a Linux
implementation at present, for instance) so I'm happy to see that I
could probably switch to something better... and even happier that it
sounds like it's aiming for inclusion in the standard library.


(Also just for the record and anyone looking for other posts, I'd guess
said discussion was on Python-dev. I don't look at even remotely
everything on python-list (there's just too much), but I do skim most
subject lines and I haven't noticed any discussion on it before now.)

Evan



-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 554 bytes
Desc: OpenPGP digital signature
URL: <http://mail.python.org/pipermail/python-list/attachments/20121217/7fe5f3c3/attachment.pgp>
Oscar Benjamin
2012-12-17 19:50:53 UTC
Permalink
Post by Evan Driscoll
Post by Oscar Benjamin
https://github.com/benhoyt/betterwalk
This is very useful to know about; thanks.
I actually wrote something very similar on my own (I wanted to get
information about whether each directory entry was a file, directory,
symlink, etc. without separate stat() calls).
The initial goal of betterwalk seemed to be the ability to do os.walk
with fewer stat calls. I think the information you want is part of
what betterwalk finds "for free" from the underlying OS iteration
(without the need to call stat()) but I'm not sure.
Post by Evan Driscoll
(Also just for the record and anyone looking for other posts, I'd guess
said discussion was on Python-dev. I don't look at even remotely
everything on python-list (there's just too much), but I do skim most
subject lines and I haven't noticed any discussion on it before now.)
Actually, it was python-ideas:
http://thread.gmane.org/gmane.comp.python.ideas/17932
http://thread.gmane.org/gmane.comp.python.ideas/17757
Evan Driscoll
2012-12-17 20:09:34 UTC
Permalink
Post by Oscar Benjamin
Post by Evan Driscoll
Post by Oscar Benjamin
https://github.com/benhoyt/betterwalk
This is very useful to know about; thanks.
I actually wrote something very similar on my own (I wanted to get
information about whether each directory entry was a file, directory,
symlink, etc. without separate stat() calls).
The initial goal of betterwalk seemed to be the ability to do os.walk
with fewer stat calls. I think the information you want is part of
what betterwalk finds "for free" from the underlying OS iteration
(without the need to call stat()) but I'm not sure.
Yes, that's my impression as well.
Post by Oscar Benjamin
Post by Evan Driscoll
(Also just for the record and anyone looking for other posts, I'd guess
said discussion was on Python-dev. I don't look at even remotely
everything on python-list (there's just too much), but I do skim most
subject lines and I haven't noticed any discussion on it before now.)
http://thread.gmane.org/gmane.comp.python.ideas/17932
http://thread.gmane.org/gmane.comp.python.ideas/17757
Thanks again for the pointers; I'll have to go through that thread. It's
possible I can contribute something; it sounds like at least at one
point the implementation was ctypes-based and is sometimes slower, and I
have both a (now-defunct) C implementation and my current Cython module.
Ironically I haven't actually benchmarked mine. :-)

Evan

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 554 bytes
Desc: OpenPGP digital signature
URL: <http://mail.python.org/pipermail/python-list/attachments/20121217/6ed21b6b/attachment.pgp>
marduk
2012-12-17 15:50:21 UTC
Permalink
Post by Gilles Lenfant
Hi,
I have googled but did not find an efficient solution to my problem. My
customer provides a directory with a huuuuge list of files (flat,
potentially 100000+) and I cannot reasonably use os.listdir(this_path)
unless creating a big memory footprint.
So I'm looking for an iterator that yields the file names of a directory
and does not make a giant list of what's in.
# My cooking...
You could try using opendir[1] which is a binding to the posix call. I
believe that it returns an iterator (file-like) of the entries in the
directory.

[1] http://pypi.python.org/pypi/opendir/
Gilles Lenfant
2012-12-17 16:06:28 UTC
Permalink
Post by Oscar Benjamin
In the last couple of months there has been a lot of discussion (on
python-list or python-dev - not sure) about creating a library to more
efficiently iterate over the files in a directory. The result so far
https://github.com/benhoyt/betterwalk
It says there that
"""
Somewhat relatedly, many people have also asked for a version of
os.listdir() that yields filenames as it iterates instead of returning
them as one big list.
So as well as a faster walk(), BetterWalk adds iterdir_stat() and
iterdir(). They're pretty easy to use, but see below for the full API
docs.
"""
Does that code work for you? If so, I imagine the author would be
interested to get some feedback on how well it works.
Alternatively, perhaps consider calling an external utility.
Many thanks for this pointer Oscar.

"betterwalk" is exactly what I was looking for. More particularly iterdir(...) and iterdir_stat(...)
I'll get a deeper look at betterwalk and provide (hopefully successful) feedback.

Cheers
--
Gilles Lenfant
Gilles Lenfant
2012-12-17 16:06:28 UTC
Permalink
Post by Oscar Benjamin
In the last couple of months there has been a lot of discussion (on
python-list or python-dev - not sure) about creating a library to more
efficiently iterate over the files in a directory. The result so far
https://github.com/benhoyt/betterwalk
It says there that
"""
Somewhat relatedly, many people have also asked for a version of
os.listdir() that yields filenames as it iterates instead of returning
them as one big list.
So as well as a faster walk(), BetterWalk adds iterdir_stat() and
iterdir(). They're pretty easy to use, but see below for the full API
docs.
"""
Does that code work for you? If so, I imagine the author would be
interested to get some feedback on how well it works.
Alternatively, perhaps consider calling an external utility.
Many thanks for this pointer Oscar.

"betterwalk" is exactly what I was looking for. More particularly iterdir(...) and iterdir_stat(...)
I'll get a deeper look at betterwalk and provide (hopefully successful) feedback.

Cheers
--
Gilles Lenfant
Paul Rudin
2012-12-17 17:27:13 UTC
Permalink
Post by Chris Angelico
On Tue, Dec 18, 2012 at 2:28 AM, Gilles Lenfant
Post by Gilles Lenfant
Hi,
I have googled but did not find an efficient solution to my
problem. My customer provides a directory with a huuuuge list of
files (flat, potentially 100000+) and I cannot reasonably use
os.listdir(this_path) unless creating a big memory footprint.
So I'm looking for an iterator that yields the file names of a
directory and does not make a giant list of what's in.
Sounds like you want os.walk.
But doesn't os.walk call listdir() and that creates a list of the
contents of a directory, which is exactly the initial problem?
Post by Chris Angelico
But... a hundred thousand files? I know the Zen of Python says that
flat is better than nested, but surely there's some kind of directory
structure that would make this marginally manageable?
Sometimes you have to deal with things other people have designed, so
the directory structure is not something you can control. I've run up
against exactly the same problem and made something in C that
implemented an iterator.

It would probably be better if listdir() made an iterator rather than a
list.
MRAB
2012-12-17 18:29:46 UTC
Permalink
Post by Paul Rudin
Post by Chris Angelico
On Tue, Dec 18, 2012 at 2:28 AM, Gilles Lenfant
Post by Gilles Lenfant
Hi,
I have googled but did not find an efficient solution to my
problem. My customer provides a directory with a huuuuge list of
files (flat, potentially 100000+) and I cannot reasonably use
os.listdir(this_path) unless creating a big memory footprint.
So I'm looking for an iterator that yields the file names of a
directory and does not make a giant list of what's in.
Sounds like you want os.walk.
But doesn't os.walk call listdir() and that creates a list of the
contents of a directory, which is exactly the initial problem?
Post by Chris Angelico
But... a hundred thousand files? I know the Zen of Python says that
flat is better than nested, but surely there's some kind of directory
structure that would make this marginally manageable?
Sometimes you have to deal with things other people have designed, so
the directory structure is not something you can control. I've run up
against exactly the same problem and made something in C that
implemented an iterator.
<Off topic>
Years ago I had to deal with an in-house application that was written
using a certain database package. The package stored each predefined
query in a separate file in the same directory.

I found that if I packed all the predefined queries into a single file
and then called an external utility to extract the desired query from
the file every time it was needed into a file for the package to use,
not only did it save a significant amount of disk space (hard disks
were a lot smaller then), I also got a significant speed-up!

It wasn't as bad as 100000 in one directory, but it was certainly too
many...
</Off topic>
Post by Paul Rudin
It would probably be better if listdir() made an iterator rather than a
list.
Chris Angelico
2012-12-17 21:10:26 UTC
Permalink
Post by MRAB
<Off topic>
Years ago I had to deal with an in-house application that was written
using a certain database package. The package stored each predefined
query in a separate file in the same directory.
I found that if I packed all the predefined queries into a single file
and then called an external utility to extract the desired query from
the file every time it was needed into a file for the package to use,
not only did it save a significant amount of disk space (hard disks
were a lot smaller then), I also got a significant speed-up!
It wasn't as bad as 100000 in one directory, but it was certainly too
many...
</Off topic>
Smart Cache, a web cache that we used to use on our network a while
ago, could potentially make a ridiculous number of subdirectories (one
for each domain you go to). Its solution: Hash the domain, then put it
into partitioning directories - by default, 4x4 of them, meaning that
there were four directories /0/ /1/ /2/ /3/ and the same inside each
of them, so the "real content" was divided sixteen ways. I don't know
if PC file systems are better at it now than they were back in the
mid-90s, but definitely back then, storing too much in one directory
would give a pretty serious performance penalty.

ChrisA
Terry Reedy
2012-12-17 21:27:22 UTC
Permalink
Post by Gilles Lenfant
Hi,
I have googled but did not find an efficient solution to my problem.
My customer provides a directory with a huuuuge list of files (flat,
potentially 100000+) and I cannot reasonably use
os.listdir(this_path) unless creating a big memory footprint.
Is is really big enough to be a real problem? See below.
Post by Gilles Lenfant
So I'm looking for an iterator that yields the file names of a
directory and does not make a giant list of what's in.
for filename in enumerate_files(some_directory): # My cooking...
See http://bugs.python.org/issue11406
As I said there, I personally think (and still do) that listdir should
have been changed in 3.0 to return an iterator rather than a list.
Developers who count more than me disagree on the basis that no
application has the millions of directory entries needed to make space a
real issue. They also claim that time is a wash either way.

As for space, 100000 entries x 100 bytes/entry (generous guess at
average) = 10,000,000 bytes, no big deal with gigabyte memories. So the
logic goes. A smaller example from my machine with 3.3.

from sys import getsizeof

def seqsize(seq):
"Get size of flat sequence and contents"
return sum((getsizeof(item) for item in seq), getsizeof(seq))

import os
d = os.listdir()
print(seqsize([1,2,3]), len(d), seqsize(d))
#
172 45 3128

The size per entry is relatively short because the two-level directory
prefix for each path is only about 15 bytes. By using 3.3 rather than
3.0-3.2, the all-ascii-char unicode paths only take 1 byte per char
rather than 2 or 4.

If you disagree with the responses on the issue, after reading them,
post one yourself with real numbers.
--
Terry Jan Reedy
Loading...