Idiomatic portable way to strip line endings?

Discussion:

Tim Hammerquist

2001-12-16 15:49:37 UTC

"Tim Hammerquist" <tim at vegeta.ath.cx> wrote in message

I've been trying to figure out the canonical way to strip the line
endings from a text file.
line = line[:-1]
or
del line[-1:]
to strip the last character off the line, but this only works on
operating systems that have a one-byte line separator like Unix/Linux
('\n'). The Win32 line separator is 2-bytes ('\r\n'), so this
solution is not portable.

line.endswith(os.linesep)

What about it? I used it in code in the original post. I know it
exists and is my preferred method of _detecting_if_ a line ends in
the designated linesep for the host OS.

However, the rest of my original post raises questions regarding
actually _stripping_ the linesep.

Tim Hammerquist

--
"This is your life ... and it's ending on minute at a time."
-- Jack, "Fight Club"

Tim Hammerquist

2001-12-16 16:17:49 UTC

Permalink

"Tim Hammerquist" <tim at vegeta.ath.cx> wrote in message

Huh? This works fine on Windows. The C library (which is
what Python is based on, after all) is defined to convert the
operating system dependent line end sequence to '\n'
on input, and back on output.

Ah. So I was doing my normal over-thinking again. =)

Thanks!

Tim Hammerquist

--
How do I type "for i in *.dvi do xdvi i done" in a GUI?
-- discussion in comp.os.linux.misc

Martin Bless

2001-12-16 18:03:52 UTC

Permalink

Post by Tim Hammerquist
Ah. So I was doing my normal over-thinking again. =)

Yes - and it's totally appropriate. Danger is not over.

The line[:-1] idiom works fine on windows, as long you (can) garantee
that files are opened without the "b" (binary) mode switch.
If the file is opened 'binary', as in
fp = open(fname,'rb')
no input conversion will be done by the lower level routines. This
means your line endings are '\r\n' indeed on a windows system.

To be on the safe side here's what I sometimes use:

def removecrlf( line):
"""Remove \r\n or \n from end of string."""
if line[-1:] == '\n':
if line[-2:-1] == '\r':
return line[:-2]
else:
return line[:-1]
else:
return line

Martin

John Roth

2001-12-16 20:54:39 UTC

Permalink

"Martin Bless" <m.bless at gmx.de> wrote in message

Post by Martin Bless

Post by Tim Hammerquist
Ah. So I was doing my normal over-thinking again. =)

This is quite true, but if you're opening it in binary mode,
it's questionable whether there's anything that corresponds to
a "line" in the first place. This is, in fact, the standard distinction
between a 'text' and a 'binary' file: text files are organized into
lines, and binary files have some other organization, which is
strictly dependent on the kind of file.

John Roth

Skip Montanaro

2001-12-16 18:53:01 UTC

Permalink

John> Huh? This works fine on Windows. The C library (which is what
John> Python is based on, after all) is defined to convert the operating
John> system dependent line end sequence to '\n' on input, and back on
John> output.

Even if you get the file from an operating system that uses another line
ending convention?

--
Skip Montanaro (skip at pobox.com - http://www.mojam.com/)

Chris Barker

2001-12-18 20:20:55 UTC

Permalink

I know Guido, Jack Jansen and I had some discussion about it, and I
think Jack put a little work into it, but I'm not sure where it
sits.

http://sourceforge.net/tracker/index.php?func=detail&aid=476814&group_id=5470&atid=305470

This looks Great! I'll have to test it out! Thanks, Jack

-Chris
--
Christopher Barker,
Ph.D.
ChrisHBarker at attbi.net --- --- ---
---@@ -----@@ -----@@
------@@@ ------@@@ ------@@@
Oil Spill Modeling ------ @ ------ @ ------ @
Water Resources Engineering ------- --------- --------
Coastal and Fluvial Hydrodynamics --------------------------------------
------------------------------------------------------------------------

Tim Hammerquist

2001-12-16 16:22:43 UTC

Permalink

line = line[:-1]
or
del line[-1:]

I think you misunderstood my question.

Suppose:

s = "a string\n"

I want:

s == "a string"

OTOH, when I write filters, yours is the exact idiom I use.

the string module also supports the following
rstrip, strip, lstrip
which removes whitespaces (on the right side, the whole string, left side)

This breaks on strings like:

Start: " string bookended by whitespace \n"
Target: " string bookended by whitespace "

w/ .rstrip: " string bookended by whitespace"
w/ .strip: "string bookended by whitespace"

Often the (l|r)?strip methods work fine, but sometimes I need to
preserve " " and "\t" in the string.

Thanks for you help, tho. =)

Tim Hammerquist

--
Microsoft's biggest and most dangerous contribution to the software
industry may be the degree to which it has lowered user expectations.

John Roth

2001-12-16 21:04:01 UTC

Permalink

"Jason Orendorff" <jason at jorendorff.com> wrote in message

I've been trying to figure out the canonical way to strip
the line endings from a text file.

There isn't one. Almost always, either rstrip() is sufficient,
*or* you're doing text slinging, in which case you can leave the
newline characters on there, do the regex stuff, and file.write()
the lines out the same way they came in.
return line[:-2]
return line[:-1]
return line
...process line...
Why is this necessary? Unfortunately readline() doesn't
interpret a bare '\r' as a line ending on Windows or Unix.
So if the file contains bare '\r's, then the above code
will read the entire file into the unstripped_line
variable, then break it into lines with splitlines().

Unfortunately, this won't work in all cases, either. Let's
go back to basics. If your file has the line ending convention
defined for the operation system you're running on, then
everything works nicely. If it doesn't, then you're in a
great deal of difficulty, because there are cases where
readline() and readlines() will not parse the file into
lines for you, or if they do, you get extra characters
at the end. You need to read it in without having the
system do the parse, and do it yourself.

These are two different cases, although it does come
up in practice if you're importing files from the internet.
Some browsers will fix the line endings, and some won't.
I had a lot of files with Unix line endings that I had to
convert to Windows line endings because Notepad
will not handle Unix line endings. At all.

John Roth

Chris Barker

2001-12-17 20:33:46 UTC

Permalink

Post by John Roth
These are two different cases, although it does come
up in practice if you're importing files from the internet.
Some browsers will fix the line endings, and some won't.
I had a lot of files with Unix line endings that I had to
convert to Windows line endings because Notepad
will not handle Unix line endings. At all.

It can be worse than that. If someone got a file off the web with the
line endings that were not native to their system and then edited it in
an editor that is not very smart, you could have mixed line endings.

To deal with all these problems (which become perhaps more common on Mac
OS-X), there was discussion on python-dev about creating a "Universal
Text File" type that would read files with any mixed combination of
*nix, DOS and Mac line endings (and maybe do something smart on VMS as
well). The goal would be able to have it work with Python code as well
as files opened with open(), and also on pipes and other text streams.

I know Guido, Jack Jansen and I had some discussion about it, and I
think Jack put a little work into it, but I'm not sure where it sits. I
don't have the C skills to make it happen myself. My solution was to
write a module in Python to handle this. It's not all that fast, not
that well tested, and doesn't (yet) support xreadlines(), but it is
working just fine for me in production code. My web page got killed when
@home went out of business, but it's not too big so I'll include it
here. Comments, bug reports, and especially improvements and fixes
welcome.

-Chris

module TexFile.py:

#!/usr/bin/env python

"""

TextFile.py : a module that provides a UniversalTextFile class, and a
replacement for the native python "open" command that provides an
interface to that class.

It would usually be used as:

from TextFile import open

then you can use the new open just like the old one (with some added
flags and arguments)

or

import TextFile

file = TextFile.open(filename,flags,[bufsize], [LineEndingType],
[LineBufferSize])

please send bug reports, helpful hints, and/or feature requests to:

Chris Barker

ChrisHBarker at attbi.com

Copywrite/license is the same as whatever version of python you are
running.

"""
import os

## Re-map the open function
_OrigOpen = open

def open(filename,flags = "",bufsize = -1, LineEndingType = "",
LineBufferSize = ""):
"""

A new open function, that returns a regular python file object for
the old calls, and returns a new nifty universal text file when
required.

This works just like the regular open command, except that a new
flag and a new parameter has been added.

Call:

file = open(filename,flags = "",bufsize = -1, LineEndingType = ""):
- filename is the name of the file to be opened
- flags is a string of one letter flags, the same as the standard
open
command, plus a "t" for universal text file.
- - "b" means binary file, this returns the standard binary file
object
- - "t" means universal text file
- - "r" for read only
- - "w" for write. If there is both "w" and "t" than the user can
specify a line ending type to be used with the LineEndingType
parameter.
- - "a" means append to existing file

- bufsize specifies the buffer size to be used by the system. Same
as the regular open function

- LineEndingType is used only for writing (and appending) files, to
specify a
non-native line ending to be written.
- - The options are: "native", "DOS", "Posix", "Unix", "Mac", or the
characters themselves( "\r\n", etc. ). "native" will result in
using the standard file object, which uses whatever is native
for the system that python is running on.

- LineBufferSize is the size of the buffer used to read data in
a readline() operation. The default is currently set to 200
characters. If you will be reading files with many lines over 200
characters long, you should set this number to the largest expected
line length.

"""

if "t" in flags: # this is a universal text file
if ("w" in flags or "a" in flags) and LineEndingType ==
"native":
return _OrigOpen(filename,flags.replace("t",""), bufsize)
return
UniversalTextFile(filename,flags,LineEndingType,LineBufferSize)
else: # this is a regular old file
return _OrigOpen(filename,flags,bufsize)

class UniversalTextFile:
"""

A class that acts just like a python file object, but has a mode
that allows the reading of arbitrary formated text files, i.e. with
either Unix, DOS or Mac line endings. [\n , \r\n, or \r]

To keep it truly universal, it checks for each of these line ending
possibilities at every line, so it should work on a file with mixed
endings as well.

"""
def __init__(self,filename,flags = "",LineEndingType =
"native",LineBufferSize = ""):
self._file = _OrigOpen(filename,flags.replace("t","")+"b")

LineEndingType = LineEndingType.lower()
if LineEndingType == "native":
self.LineSep = os.linesep()
elif LineEndingType == "dos":
self.LineSep = "\r\n"
elif LineEndingType == "posix" or LineEndingType == "unix" :
self.LineSep = "\n"
elif LineEndingType == "mac":
self.LineSep = "\r"
else:
self.LineSep = LineEndingType

## some attributes
self.closed = 0
self.mode = flags
self.softspace = 0
if LineBufferSize:
self._BufferSize = LineBufferSize
else:
self._BufferSize = 100

def readline(self):
start_pos = self._file.tell()
##print "Current file posistion is:", start_pos
line = ""
TotalBytes = 0
Buffer = self._file.read(self._BufferSize)
while Buffer:
##print "Buffer = ",repr(Buffer)
newline_pos = Buffer.find("\n")
return_pos = Buffer.find("\r")
if return_pos == newline_pos-1 and return_pos >= 0: # we
have a DOS line
line = Buffer[:return_pos]+ "\n"
TotalBytes = newline_pos+1
break
elif ((return_pos < newline_pos) or newline_pos < 0 ) and
return_pos >=0: # we have a Mac line
line = Buffer[:return_pos]+ "\n"
TotalBytes = return_pos+1
break
elif newline_pos >= 0: # we have a Posix line
line = Buffer[:newline_pos]+ "\n"
TotalBytes = newline_pos+1
break
else: # we need a larger buffer
NewBuffer = self._file.read(self._BufferSize)
if NewBuffer:
Buffer = Buffer + NewBuffer
else: # we are at the end of the file, without a line
ending.
self._file.seek(start_pos + len(Buffer))
return Buffer

self._file.seek(start_pos + TotalBytes)
return line

def readlines(self,sizehint = None):
"""

readlines acts like the regular readlines, except that it
understands any of the standard text file line endings ("\r\n",
"\n", "\r").

If sizehint is used, it will read a a maximum of that many
bytes. It will never round up, as the regular readline sometimes
does. This means that if your buffer size is less than the
length of the next line, you'll get an empty string, which could
incorrectly be interpreted as the end of the file.

"""

if sizehint:
Data = self._file.read(sizehint)
else:
Data = self._file.read()

if len(Data) == sizehint:
#print "The buffer is full"
FullBuffer = 1
else:
FullBuffer = 0
Data = Data.replace("\r\n","\n").replace("\r","\n")
Lines = [line + "\n" for line in Data.split('\n')]
## If the last line is only a linefeed it is an extra line
if Lines[-1] == "\n":
del Lines[-1]
## if it isn't then the last line didn't have a linefeed, so we
need to remove the one we put on.
else:
## or it's the end of the buffer
if FullBuffer:
self._file.seek(-(len(Lines[-1])-1),1) # reset the file
position
del(Lines[-1])
else:
Lines[-1] = Lines[-1][:-1]
return Lines

def readnumlines(self,NumLines = 1):
"""

readnumlines is an extension to the standard file object. It
returns a list containing the number of lines that are
requested. I have found this to be very useful, and allows me
to avoid the many loops like:

lines = []
for i in range(N):
lines.append(file.readline())

Also, If I ever get around to writing this in C, it will provide
a speed improvement.

"""
Lines = []
while len(Lines) < NumLines:
Lines.append(self.readline())
return Lines

def read(self,size = None):
"""

read acts like the regular read, except that it tranlates any of
the standard text file line endings ("\r\n", "\n", "\r") into a
"\n"

If size is used, it will read a maximum of that many bytes,
before translation. This means that if the line endings have
more than one character, the size returned will be smaller. This
could be fixed, but it didn't seem worth it. If you want that
much control, use a binary file.

"""

if size:
Data = self._file.read(size)
else:
Data = self._file.read()

return Data.replace("\r\n","\n").replace("\r","\n")

def write(self,string):
"""

write is just like the regular one, except that it uses the line
separator specified when the file was opened for writing or
appending.

"""
self._file.write(string.replace("\n",self.LineSep))

def writelines(self,list):
for line in list:
self.write(line)

# The rest of the standard file methods mapped
def close(self):
self._file.close()
self.closed = 1
def flush(self):
self._file.flush()
def fileno(self):
return self._file.fileno()
def seek(self,offset,whence = 0):
self._file.seek(offset,whence)
def tell(self):
return self._file.tell()
--
Christopher Barker,
Ph.D.
ChrisHBarker at attbi.net --- --- ---
---@@ -----@@ -----@@
------@@@ ------@@@ ------@@@
Oil Spill Modeling ------ @ ------ @ ------ @
Water Resources Engineering ------- --------- --------
Coastal and Fluvial Hydrodynamics --------------------------------------
------------------------------------------------------------------------

Guido van Rossum

2001-12-18 03:55:38 UTC

Permalink

Post by Chris Barker
To deal with all these problems (which become perhaps more common on
Mac OS-X), there was discussion on python-dev about creating a
"Universal Text File" type that would read files with any mixed
combination of *nix, DOS and Mac line endings (and maybe do
something smart on VMS as well). The goal would be able to have it
work with Python code as well as files opened with open(), and also
on pipes and other text streams.
I know Guido, Jack Jansen and I had some discussion about it, and I
think Jack put a little work into it, but I'm not sure where it
sits.

http://sourceforge.net/tracker/index.php?func=detail&aid=476814&group_id=5470&atid=305470

--Guido van Rossum (home page: http://www.python.org/~guido/)

Erik Max Francis

2001-12-16 19:07:43 UTC

Permalink

Not sure what the issue here is. When a file is opened in text mode, as
I'd expect, looks like line endings on different platforms are
normalized to just one newline (presumably because Python is implemented
in C where this is true as well).

I just tested

sys.stdin.readline()

and it indeed does _not_ have a trailing CR LF, just a trailing LF as
I'd expect. So what's the issue here?

--
Erik Max Francis / max at alcyone.com / http://www.alcyone.com/max/
__ San Jose, CA, US / 37 20 N 121 53 W / ICQ16063900 / &tSftDotIotE
/ \ Laws are silent in time of war.
\__/ Cicero
Esperanto reference / http://www.alcyone.com/max/lang/esperanto/
An Esperanto reference for English speakers.

Chris Perkins

2001-12-16 21:23:31 UTC

Permalink

line = line[:-1]
or
del line[-1:]
to strip the last character off the line, but this only works on
operating systems that have a one-byte line separator like Unix/Linux
('\n'). The Win32 line separator is 2-bytes ('\r\n'), so this
solution is not portable.

Actually, line[:-1] works just fine on Windows, as long as you read
the file in non-binary mode, because the '\r\n' line-endings are all
turned into '\n' when the file is read. They're also turned back into
'\r\n' when you write the file, which has the interesting effect that
you can turn Unix text-files into Windows text-files like this:

t = open(filename,'r').read()
open(filename,'w').write(t)

This is handy, but also somewhat disturbing - the file is changed by
writing its contents back to it seemingly unaltered.

line.replace(os.linesep, '')
The above works when iterating over text files, but fails if only a
line = line[:-1]
in production code.

I've found that os.linesep causes more trouble than it solves when
trying to write portable, file-writing code. Seemingly innocent code
like this:

f = open(filename,'r')
f.write(os.linesep.join(['One','Two','Three']))
f.close()

actually writes a file with screwed-up line-endings on Windows -
'\r\r\n' to be precise (which, I have discovered, shows up
single-spaced in some Windows text-editors and double-spaced in
others).
f.write('\n'.join(['One','Two','Three'])) ,on the other hand, does the
right thing on both Windows and Linux.
So unless I'm using binary mode for reading and writing files, I use
'\n' and line[:-1] everywhere. It's odd that hard-coding '\n' is more
portable than using os.linesep, but I've just gotten used to it.

Chris Perkins

Erik Max Francis

2001-12-16 19:14:57 UTC

Permalink

Post by Skip Montanaro
Even if you get the file from an operating system that uses another
line
ending convention?

Then you should probably handle it on a case-by-case basis. After all,
you might be getting an arbitrary binary file, or a "text" file that has
an end-of-line convention that you've never seen before.

John Roth

2001-12-16 16:11:49 UTC

Permalink

"Tim Hammerquist" <tim at vegeta.ath.cx> wrote in message

Tim Hammerquist

2001-12-16 14:21:34 UTC

Permalink

I've been trying to figure out the canonical way to strip the line
endings from a text file.

In general, I can use:

line = line[:-1]
or
del line[-1:]

to strip the last character off the line, but this only works on
operating systems that have a one-byte line separator like Unix/Linux
('\n'). The Win32 line separator is 2-bytes ('\r\n'), so this
solution is not portable.

Perl (the dreaded word!) had a chomp() function that worked similar
to the following. (The original modified its argument and returned the
number of bytes chomped.)

def chomp(line):
if line.endswith(os.linesep):
return line[:line.rindex(os.linesep)]
return line

I've seen several other solutions, including:

line.replace(os.linesep, '')

The above works when iterating over text files, but fails if only a
final linesep should be stripped. OTOH, I've seen the non-portable:

line = line[:-1]

in production code.

Is there a common idiom for this procedure? What do you use with
portability in mind?

TIA,
Tim Hammerquist

--
Did I listen to pop music because I was miserable...
Or was I miserable because I listened to pop music?
-- John Cusack, "High Fidelity"

Emile van Sebille

2001-12-16 15:23:25 UTC

Permalink

"Tim Hammerquist" <tim at vegeta.ath.cx> wrote in message

How about:

line.endswith(os.linesep)

--

Emile van Sebille
emile at fenx.com

---------

Ulf Magnusson

2001-12-16 15:56:27 UTC

Permalink

line = line[:-1]
or
del line[-1:]
snip...
TIA,

There is an easy way to print them without the newline
print "hello world",

Use a comma attached at the end, this ofcourse doesn't modify the
string (strings are unmutable) but it fixes the print output.

the string module also supports the following
rstrip, strip, lstrip
which removes whitespaces (on the right side, the whole string, left side)

Cheers

/U. Magnusson

Skip Montanaro

2001-12-16 18:50:45 UTC

Permalink

I've been trying to figure out the canonical way to strip the line
endings from a text file.

I suspect the high frequency of

line = line[:-1]

simply means that most people haven't worried too much about their code
running across multiple platforms. I know for most of the code I've written
I've never worried too much about it. Still, how about:

line = re.sub(r'(\r\n|\r|\n)$', '', line)

? In a little filter script I use to convert to canonical line endings I
use

sys.stdout.write(re.sub(r'\r\n|\r', r'\n', sys.stdin.read()))

(This for a set of filters that processes web pages in various ways.)

--
Skip Montanaro (skip at pobox.com - http://www.mojam.com/)

Sean 'Shaleh' Perry

2001-12-16 18:21:03 UTC

Permalink

Post by Martin Bless
"""Remove \r\n or \n from end of string."""
return line[:-2]
return line[:-1]
return line

so on the worst case you create 3 strings to return one. Ick.

Jason Orendorff

2001-12-16 19:53:19 UTC

Permalink

I've been trying to figure out the canonical way to strip
the line endings from a text file.

There isn't one. Almost always, either rstrip() is sufficient,
*or* you're doing text slinging, in which case you can leave the
newline characters on there, do the regex stuff, and file.write()
the lines out the same way they came in.

That said, the function you want is:

def chomp(line):
if line.endswith('\r\n'):
return line[:-2]
elif line.endswith('\r') or line.endswith('\n'):
return line[:-1]
else:
return line

If you're getting these strings from a text file, you could:

for unstripped_line in file:
for line in unstripped_line.splitlines():
...process line...

Why is this necessary? Unfortunately readline() doesn't
interpret a bare '\r' as a line ending on Windows or Unix.
So if the file contains bare '\r's, then the above code
will read the entire file into the unstripped_line
variable, then break it into lines with splitlines().

--
Jason Orendorff http://www.jorendorff.com/