Discussion:
xml parsing with sax
Alex Martelli
2001-04-29 21:38:18 UTC
Permalink
"Harald Kirsch" <kirschh at lionbioscience.com> wrote in message
news:yv2y9sm469d.fsf at lionsp093.lion-ag.de...
[snip]
Fortunately, the parser just calls .read(N) on its
*That* was the information I was missing. The rest is then more or
less straighforward, as your example code demonstrates.
This is not a formally documented constraint on the parser (alas,
until and unless interfaces or protocols do get formalized in Python,
we'll have to live with lots of informal mentions of [e.g.] "file-like
objects" that never clarify WHICH file-object methods do need to
be implemented...), but for the current implementation it can be
determined experimentally. Code inspection on the library sources
AND running a few experiments on an instrumented filelike object
(that delegates all to a file but also prints out what is being called)
to confirm one's understanding gives you that with a few minutes'
investment.


Alex
Alex Martelli
2001-04-27 11:26:58 UTC
Permalink
"Harald Kirsch" <kirschh at lionbioscience.com> wrote in message
news:yv28zkm63gi.fsf at lionsp093.lion-ag.de...
[snip]
it possible to push a pseudo root element in front of a stream parsed
with
xml.sax.parse(sys.stdin, ...)
If the stream fits in memory, it's clearly easy:


import xml.sax, xml.sax.handler, sys, cStringIO

class Handler(xml.sax.handler.ContentHandler):
def startElement(self, name, attrs):
print "Element",name

def wrapWith(tag, fileob):
result = cStringIO.StringIO()
result.write("<%s>\n"%tag)
result.write(fileob.read())
result.write("</%s>\n"%tag)
result.seek(0)
return result

xml.sax.parse(wrapWith("fakedoc",sys.stdin), Handler())


If you have to cater for large streams, such that
fileob.read() could blow up, then wrapWith() will
need to become a factory callable for an appropriate
object -- not quite as easy but of course still OK;
somewhat in the same spirit as cookbook recipe
http://www.activestate.com/ASPN/Python/Cookbook/Recipe/52295
we could have:

import xml.sax, xml.sax.handler, sys, cStringIO

class Handler(xml.sax.handler.ContentHandler):
def startElement(self, name, attrs):
print "Element",name

class FileWrappedWithTag:
def __init__(self, tagname, file):
self.tagname = tagname
self.file = file
self.state = 0

def __getattr__(self, attr):
return getattr(self.file, attr)

def read(self, size=-1):
if size<0:
self.state = 2
return "<%s>\n%s</%s>\n" % (
self.tagname, self.file.read(), self.tagname)
elif self.state == 0:
self.state = 1
return "<%s>\n%s" % (
self.tagname, self.file.read(size-len(self.tagname)-3))
elif self.state == 1:
result = self.file.read(size)
if result: return result
self.state = 2
return "</%s>\n" % self.tagname
elif self.state == 2:
return ""

def wrapWith(tag, fileob):
return FileWrappedWithTag(tag, fileob)

xml.sax.parse(wrapWith("fakedoc",sys.stdin), Handler())


Fortunately, the parser just calls .read(N) on its
file-object argument -- it does *NOT* type-test,
and a GOOD thing it is that it doesn't! -- so that
Python's typical signature-based polymorphism lets
us do the job in a reasonably clean and easy way.


Alex
Harald Kirsch
2001-04-27 17:11:42 UTC
Permalink
Post by Alex Martelli
"Harald Kirsch" <kirschh at lionbioscience.com> wrote in message
[snip]
it possible to push a pseudo root element in front of a stream parsed
with
xml.sax.parse(sys.stdin, ...)
No, it does not fit.

[snip]
Post by Alex Martelli
return FileWrappedWithTag(tag, fileob)
xml.sax.parse(wrapWith("fakedoc",sys.stdin), Handler())
Fortunately, the parser just calls .read(N) on its
*That* was the information I was missing. The rest is then more or
less straighforward, as your example code demonstrates.

Thanks,
Harald Kirsch
--
----------------+------------------------------------------------------
Harald Kirsch | kirschh at lionbioscience.com | "How old is the epsilon?"
LION bioscience | +49 6221 4038 172 | -- Paul Erd?s
*** Please do not send me copies of your posts. ***
Harald Kirsch
2001-04-27 17:07:19 UTC
Permalink
Post by Alex Martelli
"Harald Kirsch" <kirschh at lionbioscience.com> wrote in message
I have more (or less) XML data, however it is missing a root element
or, put another way, it is like a sequence of individual XML documents
similar to
<bla>...</bla>
<bla>...</bla>
<bla>...</bla>
<bla>...</bla>
Trying to parse this with python's inbuilt sax results in an error
message about 'junk after document element', i.e. after the first
</bla>. In a way this is correct because a surrounding root element is
missing. However, is there a way to instruct sax to keep going. Or is
it possible to push a pseudo root element in front of a stream parsed
with
xml.sax.parse(sys.stdin, ...)
Unless your XML is huge (always a possiblity...) you could use something
It *is* huuuugge. Well, and in particular it is generated in a pipe
and I want to see some output as soon as possible and not after
several megabytes are digested.

Harald Kirsch
--
----------------+------------------------------------------------------
Harald Kirsch | kirschh at lionbioscience.com | "How old is the epsilon?"
LION bioscience | +49 6221 4038 172 | -- Paul Erd?s
*** Please do not send me copies of your posts. ***
Harald Kirsch
2001-04-27 10:29:17 UTC
Permalink
I have more (or less) XML data, however it is missing a root element
or, put another way, it is like a sequence of individual XML documents
similar to

<bla>...</bla>
<bla>...</bla>
<bla>...</bla>
<bla>...</bla>

Trying to parse this with python's inbuilt sax results in an error
message about 'junk after document element', i.e. after the first
</bla>. In a way this is correct because a surrounding root element is
missing. However, is there a way to instruct sax to keep going. Or is
it possible to push a pseudo root element in front of a stream parsed
with

xml.sax.parse(sys.stdin, ...)

Harald Kirsch
--
----------------+------------------------------------------------------
Harald Kirsch | kirschh at lionbioscience.com | "How old is the epsilon?"
LION bioscience | +49 6221 4038 172 | -- Paul Erd?s
Steve Holden
2001-04-27 11:55:53 UTC
Permalink
"Harald Kirsch" <kirschh at lionbioscience.com> wrote in message
I have more (or less) XML data, however it is missing a root element
or, put another way, it is like a sequence of individual XML documents
similar to
<bla>...</bla>
<bla>...</bla>
<bla>...</bla>
<bla>...</bla>
Trying to parse this with python's inbuilt sax results in an error
message about 'junk after document element', i.e. after the first
</bla>. In a way this is correct because a surrounding root element is
missing. However, is there a way to instruct sax to keep going. Or is
it possible to push a pseudo root element in front of a stream parsed
with
xml.sax.parse(sys.stdin, ...)
Unless your XML is huge (always a possiblity...) you could use something
like:

myXML = sys.stdin.read()
xml.sax.parseString("<root>\n"+myXML+"</root>\n")

As you can see, this corrects the structural deformity. Obviously, you could
use more complex content around the <bla> ... </bla> if you needed it.

regards
steVE

Loading...