ElementTree: can't figure out a mismached-tag error

Discussion:

F.R.

2013-07-11 08:59:16 UTC

Hi all,

I haven't been able to get up to speed with XML. I do examples from the
tutorials and experiment with variations. Time and time again I fail
with errors messages I can't make sense of. Here's the latest one. The
url is "http://finance.yahoo.com/q?s=XIDEQ&ql=0". Ubuntu 12.04 LTS,
Python 2.7.3 (default, Aug 1 2012, 05:16:07) [GCC 4.6.3]

import xml.etree.ElementTree as ET
tree = ET.parse('q?s=XIDEQ') # output of wget

http://finance.yahoo.com/q?s=XIDEQ&ql=0
Traceback (most recent call last):
File "<pyshell#69>", line 1, in <module>
tree = ET.parse('q?s=XIDEQ')
File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1183, in parse
tree.parse(source, parser)
File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 656, in parse
parser.feed(data)
File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1643, in feed
self._raiseerror(v)
File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1507, in
_raiseerror
raise err
ParseError: mismatched tag: line 9, column 2

Below first nine lines. The line numbers and the following space are
hand-edited in. Three dots stand for sections cut out to fit long lines.
Line 6 is a bunch of "meta" statements, all of which I show on a
separate line each in order to preserve the angled brackets. On all
lines the angled brackets have been preserved. The mismatched character
is the slash of the closing tag </head>. What could be wrong with it?
And if it is, what about fault tolerance?

1 <!DOCTYPE html PUBLIC "-//W3C//DTD . . . /strict.dtd">
2 <html lang="en-US">
3 <head><meta http-equiv="Content-Type" content="text/html; charset=utf-8">
4 <title>XIDEQ: Summary for EXIDE TECH NEW- Yahoo! Finance</title>
5 <meta name="description" xml:space="default" content="View the basic
XIDEQ . . .
6 . . . other companies."><meta name="keywords" content="XIDEQ, EXIDE
TECH . . .">
<meta property="fb:app_id" content="118155468215844">
<meta property="fb:admins" content="503762770,100001149693905">
<meta property="og:type" content="company">
<meta property="og:site_name" content="Yahoo! Finance">
<meta property="og:title" content="Exide Technologies">
<meta property="og:image"
content="Loading Image...

">
<meta property="og:url" content="http://finance.yahoo.com/q?s=XIDEQ">
<meta property="og:description" content="View the basic XIDEQ . . .
7 other companies."><link rel="canonical"
href="http://finance.yahoo.com/q?s=XIDEQ">
8 <link rel="stylesheet" href="http://l.yimg.com/zz/ . . . type="text/css">
9 </head>
^
Mismatch!

Thanks for suggestions

Frederic

Fábio Santos

2013-07-11 09:08:04 UTC

Permalink

Post by F.R.
Hi all,
I haven't been able to get up to speed with XML. I do examples from the

tutorials and experiment with variations. Time and time again I fail with
errors messages I can't make sense of. Here's the latest one. The url is "
http://finance.yahoo.com/q?s=XIDEQ&ql=0". Ubuntu 12.04 LTS, Python 2.7.3
(default, Aug 1 2012, 05:16:07) [GCC 4.6.3]

Post by F.R.

import xml.etree.ElementTree as ET
tree = ET.parse('q?s=XIDEQ') # output of wget

http://finance.yahoo.com/q?s=XIDEQ&ql=0

Post by F.R.
File "<pyshell#69>", line 1, in <module>
tree = ET.parse('q?s=XIDEQ')
File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1183, in parse
tree.parse(source, parser)
File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 656, in parse
parser.feed(data)
File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1643, in feed
self._raiseerror(v)
File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1507, in

_raiseerror

Post by F.R.
raise err
ParseError: mismatched tag: line 9, column 2
Below first nine lines. The line numbers and the following space are

hand-edited in. Three dots stand for sections cut out to fit long lines.
Line 6 is a bunch of "meta" statements, all of which I show on a separate
line each in order to preserve the angled brackets. On all lines the angled
brackets have been preserved. The mismatched character is the slash of the
closing tag </head>. What could be wrong with it? And if it is, what about
fault tolerance?

Post by F.R.
1 <!DOCTYPE html PUBLIC "-//W3C//DTD . . . /strict.dtd">
2 <html lang="en-US">
3 <head><meta http-equiv="Content-Type" content="text/html;

charset=utf-8">

Post by F.R.
4 <title>XIDEQ: Summary for EXIDE TECH NEW- Yahoo! Finance</title>
5 <meta name="description" xml:space="default" content="View the basic

XIDEQ . . .

Post by F.R.
6 . . . other companies."><meta name="keywords" content="XIDEQ, EXIDE

TECH . . .">

Post by F.R.
<meta property="fb:app_id" content="118155468215844">
<meta property="fb:admins" content="503762770,100001149693905">
<meta property="og:type" content="company">
<meta property="og:site_name" content="Yahoo! Finance">
<meta property="og:title" content="Exide Technologies">
<meta property="og:image" content="http://l.yimg.com/a/p/fi/31/09/00.jpg

Post by F.R.
<meta property="og:url" content="http://finance.yahoo.com/q?s=XIDEQ">
<meta property="og:description" content="View the basic XIDEQ . . .
7 other companies."><link rel="canonical" href="

http://finance.yahoo.com/q?s=XIDEQ">

Post by F.R.
8 <link rel="stylesheet" href="http://l.yimg.com/zz/ . . .

type="text/css">

Post by F.R.
9 </head>
^
Mismatch!
Thanks for suggestions
Frederic

That is not XML. It is HTML. You get a mismatched tag because the <link>
tag doesn't need closing in HTML, but in XML every single tag needs closing.

Use an HTML parser. I strongly recommend BeautifulSoup but I think etree
has an HTML parser too. I am not sure..
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20130711/f8c959ee/attachment.html>

fronagzen

2013-07-11 09:19:58 UTC

Permalink

Actually, I don't think etree has a HTML parser. And I would counter-recommend lxml if speed is an issue: BeautifulSoup takes a looooong time to parse a large document.

Post by F.R.
Hi all,
I haven't been able to get up to speed with XML. I do examples from the tutorials and experiment with variations. Time and time again I fail with errors messages I can't make sense of. Here's the latest one. The url is "http://finance.yahoo.com/q?s=XIDEQ&ql=0". Ubuntu 12.04 LTS, Python 2.7.3 (default, Aug ?1 2012, 05:16:07) [GCC 4.6.3]

import xml.etree.ElementTree as ET
tree = ET.parse('q?s=XIDEQ') ?# output of wget http://finance.yahoo.com/q?s=XIDEQ&ql=0

? File "<pyshell#69>", line 1, in <module>
? ? tree = ET.parse('q?s=XIDEQ')
? File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1183, in parse
? ? tree.parse(source, parser)
? File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 656, in parse
? ? parser.feed(data)
? File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1643, in feed
? ? self._raiseerror(v)
? File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1507, in _raiseerror
? ? raise err
ParseError: mismatched tag: line 9, column 2
Below first nine lines. The line numbers and the following space are hand-edited in. Three dots stand for sections cut out to fit long lines. Line 6 is a bunch of "meta" statements, all of which I show on a separate line each in order to preserve the angled brackets. On all lines the angled brackets have been preserved. The mismatched character is the slash of the closing tag </head>. What could be wrong with it? And if it is, what about fault tolerance?
1 <!DOCTYPE html PUBLIC "-//W3C//DTD ?. . . /strict.dtd">
2 <html lang="en-US">
3 <head><meta http-equiv="Content-Type" content="text/html; charset=utf-8">
4 <title>XIDEQ: Summary for EXIDE TECH NEW- Yahoo! Finance</title>
5 <meta name="description" xml:space="default" content="View the basic XIDEQ . . .
6 . . . other companies."><meta name="keywords" content="XIDEQ, EXIDE TECH . . .">
? <meta property="fb:app_id" content="118155468215844">
? <meta property="fb:admins" content="503762770,100001149693905">
? <meta property="og:type" content="company">
? <meta property="og:site_name" content="Yahoo! Finance">
? <meta property="og:title" content="Exide Technologies">
? <meta property="og:image" content="http://l.yimg.com/a/p/fi/31/09/00.jpg">
? <meta property="og:url" content="http://finance.yahoo.com/q?s=XIDEQ">
? <meta property="og:description" content="View the basic XIDEQ . . .
7 other companies."><link rel="canonical" href="http://finance.yahoo.com/q?s=XIDEQ">
8 <link rel="stylesheet" href="http://l.yimg.com/zz/ . . . type="text/css">
9 </head>
? ?^
? ? Mismatch!
Thanks for suggestions
Frederic

Fábio Santos

2013-07-11 10:27:02 UTC

Permalink

Post by fronagzen
Actually, I don't think etree has a HTML parser. And I would

counter-recommend lxml if speed is an issue: BeautifulSoup takes a looooong
time to parse a large document.

Post by fronagzen

Post by FÃ¡bio Santos
Use an HTML parser. I strongly recommend BeautifulSoup but I think

etree has an HTML parser too. I am not sure..

I meant lxml. My apologies.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20130711/3d9c02c8/attachment.html>

F.R.

2013-07-11 12:25:13 UTC

Permalink

Post by F.R.
Hi all,
I haven't been able to get up to speed with XML. I do examples from
the tutorials and experiment with variations. Time and time again I
fail with errors messages I can't make sense of. Here's the latest
one. The url is "http://finance.yahoo.com/q?s=XIDEQ&ql=0". Ubuntu
12.04 LTS, Python 2.7.3 (default, Aug 1 2012, 05:16:07) [GCC 4.6.3]

import xml.etree.ElementTree as ET
tree = ET.parse('q?s=XIDEQ') # output of wget

http://finance.yahoo.com/q?s=XIDEQ&ql=0
File "<pyshell#69>", line 1, in <module>
tree = ET.parse('q?s=XIDEQ')
File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1183, in parse
tree.parse(source, parser)
File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 656, in parse
parser.feed(data)
File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1643, in feed
self._raiseerror(v)
File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1507, in
_raiseerror
raise err
ParseError: mismatched tag: line 9, column 2
Below first nine lines. The line numbers and the following space are
hand-edited in. Three dots stand for sections cut out to fit long
lines. Line 6 is a bunch of "meta" statements, all of which I show on
a separate line each in order to preserve the angled brackets. On all
lines the angled brackets have been preserved. The mismatched
character is the slash of the closing tag </head>. What could be wrong
with it? And if it is, what about fault tolerance?
1 <!DOCTYPE html PUBLIC "-//W3C//DTD . . . /strict.dtd">
2 <html lang="en-US">
3 <head><meta http-equiv="Content-Type" content="text/html;
charset=utf-8">
4 <title>XIDEQ: Summary for EXIDE TECH NEW- Yahoo! Finance</title>
5 <meta name="description" xml:space="default" content="View the basic
XIDEQ . . .
6 . . . other companies."><meta name="keywords" content="XIDEQ, EXIDE
TECH . . .">
<meta property="fb:app_id" content="118155468215844">
<meta property="fb:admins" content="503762770,100001149693905">
<meta property="og:type" content="company">
<meta property="og:site_name" content="Yahoo! Finance">
<meta property="og:title" content="Exide Technologies">
<meta property="og:image"
content="http://l.yimg.com/a/p/fi/31/09/00.jpg">
<meta property="og:url" content="http://finance.yahoo.com/q?s=XIDEQ">
<meta property="og:description" content="View the basic XIDEQ . . .
7 other companies."><link rel="canonical"
href="http://finance.yahoo.com/q?s=XIDEQ">
8 <link rel="stylesheet" href="http://l.yimg.com/zz/ . . .
type="text/css">
9 </head>
^
Mismatch!
Thanks for suggestions
Frederic

Thank you all!

I was a little apprehensive it could be a silly mistake. And so it was.
I have BeautifulSoup somewhere. Having had no urgent need for it I
remember shirking the learning curve.

lxml seems to be a package with these components (from help (lxml)):

PACKAGE CONTENTS
ElementInclude
_elementpath
builder
cssselect
doctestcompare
etree
html (package)
isoschematron (package)
objectify
pyclasslookup
sax
usedoctest

I would start with "from lxml import html" and see what comes out.

Break time now. Thanks again!

Frederic

fronagzen

2013-07-11 12:49:56 UTC

Permalink

Post by F.R.

Post by F.R.
Hi all,
I haven't been able to get up to speed with XML. I do examples from
the tutorials and experiment with variations. Time and time again I
fail with errors messages I can't make sense of. Here's the latest
one. The url is "http://finance.yahoo.com/q?s=XIDEQ&ql=0". Ubuntu
12.04 LTS, Python 2.7.3 (default, Aug 1 2012, 05:16:07) [GCC 4.6.3]

import xml.etree.ElementTree as ET
tree = ET.parse('q?s=XIDEQ') # output of wget

Thank you all!
I was a little apprehensive it could be a silly mistake. And so it was.
I have BeautifulSoup somewhere. Having had no urgent need for it I
remember shirking the learning curve.
PACKAGE CONTENTS
ElementInclude
_elementpath
builder
cssselect
doctestcompare
etree
html (package)
isoschematron (package)
objectify
pyclasslookup
sax
usedoctest
I would start with "from lxml import html" and see what comes out.
Break time now. Thanks again!
Frederic

from lxml.html import parse
from lxml.etree import ElementTree
root = parse(target_url).getroot()

This'll get you the root node of the element tree parsed from the URL. The lxml html parser, conveniently enough, can combine in the actual web page access. If you want to control things like socket timeout, though, you'll have to use urllib to request the URL and then feed that to the parser.