Monday, December 20, 2004

pythnon scraping follow-up

So c.l.python was helpful as usual. What I was missing was the default namespace prefix in my xpath expression - I have to do some reading up on xpath/xml namespaces interaction. In the meantime I asked them back what would happen if I simply yank the namespace from the file for now to make the xpath-ing easier. Their concern was about potential multiple namespaces, which is not a problem for me as Tidy does not produce multiple namespaces anyway. So I got xpath to play along after I removed the default namespace.
On to the next step, finding the right xpath expression to get what I need. BTW the library I have been using is xmllib2, which currently seems to have a lot of momentum. The next problem is finding the right xpath for the parts of the document you want. I used a couple of editors to look at the document's tree, and the nesting was pretty horrendous. If I was an xpath expert perhaps I would be able to crack this easily, but I felt a need for some kind of automation. I thought it would be cool to find an editor with 'xpath reverse-engineering', which would allow you to point to a node graphicaly and have an xpath suggested to you. Of course the theoretical problem with this is possible multiple xpaths that could point to a node, but I had a feeling that some 'reasonable' solution is possible here. I did some googling around and came up with 0/google results. Admittedly I have not tried XmlSpy, which I am told is The Tool for XML. Then I somehow remembered about something I read about a tool that allows you to navigate XML like a file system from the command-line. That somehow sounded neat, even though it did not solve my problem directly. I quickly 'remembered' what the tool was (using google) and fired it up - 'xmllint --shell input.xhtml'. This gives you a prompt and a lod of the standard shell navigation works like you expect: ls lists the current node's subentities, cd allows you to jump to another node, pwd gives you the current path. I wanted to see the rest of the commands, asked the shell for help, and (drum roll...) here it was - 'grep' command! I immediately realised that this is what I was looking for - you could 'grep' on a node's CDATA and get the xpath to it. The answer to the multiple potential xpaths was also right there - there is a 'simple' xpath to a node which can be obtained by navigation from the root without any wildcards or attribute matching stuff. So now I am almost there. What remains is finding a slightly more abstract xpath that would select ALL the nodes that I am interested in, so that I can process the whole list of them in a loop. To be continued...

0 Comments:

Post a Comment

<< Home