[arin-tech-discuss] bulkwhois libxml2
Wes Young
wes at ren-isac.net
Sat Sep 25 21:15:14 EDT 2010
gak. Ok, I was mis-reading the XML::LibXML::Reader implementation. As
a followup here's what I came up with which seems to work (missing a
few things, but people will get the idea).
#!/usr/bin/perl
use strict;
use warnings;
use XML::LibXML::Reader;
use XML::LibXML::XPathContext;
use XML::LibXML;
my $cache = "/tmp/arin_db.xml";
my $reader = XML::LibXML::Reader->new('location' => $cache,
load_ext_dtd => 0);
my $xpc = XML::LibXML::XPathContext->new();
$xpc->registerNs('arin','http://www.arin.net/bulkwhois/core/v1');
my $x = 0;
while ($reader->read()){
# prime the pump...
next unless($reader->name() eq 'asn');
last;
}
do {
# copy the node into memory (as a DOM for searching, probably
leaky, but it works)
my $node = $reader->copyCurrentNode(1);
for(lc($reader->name())){
# select case for the various top-level nodes found within bulkwhois/
core/v1
if(/^asn$/){
# need to search via the namespace....
print $xpc->find('./arin:startAsNumber',$node)."\n";
}
if(/^....$/) { ... }
last;
}
# jump top level nodes, skipping children
} while ($reader->nextSibling());
Some stats (within a vm):
$> time perl arin_parse.pl
real 3m28.060s
user 3m9.504s
sys 0m15.633s
total mem wasted: 450meg (mostly cause it's perl, easy to mitigate
leaking by using threads)
cheers,
On Sep 22, 2010, at 9:51 AM, Andy Newton wrote:
>
> On Sep 21, 2010, at 3:37 PM, Wes Young wrote:
>
>> Figured this was a good place to start...
>>
>> i'm debugging an issue (or some "functionality") with libxml2 and
>> perl (using the XML::LibXML::Reader interface), where it' seems to
>> be stumbling on the series of:
>>
>> </asn><asn>\n
>>
>> if I insert a linebreak; (</asn>\n<asn>\n) the libxml2 reader
>> function rips through it no problem, if there's no line break, it
>> views the <asn> as a blank element and then reads the rest of the
>> file as garbage data (trying to do this stream like instead of DOM
>> like).
>>
>> I'm assuming there isn't anything wrong with the way it's outputted
>> (guessing most people are just java-nuts and it works like that),
>> but i'm curious if anyone has gotten around this issue with libxml2
>> (or alike) by setting some sort of parsing flag, etc.
>>
>> I've tested it a few times with the first few set of asn elements
>> you'll find in the data; and the line break pretty much makes it
>> reproducible... jw if anyone has worked around that from the perl
>> side...
>
> Wes,
>
> The line break shouldn't be necessary. If you have validation turned
> on, perhaps that is causing some sort of problem. Also check other
> parsing options.
>
> I just ran last night's bulk whois through xmllint with the --stream
> option and it worked fine. xmllint uses libxml.
>
> Sorry I can be of further help here.
>
> -andy
>
--
Wes
http://claimid.com/wesyoung
-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 195 bytes
Desc: This is a digitally signed message part
URL: <https://lists.arin.net/pipermail/arin-tech-discuss/attachments/20100925/e210f2ed/attachment.sig>
More information about the arin-tech-discuss
mailing list