[arin-tech-discuss] bulkwhois libxml2

Wes Young wes at ren-isac.net
Sat Sep 25 21:15:14 EDT 2010


gak. Ok, I was mis-reading the XML::LibXML::Reader implementation. As  
a followup here's what I came up with which seems to work (missing a  
few things, but people will get the idea).

#!/usr/bin/perl

use strict;
use warnings;

use XML::LibXML::Reader;
use XML::LibXML::XPathContext;
use XML::LibXML;

my $cache = "/tmp/arin_db.xml";

my $reader = XML::LibXML::Reader->new('location' => $cache,  
load_ext_dtd => 0);
my $xpc = XML::LibXML::XPathContext->new();
$xpc->registerNs('arin','http://www.arin.net/bulkwhois/core/v1');

my $x = 0;

while ($reader->read()){
     # prime the pump...
     next unless($reader->name() eq 'asn');
     last;
}
do {
     # copy the node into memory (as a DOM for searching, probably  
leaky, but it works)
     my $node = $reader->copyCurrentNode(1);
     for(lc($reader->name())){
	# select case for the various top-level nodes found within bulkwhois/ 
core/v1
         if(/^asn$/){
	    # need to search via the namespace....
             print $xpc->find('./arin:startAsNumber',$node)."\n";
         }
         if(/^....$/) { ... }
	last;
     }
   # jump top level nodes, skipping children
} while ($reader->nextSibling());


Some stats (within a vm):

$> time perl arin_parse.pl
real	3m28.060s
user	3m9.504s
sys	0m15.633s

total mem wasted:	450meg (mostly cause it's perl, easy to mitigate  
leaking by using threads)


cheers,

On Sep 22, 2010, at 9:51 AM, Andy Newton wrote:

>
> On Sep 21, 2010, at 3:37 PM, Wes Young wrote:
>
>> Figured this was a good place to start...
>>
>> i'm debugging an issue (or some "functionality") with libxml2 and  
>> perl (using the XML::LibXML::Reader interface), where it' seems to  
>> be stumbling on the series of:
>>
>> </asn><asn>\n
>>
>> if I insert a linebreak; (</asn>\n<asn>\n) the libxml2 reader  
>> function rips through it no problem, if there's no line break, it  
>> views the <asn> as a blank element and then reads the rest of the  
>> file as garbage data (trying to do this stream like instead of DOM  
>> like).
>>
>> I'm assuming there isn't anything wrong with the way it's outputted  
>> (guessing most people are just java-nuts and it works like that),  
>> but i'm curious if anyone has gotten around this issue with libxml2  
>> (or alike) by setting some sort of parsing flag, etc.
>>
>> I've tested it a few times with the first few set of asn elements  
>> you'll find in the data; and the line break pretty much makes it  
>> reproducible... jw if anyone has worked around that from the perl  
>> side...
>
> Wes,
>
> The line break shouldn't be necessary. If you have validation turned  
> on, perhaps that is causing some sort of problem. Also check other  
> parsing options.
>
> I just ran last night's bulk whois through xmllint with the --stream  
> option and it worked fine. xmllint uses libxml.
>
> Sorry I can be of further help here.
>
> -andy
>

--
Wes
http://claimid.com/wesyoung

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 195 bytes
Desc: This is a digitally signed message part
URL: <https://lists.arin.net/pipermail/arin-tech-discuss/attachments/20100925/e210f2ed/attachment.sig>


More information about the arin-tech-discuss mailing list