Git Product home page Git Product logo

dmoz-parser's Introduction

Dmoz

Dmoz is an open directory which lists and groups web pages into categories (directories). Their data is publicly available, but provided as an RDF file - a huge, funny XML file.

Dmoz Parser

This is a really simple python implementation of the Dmoz RDF parser. It does not try to be smart and process the parsed XML for you, you have to provide a handler implementation where YOU decide what to do with the data (store it in file, database, print, etc.).

This parser makes the assumption is the last entity in each dmoz page is topic:

 <ExternalPage about="http://www.awn.com/">
   <d:Title>Animation World Network</d:Title>
   <d:Description>Provides information resources to the international animation community. Features include searchable database archives, monthly magazine, web animation guide, the Animation Village, discussion forums and other useful resources.</d:Description>
   <priority>1</priority>
   <topic>Top/Arts/Animation</topic>
 </ExternalPage>

This assumption is strictly checked, and processing will abort if it is violated.

The RDF file needs to be downloaded and unpacked before running the parser. You can download the RDF from Dmoz site. You should gunzip it into this directory.

The RDF is pretty large, over 2G unpacked and parsing it takes some time, so there is a progress indicator.

Warnings

This parser does not check for links between topics in the hierarchy, or any sophisticated parsing of the hierarchy.

The same URL might appear in multiple locations in the hierarchy.

Usage

Instantiate the parser, provide the handler and run.

#!/usr/bin/env python

from parser import DmozParser
from handlers import JSONWriter

parser = DmozParser()
parser.add_handler(JSONWriter('output.json'))
parser.run()

JSONWriter is the builtin handler which outputs the pages, one JSON object per line. (Note: This is different than saying that the entire file is a large JSON list.)

Requirements

simplejson is necessary for writing JSON output.

Built-in handlers

There are two builtin handlers so far - JSONWriter and CSVWriter. CSVWriter is buggy (see "handler.py" to understand why), and we recommend the JSONWriter.

Handlers

A handler must implement two methods:

def page(self, page, content)

this method will be called every time a new page is extracted from the RDF, argument page will contain the URL of the page and content will contain a dictionary of page content.

def finish(self)

The finish method will be called after the parsing is done. You may want to clean up here, close the files, etc.

dmoz-parser's People

Contributors

turian avatar kremso avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.