Git Product home page Git Product logo

vor-knowledge-graph's Introduction

Project vör : Open Knowledge modeling


Network Network


Synopsis

The project is initiated as a dirty hack for crawling and modeling a large volume of open knowledge out there in Wikipedia. Thus, we have a "nearly" complete graph of those knowledge, also obtain an ability to traverse the relations between knowledge topics.


Infrastructure / Prerequisites

To build and run the knowledge graph engine with vör, you need the following software for the infrastructure.


Setup

Install python 3.x requirements by:

  $ pip3 install -r -U requirements.txt

Install Node.js modules required by the graph visualiser. You may ignore these steps if you are not interested in visualisation.

  $ npm install

Other than registered NPM packages, you also need to install Sigma.js for visualisation. The module is not bundled within this repository.


1) Download (crawl) wikipedia pages

Execute:

  $ python3 crawl_wiki.py --verbose 

The script continuously and endlessly crawls the knowledge topic from Wikipedia starting from the seeding page. You may change the initial topic within the script to what best suits you. To stop the process, just terminate is fine. It won't leave anything at dirty stat so you can re-execute the script again at any time.

[NOTE] The script keeps continuously crawling and downloading the related knowledge through link traveral. The script never ends unless you terminate it.


2) Build the knowledge graph

Execute:

  $ python3 build_knowledge.py --verbose --root {PASSWORD} --limit {NUM}

Where {PASSWORD} represents your root password of OrientDB. And {NUM} represents the number of wikipedia topics to process.

What the script does is simply imports the entire raw hefty text knowledge from MongoDB to OrientDB as a big graph. The output graph in OrientDB is built from the following components:

  • [1] Vertices : Represent topic / keyword
  • [2] Edges : Represent relations between topic-keyword or keyword-keyword.

[NOTE] The script processes the entire data in the collection all the way to the end. This will definitely take large amount of time if you have large data in your collection.


3) Visualise the knowledge graph

Execute:

  $ node visualise {PASSWORD}

Where {PASSWORD} is your OrientDB root's password. The script downloads the graph data from OrientDB, renders it with appropriate visual figure. After it's done, you can view the graphs as follows.

  • [1] Universe of topics graph [html/graph-universe.html].
  • [2] Index graph [html/graph-index.html.]

4) Build Word2Vec model over the crawled data

Execute:

  $ python3 build_wordvec.py --limit {LIMIT} --out {PATH_TO_MODEL}

There should be sufficient amount of the downloaded wikipedia in MongoDB which is done by running crawl_wiki.py. The output is a binary file.


5) Create topic index

Execute:

  $ python3 build_index.py --limit {LIMIT} --root {PASSWORD}

The script generates another OrientDB collection vorindex which contains all invert-index of the topics and their corresponding keywords. Weights of the edges are calculated by how frequent the word appear in each of the topics.

Network Network


Licence

The project is licenced under GNU 3 public licence. All third party libraries are redistributed under their own licences.


vor-knowledge-graph's People

Contributors

dependabot[bot] avatar tao-pr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

vor-knowledge-graph's Issues

pymongo.errors.CursorNotFound: Cursor not found, cursor id: 22364095374

New link [Set cover problem] HAS => [25th]
['Graz', 'Austria']
Adding : Set cover problem ===> ['Graz', 'Austria']
New link [Set cover problem] HAS => [Graz]
New link [Set cover problem] HAS => [Austria]
['doi']
Adding : Set cover problem ===> ['doi']
New link [Set cover problem] HAS => [doi]
['related', 'media', 'Set', 'Setcover', 'links', 'cover', 'problem', 'Commons', 'coverproblem']
Adding : Set cover problem ===> ['related', 'media', 'Set', 'Setcover', 'links', 'cover', 'problem', 'Commons', 'coverproblem']
New link [Set cover problem] HAS => [related]
New link [Set cover problem] HAS => [media]
New link [Set cover problem] HAS => [Set]
New link [Set cover problem] HAS => [Setcover]
New link [Set cover problem] HAS => [links]
New link [Set cover problem] HAS => [cover]
New link [Set cover problem] HAS => [problem]
New link [Set cover problem] HAS => [Commons]
New link [Set cover problem] HAS => [coverproblem]
['Solutions', 'Set', 'Benchmarks', 'WinnerDetermination', 'NPoptimization', 'Cover', 'Covering', 'HiddenOptimum', 'optimization', 'SetCover', 'CoveringSet', 'Hidden', 'Determinationcompendium', 'Winner', 'Minimum', 'MinimumSet', 'Optimum', 'Determination', 'compendium', 'problems']
Adding : Set cover problem ===> ['Solutions', 'Set', 'Benchmarks', 'WinnerDetermination', 'NPoptimization', 'Cover', 'Covering', 'HiddenOptimum', 'optimization', 'SetCover', 'CoveringSet', 'Hidden', 'Determinationcompendium', 'Winner', 'Minimum', 'MinimumSet', 'Optimum', 'Determination', 'compendium', 'problems']
New link [Set cover problem] HAS => [Solutions]
New link [Set cover problem] HAS => [Set]
New link [Set cover problem] HAS => [Benchmarks]
New link [Set cover problem] HAS => [WinnerDetermination]
New link [Set cover problem] HAS => [NPoptimization]
New link [Set cover problem] HAS => [Cover]
New link [Set cover problem] HAS => [Covering]
New link [Set cover problem] HAS => [HiddenOptimum]
New link [Set cover problem] HAS => [optimization]
New link [Set cover problem] HAS => [SetCover]
New link [Set cover problem] HAS => [CoveringSet]
New link [Set cover problem] HAS => [Hidden]
New link [Set cover problem] HAS => [Determinationcompendium]
New link [Set cover problem] HAS => [Winner]
New link [Set cover problem] HAS => [Minimum]
New link [Set cover problem] HAS => [MinimumSet]
New link [Set cover problem] HAS => [Optimum]
New link [Set cover problem] HAS => [Determination]
New link [Set cover problem] HAS => [compendium]
New link [Set cover problem] HAS => [problems]
Set cover problem processed with 534 nodes.
34 wiki documents processed so far...
Traceback (most recent call last):
File "build_knowledge.py", line 113, in
for topic,sentence in iter_topic(crawl_collection,args['start']):
File "build_knowledge.py", line 41, in iter_topic
for wiki in crawl_collection.query({'downloaded': True},field=None,skip=start):
File "/home/mldl/ub16_prj/vor-knowledge-graph/pylib/knowledge/datasource.py", line 21, in query
for n in query:
File "/usr/local/lib/python3.5/dist-packages/pymongo/cursor.py", line 1189, in next
if len(self.__data) or self._refresh():
File "/usr/local/lib/python3.5/dist-packages/pymongo/cursor.py", line 1126, in _refresh
self.__send_message(g)
File "/usr/local/lib/python3.5/dist-packages/pymongo/cursor.py", line 978, in __send_message
codec_options=self.__codec_options)
File "/usr/local/lib/python3.5/dist-packages/pymongo/cursor.py", line 1067, in _unpack_response
return response.unpack_response(cursor_id, codec_options)
File "/usr/local/lib/python3.5/dist-packages/pymongo/message.py", line 1418, in unpack_response
self.raw_response(cursor_id)
File "/usr/local/lib/python3.5/dist-packages/pymongo/message.py", line 1384, in raw_response
raise CursorNotFound(msg, 43, errobj)
pymongo.errors.CursorNotFound: Cursor not found, cursor id: 22364095374
mldl@ub1604:/ub16_prj/vor-knowledge-graph$
mldl@ub1604:
/ub16_prj/vor-knowledge-graph$

Cannot visualise graph

When I try to visualise the knowledge graph, I get the graph-data.js but the graph-index.js remains empty and html files are not generated.

================================
[Datasource] Processing : { name: 'vor',
mapper: [Function: circularGraphMapper],
output: 'graph-data.js' }

[Connected] to OrientDB [vor].
All nodes retrieved...
Enumerating edges...
Transforming nodes & edges ...
Initialising I/O ...
Serialising graph to JS ...
118 nodes
100 links
Graph HTML is ready in ./HTML/

[Datasource] Processing : { name: 'vorindex',
mapper: [Function: indexGraphMapper],
output: 'graph-index.js' }

[Connected] to OrientDB [vorindex].
All nodes retrieved...
Enumerating edges...
Remapping nodes...
Initialising I/O ...
Serialising graph to JS ...
0 nodes
0 links
Graph HTML is ready in ./HTML/

Can anyone help me regarding this ?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.