Git Product home page Git Product logo

wikipedia-map's Introduction

Wikipedia Map

A web app for visualizing the connections between Wikipedia pages. Try it at wikipedia.luk.ke.

Screenshot of Wikipedia Map

Usage

Start by entering a topic into the text box, for example Cats. A single “node” will be generated, labeled Cat, which appears as a circle on the graph. Click this node to expand it.

Expanding a node creates a new node for each Wikipedia article linked in the first paragraph of the article you clicked. These new nodes will be connected to the node from which they were expanded. For example, expanding Cat will create eight nodes, including Fur, Mammal, Carnivore, and Domestication, each of which will be connected to Cat. These new nodes can also be expanded in the same way. By continuing to expand nodes, you can build a complex web of related topics.

You can also enter multiple articles to "compare" by pressing Comma, Tab, or Enter after each one you enter.

How it works

API

When you click to expand a node, a request is made to the Wikipedia API to download the full content of the Wikipedia article corresponding to that node. Wikipedia map uses this data to find the links in the first paragraph of the article.

HTML Parsing

wikipedia_parse.js uses the DOMParser API to parse wikipedia pages’ HTML (retrieved from calls to Wikipedia's API). The parser looks for the <p> tag corresponding to the first paragraph of the article, then extracts all of the <a> tag links within this paragraph. It then filters the links to include only those which link to other wikipedia articles.

You can see this in action yourself in your browser’s console. If you have Wikipedia Map open, open your browser’s developer tools and type await getSubPages('Cat'). After a second, you should see an array with the names of other related articles.

The graph

The front-end uses vis.js to display the graph. Every time a node is clicked, the app makes a XMLHttpRequest to the Node.js server. The resulting links are added as new nodes, colored according to their distance from the central node (as described above).

Cloning

To use the app locally, simply

git clone https://github.com/controversial/wikipedia-map/

and open index.html in a web browser. No compilation or server is necessary to run the front-end.

Design choices

Functional

Expanding a node creates nodes for each article linked in the first paragraph of the article for the node you expand. I've chosen to use links only from the first paragraph of an article for 2 reasons:

  1. There is usually a manageable number of these links, about 5-10 per page.
  2. These links tend to be more directly relevant to the article than links further down in the page.

Visual

Nodes are lighter in color when they are farther away from the central node. If it took 5 steps to reach Ancient Greek from Penguin, it will be a lighter color than a node like Birding, which only took 2 steps to reach. Thus, a node's color indicates how closely an article is related to the central topic.

Hovering the mouse over a node will highlight the path back to the central node: Traceback This is not necessarily the shortest path back; it is the path that you took to reach the node.

Roadmap

Stuff I'd like to implement soon(ish)

Interface

  • Build a GUI
    • Change input method to something other than prompt
    • Allow starting anew without refreshing page
    • Create small info button that explains the project, controls, etc.
      • Render this README into the help dialog
      • The area with the network should contain instructions when it is blank
      • Create a more thorough help dialog explaining controls, etc. which also includes the README
    • Add a "Random Article" button
    • Create a better help menu that pops up when a user first visits.
    • Make the tour better
      • Show users how to expand and trace back nodes. To do this, create a floating invisible div over a start node. Then, pin the Shepherd step to this div.
      • Don't allow users to advance to the next step until they've followed the instruction (entering articles, pressing Go)
      • Disappear the info box when the tour is started
  • Allow inputting of multiple starts
    • Build an interface for this
  • Implement saving + sharing
    • Saving a graph
    • Loading a graph from an id
    • Loading a graph from a URL parameter
    • Implement sharing UI

Interaction

  • Hovering over a node will show a traceback of how you arrived at that node, kind of like breadcrumbs
  • mobile optimization: Implement a separate set of controls for touch devices
  • On both desktop and mobile, double-click (or tap) a node to open the corresponding wikipedia page in a new tab
  • Improve efficiency of highlighting the nodes

Technical

  • .gitignore-ify the libraries directory, no reason for it to be in here when I didn't write that stuff
  • Remove dependance on some external libraries:
    • jQuery
    • wordwrap
    • tinycolor
  • Move JavaScript to separate files from HTML
  • Make API requests asynchronous
  • Add some kind of build system to make building local copies and contributing easier

Stuff it might be nice to implement sometime in the far future

  • Autocomplete names of Wikipedia articles in the top bar
  • Make the size of nodes reflect the number of backlinks
  • Support for other languages
  • Support for other MediaWiki wikis

Credits

This project is powered by Wikipedia, whose wealth of information makes this project possible.

The presentation of the graph is powered by vis.js.

wikipedia-map's People

Contributors

benbryant0 avatar controversial avatar lukaskollmer avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

wikipedia-map's Issues

remove all nodes with only one connection

It would be nice if there was a way to simplify a graph by removing all nodes with only one connection. This would allow you to build a complex, meaningful network fairly quickly (explore a few things, find some connections, then delete the guff).

It would also be nice to be able to manually delete selected nodes somehow.

Doesn't work

Hi, I'd like to see your page in action but unfortunately it doesn't seem to work.
I type an article then press go but it remains in the "tour" section.

Start flask server

Hi, I am running flask on python2.7. Is the code for the localserver gone? I cant find the api/api.py anymore :(

Regards

executing `api/api.py`

Hi,
i have the error below when i try to execute >python api.py to run the flask server.

Traceback (most recent call last):
File "api.py", line 10, in
from wikipedia_parse import *
File "C:\WikiMap\wikipedia-map-master\api\wikipedia_parse.py", line 164
print is_article(":Cows"), is_article("WP:UA") # Test if it's an article
^
SyntaxError: invalid syntax

Thank you in advance for your help

"Geographic Coordinates System" marked as only link for many pages about places

Recently it seems the structure of many pages has changed such that the box indicating the coordinates of a place is contained within the first direct p descendant of .mw-parser-output. This causes "Geographic Coordinate System" to be marked as the only link from all of these articles.
screen shot 2018-04-30 at 10 49 43 am
In this image, highlighted p node contains the coordinates information, but is structurally the first p node that is a direct child of .mw-parser-output

Get more links for selected nodes

The current script gets the links from the first paragraph, but this is sometimes not particularly useful. For example, Dog only returns "Carl Linnaeus" (this might be a bug though, because the first paragraph of https://en.wikipedia.org/wiki/Dog has more links than that..).

It would be good to be able to (optionally) use more paragraphs to rip links from, so that nodes with weak first paragraphs can be expanded..

Also, I wonder if it wouldn't be better to use the first 3 paragraphs by default. I have a local copy that gets the first three, and it seems to capture a much more representative set of links..

Duplicate Nodes.

Problem

Sometimes, in a graph, the same node is shown twice. In this graph, you can see that J.K. Rowling is shown twice, once with a space between initials, and once without; both "J. K. Rowling" and "J.K. Rowling".

Cause

This is because one page links to https://en.m.wikipedia.org/wiki/J._K._Rowling, while the other links to https://en.m.wikipedia.org/wiki/J.K._Rowling. These both redirect to the same page, but are different URLs. Therefore, wikipedia_parse.py, which only looks at the last segment of the URL, interprets them differently.

Possible solutions

Look at the actual title of pages, after following the link, call get_page_name on each node that is added. This would be very slow, better to pursue a faster method.

some cases could be solved by simply storing a lowercased version of page titles with spaces removed as node IDs, and using the full thing for node labels. However, this still would not resolve things like Cat vs Cats, which go to the same page but might be linked differently.

Feature: Add next terms

It would be great if it was possible to add a term after initial search without losing connections - e.g. for brainstorming purposes.

Great tool, btw - really appreciate

Lag in traceback

In very large networks, the traceback can sometimes be very slow. The whole network pauses during slow tracebacks.

A hackish solution could be to call traceBack asynchronously from a setTimeout call, which might not freeze the network in the same way. However, this would address the symptoms rather than the problem, and not actually improve the speed.

A much better solution would be to increase speed by reducing the number of iterations that are made through the traceback nodes. Right now, 6 iterations are made:

  1. Iterate through parents to identify traceback nodes
  2. Iterate through identified nodes to adjust color
  3. Iterate through identified nodes again inside vis.DataSet.update
  4. Iterate through parents to identify traceback edges
  5. Iterate through identified edges to adjust color
  6. Iterate through the edges again inside vis.DataSet.update

Looking into the code for vis.DataSet.update, it appears that commit dfc633e was made in error. This added two more iterations to the list, further slowing down the traceback, rather than speeding it up.

To bring this down to one loop, traceBack, getTraceBackNodes, and getTraceBackEdges could be merged into a single function with one iteration. nodes.update() and edges.update() could be called once for each item as they are identified and modified. This could bring the total loops made through the same data during a traceBack down to one.

Show All Links of full Article as Nodes

Hallo,
My colleagues and I regularly play a game where the goal is to find the shortest path between two Wikipedia articles. I found this software to enable a checking function. Therefore, I have two questions:

Is it possible to extract all links within an article using this script?
Is it possible to display all links as nodes?

More minor unicode problems

There are still some minor issues with unicode:

  • Typing unicode characters in the top bar will return an error
  • When get_page_name results in unicode characters, they're left out of the node title:

Unicode problems with random button

I've got the random button working for special characters like , and and., but it still doesn't work well for very strange characters. For example, there's a severe display bug with wikipedia page titles like "Dąbrowice, Gmina Maków." In the text box, it displays with a HTML character encoding (the second character displays as&#261;, with the full text ofD&#261;browice, Gmina Mak�w), and on the node it displays like

screen shot 2016-03-06 at 4 23 55 pm

It is likely that this is in part Python's fault, and also JavaScript's fault as well. I'll try to fix it.

Direct linking to graphs

Coming from erabug/wikigraph#2 and fedwiki/wiki#63 we know that linking to certain states of the graph would be interesting.

Similar to what CoGraph allows, but by using URL fragments known from @fedwiki lineups.

If I searched for Space and Time, they'd automatically be added to the URL and therefore create a stable view onto the data. Those nodes should be expanded by default on load.

Data Source for the Map

Hi,

Can you please let me know the data source which you are using in the Flask Server API?

Thanks,

Node shows edge to self

When you enter facebook as one of your search terms, the main topic node will show an edge to itself

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.