dharmafly / elsewhere Goto Github PK

View Code? Open in Web Editor NEW

18.0 18.0 4.0 4 MB

A node project that aims to replicate the functionality of the Google Social Graph API

Home Page: https://elsewhere.dharmafly.com/

JavaScript 72.90% HTML 27.10%

elsewhere's People

Contributors

Stargazers

Watchers

Forkers

web5design zawsx glennjones rashinp

elsewhere's Issues

Adding http request headers

I couple of sites like github have started rejecting request which have no http header. Could we add this to the options

// the HTTP headers to use when requesting
// resources from the internet
httpHeaders: {
'Accept': '/',
'Cache-Control': 'no-cache',
'Connection': 'keep-alive',
'Pragma': 'no-cache',
'User-Agent': 'node.js'
}

And where we use the request object add the following code change

    var requestObj = {
      uri: this.url,
      headers: options.httpHeaders
    }

    request(requestObj, function(requestErrors, response, body){
         ....
   });

I am happy to make this change ?

Support cached responses for any URL within a crawled network

Aspirational feature, i.e. not too important right now.

Summary: Support cached responses to a URL's network when the URL is already known as a secondary URL in a previously cached response.

The crawling required to discover a rel=me network of URLs is intensive and takes some time, so we want to avoid as much excessive processing as much as possible.

Currently, we cache the network based only on the originally requested URL, e.g. example.com. However, we should also index the data by every other URL in the network - e.g. if a request is later made to foo.com, and if foo.com is already known to part of example.com's network, then the previously cached data should be served (although, this time, foo.com will be considered the original URL and so foo.com will not be included in the URL list, but example.com will be).

useCache option

I am finding it very useful in other projects to have a useCache option applies to the individual request, so I can switch on and off caching for testing purposes. Should not be used globally as it would effect other users of an API

ie {useCache : false}

Could we add this - again happy to pu it into the code

iframes

If you take a look at the source of this personal site http://www.benjaminparry.me you'll see that the source just contains an Iframe that then displays his actual site.

<frame src="http://dl.dropbox.com/u/3242878/benjaminparrydotme/index.html" name="domainmonsterMain" />

This seems like a tragic missed opportunity as the actual site is full of lots of juicy rel=me's.

Not sure how or whether to handle this kind of thing. Suggestions?

Mitigate URL shorteners

So Twitter have started using their t.co url shortener on their profile links. At the moment Elsewhere is oblivious to this and any other shortener, it'll just treat the shortened url like it's the actual url. This is a problem.

What it basically means is that your website, instead of being example.com, is identified as t.co/24rkwdfj. Now when Elsewhere is validating links, it can't find any link to example.com, it can only see t.co/24rkwdfj and since nowhere else links to that, it won't treat it as being a valid resource.

In order to mitigate this we need to make Elsewhere aware of the fact that it's resolving redirects (I'm not exactly sure how they're handled at the moment).

The solution proposed is that we identify sites by their actual url i.e. the url that the shortener resolves to. We will still however keep track of the urls that are used as part of any redirects to the resolved url.

The end result, aside from the fixed urls, as a slight modification to each resource returned in the response.

{
  "results": [
    {
      "url": "http://chrisnewtn.com",
      "title": "Chris Newton",
      "favicon": "http://chrisnewtn.com/favicon.ico",
      "outboundLinks": {
          "verified": [ ... ],
          "unverified": [ ]
      },
      "inboundCount": {
        "verified": 4,
        "unverified": 0
      },
      "verified": true,
      // new bit
      "urlAliases": [
        "http://t.co/vV5BWNxil2"
      ],
    }
  ],
  "query": "http://chrisnewtn.com",
  "created": "2012-10-12T16:30:57.270Z",
  "crawled": 9,
  "verified": 9
}

This aliases property contains all the other urls used to identify the resource that Elsewhere has encountered, just in case it's useful.

Query parameter version not seeming to work

e.g.

http://dharmafly.node-socialgraph.jit.su/?q=premasgar.com

does not provide the same output as

http://dharmafly.node-socialgraph.jit.su/premasgar.com

Crawl opengraph data?

I've been noticing these tags in the source of some of the pages:

<meta property="og:site_name" content="Flickr" />

Would it be worth crawling this data?

Move response URLs to a `results` array

e.g.

{
    "results": [],

    "query": "example.com",
    "created": "2012-08-22T17:01:29.192Z",
    "crawled": 87,
    "verified": 10
}

Allow url encoded request url as query parameter

e.g.

nodesocialgraph.com/example.com
nodesocialgraph.com/?q=example.com
nodesocialgraph.com/?q=http%xx%xx%xxexample.com
nodesocialgraph.com/?q=https%xx%xx%xxexample.com
nodesocialgraph.com/?q=https%xx%xx%xxexample.com&strict

Support a limit on the crawl depth

E.g. by setting a crawl depth of 10, the spider will only follow a maximum chain of 10 URLs before it gives up on finding a valid reciprocal rel=me URL and abandons the chain.

The cache counter increments whenever a link is followed. When the counter reaches the crawl limit, that branch of the network is abandoned. Each time a valid reciprocal link in the requested network is found, the crawl counter is reset to 0. When all branches have been explored to the extent of the crawl limit, then the response data is cached and served.

The crawl limit prevents an unreciprocated link triggering a huge search chain, and also permits a reasonable amount of crawling the network, in order to capture its spread.

E.g. example.com may link to foo.com, and foo.com to bar.com, and bar.com back to example.com - here, a crawl limit of 3 will be sufficient to discover the network.

The parameter should be adjustable via a config.json file on the server.

JSDom: why in hell does glennjones.net crash jsdom?

Still haven't managed to either figure this out or find a way of mitigating it.

Grapher sharing resources between requests

The grapher needs heavily modifying in order to avoid it sharing resources between HTTP requests.

For example the confirmedLinks array is built up over every request. It should actually be fresh for every request instead of this happening. Fit it!

Support optional whitelist of referrer domains

This will prevent overload of the service by third party developers. (This is also issue 21 on NSQL - so the implementation can be shared)

Differences between graph method and JSON response

When looking at #29, #32 and similar previous issues it occurs to me that there is now great disparity between the graph you get after calling the graph method on the server (raw unmodified graph object) and the JSON you get from querying it (the crafted object literal).

Really I'd like to return the object literal that forms the JSON on the server too as this would eliminate the differences between the two responses, plus I don't think people need the full graph object in the response anyway.

Just to clarify : all the current manipulation of the response, getting rid of the deeper links, only showing verified pages etc. Is all presently done in the toJSON method.

Limit number of concurrent requests, to prevent server crash

E.g. research the number of jsDom instances on large request pages that the server can handle before crashing, and use only 10% of this capacity.

See identical issue 26 in nsql.

Add verified flag in non strict mode to each returned page object

{
    "url": "http://premasagar.com",
    "title": "Premasagar :: Home :: <remixing bits of stuff & things />",
    "favicon": "http://premasagar.com/favicon.ico",
    "verified": true
}

add rel=author

Google have started using rel=author to pages authored by an individual. They are setup to point back to a profile page.

It maybe interesting to add rel=author parsing to the url first given to the API. It could be used as a starting point for a graph.

I would not recommend parsing beyond the first URL provided to the API.

Docs: document the properties in the JSON response

add rel=me parsing on link tags

You can use a link tag with rel=me. Its very rare to see this pattern of coding in the wild, but it should be supported.

After verification, compare the path depth and discard longer paths from the response

(A more complete description of #4).

Calculate path depth:

Initially, ignore 'www' and 'http(s)' protocols.
Treat the root domain, e.g. example.com (or www.example.com) as a path depth of zero.
For each 'file' or sub-directory, e.g. example.com/foo or example.com/bar/, increment the path depth by one. (For the moment, we should ignore trailing slashes, as some links to example.com/foo are meant to be to example.com/foo/ and vice versa)
For each sub-domain (except 'www'), increment the path depth by one.
Ignore query parameters.

This results in a number that represents the path depth of the URL. The response should only include those URLs that have the shortest path depth for a given domain. E.g. if the following URLs are included in the list:

then the first two should be included in the response (they both have a path depth of 2), but the third and fourth should not (they have a path depth of 3).

Finally, once the crawling has completed, and the links are verified, then remove any URLs that represent the same resource as another URL in the list. This should be done by favouring HTTPS over HTTP and favouring shorter URLs over longer ones:

If the list includes https://example.com and http://example.com, then return https://example.com
If the list includes www.example.com and example.com, then return example.com
If the list includes example.com/foo/ and example.com/foo, then return example.com/foo

Note that these comparisons are only made after the crawling and verification routines. They are not made before crawling and so do not impact on the decision whether to crawl a URL or not (that depends only on the crawl depth - see #24)

Create front-end for Node SocialGraph demo

1st request can be unpredictable

Alterations to the validation code, together with the non-crawing of links on non-validated pages have made the first graph of a url highly unpredictable e.g. returning only a few results or even unvalidated links in strict mode.

Subsequent requests made once the graph has been built always return exactly the same data i.e. the correct data.

Response object undefined

Just got this error trying to graph http://premasagar.com/:

/home/cnewton/Development/l4rp/elsewhere/lib/scraper.js:58
            logger.warn('http error: '  + response.statusCode 
                                                  ^
TypeError: Cannot read property 'statusCode' of undefined
    at Request._callback (/home/cnewton/Development/l4rp/elsewhere/lib/scraper.js:58:51)
    at Request.init.self.callback (/home/cnewton/Development/l4rp/elsewhere/node_modules/request/main.js:120:22)
    at Request.EventEmitter.emit (events.js:88:17)
    at ClientRequest.Request.init.self.clientErrorHandler (/home/cnewton/Development/l4rp/elsewhere/node_modules/request/main.js:222:10)
    at ClientRequest.EventEmitter.emit (events.js:88:17)
    at CleartextStream.socketCloseListener (http.js:1314:9)
    at CleartextStream.EventEmitter.emit (events.js:115:20)
    at SecurePair.destroy (tls.js:907:22)
    at process.startup.processNextTick.process._tickCallback (node.js:244:9)

Seems that in certain situations the response object comes back undefined. I'm going to try to replicate/fix it after food, but otherwise I'm loving the changes so far.

Remove duplicates in `links` array of each item in the response

Versioning schema

Implementing #48 made me think. When it comes to updating the version number, as #48 is a change that will break anything dependent on Elsewhere, should it go from 0.0.3 to 0.1.0 or to 0.0.4?

For that matter are we using semantic versioning? If yes then I need to read up on this!

It's probably too early in development to be worrying about this stuff, but I thought I'd broach the subject and get the ball rolling on it before it becomes a problem.

Error on accessing via ?url= query parameter

In the develop-cheerio branch.

Request:
GET http://localhost:8888/?url=chrisnewtn.com

Console:

info: -----------------------------------------------
info: Elsewhere started - with url: chrisnewtn.com
info: parsing: chrisnewtn.com
warn: TypeError: Cannot read property 'statusCode' of undefined
info: total pages 1
info: total fetched 0
info: total errors 0
info: total pages outside limits 0
info: total html request time: 0ms
info: total time taken: 4ms

Response:

{
"results": [],
"query": "chrisnewtn.com",
"created": "2012-10-12T23:04:46.411Z",
"crawled": 1,
"verified": true
}

Investigate use of Cheerio instead of JSDom

https://github.com/MatthewMueller/cheerio

Do something meaningful with non-HTML resources, e.g. Atom/RSS feeds

Options:

Exclude all non-HTML resources
Include non-HTML resources, but don't attempt to verify them
Include non-HTML resources, and add functionality to verify known types like Atom and RSS (e.g. use <link rel="alternate">

Include MIME-type in item object in the response, and/or indicate if resource was actually parsed.

Document the api

JSDom: Big pages seem to cause jsdom to hang

Try loading this into Node Socialgraph, or your own browser for that matter. It completely locks the node server.

http://www.whatwg.org/specs/web-apps/current-work/

Code review

elsewhere and bundleDependencies

Hi,

Using elsewhere in the elsewhere-mapper-site which is an express project works with local bundleDependencies but not when deployed to jit.su http://elsewhere-mapper.jit.su/example01.html

The above could be another jsdom deployment problem.

The package.json contains one other bundleDependencies package which is working fine

Any ideas. are you going to add the project npm

Glenn

Add tests

Investigate the lowest crawl depth that will result in complete graphs

Test a number of different graphs. Increment the crawl depth, testing the number of URLs in the resultant URL list (i.e. 'graph').

Use this process to determine a suitable default value for the crawl depth - see #24.

Implement EventEmitter interface for the Grapher

Intended outcome:

grapher.on('build', function (data) {
    console.log(data);
});

Add domain limiter

E.g. for any given network, only parse a maximum of 5 URLs from the domain. This assumes that the main URL will be grabbed before the domain limit is reached. It will prevent situations such as this series of interlinked URLs:

See #26 (comment) and follow-on comments

Create docs website

In strict mode, don't crawl pages linked from non-verified pages

a tags with rel=me but no href attribute

Twitter can have the below link structure where there is no href value

This cause a 500 server error in the current code

Crawl deep links, but only return the highest level URLs

Crawl flickr contact pages put only return all at highest level.

E.g. crawl this:

http://flickr.com/photos/username/
http://flickr.com/people/username/
http://flickr.com/people/username/contacts?p=1
http://flickr.com/people/username/contacts?p=2
http://flickr.com/people/username/contacts?p=3
http://flickr.com/people/username/contacts?p=4

and return this:

http://flickr.com/people/username/
http://flickr.com/photos/username/

If a rel=me link also contains rel=next or rel=prev then ignore it

This will prevent pagination links being crawled. It will prevent situations such as this series of interlinked URLs:

See #26 (comment) and follow-on comments

Even in strict mode, links should be followed, waiting for the loop to complete

e.g. entering http://lanyrd.com/profile/premasagar/ will link to http://twitter.com/premasagar which links to http://premasagar.com which links back to http://lanyrd.com/profile/premasagar/ and http://twitter.com/premasagar - verifying all three as belonging to each other.

A chain / tree depth limit should be in place to prevent the server crawling too much.

Demo page's 'Graph' form should be submitted on pressing 'Return'

e.g. by using a <form> and using JavaScript to listen for the submit event, and event.preventDefault()

dharmafly / elsewhere Goto Github PK

elsewhere's People

Contributors

Stargazers

Watchers

Forkers

elsewhere's Issues

Recommend Projects

Recommend Topics

Recommend Org