Git Product home page Git Product logo

elsewhere's People

Contributors

chrisnewtn avatar glennjones avatar premasagar avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

elsewhere's Issues

Adding http request headers

I couple of sites like github have started rejecting request which have no http header. Could we add this to the options

// the HTTP headers to use when requesting
// resources from the internet
httpHeaders: {
'Accept': '/',
'Cache-Control': 'no-cache',
'Connection': 'keep-alive',
'Pragma': 'no-cache',
'User-Agent': 'node.js'
}

And where we use the request object add the following code change

    var requestObj = {
      uri: this.url,
      headers: options.httpHeaders
    }

    request(requestObj, function(requestErrors, response, body){
         ....
   });

I am happy to make this change ?

Support cached responses for any URL within a crawled network

Aspirational feature, i.e. not too important right now.

Summary: Support cached responses to a URL's network when the URL is already known as a secondary URL in a previously cached response.

The crawling required to discover a rel=me network of URLs is intensive and takes some time, so we want to avoid as much excessive processing as much as possible.

Currently, we cache the network based only on the originally requested URL, e.g. example.com. However, we should also index the data by every other URL in the network - e.g. if a request is later made to foo.com, and if foo.com is already known to part of example.com's network, then the previously cached data should be served (although, this time, foo.com will be considered the original URL and so foo.com will not be included in the URL list, but example.com will be).

useCache option

I am finding it very useful in other projects to have a useCache option applies to the individual request, so I can switch on and off caching for testing purposes. Should not be used globally as it would effect other users of an API

ie {useCache : false}

Could we add this - again happy to pu it into the code

iframes

If you take a look at the source of this personal site http://www.benjaminparry.me you'll see that the source just contains an Iframe that then displays his actual site.

<frame src="http://dl.dropbox.com/u/3242878/benjaminparrydotme/index.html" name="domainmonsterMain" />

This seems like a tragic missed opportunity as the actual site is full of lots of juicy rel=me's.

Not sure how or whether to handle this kind of thing. Suggestions?

Mitigate URL shorteners

So Twitter have started using their t.co url shortener on their profile links. At the moment Elsewhere is oblivious to this and any other shortener, it'll just treat the shortened url like it's the actual url. This is a problem.

What it basically means is that your website, instead of being example.com, is identified as t.co/24rkwdfj. Now when Elsewhere is validating links, it can't find any link to example.com, it can only see t.co/24rkwdfj and since nowhere else links to that, it won't treat it as being a valid resource.

In order to mitigate this we need to make Elsewhere aware of the fact that it's resolving redirects (I'm not exactly sure how they're handled at the moment).

The solution proposed is that we identify sites by their actual url i.e. the url that the shortener resolves to. We will still however keep track of the urls that are used as part of any redirects to the resolved url.

The end result, aside from the fixed urls, as a slight modification to each resource returned in the response.

{
  "results": [
    {
      "url": "http://chrisnewtn.com",
      "title": "Chris Newton",
      "favicon": "http://chrisnewtn.com/favicon.ico",
      "outboundLinks": {
          "verified": [ ... ],
          "unverified": [ ]
      },
      "inboundCount": {
        "verified": 4,
        "unverified": 0
      },
      "verified": true,
      // new bit
      "urlAliases": [
        "http://t.co/vV5BWNxil2"
      ],
    }
  ],
  "query": "http://chrisnewtn.com",
  "created": "2012-10-12T16:30:57.270Z",
  "crawled": 9,
  "verified": 9
}

This aliases property contains all the other urls used to identify the resource that Elsewhere has encountered, just in case it's useful.

Crawl opengraph data?

I've been noticing these tags in the source of some of the pages:

<meta property="og:site_name" content="Flickr" />

Would it be worth crawling this data?

Allow url encoded request url as query parameter

e.g.

nodesocialgraph.com/example.com
nodesocialgraph.com/?q=example.com
nodesocialgraph.com/?q=http%xx%xx%xxexample.com
nodesocialgraph.com/?q=https%xx%xx%xxexample.com
nodesocialgraph.com/?q=https%xx%xx%xxexample.com&strict

Support a limit on the crawl depth

E.g. by setting a crawl depth of 10, the spider will only follow a maximum chain of 10 URLs before it gives up on finding a valid reciprocal rel=me URL and abandons the chain.

The cache counter increments whenever a link is followed. When the counter reaches the crawl limit, that branch of the network is abandoned. Each time a valid reciprocal link in the requested network is found, the crawl counter is reset to 0. When all branches have been explored to the extent of the crawl limit, then the response data is cached and served.

The crawl limit prevents an unreciprocated link triggering a huge search chain, and also permits a reasonable amount of crawling the network, in order to capture its spread.

E.g. example.com may link to foo.com, and foo.com to bar.com, and bar.com back to example.com - here, a crawl limit of 3 will be sufficient to discover the network.

The parameter should be adjustable via a config.json file on the server.

Grapher sharing resources between requests

The grapher needs heavily modifying in order to avoid it sharing resources between HTTP requests.

For example the confirmedLinks array is built up over every request. It should actually be fresh for every request instead of this happening. Fit it!

Differences between graph method and JSON response

When looking at #29, #32 and similar previous issues it occurs to me that there is now great disparity between the graph you get after calling the graph method on the server (raw unmodified graph object) and the JSON you get from querying it (the crafted object literal).

Really I'd like to return the object literal that forms the JSON on the server too as this would eliminate the differences between the two responses, plus I don't think people need the full graph object in the response anyway.

Just to clarify : all the current manipulation of the response, getting rid of the deeper links, only showing verified pages etc. Is all presently done in the toJSON method.

add rel=author

Google have started using rel=author to pages authored by an individual. They are setup to point back to a profile page.

It maybe interesting to add rel=author parsing to the url first given to the API. It could be used as a starting point for a graph.

I would not recommend parsing beyond the first URL provided to the API.

After verification, compare the path depth and discard longer paths from the response

(A more complete description of #4).

Calculate path depth:

  • Initially, ignore 'www' and 'http(s)' protocols.
  • Treat the root domain, e.g. example.com (or www.example.com) as a path depth of zero.
  • For each 'file' or sub-directory, e.g. example.com/foo or example.com/bar/, increment the path depth by one. (For the moment, we should ignore trailing slashes, as some links to example.com/foo are meant to be to example.com/foo/ and vice versa)
  • For each sub-domain (except 'www'), increment the path depth by one.
  • Ignore query parameters.

This results in a number that represents the path depth of the URL. The response should only include those URLs that have the shortest path depth for a given domain. E.g. if the following URLs are included in the list:

then the first two should be included in the response (they both have a path depth of 2), but the third and fourth should not (they have a path depth of 3).

Finally, once the crawling has completed, and the links are verified, then remove any URLs that represent the same resource as another URL in the list. This should be done by favouring HTTPS over HTTP and favouring shorter URLs over longer ones:

Note that these comparisons are only made after the crawling and verification routines. They are not made before crawling and so do not impact on the decision whether to crawl a URL or not (that depends only on the crawl depth - see #24)

1st request can be unpredictable

Alterations to the validation code, together with the non-crawing of links on non-validated pages have made the first graph of a url highly unpredictable e.g. returning only a few results or even unvalidated links in strict mode.

Subsequent requests made once the graph has been built always return exactly the same data i.e. the correct data.

Response object undefined

Just got this error trying to graph http://premasagar.com/:

/home/cnewton/Development/l4rp/elsewhere/lib/scraper.js:58
            logger.warn('http error: '  + response.statusCode 
                                                  ^
TypeError: Cannot read property 'statusCode' of undefined
    at Request._callback (/home/cnewton/Development/l4rp/elsewhere/lib/scraper.js:58:51)
    at Request.init.self.callback (/home/cnewton/Development/l4rp/elsewhere/node_modules/request/main.js:120:22)
    at Request.EventEmitter.emit (events.js:88:17)
    at ClientRequest.Request.init.self.clientErrorHandler (/home/cnewton/Development/l4rp/elsewhere/node_modules/request/main.js:222:10)
    at ClientRequest.EventEmitter.emit (events.js:88:17)
    at CleartextStream.socketCloseListener (http.js:1314:9)
    at CleartextStream.EventEmitter.emit (events.js:115:20)
    at SecurePair.destroy (tls.js:907:22)
    at process.startup.processNextTick.process._tickCallback (node.js:244:9)

Seems that in certain situations the response object comes back undefined. I'm going to try to replicate/fix it after food, but otherwise I'm loving the changes so far.

Versioning schema

Implementing #48 made me think. When it comes to updating the version number, as #48 is a change that will break anything dependent on Elsewhere, should it go from 0.0.3 to 0.1.0 or to 0.0.4?

For that matter are we using semantic versioning? If yes then I need to read up on this!

It's probably too early in development to be worrying about this stuff, but I thought I'd broach the subject and get the ball rolling on it before it becomes a problem.

Error on accessing via ?url= query parameter

In the develop-cheerio branch.

Request:
GET http://localhost:8888/?url=chrisnewtn.com

Console:

info: -----------------------------------------------
info: Elsewhere started - with url: chrisnewtn.com
info: parsing: chrisnewtn.com
warn: TypeError: Cannot read property 'statusCode' of undefined
info: total pages 1
info: total fetched 0
info: total errors 0
info: total pages outside limits 0
info: total html request time: 0ms
info: total time taken: 4ms

Response:

{
"results": [],
"query": "chrisnewtn.com",
"created": "2012-10-12T23:04:46.411Z",
"crawled": 1,
"verified": true
}

Do something meaningful with non-HTML resources, e.g. Atom/RSS feeds

Options:

  • Exclude all non-HTML resources
  • Include non-HTML resources, but don't attempt to verify them
  • Include non-HTML resources, and add functionality to verify known types like Atom and RSS (e.g. use <link rel="alternate">

Include MIME-type in item object in the response, and/or indicate if resource was actually parsed.

elsewhere and bundleDependencies

Hi,

Using elsewhere in the elsewhere-mapper-site which is an express project works with local bundleDependencies but not when deployed to jit.su http://elsewhere-mapper.jit.su/example01.html

The above could be another jsdom deployment problem.

The package.json contains one other bundleDependencies package which is working fine

Any ideas. are you going to add the project npm

Glenn

Add domain limiter

If a rel=me link also contains rel=next or rel=prev then ignore it

Even in strict mode, links should be followed, waiting for the loop to complete

e.g. entering http://lanyrd.com/profile/premasagar/ will link to http://twitter.com/premasagar which links to http://premasagar.com which links back to http://lanyrd.com/profile/premasagar/ and http://twitter.com/premasagar - verifying all three as belonging to each other.

A chain / tree depth limit should be in place to prevent the server crawling too much.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.