Git Product home page Git Product logo

sitemapper's Introduction

Sitemap-parser

Code Scanning NPM Publish Version Bump Test Build Status Codecov CodeFactor GitHub license GitHub release date Inline docs LGTM Alerts LGTM Grade Libraries.io dependency status for latest release license Monthly Downloads npm version release scrutinizer

Parse through a sitemaps xml to get all the urls for your crawler.

Version 2

Installation

npm install sitemapper --save

Simple Example

const Sitemapper = require('sitemapper');

const sitemap = new Sitemapper();

sitemap.fetch('https://wp.seantburke.com/sitemap.xml').then(function(sites) {
  console.log(sites);
});

Examples in ES6

import Sitemapper from 'sitemapper';

(async () => {
  const Google = new Sitemapper({
    url: 'https://www.google.com/work/sitemap.xml',
    timeout: 15000, // 15 seconds
  });

  try {
    const { sites } = await Google.fetch();
    console.log(sites);
  } catch (error) {
    console.log(error);
  }
})();

// or

const sitemapper = new Sitemapper();
sitemapper.timeout = 5000;

sitemapper.fetch('https://wp.seantburke.com/sitemap.xml')
  .then(({ url, sites }) => console.log(`url:${url}`, 'sites:', sites))
  .catch(error => console.log(error));

Options

You can add options on the initial Sitemapper object when instantiating it.

  • requestHeaders: (Object) - Additional Request Headers (e.g. User-Agent)
  • timeout: (Number) - Maximum timeout in ms for a single URL. Default: 15000 (15 seconds)
  • url: (String) - Sitemap URL to crawl
  • debug: (Boolean) - Enables/Disables debug console logging. Default: False
  • concurrency: (Number) - Sets the maximum number of concurrent sitemap crawling threads. Default: 10
  • retries: (Number) - Sets the maximum number of retries to attempt in case of an error response (e.g. 404 or Timeout). Default: 0
  • rejectUnauthorized: (Boolean) - If true, it will throw on invalid certificates, such as expired or self-signed ones. Default: True
  • lastmod: (Number) - Timestamp of the minimum lastmod value allowed for returned urls
  • field : (Object) - An object of fields to be returned from the sitemap. For Example: { loc: true, lastmod: true, changefreq: true, priority: true }. Leaving a field out has the same effect as field: false. If not specified sitemapper defaults to returning the 'classic' array of urls.
const sitemapper = new Sitemapper({
  url: 'https://art-works.community/sitemap.xml',
  rejectUnauthorized: true,
  timeout: 15000,
  requestHeaders: {
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:81.0) Gecko/20100101 Firefox/81.0'
  }
});

An example using all available options:

const sitemapper = new Sitemapper({
  url: 'https://art-works.community/sitemap.xml',
  timeout: 15000,
  requestHeaders: {
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:81.0) Gecko/20100101 Firefox/81.0'
  },
  debug: true,
  concurrency: 2,
  retries: 1,
});

Examples in ES5

var Sitemapper = require('sitemapper');

var Google = new Sitemapper({
  url: 'https://www.google.com/work/sitemap.xml',
  timeout: 15000 //15 seconds
});

Google.fetch()
  .then(function (data) {
    console.log(data);
  })
  .catch(function (error) {
    console.log(error);
  });


// or


var sitemapper = new Sitemapper();

sitemapper.timeout = 5000;
sitemapper.fetch('https://wp.seantburke.com/sitemap.xml')
  .then(function (data) {
    console.log(data);
  })
  .catch(function (error) {
    console.log(error);
  });

Version 1

npm install [email protected] --save

Simple Example

var Sitemapper = require('sitemapper');

var sitemapper = new Sitemapper();

sitemapper.getSites('https://wp.seantburke.com/sitemap.xml', function(err, sites) {
    if (!err) {
     console.log(sites);
    }
});

sitemapper's People

Contributors

achimkoellner avatar actions-user avatar alexm92 avatar bsq-panagiotis avatar dependabot[bot] avatar esamattis avatar james-mckinnon avatar jasonaibrahim avatar mmeinzer avatar oudmane avatar phanect avatar saschanowak avatar seantomburke avatar terry-au avatar tijevlam avatar y16ra avatar zijua avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

sitemapper's Issues

Url of sitemap file is not updated if redirection

In the case of a redirection on the sitemap Url, Url is not updated with the new location.

For example http://cocon.se/sitemap.xml, 301 redirection to http://cocon.se/sitemap_index.xml

let sitemap = new sitemapper({
    url: 'http://cocon.se/sitemap.xml', 
    timeout: 5000 
  });
  sitemap.fetch()
    .then(({ url, sites }) => console.log(`url:${url}`, 'sites:', sites))
    .catch(function (error) {
        loggerConsole.warn(error);
    });

Display

url:http://cocon.se/sitemap.xml sites:
Array(124) ["http://cocon.se", "http://cocon.se/actus/vlc-2015.html", ....

url should be equal to http://cocon.se/sitemap_index.xml

Version bump to 2.1.14?

Thanks for merging PR #26. Looks like I forgot to bump the version. Can a new release get published to the npm repo?

Support for Basic Authentication with multiple sitemap files

I need to parse a sitemap on a site protected by a basic authentication. If there is just one sitemap file and I pass basic auth parameters in the url like http://{user}:{pass}@example.com, it works great. But if sitemap.xml is just an index of sitemap files, then those files will not be parsed and an empty result set is returned.

It would be great to have a basic auth parameter in Sitemapper class that would be applied to all sitemap files.

requester.cancel is not a function

On timeouts, site mapper fails to cancel.

See:

/node_modules/sitemapper/lib/assets/sitemapper.js:142

        requester.cancel();
                  ^^
TypeError: requester.cancel is not a function
    at Timeout._onTimeout (/node_modules/sitemapper/lib/assets/sitemapper.js:142:19)

    at listOnTimeout (internal/timers.js:549:17)

    at processTimers (internal/timers.js:492:7)

Throttling when parsing multiple sitemaps

Is there an option available to add an artificial delay (throttling) between requests to avoid getting blocked by firewalls?
I couldn't find any mention of this feature in the documentation.

Add an option search in nested SiteMaps

Handling an empty sitemap file

What if the sitemap URL and file are valid, but it contains no url's like below?

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
</urlset>

It is froozen in special cases.

It does not works for hubspot.com because node_modules/sitemapper/lib/sitemapper.js, line 243
checks just
if (data && data.urlset)
not
if (data && data.urlset && data.urlset.url)

Sitemapper version 1.0.3 runs twice

var sitemap = require('sitemapper');

sitemap.getSites('http://wp.seantburke.com/sitemap.xml', function(err, sites) {
    if(!err) {
        console.log('done');
    } else {
        console.log('error');
    }
});

When using sitemapper version 1.0.3, this snippet of code prints "done" twice, and only prints "done" once when using version 1.0.1.

async generator support

I'd like to get urls one by one for lower RAM consumption.

Something like this using async generator:

  for await (const url of sitemapper.fetch()) {
    console.log(url);
  }

It'll work well in an environment with limited memory (e.g. AWS Lambda)

Sub-sitemaps as object

In certain multilingual websites, we utilize a sitemap containing an index of sub-sitemaps. For example:

<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
    <sitemap>
        <loc>https://mysite.com/en/sitemap.xml</loc>
    </sitemap>
    <sitemap>
        <loc>https://mysite.com/pt-br/sitemap.xml</loc>
    </sitemap>
    <sitemap>
        <loc>https://mysite.com/es/sitemap.xml</loc>
    </sitemap>
</sitemapindex>

Is there a method to access these sub-sitemaps individually rather than retrieving all links collectively?

Support for gzipped urls?

Hi Sean,

I am trying to use the sitemapper with this site:

https://www.imot.bg/sitemap/index.xml

And as you can see the urls inside are .gz ipped.

Can we extend sitemapper to understand these and extract the contents.

I tried adding this:

var body = response.body;
if (response.headers['content-encoding'] && response.headers['content-encoding'].toLowerCase().indexOf('gzip') > -1) {
body = zlib.gunzipSync(body);
}

      return xmlParse(body);

to the parse() method.

But I do not know how to build the packages so I can get a node.js supported output file. I have no idea how to produce the package file. Can you help me with this? I want to make it work.

Kind Regards,
Lyubomir

Support robots.txt Sitemaps (plural!) discovery

The robots.txt standard allows for declaring the location of sitemaps (plural!), e.g. for https://www.nytimes.com/robots.txt :

# ....
User-Agent: omgili
Disallow: /

User-agent: ia_archiver
Disallow: /

Sitemap: https://www.nytimes.com/sitemaps/new/news.xml.gz
Sitemap: https://www.nytimes.com/sitemaps/new/sitemap.xml.gz
Sitemap: https://www.nytimes.com/sitemaps/new/collections.xml.gz
Sitemap: https://www.nytimes.com/sitemaps/new/video.xml.gz
Sitemap: https://www.nytimes.com/sitemaps/new/cooking.xml.gz
Sitemap: https://www.nytimes.com/sitemaps/new/recipe-collects.xml.gz
Sitemap: https://www.nytimes.com/sitemaps/new/regions.xml
Sitemap: https://www.nytimes.com/sitemaps/new/best-sellers.xml
Sitemap: https://www.nytimes.com/sitemaps/www.nytimes.com/2016_election_sitemap.xml.gz
Sitemap: https://www.nytimes.com/elections/2018/sitemap
Sitemap: https://www.nytimes.com/wirecutter/sitemapindex.xml

It would be great if sitemapper allowed to process URLs to robots.txt in order to transiently return all Sitemap URLs.

Add proxy support

It would be great if one could add a proxy support.

I can see there is already support using the "got" npm,
The proxy options just need to be exposed.

Check if Sitemap is Valid?

Hi,

Thanks for the great library and I think its the only nodejs sitemap parser which is upto date. I was wondering is there any way to check whether the sitemap is having valid syntax?
I basically needs a validator for sitemap but I am sure that this parser will be already using rules for the schema of sitemap. So does it throws any errors on detecting invalid syntax or is it forgiving?

I am happy to help and submit a PR if that can be implemented using the code already in place.

Kind Regards

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.