Git Product home page Git Product logo

Comments (12)

pypt avatar pypt commented on August 23, 2024 1

0.3 released.

from ultimate-sitemap-parser.

pypt avatar pypt commented on August 23, 2024

Hey @kienli, thank you for the kind words!

Interesting idea, although I suspect that most of the websites do post a link to their sitemap in robots.txt. In addition to /sitemap.xml[.gz], possibly some CMSes provide their sitemaps at very predictable paths (think www.joomla-website.com/index.php?module=sitemap or something like that) so we could take those into account too.

However, I think it would make sense to test a few websites before implementing this feature first, e.g. a test script could fetch /robots.txt and /sitemap.xml[.gz] of every URL in the sample list and see how many "shadow" sitemaps it would be able to find.

Would you be able to run such a test? I can provide you with a sample of 1000 / 10,000 / 100,000 news website URLs for sample data (we at the Media Cloud project work with those).

from ultimate-sitemap-parser.

kienli avatar kienli commented on August 23, 2024

@pypt good idea! I wanted to do the same, to scan a big sample of websites to understand how heavily we can/should rely on robots.txt for big/medium/small websites in our project and to see the percentage of websites without robots.txt/sitemap.xml.

The idea came from the internal test: the ultimate_sitemap_parser failed to scan two of our websites (for different reasons though). One of the reasons was the missing robots.txt, but with big sitemap index at a predictable path.

I can run such a test.

from ultimate-sitemap-parser.

pypt avatar pypt commented on August 23, 2024

Cool, thanks! So, do you need a list of website URLs for your tests or do you have your own?

from ultimate-sitemap-parser.

kienli avatar kienli commented on August 23, 2024

I would like to start with your list of 100,000 websites, if possible. How I can get it? My email address is in my profile or you can post a link here.

from ultimate-sitemap-parser.

kienli avatar kienli commented on August 23, 2024

I did a small test on 7335 websites from this dataset and searched for robots.txt and sitemap in it during the first round.

If sitemap was present in robots.txt, I saved the path.
If sitemap was not present, during the second round I iterated over the saved paths for the given website with a hope to guess the sitemap location and saved the result.

Here what I found so far.

From 7335 I got 6014 websites, which responded with 200 or 403 on the homepage.

19,7% (1182 websites out of 6014) don't have robots.txt.
80,3% (4832 websites) have it.

From those with robots.txt
63% (3043 out of 4832) don't have a sitemap in it.
37% (1789 out of 4832) have sitemap it there.

In 60% of the cases (1061 out of 1789), the sitemap is located at /sitemap.xml

For the second round, I took 1182 websites, which don't have robots.txt and tried to guess,
based on the collected sitemaps locations.

The script could guess 285 sitemaps out of 1182, it's 24%.

/sitemap.xml was used by 43% of these websites (123 out of 285).

Other variations:
/sitemap_index.xml - 24 times
/.sitemap.xml - 9 times
/sitemap - 8 times
/admin/config/search/xmlsitemap - 8 times
/sitemap/sitemap-index.xml - 4 times

from ultimate-sitemap-parser.

kienli avatar kienli commented on August 23, 2024

There is a sense to try to call /sitemap.xml at least once, if the sitemap is not present in robots.txt or robots.txt is missing. It can increase chances to find the sitemap in general.
I found out, that some of the websites have a redirection from /sitemap.xml to the real sitemap location, e.g. Yoast SEO plugin for Wordpress does it.

from ultimate-sitemap-parser.

pypt avatar pypt commented on August 23, 2024

Very cool, thank you Alex! Yes, I agree that it's worth it to blindly try /sitemap.xml (and other similar paths) without robots.txt being present at all.

from ultimate-sitemap-parser.

pypt avatar pypt commented on August 23, 2024

@kienli, if by any chance you're a student at some sort of a university, you can consider implementing this task as a GSoC 2019 project:

https://docs.google.com/document/d/1GGbGtFOMS07dog4yzglY5hZCDc41ZQjY1RqRKOlW0B4/edit?usp=sharing

https://cyber.harvard.edu/gsoc/MediaCloud

https://summerofcode.withgoogle.com/organizations/5825827049046016/

from ultimate-sitemap-parser.

kienli avatar kienli commented on August 23, 2024

@pypt thanks for the suggestion. I'm not a student. But I think, all the mentioned ideas are cool and I hope to contribute to some of them, if possible.

from ultimate-sitemap-parser.

pypt avatar pypt commented on August 23, 2024

Thanks again @kienli for your initial research on the issue, I've added support for trying out a few extra URL paths to find sitemaps not published in robots.txt:

https://github.com/berkmancenter/mediacloud-ultimate_sitemap_parser/blob/develop/usp/tree.py#L13-L22

Will release an updated version soon.

from ultimate-sitemap-parser.

kienli avatar kienli commented on August 23, 2024

That's awesome. Thanks a lot!

from ultimate-sitemap-parser.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.