mediacloud / ultimate-sitemap-parser Goto Github PK

View Code? Open in Web Editor NEW

176.0 11.0 64.0 114 KB

Ultimate Website Sitemap Parser

Home Page: https://mediacloud.org/

License: Other

Python 100.00%

python3 sitemap sitemap-xml robots-txt xml-sitemap xml-sitemap-parser python python-3

ultimate-sitemap-parser's Introduction

Website sitemap parser for Python 3.5+.

Features

Supports all sitemap formats:
Field-tested with ~1 million URLs as part of the Media Cloud project
Error-tolerant with more common sitemap bugs
Tries to find sitemaps not listed in robots.txt
Uses fast and memory efficient Expat XML parsing
Doesn't consume much memory even with massive sitemap hierarchies
Provides a generated sitemap tree as easy to use object tree
Supports using a custom web client
Uses a small number of actively maintained third-party modules
Reasonably tested

Installation

pip install ultimate-sitemap-parser

Usage

from usp.tree import sitemap_tree_for_homepage

tree = sitemap_tree_for_homepage('https://www.nytimes.com/')
print(tree)

sitemap_tree_for_homepage() will return a tree of AbstractSitemap subclass objects that represent the sitemap hierarchy found on the website; see a reference of AbstractSitemap subclasses.

If you'd like to just list all the pages found in all of the sitemaps within the website, consider using all_pages() method:

# all_pages() returns an Iterator
for page in tree.all_pages():
    print(page)

all_pages() method will return an iterator yielding SitemapPage objects; see a reference of SitemapPage.

ultimate-sitemap-parser's People

Contributors

Stargazers

Watchers

ultimate-sitemap-parser's Issues

Library interferes with application logging configuration

Because of this:

ultimate-sitemap-parser/usp/fetch_parse.py

Line 43 in 68d1ccd

log = create_logger(__name__)

and this:

ultimate-sitemap-parser/usp/log.py

Lines 36 to 43 in 68d1ccd

 if not self.__l.handlers: 

 formatter = logging.Formatter( 

 fmt='%(asctime)s %(levelname)s %(name)s [%(process)d/%(threadName)s]: %(message)s' 

 ) 

 handler = logging.StreamHandler() 

 handler.setFormatter(formatter) 

 self.__l.addHandler(handler)

this library almost always ends-up configuring the logging of the calling application in ways not often desired. It's usually not appropriate to unconditionally print logging to the screen. Can you please remove the code (shown above) where it's adding handlers if no handlers already exists?

Also, this module is overriding all of the built-in logging functionality unnecessarily. It could just use the logging object directly (e.g. _LOGGING = logging.getLogger(__name__)), but that's just a side comment.

I'm currently having to do this to squash your output:

import usp.fetch_parse
import usp.objects.sitemap

logging.getLogger('usp.fetch_parse').handlers = []
logging.getLogger('usp.helpers').handlers = []

Unable to gunzip response

I used python code like that from example:

from usp.tree import sitemap_tree_for_homepage

tree = sitemap_tree_for_homepage('https://www.nytimes.com/')
print(tree)

And it raised error:

2023-03-11 12:45:00,197 ERROR usp.helpers [26140/MainThread]: Unable to gunzip response <usp.web_client.requests_client.RequestsWebClientSuccessResponse object at 0x00000244E6BAF280>, maybe it's a non-gzipped sitemap: Unable to gunzip data: Not a gzipped file (b' <')

Prevent XML parser from parsing gzipped XMLs that it's unable to decompress

2018-11-26 12:59:27,847 INFO mediawords.util.sitemap.fetchers
[194712/MainThread]: Fetching level 1 sitemap from
https://www.iberlibro.com/sitemap.bdp31.xml.gz...
2018-11-26 12:59:27,848 INFO mediawords.util.sitemap.helpers
[194712/MainThread]: Fetching URL
https://www.iberlibro.com/sitemap.bdp31.xml.gz...
2018-11-26 12:59:28,433 ERROR mediawords.util.sitemap.helpers
[194712/MainThread]: Unable to gunzip response
<mediawords.util.web.user_agent.response.response.Response object at
0x7f3485abfcc8>: Unable to gunzip data: Not a gzipped file (b'<?')
2018-11-26 12:59:28,437 INFO mediawords.util.sitemap.fetchers
[194712/MainThread]: Parsing sitemap from URL
https://www.iberlibro.com/sitemap.bdp31.xml.gz...

Error in request causes total crash

Because of a time out error on a single sitemap (that does not exist) the entire script crashes. So no other sitemaps are tried and an error is raised:

ERROR [2019-07-26 07:36:22,600 sitemap_scanner: 24] HTTPSConnectionPool(host='dutchitchannel.nl', port=443): Read timed out. (read timeout=60)

BOM removal doesn't seem to work properly

2018-11-23 21:38:36,966 INFO mediawords.util.sitemap.fetchers [194713/MainThread]: Fetching level 0 sitemap from http://test.de/robots.txt...
2018-11-23 21:38:36,966 INFO mediawords.util.sitemap.helpers [194713/MainThread]: Fetching URL http://test.de/robots.txt...
2018-11-23 21:38:37,684 INFO mediawords.util.sitemap.fetchers [194713/MainThread]: Parsing sitemap from URL http://test.de/robots.txt...
2018-11-23 21:38:37,688 INFO mediawords.util.sitemap.fetchers [194713/MainThread]: Fetching level 0 sitemap from https://www.test.de/sitemap.xml...
2018-11-23 21:38:37,689 INFO mediawords.util.sitemap.helpers [194713/MainThread]: Fetching URL https://www.test.de/sitemap.xml...
2018-11-23 21:38:38,050 INFO mediawords.util.sitemap.fetchers [194713/MainThread]: Parsing sitemap from URL https://www.test.de/sitemap.xml...
2018-11-23 21:38:38,060 INFO mediawords.util.sitemap.fetchers [194713/MainThread]: Fetching level 1 sitemap from https://www.test.de/sitemap.ashx?ressort=altersvorsorge-rente...
2018-11-23 21:38:38,060 INFO mediawords.util.sitemap.helpers [194713/MainThread]: Fetching URL https://www.test.de/sitemap.ashx?ressort=altersvorsorge-rente...
2018-11-23 21:38:45,793 INFO mediawords.util.sitemap.fetchers [194713/MainThread]: Parsing sitemap from URL https://www.test.de/sitemap.ashx?ressort=altersvorsorge-rente...
2018-11-23 21:38:45,796 WARNING mediawords.util.sitemap.fetchers [194713/MainThread]: Story URL <U+FEFF><?xml version="1.0" encoding="utf-8"?> doesn't look like an URL, skipping
2018-11-23 21:38:45,796 WARNING mediawords.util.sitemap.fetchers [194713/MainThread]: Story URL <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> doesn't look like an URL, skipping
2018-11-23 21:38:45,796 WARNING mediawords.util.sitemap.fetchers [194713/MainThread]: Story URL <url> doesn't look like an URL, skipping
2018-11-23 21:38:45,796 WARNING mediawords.util.sitemap.fetchers [194713/MainThread]: Story URL <loc>https://www.test.de/Gefoerderte-Altersvorsorge-Keine-staatlichen-Zulagen-verschenken-5398366-0/</loc>
 doesn't look like an URL, skipping
2018-11-23 21:38:45,796 WARNING mediawords.util.sitemap.fetchers [194713/MainThread]: Story URL <lastmod>2018-11-13T01:00:00+00:00</lastmod> doesn't look like an URL, skipping
2018-11-23 21:38:45,796 WARNING mediawords.util.sitemap.fetchers [194713/MainThread]: Story URL <changefreq>daily</changefreq> doesn't look like an URL, skipping
2018-11-23 21:38:45,796 WARNING mediawords.util.sitemap.fetchers [194713/MainThread]: Story URL <priority>0.5</priority> doesn't look like an URL, skipping
2018-11-23 21:38:45,796 WARNING mediawords.util.sitemap.fetchers [194713/MainThread]: Story URL </url> doesn't look like an URL, skipping
2018-11-23 21:38:45,796 WARNING mediawords.util.sitemap.fetchers [194713/MainThread]: Story URL <url> doesn't look like an URL, skipping
2018-11-23 21:38:45,796 WARNING mediawords.util.sitemap.fetchers [194713/MainThread]: Story URL <loc>https://www.test.de/Pflege-zu-Hause-Hilfe-im-Pflegefall-die-wichtigsten-Tipps-fuer-Angehoerige-5397896-0/</loc> doesn't look like an URL, skipping
2018-11-23 21:38:45,796 WARNING mediawords.util.sitemap.fetchers [194713/MainThread]: Story URL <lastmod>2018-11-13T01:00:00+00:00</lastmod> doesn't look like an URL, skipping
2018-11-23 21:38:45,797 WARNING mediawords.util.sitemap.fetchers [194713/MainThread]: Story URL <changefreq>daily</changefreq> doesn't look like an URL, skipping
2018-11-23 21:38:45,797 WARNING mediawords.util.sitemap.fetchers [194713/MainThread]: Story URL <priority>0.5</priority> doesn't look like an URL, skipping
2018-11-23 21:38:45,797 WARNING mediawords.util.sitemap.fetchers [194713/MainThread]: Story URL </url> doesn't look like an URL, skipping
2018-11-23 21:38:45,797 WARNING mediawords.util.sitemap.fetchers [194713/MainThread]: Story URL <url> doesn't look like an URL, skipping
2018-11-23 21:38:45,797 WARNING mediawords.util.sitemap.fetchers [194713/MainThread]: Story URL <loc>https://www.test.de/Pflege-zu-Hause-Hilfe-im-Pflegefall-die-wichtigsten-Tipps-fuer-Angehoerige-5397896-5397901/</loc> doesn't look like an URL, skipping
2018-11-23 21:38:45,797 WARNING mediawords.util.sitemap.fetchers [194713/MainThread]: Story URL <lastmod>2018-11-13T01:00:00+00:00</lastmod> doesn't look like an URL, skipping
2018-11-23 21:38:45,797 WARNING mediawords.util.sitemap.fetchers [194713/MainThread]: Story URL <changefreq>daily</changefreq> doesn't look like an URL, skipping
2018-11-23 21:38:45,797 WARNING mediawords.util.sitemap.fetchers [194713/MainThread]: Story URL <priority>0.5</priority> doesn't look like an URL, skipping
2018-11-23 21:38:45,797 WARNING mediawords.util.sitemap.fetchers [194713/MainThread]: Story URL </url> doesn't look like an URL, skipping
2018-11-23 21:38:45,797 WARNING mediawords.util.sitemap.fetchers [194713/MainThread]: Story URL <url> doesn't look like an URL, skipping
2018-11-23 21:38:45,797 WARNING mediawords.util.sitemap.fetchers [194713/MainThread]: Story URL <loc>https://www.test.de/Pflege-zu-Hause-Hilfe-im-Pflegefall-die-wichtigsten-Tipps-fuer-Angehoerige-5397896-5399448/</loc> doesn't look like an URL, skipping

This site is not working => "set()" as result

Probably this site has a strange format or I called something wrong?

https://hls-dhs-dss.ch

from usp.tree import sitemap_tree_for_homepage

tree = sitemap_tree_for_homepage('https://hls-dhs-dss.ch')
print(tree.all_pages())

The result:

2019-07-11 18:45:00,533 WARNING usp.tree [2344/MainThread]: Assuming that the homepage of https://hls-dhs-dss.ch is https://hls-dhs-dss.ch/
2019-07-11 18:45:00,534 INFO usp.fetchers [2344/MainThread]: Fetching level 0 sitemap from https://hls-dhs-dss.ch/robots.txt...
2019-07-11 18:45:00,534 INFO usp.helpers [2344/MainThread]: Fetching URL https://hls-dhs-dss.ch/robots.txt...
2019-07-11 18:45:00,821 INFO usp.fetchers [2344/MainThread]: Parsing sitemap from URL https://hls-dhs-dss.ch/robots.txt...
set()

Reading the robots.txt manually, I know there are two layers of sitemap.xml

Add support for Crawl-Delay from robots.txt

https://en.wikipedia.org/wiki/Robots_exclusion_standard#Crawl-delay_directive

Invalid time stamp cannot be handled

from usp.tree import sitemap_tree_for_homepage
sitemap_tree_for_homepage('https://www.lobinc.ca/')

Maybe the parser should fail more gracefully if a date can't be converted. The sitemap seems to be fine otherwise. https://www.lobinc.ca/sitemap.xml

SSL Certificate error fix?

I was testing this package for a web crawler I was building. But at times it gives below error. Is there any argument I have to pass or is this a bug?

_IndexWebsiteSitemap(url=https://www.crummy.com/, sub_sitemaps=[InvalidSitemap(url=https://www.crummy.com/robots.txt, reason=Unable to fetch sitemap from https://www.crummy.com/robots.txt: HTTPSConnectionPool(host='www.crummy.com', port=443): Max retries exceeded with url: /robots.txt (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate in certificate chain (ssl.c:1131)'))))])

what I am trying is:

from usp.tree import sitemap_tree_for_homepage
tree = sitemap_tree_for_homepage("https://www.crummy.com")
print(tree)

If `Content-Type` header is set, verify it's the expected one

For example, if Content-Type for /robots.txt is text/html (and not text/plain), this usually means that the file is missing (and instead a 404 page would get returned) so there's no need to attempt to parse it.

Same goes for XML files, plain text sitemaps, and gzipped XML / text sitemaps.

Not working

Installed Usp.tree file differ from https://ultimate-sitemap-parser.readthedocs.io/en/latest/_modules/usp/tree.html#sitemap_tree_for_homepage and do not include _UNPUBLISHED_SITEMAP_PATHS

my site use https://www.site.com/sitemap-index.xml which download sitemap.xml.gz

which include

https://*.amazonaws.com/sitemap_00.xml.gz
2019-07-16

https://.amazonaws.com/*sitemap_01.xml.gz
2019-07-16

usp.objects.py does not include IndexWebsiteSitemap

Don't refetch sitemaps that were already fetched

2019-07-19 14:26:46,279 INFO mediawords.util.sitemap.media [95859/MainThread]: Fetching sitemap pages for media ID 10 (https://globalvoices.org/)...
2019-07-19 14:26:46,282 INFO usp.fetch_parse [95859/MainThread]: Fetching level 0 sitemap from https://globalvoices.org/robots.txt...
2019-07-19 14:26:46,282 INFO usp.helpers [95859/MainThread]: Fetching URL https://globalvoices.org/robots.txt...
2019-07-19 14:26:46,800 INFO usp.fetch_parse [95859/MainThread]: Parsing sitemap from URL https://globalvoices.org/robots.txt...
2019-07-19 14:26:46,801 INFO usp.fetch_parse [95859/MainThread]: Fetching level 0 sitemap from https://globalvoices.org/sitemap.xml...
2019-07-19 14:26:46,802 INFO usp.helpers [95859/MainThread]: Fetching URL https://globalvoices.org/sitemap.xml...
2019-07-19 14:26:46,963 INFO usp.fetch_parse [95859/MainThread]: Parsing sitemap from URL https://globalvoices.org/sitemap.xml...
2019-07-19 14:26:46,979 INFO usp.fetch_parse [95859/MainThread]: Fetching level 1 sitemap from https://globalvoices.org/sitemap-home.xml...
2019-07-19 14:26:46,979 INFO usp.helpers [95859/MainThread]: Fetching URL https://globalvoices.org/sitemap-home.xml...
2019-07-19 14:26:47,407 INFO usp.fetch_parse [95859/MainThread]: Parsing sitemap from URL https://globalvoices.org/sitemap-home.xml...
2019-07-19 14:26:47,413 INFO usp.fetch_parse [95859/MainThread]: Fetching level 1 sitemap from https://globalvoices.org/sitemap-posttype-post.201907.xml...
2019-07-19 14:26:47,414 INFO usp.helpers [95859/MainThread]: Fetching URL https://globalvoices.org/sitemap-posttype-post.201907.xml...
2019-07-19 14:26:48,998 INFO usp.fetch_parse [95859/MainThread]: Parsing sitemap from URL https://globalvoices.org/sitemap-posttype-post.201907.xml...
<...>
2019-07-19 14:54:06,695 INFO usp.fetch_parse [95859/MainThread]: Fetching level 1 sitemap from https://globalvoices.org/sitemap-home.xml...
2019-07-19 14:54:06,696 INFO usp.helpers [95859/MainThread]: Fetching URL https://globalvoices.org/sitemap-home.xml...
2019-07-19 14:54:06,869 INFO usp.fetch_parse [95859/MainThread]: Parsing sitemap from URL https://globalvoices.org/sitemap-home.xml...
2019-07-19 14:54:06,871 INFO usp.fetch_parse [95859/MainThread]: Fetching level 1 sitemap from https://globalvoices.org/sitemap-posttype-post.201907.xml...
2019-07-19 14:54:06,872 INFO usp.helpers [95859/MainThread]: Fetching URL https://globalvoices.org/sitemap-posttype-post.201907.xml...
2019-07-19 14:54:07,033 INFO usp.fetch_parse [95859/MainThread]: Parsing sitemap from URL https://globalvoices.org/sitemap-posttype-post.201907.xml...
<...>

Reduce recursivity level for sitemap fetcher

10 levels deep is probably too much:

2018-11-26 13:11:19,139 INFO mediawords.util.sitemap.helpers
[162086/MainThread]: Fetching URL
https://www.juiceplus.com/fr/fr/franchise/sitemap.xml...
2018-11-26 13:11:19,428 INFO mediawords.util.sitemap.fetchers
[162086/MainThread]: Parsing sitemap from URL
https://www.juiceplus.com/fr/fr/franchise/sitemap.xml...
2018-11-26 13:11:19,508 INFO mediawords.util.sitemap.fetchers
[162086/MainThread]: Fetching level 8 sitemap from
https://www.juiceplus.com/il/en/franchise/sitemap.xml...
2018-11-26 13:11:19,508 INFO mediawords.util.sitemap.helpers
[162086/MainThread]: Fetching URL
https://www.juiceplus.com/il/en/franchise/sitemap.xml...

RecursionError - maximum recursion depth exceeded while calling a Python object

I am having issues with parsing some of the urls.
Code generates ResursionError as on image attached.
In code sample I added two urls causing this issue.

from usp.tree import sitemap_tree_for_homepage
url1='https://infirmaryhealth.org/'
t = sitemap_tree_for_homepage(url1)
url2='https://www.kdhmadison.org/'
t = sitemap_tree_for_homepage(url2)

Following scenario happens when robots.txt as Sitemap contains itself, ie a Sitemap entry with robots.txt

Put urls in lowercase and as a result, the urls are no longer valid

Hello,
So here's a little issue.
Basically USP put all URL in lowercases, and as a result if the urls has some uppercase caracter, it no longer finds it.

Here's an example 👍
`from usp.tree import sitemap_tree_for_homepage

tree = sitemap_tree_for_homepage('https://www.distriartisan.fr')`

The sitemaps urls are like this :
"https://www.distriartisan.fr/media/sitemap/sitemapProduitsAll_1.xml"

However in the logs it's written like this :
2023-06-02 12:42:12,823 INFO usp.fetch_parse [7776/MainThread]: Parsing sitemap from URL https://www.distriartisan.fr/media/sitemap/sitemapproduitsall_1.xml...
2023-06-02 12:42:12,826 ERROR usp.fetch_parse [7776/MainThread]: Parsing sitemap from URL https://www.distriartisan.fr/media/sitemap/sitemapproduitsall_1.xml failed: Unsupported root element 'html'.`

ModuleNotFoundError: No module named 'http.client'; 'http' is not a package

Code :

from usp.tree import sitemap_tree_for_homepage
tree = sitemap_tree_for_homepage("https://www.nytimes.com/")
for page in tree.all_pages():
    print(page)

Error :

Traceback (most recent call last):
  File "c:/Users/Raven/Desktop/CODE/GitHub/ezweb/src/ezweb/utils/sitemap.py", line 1, in <module>
    from usp.tree import sitemap_tree_for_homepage
  File "C:\Users\Raven\AppData\Local\Programs\Python\Python37\lib\site-packages\usp\tree.py", line 6, in <module>
    from .fetch_parse import SitemapFetcher
  File "C:\Users\Raven\AppData\Local\Programs\Python\Python37\lib\site-packages\usp\fetch_parse.py", line 11, in <module>
    from .helpers import (
  File "C:\Users\Raven\AppData\Local\Programs\Python\Python37\lib\site-packages\usp\helpers.py", line 15, in <module>
    from .web_client.abstract_client import (
  File "C:\Users\Raven\AppData\Local\Programs\Python\Python37\lib\site-packages\usp\web_client\abstract_client.py", line 4, in <module>
    from http import HTTPStatus
  File "c:\Users\Raven\Desktop\CODE\GitHub\ezweb\src\ezweb\utils\http.py", line 1, in <module>
    import requests
  File "C:\Users\Raven\AppData\Local\Programs\Python\Python37\lib\site-packages\requests\__init__.py", line 43, in <module>
    import urllib3
  File "C:\Users\Raven\AppData\Local\Programs\Python\Python37\lib\site-packages\urllib3\__init__.py", line 11, in <module>
    from . import exceptions
  File "C:\Users\Raven\AppData\Local\Programs\Python\Python37\lib\site-packages\urllib3\exceptions.py", line 3, in <module>
    from .packages.six.moves.http_client import IncompleteRead as httplib_IncompleteRead
  File "C:\Users\Raven\AppData\Local\Programs\Python\Python37\lib\site-packages\urllib3\packages\six.py", line 199, in load_module
    mod = mod._resolve()
  File "C:\Users\Raven\AppData\Local\Programs\Python\Python37\lib\site-packages\urllib3\packages\six.py", line 113, in _resolve
    return _import_module(self.mod)
  File "C:\Users\Raven\AppData\Local\Programs\Python\Python37\lib\site-packages\urllib3\packages\six.py", line 82, in _import_module
    __import__(name)
ModuleNotFoundError: No module named 'http.client'; 'http' is not a package

Not able to parse the html sitemaps

Not able to parse html sitemaps.
For example tried for: https://www.axa-im.com/site-map

Can't use own logging handlers / Can't propagate logger

Hi,

I have some troubles changing usp logging behavior. By default it outputs to stdout. It should be easy enough to use my own logger and handlers with propagation to the root logger.

The only way I figured out so far is to hack logging within context with following code:

import logging
import os.path, pkgutil
import usp.tree

logger = logging.getLogger('')
logger.setLevel(logging.DEBUG)
logger.propagate = True

handler = logging.FileHandler('/tmp/test.log', mode='a')
formatter = logging.Formatter("%(asctime)s %(levelname)s %(name)s %(message)s")
handler.setFormatter(formatter)
handler.setLevel(logging.DEBUG)
logger.addHandler(handler)

class CustomLogger():

  def __enter__(self):
    pkgpath = os.path.dirname(usp.tree.__file__)
    modules = [module.name for module in pkgutil.iter_modules([pkgpath])]
    
    for module in modules:
      l = logging.getLogger('usp.' + module)
      l.handlers = logger.handlers
      
    # Disable DEBUG for usp submodules
    logging.disable(logging.DEBUG)

  def __exit__(self, *args):
    logging.disable(logging.NOTSET)


with CustomLogger():
  page = "https://www.github.com/"
  tree = usp.tree.sitemap_tree_for_homepage(page)

It ain't pretty but it works. Any hints how to do it correctly?

Throwing Exception while parsing date

Throwing unknown string format: null error while parsing the date.
URL used: https://www.ey.com/en_gl
For better reference traceback snapshot:

`yield` found links instead of `return`ing them

Primarily to conserve memory.

Need Contribution and Set up Guidelines to facilitate in Development.

Since this is in the initial stage of development, it would be great if we have documentation for the setting up of parser like having setting up of virtualenv, requirements file etc.

How to set timeout properly

Hi Team - thanks for this great lib! I have a couple of websites on my list that seem to be down. However, usp doesn't time out until like 2 hours later. How can I set the timeout to some shorter value for unresponsive websites? I did see the set_timeout parameter for the web_client class but not exactly sure how to initiate a web_client class with a proper timeout set.

Here is how far I have come:

from usp.web_client.requests_client import AbstractWebClient
from usp.web_client.requests_client import RequestsWebClient

RequestsWebClient.set_timeout(RequestsWebClient, timeout=10)
RequestsWebClient.set_max_response_data_length(RequestsWebClient, max_response_data_length=None)
web_client = RequestsWebClient

sitemap_tree_for_homepage('https://garfieldre2.org/', web_client=web_client)

which results in the following error.

Disable logging?

Is there a way to turn off logging? A single call to sitemap_tree_for_homepage spits out over 20 messages into the console, most of which are of no concern.

Add support for RSS / Atom sitemaps

Rare, but sometimes still encountered in the wild:

https://www.sitemaps.org/protocol.html#otherformats

Convert SitemapNewsStory to dict

usp.objects.page.SitemapNewsStory has no attribute __dict__ because of the usage of slots. Therefore, vars() and __dict__() don't work. It would be nice to have an option to parse the attributes to a dict.

In [39]: a[-1].__dict__()                                                                                                                                                                                   
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-39-ab06281650cf> in <module>
----> 1 a[-1].__dict__()

AttributeError: 'SitemapPage' object has no attribute '__dict__'

In [40]: vars(a[-1])                                                                                                                                                                                        
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-40-ada32e179a17> in <module>
----> 1 vars(a[-1])

TypeError: vars() argument must have __dict__ attribute

Thanks for this lib. Works well.

Some sitemaps don't get fetched fully

2019-07-19 14:48:41,974 INFO usp.helpers [95859/MainThread]: Fetching URL https://globalvoices.org/sitemap-posttype-post.200705.xml...
2019-07-19 14:48:59,852 INFO usp.fetch_parse [95859/MainThread]: Parsing sitemap from URL https://globalvoices.org/sitemap-posttype-post.200705.xml...
2019-07-19 14:48:59,932 ERROR usp.fetch_parse [95859/MainThread]: Parsing sitemap from URL https://globalvoices.org/sitemap-posttype-post.200705.xml failed: no element found: line 4651, column 26

Optional typing should be set to None for considering as Optional Argument

I noticed that for RequestsWebClientResponse constructor the Optional Argument is not set as None which would cause an issue in later stages when making new development/fixes around that.

https://github.com/berkmancenter/mediacloud-ultimate_sitemap_parser/blob/3c2b076f52ca57412791a21a6cac4210b3468b86/usp/web_client/requests_client.py#L22

This is not an issue, just a suggestion for keeping it clean :)

End of Support

Since this project has been left by the wayside for the last good chunk of time, the Mediacloud team has decided to formally end our support for the ultimate-sitemap-parser.
As there are several active forks of this project, we wanted to give some lead time to figure out if any community members might be interested in assuming ownership of this repository and the associated package on pypi. This thread can be a place to figure out those details.
Otherwise, in the beginning of September 2024, this repository will be archived.

Authentication Method for Secured Sites?

Hi,
First of all, Thank you for building such a fantastic library for searching sitemaps through a website. There is a enhancement though which I would like to propose.

I have noticed for one of the site when I was trying the same thing, it asks for authentication via HttpNtlmAuth.

I was wondering if such arguments can be passed along with url to bypass authentication?

thanks

Detection of sitemap if it's not present in robots.txt

First of all, thanks for this great package. I like it. It works better than any of my implementations of sitemap parsers. However, I see features which can be addressed. And I would rather contribute to this project than invent something else.
I would be cool to detect sitemap.xml in another location, not only in robots.txt. It can increase chances to find and parse the sitemap.
Why? Because the robots.txt file is an optional file. Also, it's optional to put sitemap declaration inside of the robots.txt. But we can guess, that sitemap.xml is sometimes placed in https://www.example.com/sitemap.xml

What do you think?

Provide a simple mechanism to parse raw sitemap content

I had to use XMLSitemapParser directly in order to accomplish this. It's more of a hack since the project seems geared to only work with HTTP URLs and the parser classes always want both URLs and the HTTP client objects. However, the URLs are only for logging and the client objects are only used in very narrow use-cases (which will never apply to me/us). So, requiring HTTP seems like it'd be an arbitrary requirement most of the time.

You might just add general support for "file:" schemes and resolve both issues (the validation that only allows HTTP URLs, and keeping us from having to use the interior classes directly since they don't appear to have been meant to be used that way).

Exculde specific sitemap from sitemap_tree_for_homepage

Like if any sitemap/XML contains any string, like discussion, comments, category or something similar, I wanna it do be excluded from sitemap_tree_for_homepage().all_pages(), how to do it?

	if not self.__l.handlers:
	formatter = logging.Formatter(
	fmt='%(asctime)s %(levelname)s %(name)s [%(process)d/%(threadName)s]: %(message)s'
	)

	handler = logging.StreamHandler()
	handler.setFormatter(formatter)
	self.__l.addHandler(handler)