Git Product home page Git Product logo

ultimate-sitemap-parser's Introduction

Build Status Documentation Status Coverage Status PyPI package Download stats

Website sitemap parser for Python 3.5+.

Features

Installation

pip install ultimate-sitemap-parser

Usage

from usp.tree import sitemap_tree_for_homepage

tree = sitemap_tree_for_homepage('https://www.nytimes.com/')
print(tree)

sitemap_tree_for_homepage() will return a tree of AbstractSitemap subclass objects that represent the sitemap hierarchy found on the website; see a reference of AbstractSitemap subclasses.

If you'd like to just list all the pages found in all of the sitemaps within the website, consider using all_pages() method:

# all_pages() returns an Iterator
for page in tree.all_pages():
    print(page)

all_pages() method will return an iterator yielding SitemapPage objects; see a reference of SitemapPage.

ultimate-sitemap-parser's People

Contributors

pypt avatar tgrandje avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ultimate-sitemap-parser's Issues

Library interferes with application logging configuration

Because of this:

log = create_logger(__name__)

and this:

if not self.__l.handlers:
formatter = logging.Formatter(
fmt='%(asctime)s %(levelname)s %(name)s [%(process)d/%(threadName)s]: %(message)s'
)
handler = logging.StreamHandler()
handler.setFormatter(formatter)
self.__l.addHandler(handler)

this library almost always ends-up configuring the logging of the calling application in ways not often desired. It's usually not appropriate to unconditionally print logging to the screen. Can you please remove the code (shown above) where it's adding handlers if no handlers already exists?

Also, this module is overriding all of the built-in logging functionality unnecessarily. It could just use the logging object directly (e.g. _LOGGING = logging.getLogger(__name__)), but that's just a side comment.

I'm currently having to do this to squash your output:

import usp.fetch_parse
import usp.objects.sitemap

logging.getLogger('usp.fetch_parse').handlers = []
logging.getLogger('usp.helpers').handlers = []

Unable to gunzip response

I used python code like that from example:

from usp.tree import sitemap_tree_for_homepage

tree = sitemap_tree_for_homepage('https://www.nytimes.com/')
print(tree)

And it raised error:

2023-03-11 12:45:00,197 ERROR usp.helpers [26140/MainThread]: Unable to gunzip response <usp.web_client.requests_client.RequestsWebClientSuccessResponse object at 0x00000244E6BAF280>, maybe it's a non-gzipped sitemap: Unable to gunzip data: Not a gzipped file (b' <')

Prevent XML parser from parsing gzipped XMLs that it's unable to decompress

2018-11-26 12:59:27,847 INFO mediawords.util.sitemap.fetchers
[194712/MainThread]: Fetching level 1 sitemap from
https://www.iberlibro.com/sitemap.bdp31.xml.gz...
2018-11-26 12:59:27,848 INFO mediawords.util.sitemap.helpers
[194712/MainThread]: Fetching URL
https://www.iberlibro.com/sitemap.bdp31.xml.gz...
2018-11-26 12:59:28,433 ERROR mediawords.util.sitemap.helpers
[194712/MainThread]: Unable to gunzip response
<mediawords.util.web.user_agent.response.response.Response object at
0x7f3485abfcc8>: Unable to gunzip data: Not a gzipped file (b'<?')
2018-11-26 12:59:28,437 INFO mediawords.util.sitemap.fetchers
[194712/MainThread]: Parsing sitemap from URL
https://www.iberlibro.com/sitemap.bdp31.xml.gz...

Error in request causes total crash

Because of a time out error on a single sitemap (that does not exist) the entire script crashes. So no other sitemaps are tried and an error is raised:

ERROR [2019-07-26 07:36:22,600 sitemap_scanner: 24] HTTPSConnectionPool(host='dutchitchannel.nl', port=443): Read timed out. (read timeout=60)

BOM removal doesn't seem to work properly

2018-11-23 21:38:36,966 INFO mediawords.util.sitemap.fetchers [194713/MainThread]: Fetching level 0 sitemap from http://test.de/robots.txt...
2018-11-23 21:38:36,966 INFO mediawords.util.sitemap.helpers [194713/MainThread]: Fetching URL http://test.de/robots.txt...
2018-11-23 21:38:37,684 INFO mediawords.util.sitemap.fetchers [194713/MainThread]: Parsing sitemap from URL http://test.de/robots.txt...
2018-11-23 21:38:37,688 INFO mediawords.util.sitemap.fetchers [194713/MainThread]: Fetching level 0 sitemap from https://www.test.de/sitemap.xml...
2018-11-23 21:38:37,689 INFO mediawords.util.sitemap.helpers [194713/MainThread]: Fetching URL https://www.test.de/sitemap.xml...
2018-11-23 21:38:38,050 INFO mediawords.util.sitemap.fetchers [194713/MainThread]: Parsing sitemap from URL https://www.test.de/sitemap.xml...
2018-11-23 21:38:38,060 INFO mediawords.util.sitemap.fetchers [194713/MainThread]: Fetching level 1 sitemap from https://www.test.de/sitemap.ashx?ressort=altersvorsorge-rente...
2018-11-23 21:38:38,060 INFO mediawords.util.sitemap.helpers [194713/MainThread]: Fetching URL https://www.test.de/sitemap.ashx?ressort=altersvorsorge-rente...
2018-11-23 21:38:45,793 INFO mediawords.util.sitemap.fetchers [194713/MainThread]: Parsing sitemap from URL https://www.test.de/sitemap.ashx?ressort=altersvorsorge-rente...
2018-11-23 21:38:45,796 WARNING mediawords.util.sitemap.fetchers [194713/MainThread]: Story URL <U+FEFF><?xml version="1.0" encoding="utf-8"?> doesn't look like an URL, skipping
2018-11-23 21:38:45,796 WARNING mediawords.util.sitemap.fetchers [194713/MainThread]: Story URL <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> doesn't look like an URL, skipping
2018-11-23 21:38:45,796 WARNING mediawords.util.sitemap.fetchers [194713/MainThread]: Story URL <url> doesn't look like an URL, skipping
2018-11-23 21:38:45,796 WARNING mediawords.util.sitemap.fetchers [194713/MainThread]: Story URL <loc>https://www.test.de/Gefoerderte-Altersvorsorge-Keine-staatlichen-Zulagen-verschenken-5398366-0/</loc>
 doesn't look like an URL, skipping
2018-11-23 21:38:45,796 WARNING mediawords.util.sitemap.fetchers [194713/MainThread]: Story URL <lastmod>2018-11-13T01:00:00+00:00</lastmod> doesn't look like an URL, skipping
2018-11-23 21:38:45,796 WARNING mediawords.util.sitemap.fetchers [194713/MainThread]: Story URL <changefreq>daily</changefreq> doesn't look like an URL, skipping
2018-11-23 21:38:45,796 WARNING mediawords.util.sitemap.fetchers [194713/MainThread]: Story URL <priority>0.5</priority> doesn't look like an URL, skipping
2018-11-23 21:38:45,796 WARNING mediawords.util.sitemap.fetchers [194713/MainThread]: Story URL </url> doesn't look like an URL, skipping
2018-11-23 21:38:45,796 WARNING mediawords.util.sitemap.fetchers [194713/MainThread]: Story URL <url> doesn't look like an URL, skipping
2018-11-23 21:38:45,796 WARNING mediawords.util.sitemap.fetchers [194713/MainThread]: Story URL <loc>https://www.test.de/Pflege-zu-Hause-Hilfe-im-Pflegefall-die-wichtigsten-Tipps-fuer-Angehoerige-5397896-0/</loc> doesn't look like an URL, skipping
2018-11-23 21:38:45,796 WARNING mediawords.util.sitemap.fetchers [194713/MainThread]: Story URL <lastmod>2018-11-13T01:00:00+00:00</lastmod> doesn't look like an URL, skipping
2018-11-23 21:38:45,797 WARNING mediawords.util.sitemap.fetchers [194713/MainThread]: Story URL <changefreq>daily</changefreq> doesn't look like an URL, skipping
2018-11-23 21:38:45,797 WARNING mediawords.util.sitemap.fetchers [194713/MainThread]: Story URL <priority>0.5</priority> doesn't look like an URL, skipping
2018-11-23 21:38:45,797 WARNING mediawords.util.sitemap.fetchers [194713/MainThread]: Story URL </url> doesn't look like an URL, skipping
2018-11-23 21:38:45,797 WARNING mediawords.util.sitemap.fetchers [194713/MainThread]: Story URL <url> doesn't look like an URL, skipping
2018-11-23 21:38:45,797 WARNING mediawords.util.sitemap.fetchers [194713/MainThread]: Story URL <loc>https://www.test.de/Pflege-zu-Hause-Hilfe-im-Pflegefall-die-wichtigsten-Tipps-fuer-Angehoerige-5397896-5397901/</loc> doesn't look like an URL, skipping
2018-11-23 21:38:45,797 WARNING mediawords.util.sitemap.fetchers [194713/MainThread]: Story URL <lastmod>2018-11-13T01:00:00+00:00</lastmod> doesn't look like an URL, skipping
2018-11-23 21:38:45,797 WARNING mediawords.util.sitemap.fetchers [194713/MainThread]: Story URL <changefreq>daily</changefreq> doesn't look like an URL, skipping
2018-11-23 21:38:45,797 WARNING mediawords.util.sitemap.fetchers [194713/MainThread]: Story URL <priority>0.5</priority> doesn't look like an URL, skipping
2018-11-23 21:38:45,797 WARNING mediawords.util.sitemap.fetchers [194713/MainThread]: Story URL </url> doesn't look like an URL, skipping
2018-11-23 21:38:45,797 WARNING mediawords.util.sitemap.fetchers [194713/MainThread]: Story URL <url> doesn't look like an URL, skipping
2018-11-23 21:38:45,797 WARNING mediawords.util.sitemap.fetchers [194713/MainThread]: Story URL <loc>https://www.test.de/Pflege-zu-Hause-Hilfe-im-Pflegefall-die-wichtigsten-Tipps-fuer-Angehoerige-5397896-5399448/</loc> doesn't look like an URL, skipping

This site is not working => "set()" as result

Probably this site has a strange format or I called something wrong?

from usp.tree import sitemap_tree_for_homepage

tree = sitemap_tree_for_homepage('https://hls-dhs-dss.ch')
print(tree.all_pages())

The result:

2019-07-11 18:45:00,533 WARNING usp.tree [2344/MainThread]: Assuming that the homepage of https://hls-dhs-dss.ch is https://hls-dhs-dss.ch/
2019-07-11 18:45:00,534 INFO usp.fetchers [2344/MainThread]: Fetching level 0 sitemap from https://hls-dhs-dss.ch/robots.txt...
2019-07-11 18:45:00,534 INFO usp.helpers [2344/MainThread]: Fetching URL https://hls-dhs-dss.ch/robots.txt...
2019-07-11 18:45:00,821 INFO usp.fetchers [2344/MainThread]: Parsing sitemap from URL https://hls-dhs-dss.ch/robots.txt...
set()

Reading the robots.txt manually, I know there are two layers of sitemap.xml

Invalid time stamp cannot be handled

from usp.tree import sitemap_tree_for_homepage
sitemap_tree_for_homepage('https://www.lobinc.ca/')

Maybe the parser should fail more gracefully if a date can't be converted. The sitemap seems to be fine otherwise. https://www.lobinc.ca/sitemap.xml

image

SSL Certificate error fix?

I was testing this package for a web crawler I was building. But at times it gives below error. Is there any argument I have to pass or is this a bug?

_IndexWebsiteSitemap(url=https://www.crummy.com/, sub_sitemaps=[InvalidSitemap(url=https://www.crummy.com/robots.txt, reason=Unable to fetch sitemap from https://www.crummy.com/robots.txt: HTTPSConnectionPool(host='www.crummy.com', port=443): Max retries exceeded with url: /robots.txt (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate in certificate chain (ssl.c:1131)'))))])

what I am trying is:

from usp.tree import sitemap_tree_for_homepage
tree = sitemap_tree_for_homepage("https://www.crummy.com")
print(tree)

If `Content-Type` header is set, verify it's the expected one

For example, if Content-Type for /robots.txt is text/html (and not text/plain), this usually means that the file is missing (and instead a 404 page would get returned) so there's no need to attempt to parse it.

Same goes for XML files, plain text sitemaps, and gzipped XML / text sitemaps.

Don't refetch sitemaps that were already fetched

2019-07-19 14:26:46,279 INFO mediawords.util.sitemap.media [95859/MainThread]: Fetching sitemap pages for media ID 10 (https://globalvoices.org/)...
2019-07-19 14:26:46,282 INFO usp.fetch_parse [95859/MainThread]: Fetching level 0 sitemap from https://globalvoices.org/robots.txt...
2019-07-19 14:26:46,282 INFO usp.helpers [95859/MainThread]: Fetching URL https://globalvoices.org/robots.txt...
2019-07-19 14:26:46,800 INFO usp.fetch_parse [95859/MainThread]: Parsing sitemap from URL https://globalvoices.org/robots.txt...
2019-07-19 14:26:46,801 INFO usp.fetch_parse [95859/MainThread]: Fetching level 0 sitemap from https://globalvoices.org/sitemap.xml...
2019-07-19 14:26:46,802 INFO usp.helpers [95859/MainThread]: Fetching URL https://globalvoices.org/sitemap.xml...
2019-07-19 14:26:46,963 INFO usp.fetch_parse [95859/MainThread]: Parsing sitemap from URL https://globalvoices.org/sitemap.xml...
2019-07-19 14:26:46,979 INFO usp.fetch_parse [95859/MainThread]: Fetching level 1 sitemap from https://globalvoices.org/sitemap-home.xml...
2019-07-19 14:26:46,979 INFO usp.helpers [95859/MainThread]: Fetching URL https://globalvoices.org/sitemap-home.xml...
2019-07-19 14:26:47,407 INFO usp.fetch_parse [95859/MainThread]: Parsing sitemap from URL https://globalvoices.org/sitemap-home.xml...
2019-07-19 14:26:47,413 INFO usp.fetch_parse [95859/MainThread]: Fetching level 1 sitemap from https://globalvoices.org/sitemap-posttype-post.201907.xml...
2019-07-19 14:26:47,414 INFO usp.helpers [95859/MainThread]: Fetching URL https://globalvoices.org/sitemap-posttype-post.201907.xml...
2019-07-19 14:26:48,998 INFO usp.fetch_parse [95859/MainThread]: Parsing sitemap from URL https://globalvoices.org/sitemap-posttype-post.201907.xml...
<...>
2019-07-19 14:54:06,695 INFO usp.fetch_parse [95859/MainThread]: Fetching level 1 sitemap from https://globalvoices.org/sitemap-home.xml...
2019-07-19 14:54:06,696 INFO usp.helpers [95859/MainThread]: Fetching URL https://globalvoices.org/sitemap-home.xml...
2019-07-19 14:54:06,869 INFO usp.fetch_parse [95859/MainThread]: Parsing sitemap from URL https://globalvoices.org/sitemap-home.xml...
2019-07-19 14:54:06,871 INFO usp.fetch_parse [95859/MainThread]: Fetching level 1 sitemap from https://globalvoices.org/sitemap-posttype-post.201907.xml...
2019-07-19 14:54:06,872 INFO usp.helpers [95859/MainThread]: Fetching URL https://globalvoices.org/sitemap-posttype-post.201907.xml...
2019-07-19 14:54:07,033 INFO usp.fetch_parse [95859/MainThread]: Parsing sitemap from URL https://globalvoices.org/sitemap-posttype-post.201907.xml...
<...>

Reduce recursivity level for sitemap fetcher

10 levels deep is probably too much:

2018-11-26 13:11:19,139 INFO mediawords.util.sitemap.helpers
[162086/MainThread]: Fetching URL
https://www.juiceplus.com/fr/fr/franchise/sitemap.xml...
2018-11-26 13:11:19,428 INFO mediawords.util.sitemap.fetchers
[162086/MainThread]: Parsing sitemap from URL
https://www.juiceplus.com/fr/fr/franchise/sitemap.xml...
2018-11-26 13:11:19,508 INFO mediawords.util.sitemap.fetchers
[162086/MainThread]: Fetching level 8 sitemap from
https://www.juiceplus.com/il/en/franchise/sitemap.xml...
2018-11-26 13:11:19,508 INFO mediawords.util.sitemap.helpers
[162086/MainThread]: Fetching URL
https://www.juiceplus.com/il/en/franchise/sitemap.xml...

RecursionError - maximum recursion depth exceeded while calling a Python object

I am having issues with parsing some of the urls.
Code generates ResursionError as on image attached.
In code sample I added two urls causing this issue.

from usp.tree import sitemap_tree_for_homepage
url1='https://infirmaryhealth.org/'
t = sitemap_tree_for_homepage(url1)
url2='https://www.kdhmadison.org/'
t = sitemap_tree_for_homepage(url2)

Following scenario happens when robots.txt as Sitemap contains itself, ie a Sitemap entry with robots.txt
image

Put urls in lowercase and as a result, the urls are no longer valid

Hello,
So here's a little issue.
Basically USP put all URL in lowercases, and as a result if the urls has some uppercase caracter, it no longer finds it.

Here's an example ๐Ÿ‘
`from usp.tree import sitemap_tree_for_homepage

tree = sitemap_tree_for_homepage('https://www.distriartisan.fr')`

The sitemaps urls are like this :
"https://www.distriartisan.fr/media/sitemap/sitemapProduitsAll_1.xml"

However in the logs it's written like this :
2023-06-02 12:42:12,823 INFO usp.fetch_parse [7776/MainThread]: Parsing sitemap from URL https://www.distriartisan.fr/media/sitemap/sitemapproduitsall_1.xml...
2023-06-02 12:42:12,826 ERROR usp.fetch_parse [7776/MainThread]: Parsing sitemap from URL https://www.distriartisan.fr/media/sitemap/sitemapproduitsall_1.xml failed: Unsupported root element 'html'.`

ModuleNotFoundError: No module named 'http.client'; 'http' is not a package

Code :

from usp.tree import sitemap_tree_for_homepage
tree = sitemap_tree_for_homepage("https://www.nytimes.com/")
for page in tree.all_pages():
    print(page)

Error :

Traceback (most recent call last):
  File "c:/Users/Raven/Desktop/CODE/GitHub/ezweb/src/ezweb/utils/sitemap.py", line 1, in <module>
    from usp.tree import sitemap_tree_for_homepage
  File "C:\Users\Raven\AppData\Local\Programs\Python\Python37\lib\site-packages\usp\tree.py", line 6, in <module>
    from .fetch_parse import SitemapFetcher
  File "C:\Users\Raven\AppData\Local\Programs\Python\Python37\lib\site-packages\usp\fetch_parse.py", line 11, in <module>
    from .helpers import (
  File "C:\Users\Raven\AppData\Local\Programs\Python\Python37\lib\site-packages\usp\helpers.py", line 15, in <module>
    from .web_client.abstract_client import (
  File "C:\Users\Raven\AppData\Local\Programs\Python\Python37\lib\site-packages\usp\web_client\abstract_client.py", line 4, in <module>
    from http import HTTPStatus
  File "c:\Users\Raven\Desktop\CODE\GitHub\ezweb\src\ezweb\utils\http.py", line 1, in <module>
    import requests
  File "C:\Users\Raven\AppData\Local\Programs\Python\Python37\lib\site-packages\requests\__init__.py", line 43, in <module>
    import urllib3
  File "C:\Users\Raven\AppData\Local\Programs\Python\Python37\lib\site-packages\urllib3\__init__.py", line 11, in <module>
    from . import exceptions
  File "C:\Users\Raven\AppData\Local\Programs\Python\Python37\lib\site-packages\urllib3\exceptions.py", line 3, in <module>
    from .packages.six.moves.http_client import IncompleteRead as httplib_IncompleteRead
  File "C:\Users\Raven\AppData\Local\Programs\Python\Python37\lib\site-packages\urllib3\packages\six.py", line 199, in load_module
    mod = mod._resolve()
  File "C:\Users\Raven\AppData\Local\Programs\Python\Python37\lib\site-packages\urllib3\packages\six.py", line 113, in _resolve
    return _import_module(self.mod)
  File "C:\Users\Raven\AppData\Local\Programs\Python\Python37\lib\site-packages\urllib3\packages\six.py", line 82, in _import_module
    __import__(name)
ModuleNotFoundError: No module named 'http.client'; 'http' is not a package

Can't use own logging handlers / Can't propagate logger

Hi,

I have some troubles changing usp logging behavior. By default it outputs to stdout. It should be easy enough to use my own logger and handlers with propagation to the root logger.

The only way I figured out so far is to hack logging within context with following code:

import logging
import os.path, pkgutil
import usp.tree

logger = logging.getLogger('')
logger.setLevel(logging.DEBUG)
logger.propagate = True

handler = logging.FileHandler('/tmp/test.log', mode='a')
formatter = logging.Formatter("%(asctime)s %(levelname)s %(name)s %(message)s")
handler.setFormatter(formatter)
handler.setLevel(logging.DEBUG)
logger.addHandler(handler)

class CustomLogger():

  def __enter__(self):
    pkgpath = os.path.dirname(usp.tree.__file__)
    modules = [module.name for module in pkgutil.iter_modules([pkgpath])]
    
    for module in modules:
      l = logging.getLogger('usp.' + module)
      l.handlers = logger.handlers
      
    # Disable DEBUG for usp submodules
    logging.disable(logging.DEBUG)

  def __exit__(self, *args):
    logging.disable(logging.NOTSET)


with CustomLogger():
  page = "https://www.github.com/"
  tree = usp.tree.sitemap_tree_for_homepage(page)

It ain't pretty but it works. Any hints how to do it correctly?

How to set timeout properly

Hi Team - thanks for this great lib! I have a couple of websites on my list that seem to be down. However, usp doesn't time out until like 2 hours later. How can I set the timeout to some shorter value for unresponsive websites? I did see the set_timeout parameter for the web_client class but not exactly sure how to initiate a web_client class with a proper timeout set.

Here is how far I have come:

from usp.web_client.requests_client import AbstractWebClient
from usp.web_client.requests_client import RequestsWebClient

RequestsWebClient.set_timeout(RequestsWebClient, timeout=10)
RequestsWebClient.set_max_response_data_length(RequestsWebClient, max_response_data_length=None)
web_client = RequestsWebClient

sitemap_tree_for_homepage('https://garfieldre2.org/', web_client=web_client)

which results in the following error.

image

Disable logging?

Is there a way to turn off logging? A single call to sitemap_tree_for_homepage spits out over 20 messages into the console, most of which are of no concern.

Convert SitemapNewsStory to dict

usp.objects.page.SitemapNewsStory has no attribute __dict__ because of the usage of slots. Therefore, vars() and __dict__() don't work. It would be nice to have an option to parse the attributes to a dict.

In [39]: a[-1].__dict__()                                                                                                                                                                                   
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-39-ab06281650cf> in <module>
----> 1 a[-1].__dict__()

AttributeError: 'SitemapPage' object has no attribute '__dict__'

In [40]: vars(a[-1])                                                                                                                                                                                        
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-40-ada32e179a17> in <module>
----> 1 vars(a[-1])

TypeError: vars() argument must have __dict__ attribute

Thanks for this lib. Works well.

Some sitemaps don't get fetched fully

2019-07-19 14:48:41,974 INFO usp.helpers [95859/MainThread]: Fetching URL https://globalvoices.org/sitemap-posttype-post.200705.xml...
2019-07-19 14:48:59,852 INFO usp.fetch_parse [95859/MainThread]: Parsing sitemap from URL https://globalvoices.org/sitemap-posttype-post.200705.xml...
2019-07-19 14:48:59,932 ERROR usp.fetch_parse [95859/MainThread]: Parsing sitemap from URL https://globalvoices.org/sitemap-posttype-post.200705.xml failed: no element found: line 4651, column 26

End of Support

Since this project has been left by the wayside for the last good chunk of time, the Mediacloud team has decided to formally end our support for the ultimate-sitemap-parser.
As there are several active forks of this project, we wanted to give some lead time to figure out if any community members might be interested in assuming ownership of this repository and the associated package on pypi. This thread can be a place to figure out those details.
Otherwise, in the beginning of September 2024, this repository will be archived.

Authentication Method for Secured Sites?

Hi,
First of all, Thank you for building such a fantastic library for searching sitemaps through a website. There is a enhancement though which I would like to propose.

I have noticed for one of the site when I was trying the same thing, it asks for authentication via HttpNtlmAuth.

I was wondering if such arguments can be passed along with url to bypass authentication?

thanks

Detection of sitemap if it's not present in robots.txt

First of all, thanks for this great package. I like it. It works better than any of my implementations of sitemap parsers. However, I see features which can be addressed. And I would rather contribute to this project than invent something else.
I would be cool to detect sitemap.xml in another location, not only in robots.txt. It can increase chances to find and parse the sitemap.
Why? Because the robots.txt file is an optional file. Also, it's optional to put sitemap declaration inside of the robots.txt. But we can guess, that sitemap.xml is sometimes placed in https://www.example.com/sitemap.xml

What do you think?

Provide a simple mechanism to parse raw sitemap content

I had to use XMLSitemapParser directly in order to accomplish this. It's more of a hack since the project seems geared to only work with HTTP URLs and the parser classes always want both URLs and the HTTP client objects. However, the URLs are only for logging and the client objects are only used in very narrow use-cases (which will never apply to me/us). So, requiring HTTP seems like it'd be an arbitrary requirement most of the time.

You might just add general support for "file:" schemes and resolve both issues (the validation that only allows HTTP URLs, and keeping us from having to use the interior classes directly since they don't appear to have been meant to be used that way).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.