Hi, More broken links: <a href="https://github.com/LibraryCarpen

I've hit similar issues in episode 3 (PR <a href="https://github.com/LibraryCarpentry/

Links about lc-webscraping HOT 4 OPEN

RichardPBerry commented on May 24, 2024

Links

from lc-webscraping.

Comments (4)

libcce commented on May 24, 2024 1

Does it make sense to change the links now since the pages will ultimately be updated with new information? Or should we link to a particular snapshot in time using the Wayback Machine, for example: https://web.archive.org/web/20170715183551/http://www.ontla.on.ca/web/members/members_current.do?locale=en

from lc-webscraping.

RichardPBerry commented on May 24, 2024

I think it would be worth updating the link to the wayback version, because it future proofs against both dissolving of the parliament, and website changes. In fact it's probably a great example of one of the perils of web scraping in that your can code break if the website updates!

from lc-webscraping.

RichardPBerry commented on May 24, 2024

Actually I just tried to scrape the IA page with scrapy and received a "Forbidden by robots.txt" error. Currently it is set to deny all user agents bar the ia_archiver. I guess this might require someone from LC contacting IA to gain permission for scraping for training purposes?

from lc-webscraping.

pansapiens commented on May 24, 2024

I've hit similar issues in episode 3 (PR https://github.com/LibraryCarpentry/lc-webscraping/pull/29).

Another issue with Internet Archive snapshots is that an extra <div> is injected, that changes page structure and often the XPath generated by the Scraper browser plugin. The workaround for this is to use the id_ variant of the IA URL which gives the original unmodified page (eg, http://webarchive.parliament.uk/20150218214039/http://www.parliament.uk/mps-lords-and-offices/mps/ vs. http://webarchive.parliament.uk/20150218214039id_/http://www.parliament.uk/mps-lords-and-offices/mps/ ). Unfortunately this results in broken links to CSS and images, making the page look broken - the content can still be reliably scraped using Scraper.

With regard to robots.txt - Scrapy can be configure to ignore this (https://www.simplified.guide/scrapy/ignore-robots), but it could be a controversial workaround with regard to the ethics of web scraping.

I think we do need to use stable snapshots of pages hosted somewhere - it may be that the Internet Archive isn't the solution to that however.

from lc-webscraping.

Recommend Projects

Links about lc-webscraping HOT 4 OPEN

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent