carpentries-incubator / lc-webscraping Goto Github PK
View Code? Open in Web Editor NEWIntroduction to web scraping
Home Page: https://carpentries-incubator.github.io/lc-webscraping/
License: Other
Introduction to web scraping
Home Page: https://carpentries-incubator.github.io/lc-webscraping/
License: Other
Discussion on what to include/what to exclude here: data-lessons/library-webscraping-DEPRECATED#7 and here: data-lessons/library-webscraping-DEPRECATED#11 and here: data-lessons/library-webscraping-DEPRECATED#12 and here: data-lessons/library-webscraping-DEPRECATED#8 and here: data-lessons/library-webscraping-DEPRECATED#36 and this: data-lessons/library-webscraping-DEPRECATED#13
Compared to the other lessons, this Webscraping lesson team seems to be greatly understaffed and probably could use some helping hands. Being an alpha lesson, a lot of issues and pull requests are yet to come.
In case you are looking for a maintainer, I would be glad to help.
We may wish to update the lesson page to advertise for additional maintainers. That might attract and allow some of the new LCAG members (recruited in 2021) to express their interests to be a part of the team.
Best Regards,
Annajiat
This afternoon, I had 3h (including 10 min break) to present web scraping. I presented from https://ctds-usyd.github.io/2017-07-03-resbaz-webscraping/. I am not a trained SWC instructor, and not used to the narrative format of SWC lessons. I am also an experienced software engineer, so while I am used to some amount of teaching, it was hard for me to recall how much ground work there is to this topic. In the context of ResBaz, I was presenting to a group of research students, librarians, ?academics, etc. from Sydney universities. I did not get anything in the way of a survey, but hope to ask the ResBaz organisers to email students for their comments.
There were about 22 students, though 40 had signed up. Despite the Library Carpentry resolutions of a few weeks ago to focus on coding scrapers, I had decided to make something accessible to non-coders. In the end, we did not cover the coding part at all. I don't think we suffered greatly for this.
read more: data-lessons/library-webscraping-DEPRECATED#41
Also germane to this is this issue: data-lessons/library-webscraping-DEPRECATED#30
I may be wrong, but in the Visual Scraping tutorial I suspect that the value for the for the child selector that is listed in the second child under "The data for each resolution" section should be url
rather than symbol
as in the child selector above it.
Similarly, in the answer to the "add new selectors for date and title" challenge, the second answer should be id:title
not id:date
.
(PS to authors: a pleasure to work through this lesson for the May 10/11 2018 sprint!)
On my mac typing spyder3
in the command line results in an error, spyder
works though.
Accompanies PR #56 . This section includes multiple FIXME headings that lead me to assume it wasn't meant to be in the public version of the lesson.
Maintainers should determine whether to add corrected/complete information on some/all of these topics or remove them entirely. Merging the PR would remove it from the published lesson. At that point this issue can be used to discuss longer-term solutions.
FIXME: add more XPath functions such as concat() and normalize-space().
FIXME: mention XPath Checker for Firefox
FIXME: Firefox sometime cleans up the HTML of a page before displaying it, meaning that the DOM tree
we can access through the console might not reflect the actual source code. <tbody>
elements are
typically not reliable.
The Scrapy documentation
has more on the topic.
I installed the Scraper Chrome extension. In Episode 3, under 'Scrape similar', there is a line: "Alternatively, the “Scrape similar” option can also be accessed from the Scraper extension icon:". When I use the extension from the icon, I get an error: "Frames are not supported at the moment. Please open the frame in a new tab or window and try scraping again." If I use the right-click page option, there is no error.
I'm using Windows 10, Chrome 86, Scraper 1.7
It may be worth re-writing the lesson to use Bash, see for example
We should probably figure out how to work https://www.youtube.com/watch?v=BxV14h0kFs0&t=0s into the lesson.
Discussion on this: data-lessons/library-webscraping-DEPRECATED#48
Hi folks,
Last year, I reworked this lesson (https://github.com/resbazSQL/lc-webscraping) as a way of integrating it with the SWC capstone "excel to database." (https://github.com/resbazSQL/capstone-novice-spreadsheet-biblio) My pull request back was rightly rejected for being entirely too large. While I've had "todo: break lesson into commits" on my todo list for the last year, I suppose it's worth noting that the reworked (and taught) lesson is available for other folk (including those working on instructor checkouts) to mine text from.
Here is an incomplete listing of changes:
I hope it's useful to folks who want to find text to potentially address issues they find. It's unlikely that I'll have time in the next few months to break my edits into a series of commits for proper staging back into main.
Hello! The following message was sent by a community member to the Carpentries HQ inbox, passed onto the LC advisory committee, and hopefully ending up here in the best place to consider this suggestion.
Hi,
Are you able to please update something on your website?
You are linking to a dead link over anchor text: "This case study" here: https://librarycarpentry.org/lc-webscraping/05-conclusion/
The link is going to this page: http://naelshiab.com/members-parliament-web-scraping/
I found a good working replacement for you here: https://prowebscraper.com/blog/data-mining-examples/
I think it's the best suitable replacement for you. Hope this helps!
In the "The data for each resolution" section, the screenshot preview of the list of symbols and urls that are children of the year 2016 parent page has URLs that appear to be different from those embedded in the current version of the UN Security Council page. Whereas the lesson shows URLs of the structure http://www.un.org/en/ga/search/view_doc.asp?symbol=S/RES/2336(2016), the parent page now has http://undocs.org/S/RES/2336(2016).
Note that both links work, so this doesn't affect anything other than what a user might expect to see in a data preview or export table.
I am in Episode 4, when I get to this line: "and we can use the export functions to either create a Google Spreadsheet with the results", I clicked the Scraper button labelled 'Export to Google Docs...'
I chose my Google account then saw this error:
"Sign in with Google temporarily disabled for this app
This app has not been verified yet by Google in order to use Google Sign In."
Is anyone else having this problem?
Discussion is here: data-lessons/library-webscraping-DEPRECATED#16
In 02-xpath.md
, under References, there is a link to XPath Cheatsheet (a PDF of a md file) which does not work for a few reasons.
lc-webscraping
, not library-webscraping
._extras
folder does not work like that.I don't know how to make the link work, other than using the download button from github.com
On my version of Safari (11.1) you must first turn on the "Develop" menu (in preferences) and then navigate to Develop > Show Javascript Console and then click on the "Console" tab
Specifically in the examples immediately following the "Select the 'introduction' title" section:
Similar to Issue #6, I needed to add the "article" tag to make the $x("html/body/div/blockquote")
query work. (so I changed it to: $x("/html/body/div/article/blockquote")
)
The link here did not resolve for me: https://github.com/LibraryCarpentry/lc-webscraping/blame/4239bf5b7aee9cad855d49e3bbd15e3a0870cf58/_episodes/03-manual-scraping.md#L127
I think it should be this?
https://www.ola.org/en/members/current?locale=en
Hi,
More broken links:
https://github.com/LibraryCarpentry/lc-webscraping/blame/gh-pages/_episodes/04-scrapy.md#L81
https://github.com/LibraryCarpentry/lc-webscraping/blame/gh-pages/_episodes/04-scrapy.md#L82
https://github.com/LibraryCarpentry/lc-webscraping/blame/gh-pages/_episodes/04-scrapy.md#L198
https://github.com/LibraryCarpentry/lc-webscraping/blame/gh-pages/_episodes/04-scrapy.md#L220
https://github.com/LibraryCarpentry/lc-webscraping/blame/gh-pages/_episodes/04-scrapy.md#L221
https://github.com/LibraryCarpentry/lc-webscraping/blame/gh-pages/_episodes/04-scrapy.md#L288
https://github.com/LibraryCarpentry/lc-webscraping/blame/gh-pages/_episodes/04-scrapy.md#L303
https://github.com/LibraryCarpentry/lc-webscraping/blame/gh-pages/_episodes/04-scrapy.md#L337
https://github.com/LibraryCarpentry/lc-webscraping/blame/gh-pages/_episodes/04-scrapy.md#L365
https://github.com/LibraryCarpentry/lc-webscraping/blame/gh-pages/_episodes/04-scrapy.md#L405
Given that Canadian parliament is currently dissolved there is also not much information under current members, so it might be better to point here: https://www.ola.org/en/members/parliament-41
@zkamvar
We need to update style for this lesson so that jump lists (anchors for different sections) becomes automatically available for different levels of headings.
Some updates were done to deal with changes to the Canadian Parliament webpages for Episode 3 in #29 but the lists the mailing addresses link for the Custom XPath Queries
now forwards to a much less scrape-friendly version of the list (also with different members) that doesn't match any of the directions.
Either the directions and screenshots need updated to match the new page or an archived version of the old page can be used.
For the former, I don't know how to cleanly pull out the relevant information because it's not wrapped in its own tags but just element text as part of larger
and
For the latter, I was able to adapt the XPath when using an archive.org capture by changin //body/div[1]/div/ul
to //div[4]/div/div/ul
and then the rest of the commands worked. But you'd probably need to add something explaining web archiving and why we're doing it.
Either way, the lesson is currently broken beginning at Custom XPath Queries
.
Not adding a PR both because I don't have time to develop the explanation/information on web archiving and can't solve it the other way.
This lesson has now been migrated to the Library Carpentry organisation. Work (updates, issues) should happen on THIS repo.
All the issues that need resolving are still at the old location, however: https://github.com/data-lessons/library-webscraping-DEPRECATED/issues
Find issues to resolve there and fix them here.
Using this page as the closest approximate for http://www.ontla.on.ca/web/members/members_current.do, it seems that the page structure has changed significantly from the version shown in episode 4 (starting here).
One possibility may be to use a selector like this //tbody/tr//a/@href
, but unfortunately that won't show the contains function well. Another option might be to show selecting every second row using a selector like this? //*[contains(concat(" ", normalize-space(@class), " "), " even ")]
I'd be interested to know why all the changes (arguable improvements) to https://github.com/data-lessons/library-webscraping were not transferred here, which seems to have adopted, and then edited, an older version. Has there been consideration of bringing over any beneficial changes from data-lessons?
Apart from anything else, I think the Scraper tool used here is much less powerful than the state of the art in visual scraping systems: its limitation to single-page scrapes is especially problematic. Then getting students to understand site maps, fetching, etc, in order to write up a Python scraper, when they've previously only done single-page visual scraping, is a big jump for students who are not at home with coding.
In episode 2 (Xpath) there's an error in the walk-through of the challenge that involves selecting the challenge box:
|
//| and select the parent node of that h2 element |
should be:
|
..| and select the parent node of that h2 element |
Setup link broken in Web scraping episode 4- Web scraping using Python: requests and lxml. Brooken Setup link is under "Introducing Requests and lxml".
The current site for the UK house of commons has changes significantly from the example in the Manual Scraping using the Scraper extension. The nicely structured data has been replaced with a collection of <div>
elements that is much harder to scrape automatically.
"Web scraping" is used throughout this lesson but the title in the LC lesson directory is "Webscraping". Please update for consistency.
If your Maintainer team has decided not to participate in the June 2019 lesson release, please close this issue.
To have this lesson included in the 18 June 2019 release, please confirm that the following items are true:
When all checkboxes above are completed, this lesson will be added to the 18 June lesson release. Please leave a comment on carpentries/lesson-infrastructure#26 or contact Erin Becker with questions ([email protected]).
Setup link broken in Visual Scraping Using Browser Extensions. Directly below "Why we chose the Web Scraper extension".
https://librarycarpentry.github.io/lc-webscraping/03-visual-scraping/
I needed the "article" tag in order to make this work $x("/html/body/div/article/h1[1]")
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.