carpentries-incubator / lc-webscraping Goto Github PK

View Code? Open in Web Editor NEW

37.0 15.0 27.0 6.92 MB

Introduction to web scraping

Home Page: https://carpentries-incubator.github.io/lc-webscraping/

License: Other

Makefile 3.46% HTML 38.61% CSS 4.31% JavaScript 1.03% R 4.05% Python 43.34% Shell 0.22% Ruby 0.23% SCSS 4.75%

carpentries lesson python webscraping scraping web-scraping english programming alpha

lc-webscraping's People

Contributors

Stargazers

Watchers

lc-webscraping's Issues

Lesson content

Discussion on what to include/what to exclude here: data-lessons/library-webscraping-DEPRECATED#7 and here: data-lessons/library-webscraping-DEPRECATED#11 and here: data-lessons/library-webscraping-DEPRECATED#12 and here: data-lessons/library-webscraping-DEPRECATED#8 and here: data-lessons/library-webscraping-DEPRECATED#36 and this: data-lessons/library-webscraping-DEPRECATED#13

Compared to the other lessons, this Webscraping lesson team seems to be greatly understaffed and probably could use some helping hands. Being an alpha lesson, a lot of issues and pull requests are yet to come.

In case you are looking for a maintainer, I would be glad to help.

We may wish to update the lesson page to advertise for additional maintainers. That might attract and allow some of the new LCAG members (recruited in 2021) to express their interests to be a part of the team.

Best Regards,
Annajiat

Reflections after teaching it

This afternoon, I had 3h (including 10 min break) to present web scraping. I presented from https://ctds-usyd.github.io/2017-07-03-resbaz-webscraping/. I am not a trained SWC instructor, and not used to the narrative format of SWC lessons. I am also an experienced software engineer, so while I am used to some amount of teaching, it was hard for me to recall how much ground work there is to this topic. In the context of ResBaz, I was presenting to a group of research students, librarians, ?academics, etc. from Sydney universities. I did not get anything in the way of a survey, but hope to ask the ResBaz organisers to email students for their comments.

There were about 22 students, though 40 had signed up. Despite the Library Carpentry resolutions of a few weeks ago to focus on coding scrapers, I had decided to make something accessible to non-coders. In the end, we did not cover the coding part at all. I don't think we suffered greatly for this.

Also germane to this is this issue: data-lessons/library-webscraping-DEPRECATED#30

ID setting in Visual Scraping resolution children

I may be wrong, but in the Visual Scraping tutorial I suspect that the value for the for the child selector that is listed in the second child under "The data for each resolution" section should be url rather than symbol as in the child selector above it.
Similarly, in the answer to the "add new selectors for date and title" challenge, the second answer should be id:title not id:date.
(PS to authors: a pleasure to work through this lesson for the May 10/11 2018 sprint!)

Starting Python command error in Setup

On my mac typing spyder3 in the command line results in an error, spyder works though.

Deal with content from Additions heading in episode 02

Accompanies PR #56 . This section includes multiple FIXME headings that lead me to assume it wasn't meant to be in the public version of the lesson.

Maintainers should determine whether to add corrected/complete information on some/all of these topics or remove them entirely. Merging the PR would remove it from the published lesson. At that point this issue can be used to discuss longer-term solutions.

Additions

FIXME: add more XPath functions such as concat() and normalize-space().
FIXME: mention XPath Checker for Firefox
FIXME: Firefox sometime cleans up the HTML of a page before displaying it, meaning that the DOM tree
we can access through the console might not reflect the actual source code. <tbody> elements are
typically not reliable.
The Scrapy documentation
has more on the topic.

Scraper extension icon gives error

I installed the Scraper Chrome extension. In Episode 3, under 'Scrape similar', there is a line: "Alternatively, the “Scrape similar” option can also be accessed from the Scraper extension icon:". When I use the extension from the icon, I get an error: "Frames are not supported at the moment. Please open the frame in a new tab or window and try scraping again." If I use the right-click page option, there is no error.
I'm using Windows 10, Chrome 86, Scraper 1.7

Using Bash

It may be worth re-writing the lesson to use Bash, see for example

https://data36.com/web-scraping-tutorial-episode-1-scraping-a-webpage-with-bash/
https://www.joyofdata.de/blog/using-linux-shell-web-scraping/
It can then follow the Unix shell lesson. Some other relevant software that could be included
Saxon
HTML-XML-utils
Bash Web Crawler

Tom Scott video on scraping

We should probably figure out how to work https://www.youtube.com/watch?v=BxV14h0kFs0&t=0s into the lesson.

URL hacking

Discussion on this: data-lessons/library-webscraping-DEPRECATED#48

Rework of lesson available for mining into the original

Hi folks,

Last year, I reworked this lesson (https://github.com/resbazSQL/lc-webscraping) as a way of integrating it with the SWC capstone "excel to database." (https://github.com/resbazSQL/capstone-novice-spreadsheet-biblio) My pull request back was rightly rejected for being entirely too large. While I've had "todo: break lesson into commits" on my todo list for the last year, I suppose it's worth noting that the reworked (and taught) lesson is available for other folk (including those working on instructor checkouts) to mine text from.

Here is an incomplete listing of changes:

Lessons reworked to use perma.cc datasources, so that when the parent pages change, the lesson doesn't break
Tried to incorporate repeating themes into lesson flow (browser extension, then console, then scrapy) to reinforce learning
Reduced emphasis on hand-crafting xpaths in the browser console
Made the scrapy output flow into excel-to-database lesson
Made the pages refer to multiple countries, to reduce the single-country political focus

I hope it's useful to folks who want to find text to potentially address issues they find. It's unlikely that I'll have time in the next few months to break my edits into a series of commits for proper staging back into main.

update link suggestion for case study Ep5 conclusion

Hello! The following message was sent by a community member to the Carpentries HQ inbox, passed onto the LC advisory committee, and hopefully ending up here in the best place to consider this suggestion.

Hi,

Are you able to please update something on your website?

You are linking to a dead link over anchor text: "This case study" here: https://librarycarpentry.org/lc-webscraping/05-conclusion/

The link is going to this page: http://naelshiab.com/members-parliament-web-scraping/

I found a good working replacement for you here: https://prowebscraper.com/blog/data-mining-examples/

I think it's the best suitable replacement for you. Hope this helps!

Change to underlying target page URLs for Visual Scraping lesson

In the "The data for each resolution" section, the screenshot preview of the list of symbols and urls that are children of the year 2016 parent page has URLs that appear to be different from those embedded in the current version of the UN Security Council page. Whereas the lesson shows URLs of the structure http://www.un.org/en/ga/search/view_doc.asp?symbol=S/RES/2336(2016), the parent page now has http://undocs.org/S/RES/2336(2016).
Note that both links work, so this doesn't affect anything other than what a user might expect to see in a data preview or export table.

Scraper export to Google Drive disabled

I am in Episode 4, when I get to this line: "and we can use the export functions to either create a Google Spreadsheet with the results", I clicked the Scraper button labelled 'Export to Google Docs...'
I chose my Google account then saw this error:

"Sign in with Google temporarily disabled for this app
This app has not been verified yet by Google in order to use Google Sign In."

Is anyone else having this problem?

Ethics of webscraping

Discussion is here: data-lessons/library-webscraping-DEPRECATED#16

Dead link to XPath Cheatsheet in Ep2

In 02-xpath.md, under References, there is a link to XPath Cheatsheet (a PDF of a md file) which does not work for a few reasons.

The lesson name should be lc-webscraping, not library-webscraping.
The link into the _extras folder does not work like that.

I don't know how to make the link work, other than using the download button from github.com

Safari developer links changed

On my version of Safari (11.1) you must first turn on the "Develop" menu (in preferences) and then navigate to Develop > Show Javascript Console and then click on the "Console" tab

Incomplete XPath command in "Navigating through web-page using browser console"

Specifically in the examples immediately following the "Select the 'introduction' title" section:

Similar to Issue #6, I needed to add the "article" tag to make the $x("html/body/div/blockquote") query work. (so I changed it to: $x("/html/body/div/article/blockquote"))

Link Broken - Scrape the list of Ontario MPPs

The link here did not resolve for me: https://github.com/LibraryCarpentry/lc-webscraping/blame/4239bf5b7aee9cad855d49e3bbd15e3a0870cf58/_episodes/03-manual-scraping.md#L127

I think it should be this?
https://www.ola.org/en/members/current?locale=en

Links

Hi,

Given that Canadian parliament is currently dissolved there is also not much information under current members, so it might be better to point here: https://www.ola.org/en/members/parliament-41

Need jump lists (anchors) for headings

@zkamvar
We need to update style for this lesson so that jump lists (anchors for different sections) becomes automatically available for different levels of headings.

Adjust lesson or links so directions match content in Episode 3

Some updates were done to deal with changes to the Canadian Parliament webpages for Episode 3 in #29 but the lists the mailing addresses link for the Custom XPath Queries now forwards to a much less scrape-friendly version of the list (also with different members) that doesn't match any of the directions.

Either the directions and screenshots need updated to match the new page or an archived version of the old page can be used.

For the former, I don't know how to cleanly pull out the relevant information because it's not wrapped in its own tags but just element text as part of larger

and

sections.

For the latter, I was able to adapt the XPath when using an archive.org capture by changin //body/div[1]/div/ul to //div[4]/div/div/ul and then the rest of the commands worked. But you'd probably need to add something explaining web archiving and why we're doing it.

Either way, the lesson is currently broken beginning at Custom XPath Queries.

Not adding a PR both because I don't have time to develop the explanation/information on web archiving and can't solve it the other way.

Find issues to resolve for this lesson

This lesson has now been migrated to the Library Carpentry organisation. Work (updates, issues) should happen on THIS repo.

All the issues that need resolving are still at the old location, however: https://github.com/data-lessons/library-webscraping-DEPRECATED/issues

Find issues to resolve there and fix them here.

Legislative Assembly of Ontario Page Structure changes

Using this page as the closest approximate for http://www.ontla.on.ca/web/members/members_current.do, it seems that the page structure has changed significantly from the version shown in episode 4 (starting here).

One possibility may be to use a selector like this //tbody/tr//a/@href, but unfortunately that won't show the contains function well. Another option might be to show selecting every second row using a selector like this? //*[contains(concat(" ", normalize-space(@class), " "), " even ")]

This is very different from the data-lessons version

I'd be interested to know why all the changes (arguable improvements) to https://github.com/data-lessons/library-webscraping were not transferred here, which seems to have adopted, and then edited, an older version. Has there been consideration of bringing over any beneficial changes from data-lessons?

Apart from anything else, I think the Scraper tool used here is much less powerful than the state of the art in visual scraping systems: its limitation to single-page scrapes is especially problematic. Then getting students to understand site maps, fetching, etc, in order to write up a Python scraper, when they've previously only done single-page visual scraping, is a big jump for students who are not at home with coding.

Error in walk-through of "select the challenge box" solution

In episode 2 (Xpath) there's an error in the walk-through of the challenge that involves selecting the challenge box:

| //| and select the parent node of that h2 element |

should be:

| ..| and select the parent node of that h2 element |

Broken Link- Web Scraping 04 Web scraping using Python: requests and lxml

Setup link broken in Web scraping episode 4- Web scraping using Python: requests and lxml. Brooken Setup link is under "Introducing Requests and lxml".

https://librarycarpentry.github.io/lc-webscraping/04-lxml/

Changes to UK Members of Parliament

The current site for the UK house of commons has changes significantly from the example in the Manual Scraping using the Scraper extension. The nicely structured data has been replaced with a collection of <div> elements that is much harder to scrape automatically.

Low-priority: Update lesson title in LC lesson directory

"Web scraping" is used throughout this lesson but the title in the LC lesson directory is "Webscraping". Please update for consistency.

June 2019 Lesson Release checklist

If your Maintainer team has decided not to participate in the June 2019 lesson release, please close this issue.

To have this lesson included in the 18 June 2019 release, please confirm that the following items are true:

When all checkboxes above are completed, this lesson will be added to the 18 June lesson release. Please leave a comment on carpentries/lesson-infrastructure#26 or contact Erin Becker with questions ([email protected]).

Broken Link- Web Scraping 03 Visual Scraping

Setup link broken in Visual Scraping Using Browser Extensions. Directly below "Why we chose the Web Scraper extension".

https://librarycarpentry.github.io/lc-webscraping/03-visual-scraping/

Missing text in "Select the "Introduction" challenge? (Selecting content on a web page with XPath lesson)

I needed the "article" tag in order to make this work $x("/html/body/div/article/h1[1]")

carpentries-incubator / lc-webscraping Goto Github PK

lc-webscraping's People

Contributors

Stargazers

Watchers

Forkers

lc-webscraping's Issues

Additions

Recommend Projects

Recommend Topics

Recommend Org