Git Product home page Git Product logo

carpentries-incubator / lc-webscraping Goto Github PK

View Code? Open in Web Editor NEW
37.0 15.0 27.0 6.92 MB

Introduction to web scraping

Home Page: https://carpentries-incubator.github.io/lc-webscraping/

License: Other

Makefile 3.46% HTML 38.61% CSS 4.31% JavaScript 1.03% R 4.05% Python 43.34% Shell 0.22% Ruby 0.23% SCSS 4.75%
carpentries lesson python webscraping scraping web-scraping english programming alpha

lc-webscraping's People

Contributors

abbycabs avatar brandoncurtis avatar cmacdonell avatar erinbecker avatar evanwill avatar fmichonneau avatar gvwilson avatar ianlee1521 avatar ishandahal avatar jcoliver avatar jduckles avatar jpallen avatar jsta avatar katrinleinweber avatar kimpham54 avatar malramsay64 avatar mawds avatar maxim-belkin avatar mr-c avatar naught101 avatar neon-ninja avatar pbanaszkiewicz avatar pipitone avatar rgaiacs avatar synesthesiam avatar timtomch avatar tobyhodges avatar twitwi avatar wclose avatar weaverbel avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

lc-webscraping's Issues

Lesson Maintainer Recruitment

Hi @JoshuaDull @timtomch

Compared to the other lessons, this Webscraping lesson team seems to be greatly understaffed and probably could use some helping hands. Being an alpha lesson, a lot of issues and pull requests are yet to come.

In case you are looking for a maintainer, I would be glad to help.

We may wish to update the lesson page to advertise for additional maintainers. That might attract and allow some of the new LCAG members (recruited in 2021) to express their interests to be a part of the team.

Best Regards,
Annajiat

Reflections after teaching it

This afternoon, I had 3h (including 10 min break) to present web scraping. I presented from https://ctds-usyd.github.io/2017-07-03-resbaz-webscraping/. I am not a trained SWC instructor, and not used to the narrative format of SWC lessons. I am also an experienced software engineer, so while I am used to some amount of teaching, it was hard for me to recall how much ground work there is to this topic. In the context of ResBaz, I was presenting to a group of research students, librarians, ?academics, etc. from Sydney universities. I did not get anything in the way of a survey, but hope to ask the ResBaz organisers to email students for their comments.

There were about 22 students, though 40 had signed up. Despite the Library Carpentry resolutions of a few weeks ago to focus on coding scrapers, I had decided to make something accessible to non-coders. In the end, we did not cover the coding part at all. I don't think we suffered greatly for this.

read more: data-lessons/library-webscraping-DEPRECATED#41

Also germane to this is this issue: data-lessons/library-webscraping-DEPRECATED#30

ID setting in Visual Scraping resolution children

I may be wrong, but in the Visual Scraping tutorial I suspect that the value for the for the child selector that is listed in the second child under "The data for each resolution" section should be url rather than symbol as in the child selector above it.
Similarly, in the answer to the "add new selectors for date and title" challenge, the second answer should be id:title not id:date.
(PS to authors: a pleasure to work through this lesson for the May 10/11 2018 sprint!)

Deal with content from Additions heading in episode 02

Accompanies PR #56 . This section includes multiple FIXME headings that lead me to assume it wasn't meant to be in the public version of the lesson.

Maintainers should determine whether to add corrected/complete information on some/all of these topics or remove them entirely. Merging the PR would remove it from the published lesson. At that point this issue can be used to discuss longer-term solutions.

Additions

FIXME: add more XPath functions such as concat() and normalize-space().
FIXME: mention XPath Checker for Firefox
FIXME: Firefox sometime cleans up the HTML of a page before displaying it, meaning that the DOM tree
we can access through the console might not reflect the actual source code. <tbody> elements are
typically not reliable.
The Scrapy documentation
has more on the topic.

Scraper extension icon gives error

I installed the Scraper Chrome extension. In Episode 3, under 'Scrape similar', there is a line: "Alternatively, the “Scrape similar” option can also be accessed from the Scraper extension icon:". When I use the extension from the icon, I get an error: "Frames are not supported at the moment. Please open the frame in a new tab or window and try scraping again." If I use the right-click page option, there is no error.
I'm using Windows 10, Chrome 86, Scraper 1.7

Rework of lesson available for mining into the original

Hi folks,

Last year, I reworked this lesson (https://github.com/resbazSQL/lc-webscraping) as a way of integrating it with the SWC capstone "excel to database." (https://github.com/resbazSQL/capstone-novice-spreadsheet-biblio) My pull request back was rightly rejected for being entirely too large. While I've had "todo: break lesson into commits" on my todo list for the last year, I suppose it's worth noting that the reworked (and taught) lesson is available for other folk (including those working on instructor checkouts) to mine text from.

Here is an incomplete listing of changes:

  • Lessons reworked to use perma.cc datasources, so that when the parent pages change, the lesson doesn't break
  • Tried to incorporate repeating themes into lesson flow (browser extension, then console, then scrapy) to reinforce learning
  • Reduced emphasis on hand-crafting xpaths in the browser console
  • Made the scrapy output flow into excel-to-database lesson
  • Made the pages refer to multiple countries, to reduce the single-country political focus

I hope it's useful to folks who want to find text to potentially address issues they find. It's unlikely that I'll have time in the next few months to break my edits into a series of commits for proper staging back into main.

update link suggestion for case study Ep5 conclusion

Hello! The following message was sent by a community member to the Carpentries HQ inbox, passed onto the LC advisory committee, and hopefully ending up here in the best place to consider this suggestion.


Hi,

Are you able to please update something on your website?

You are linking to a dead link over anchor text: "This case study" here: https://librarycarpentry.org/lc-webscraping/05-conclusion/

The link is going to this page: http://naelshiab.com/members-parliament-web-scraping/

I found a good working replacement for you here: https://prowebscraper.com/blog/data-mining-examples/

I think it's the best suitable replacement for you. Hope this helps!

Change to underlying target page URLs for Visual Scraping lesson

In the "The data for each resolution" section, the screenshot preview of the list of symbols and urls that are children of the year 2016 parent page has URLs that appear to be different from those embedded in the current version of the UN Security Council page. Whereas the lesson shows URLs of the structure http://www.un.org/en/ga/search/view_doc.asp?symbol=S/RES/2336(2016), the parent page now has http://undocs.org/S/RES/2336(2016).
Note that both links work, so this doesn't affect anything other than what a user might expect to see in a data preview or export table.

Scraper export to Google Drive disabled

I am in Episode 4, when I get to this line: "and we can use the export functions to either create a Google Spreadsheet with the results", I clicked the Scraper button labelled 'Export to Google Docs...'
I chose my Google account then saw this error:

"Sign in with Google temporarily disabled for this app
This app has not been verified yet by Google in order to use Google Sign In."

Is anyone else having this problem?

Dead link to XPath Cheatsheet in Ep2

In 02-xpath.md, under References, there is a link to XPath Cheatsheet (a PDF of a md file) which does not work for a few reasons.

  1. The lesson name should be lc-webscraping, not library-webscraping.
  2. The link into the _extras folder does not work like that.

I don't know how to make the link work, other than using the download button from github.com

Safari developer links changed

On my version of Safari (11.1) you must first turn on the "Develop" menu (in preferences) and then navigate to Develop > Show Javascript Console and then click on the "Console" tab

Links

Adjust lesson or links so directions match content in Episode 3

Some updates were done to deal with changes to the Canadian Parliament webpages for Episode 3 in #29 but the lists the mailing addresses link for the Custom XPath Queries now forwards to a much less scrape-friendly version of the list (also with different members) that doesn't match any of the directions.

Either the directions and screenshots need updated to match the new page or an archived version of the old page can be used.

For the former, I don't know how to cleanly pull out the relevant information because it's not wrapped in its own tags but just element text as part of larger

and

sections.

For the latter, I was able to adapt the XPath when using an archive.org capture by changin //body/div[1]/div/ul to //div[4]/div/div/ul and then the rest of the commands worked. But you'd probably need to add something explaining web archiving and why we're doing it.

Either way, the lesson is currently broken beginning at Custom XPath Queries.

Not adding a PR both because I don't have time to develop the explanation/information on web archiving and can't solve it the other way.

Legislative Assembly of Ontario Page Structure changes

Using this page as the closest approximate for http://www.ontla.on.ca/web/members/members_current.do, it seems that the page structure has changed significantly from the version shown in episode 4 (starting here).

One possibility may be to use a selector like this //tbody/tr//a/@href, but unfortunately that won't show the contains function well. Another option might be to show selecting every second row using a selector like this? //*[contains(concat(" ", normalize-space(@class), " "), " even ")]

This is very different from the data-lessons version

I'd be interested to know why all the changes (arguable improvements) to https://github.com/data-lessons/library-webscraping were not transferred here, which seems to have adopted, and then edited, an older version. Has there been consideration of bringing over any beneficial changes from data-lessons?

Apart from anything else, I think the Scraper tool used here is much less powerful than the state of the art in visual scraping systems: its limitation to single-page scrapes is especially problematic. Then getting students to understand site maps, fetching, etc, in order to write up a Python scraper, when they've previously only done single-page visual scraping, is a big jump for students who are not at home with coding.

Changes to UK Members of Parliament

The current site for the UK house of commons has changes significantly from the example in the Manual Scraping using the Scraper extension. The nicely structured data has been replaced with a collection of <div> elements that is much harder to scrape automatically.

June 2019 Lesson Release checklist

If your Maintainer team has decided not to participate in the June 2019 lesson release, please close this issue.

To have this lesson included in the 18 June 2019 release, please confirm that the following items are true:

  • Example code chunks run as expected
  • Challenges / exercises run as expected
  • Challenge / exercise solutions are correct
  • Call out boxes (exercises, discussions, tips, etc) render correctly
  • A schedule appears on the lesson homepage (e.g. not “00:00”)
  • Each episode includes learning objectives
  • Each episode includes questions
  • Each episode includes key points
  • Setup instructions are up-to-date, correct, clear, and complete
  • File structure is clean (e.g. delete deprecated files, insure filenames are consistent)
  • Some Instructor notes are provided
  • Lesson links work as expected

When all checkboxes above are completed, this lesson will be added to the 18 June lesson release. Please leave a comment on carpentries/lesson-infrastructure#26 or contact Erin Becker with questions ([email protected]).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.