Git Product home page Git Product logo

cascadia.jl's Introduction

Cascadia

Build Status Latest Version Pkg Eval Dependents

A CSS Selector library in Julia.

Inspired by, and mostly a direct translation of, the Cascadia CSS Selector library, written in Go, by @andybalhom.

This package depends on the Gumbo.jl package by @porterjamesj, which is a Julia wrapper around Google's Gumbo HTML parser library

Usage

Usage is simple. Use Gumbo to parse an HTML string into a document, create a Selector from a string, and then use eachmatch to get the nodes in the document that match the selector. Alternatively, use sel"<selector string>" to do the same thing as Selector. The eachmatch function returns an array of elements which match the selector. If no match is found, a zero element array is returned. For unique matches, the array contains one element. Thus, check the length of the array to test whether a selector matches.

using Cascadia
using Gumbo

n=parsehtml("<p id=\"foo\"><p id=\"bar\">")
s=Selector("#foo")
sm = sel"#foo"
eachmatch(s, n.root)
# 1-element Array{Gumbo.HTMLNode,1}:
#  Gumbo.HTMLElement{:p}

eachmatch(sm, n.root)
# 1-element Array{Gumbo.HTMLNode,1}:
#  Gumbo.HTMLElement{:p}

Note: The top level matching function name has changed from matchall in v0.6 to eachmatch in v0.7 and higher to reflect the change in Julia base.

Webscraping Example

The primary use case for this library is to enable webscraping -- the automatic extraction of information from html pages. As an example, consider the following code, which returns a list of questions that have been tagged with julia-lang on StackOverflow.

using Cascadia, Gumbo, HTTP

r = HTTP.get("http://stackoverflow.com/questions/tagged/julia-lang")
h = parsehtml(String(r.body))

qs = eachmatch(Selector(".question-summary"),h.root)

println("StackOverflow Julia Questions (votes  answered?  url)")

for q in qs
    votes = nodeText(eachmatch(Selector(".votes .vote-count-post "), q)[1])
    answered = length(eachmatch(Selector(".status.answered"), q)) > 0 || length(eachmatch(Selector(".status.answered-accepted"), q)) > 0
    href = eachmatch(Selector(".question-hyperlink"), q)[1].attributes["href"]
    println("$votes  $answered  http://stackoverflow.com$href")
end

This code produces the following output:

StackOverflow Julia Questions (votes  answered?  url)

0  false  http://stackoverflow.com/questions/59361325/how-to-get-a-rolling-window-regression-in-julia
0  true  http://stackoverflow.com/questions/59356818/how-i-translate-python-code-into-julia-code
-2  false  http://stackoverflow.com/questions/59354720/how-to-fix-this-error-in-julia-throws-same-error-for-all-packages-not-found-i
-1  true  http://stackoverflow.com/questions/59354407/julia-package-for-geocoding
1  false  http://stackoverflow.com/questions/59350631/jupyter-lab-precompile-error-for-kernel-1-0-after-adding-kernel-1-3
0  true  http://stackoverflow.com/questions/59348461/genie-framework-does-not-install-under-julia-1-2
...
2  true  http://stackoverflow.com/questions/59300202/julia-package-install-fail-with-please-specify-by-known-name-uuid
2  false  http://stackoverflow.com/questions/59297379/how-do-i-transfer-my-packages-after-installing-a-new-julia-version

Note that this returns the elements on the first page of the query results. Getting the values from subsequent pages is left as an exercise for the reader.

Current Status

This library should work with almost all CSS selectors. Please raise an issue if you find any that don't work. However, note that CSS pseudo elements are not yet supported.

Specifically, the following selector types are tested, and known to work.

Selector
address
*
#foo
li#t1
*#t4
.t1
p.t1
div.teST
.t1.fail
p.t1.t2
p[title]
address[title="foo"]
[title ~= foo]
[title~="hello world"]
[lang|="en"]
[title^="foo"]
[title$="bar"]
[title*="bar"]
.t1:not(.t2)
div:not(.t1)
li:nth-child(odd)
li:nth-child(even)
li:nth-child(-n+2)
li:nth-child(3n+1)
li:nth-last-child(odd)
li:nth-last-child(even)
li:nth-last-child(-n+2)
li:nth-last-child(3n+1)
span:first-child
span:last-child
p:nth-of-type(2)
p:nth-last-of-type(2)
p:last-of-type
p:first-of-type
p:only-child
p:only-of-type
:empty
div p
div table p
div > p
p ~ p
p + p
li, p
p +/*This is a comment*/ p
p:contains("that wraps")
p:containsOwn("that wraps")
:containsOwn("inner")
p:containsOwn("block")
div:has(#p1)
div:has(:containsOwn("2"))
body :has(:containsOwn("2"))
body :haschild(:containsOwn("2"))
p:matches([\d])
p:matches([a-z])
p:matches([a-zA-Z])
p:matches([^\d])
p:matches(^(0|a))
p:matches(^\d+$)
p:not(:matches(^\d+$))
div :matchesOwn(^\d+$)
[href#=(fina)]:not([href#=(\/\/[^\/]+untrusted)])
[href#=(^https:\/\/[^\/]*\/?news)]
:input

cascadia.jl's People

Contributors

aviks avatar femtocleaner[bot] avatar github-actions[bot] avatar henry2004y avatar jeffreysarnoff avatar joramm avatar koufopoulosf avatar ollin18 avatar rapus95 avatar staticfloat avatar tkelman avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

cascadia.jl's Issues

ERROR: type Response has no field data

like below, Jula 7.0 Win 7 64

using Cascadia, Gumbo, HTTP

r = HTTP.get("http://stackoverflow.com/questions/tagged/julia-lang")
h = parsehtml(convert(String, r.data))
ERROR: type Response has no field data
Stacktrace:
 [1] getproperty(::Any, ::Symbol) at .\sysimg.jl:18
 [2] top-level scope at none:0
julia> r.data
ERROR: type Response has no field data
Stacktrace:
 [1] getproperty(::Any, ::Symbol) at .\sysimg.jl:18

Thanks, Paul

Cascadia never finishes eachmatch execution

julia> using Gumbo, Cascadia

julia> s = Selector("""div[data-pseudo-content="Unternehmen"]+div""")
Selector(Cascadia.var"#55#56"{Selector,Selector,Bool}(Selector(Cascadia.var"#25#26"{Selector,Selector}(Selector(Cascadia.var"#5#6"(Core.Box("div"))), Selector(Cascadia.var"#7#8"{Cascadia.var"#11#12"{String}}(Cascadia.var"#11#12"{String}("Unternehmen"), Core.Box("data-pseudo-content"))))), Selector(Cascadia.var"#5#6"(Core.Box("div"))), true))

julia> begin
           root = parsehtml("""<tr _ngcontent-udu-c357="" class="table-body table-smallFont ng-tns-c357-96 ng-star-inserted"><td _ngcontent-udu-c357="" class="ng-tns-c357-96 ng-star-inserted"><div _ngcontent-udu-c357="" appdbtooltip="" class="ng-tns-c357-96" style="cursor: default; opacity: 0.4;"><span _ngcontent-udu-c357="" class="k-icon k-i-user ng-tns-c357-96"></span></div></td><!----><!----><td _ngcontent-udu-c357="" class="ng-tns-c357-96 kp-link ng-star-inserted">Mr Name</td><!----><td _ngcontent-udu-c357="" class="ng-tns-c357-96">Geschäftsführer <div _ngcontent-udu-c357="" class="ng-tns-c357-96 ng-star-inserted">der <span _ngcontent-udu-c357="" style="font-style: italic;" class="ng-tns-c357-96">Proud GmbH</span></div><!----></td><!----><td _ngcontent-udu-c357="" class="ng-tns-c357-96 ng-star-inserted">58</td><!----></tr>
           """)
           eachmatch(s, root.root)
       end

This code never stops execution for me and I have no clue why. I'd consider it a bug but maybe it is just my fault.

Tag ?

It may be nice to create a tag with Gumbo 0.8 compat. Currently, Cascadia 0.4.0 is checked out if Gumbo 0.8 is already loaded since Cascadia 0.4.0 (still with REQUIRE file) does not have any version bounds while Cascadia 0.5.0 is has

[compat]
Gumbo = "0.5, 0.7"

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

If you'd like for me to do this for you, comment TagBot fix on this issue.
I'll open a PR within a few hours, please be patient!

eachmatch returning empty array despite valid selector

I am trying to scrape some weather data from weather underground. According to the documentation .class selectors should work right? Why I am still getting an empty HTMLNode array with the following code:

using Cascadia
using Gumbo
using HTTP

function getWindChill(speed::Int64)

    url = "https://www.wunderground.com/weather/us/ma/boston";

    #get and parse html
    res = HTTP.get(url);
    html = parsehtml(String(res.body));

    #select the html element from DOM using cascadia and eachmatch
    s = Selector(".wu-value wu-value-to")
    temp = eachmatch(s, html.root)

    temp

    #calculation for wind chill based on temperature and speed
end

getWindChill(10)

:nth-child seemingly selects identical siblings

I have managed to reduce the situation to the following:

using Gumbo
using Cascadia

tree = parsehtml("""
<a></a>
<a class="xyz"></a>
<a></a>
""")
eachmatch(sel"a:nth-child(1)", tree.root)

This gives me the following result:

2-element Array{HTMLNode,1}:
 HTMLElement{:a}:<a></a>


 HTMLElement{:a}:<a></a>

Adding more children that are identical(ish) to the actual first sibling results in even more elements in the final array. Correct me if I’m wrong, but that shouldn’t be the case. In addition :nth-child(1) is not special, this also happens for other values of the pseudo-selector.

I’m on Julia 1.5.3, which reports the following:

]status Gumbo Cascadia
Status `~/.julia/environments/v1.5/Project.toml`
  [54eefc05] Cascadia v1.0.1
  [708ec375] Gumbo v0.8.0

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.