puerkitobio / goquery Goto Github PK
View Code? Open in Web Editor NEWA little like that j-thing, only in Go.
License: BSD 3-Clause "New" or "Revised" License
A little like that j-thing, only in Go.
License: BSD 3-Clause "New" or "Revised" License
What if I have already downloaded an HTML file locally and want to parse it?
Based on the main example, I tried:
if doc, e = goquery.NewDocument("file://somefile.html"); e != nil {
log.Fatal(e)
}
But it gives:
2014/06/20 17:05:04 Get file://somefile.html: unsupported protocol scheme "file"
exit status 1
Would it be possible to match by attribute?
Is there a property that I can pass that if I haven't received a response in so many seconds to return an error?
This isn't an issue; it's an idea/feature request. It'd be cool if there were a way to convert a goquery.Document type to an io.Reader type. The NewDocumentFromReader
function does the reverse. A NewReaderFromDocument
function would be cool too.
Hey there! Nice job with this package.
I'm a newcomers to go so I thought I'd most humbly submit my example I'd use to show others. It's just easier to copy and paste this example into a test.go file and simply run it (for complete beginners).
package main
import (
"fmt"
"log"
gq "github.com/PuerkitoBio/goquery"
)
func ExampleScrape() {
var doc *gq.Document
var e error
if doc, e = gq.NewDocument("http://metalsucks.net"); e != nil {
log.Fatal(e)
}
doc.Find(".reviews-wrap article .review-rhs").Each(func(i int, s *gq.Selection) {
band := s.Find("h3").Text()
title := s.Find("i").Text()
fmt.Printf("Review %d: %s - %s\n", i, band, title)
})
}
func main() {
ExampleScrape()
}
Hi,
I have HTML element:
[input onkeypress="if(event.keyCode == 13){processHash('Search')}" class="jq-zoho-search-input" type="text" id="searchInputBox" accesskey="f" title="Search jQuery" name="search"]
There is way to get all atributes of selected element ? I am expecting to get:
onkeypress="if(event.keyCode == 13){processHash('Search')}"
class="jq-zoho-search-input"
type="text"
id="searchInputBox"
accesskey="f"
title="Search jQuery"
name="search"
I know that there is function Attr, but I want all them in one list.. I select element and run function GetAttributes (for example) and I get list of attributes..
Is this possible with current version of GoQuery ? I can't find it..
It depends on Go's experimental html package, which must be installed so that it can be imported as "code.google.com/p/go.net/html". See this tutorial on how to install it accordingly: http://code.google.com/p/go-wiki/wiki/InstallingExp
I'm new to go so I'm not having much luck here.
You refer to the https://code.google.com/p/go-wiki/wiki/InstallingExp page for installing the experimental libraries however I think this page has changed since you wrote your instructions. Also, the given example on that page also does not work.
I think that the HTML package may have been moved outside of experimental? I'm not familiar enough with Go to know or not. Either way, I cannot get goquery installed.
sed -i 's|"code\.google\.com/p/go\.|"golang.org/x/|' $(find . -name '*.go')
How do we handle the new path for go.html?
https://groups.google.com/forum/#!topic/golang-nuts/eD8dh3T9yyA
At winXP i386, get old code, it's not have NewDocumentFromReader()
Can give a example about the question?
I am using goquery to scan existing HTML files. These aren't created from a response. And I don't want to expose go.net/html
to my application.
A NewFromString
or NewFromReader
would be great... or just a simple New
func New(src io.Reader) (d *Document, e error){
// Parse the HTML into nodes
root, e := html.Parse(src)
if e != nil {
return
}
// Create and fill the document
d = newDocument(root, nil)
return
}
BTW nice library, thank you very much.
Can i remove a selection's child selection?
eg. ":first" ":eq(3)" :lt :gt and so on
Sorry for the newb question, Learning go and trying to extract the mp4 from a vine link. I'm trying to follow the example, and extract the "src" attribute from the "video" html tag, but I get a
'# command-line-arguments
./main.go:17: multiple-value s.Find("video").Attr() in single-value context
error when running, I was wondering if I need to select the Attribute a different way?
package main
import (
"fmt"
"log"
"github.com/PuerkitoBio/goquery"
)
func getMP4URL() {
doc, err := goquery.NewDocument("https://vine.co/v/MlWtKgwh7WY")
if err != nil {
log.Fatal(err)
}
doc.Find(".vine-video-container").Each(func(i int, s *goquery.Selection) {
mp4 := s.Find("video").Attr("src")
fmt.Printf("MP4 %d: %s\n", i, mp4)
})
}
func main() {
getMP4URL()
}
Hi,
In readme ==> GoQuery's complete godoc reference documentation can be found here.
Change destination URL from:
http://go.pkgdoc.org/github.com/puerkitobio/goquery
To:
http://go.pkgdoc.org/github.com/PuerkitoBio/goquery
because link is pointing to wrong location.. where no documentation can be found..
please :)
This actually is a question; I was wondering,
Now in go.net package:
https://groups.google.com/forum/?fromgroups=#!topic/golang-nuts/Qq5hTQyPuLg
Im trying to build this package inside a container:
Step 8 : RUN go get github.com/PuerkitoBio/goquery ---> Running in e81d2861928f
# github.com/PuerkitoBio/goquery
gopath/src/github.com/PuerkitoBio/goquery/filter.go:116: cannot use sel.Nodes (type []*"code.google.com/p/go.net/html".Node) as type []*"golang.org/x/net/html".Node in argument to cs.Filter
gopath/src/github.com/PuerkitoBio/goquery/filter.go:116: cannot use cs.Filter(sel.Nodes) (type []*"golang.org/x/net/html".Node) as type []*"code.google.com/p/go.net/html".Node in return argument
gopath/src/github.com/PuerkitoBio/goquery/filter.go:120: cannot use s.Get(0) (type *"code.google.com/p/go.net/html".Node) as type *"golang.org/x/net/html".Node in argument to cs.Match
gopath/src/github.com/PuerkitoBio/goquery/query.go:20: cannot use s.Nodes[0] (type *"code.google.com/p/go.net/html".Node) as type *"golang.org/x/net/html".Node in argument to cs.Match
gopath/src/github.com/PuerkitoBio/goquery/query.go:22: cannot use s.Nodes (type []*"code.google.com/p/go.net/html".Node) as type []*"golang.org/x/net/html".Node in argument to cs.Filter
gopath/src/github.com/PuerkitoBio/goquery/traversal.go:105: cannot use n (type *"code.google.com/p/go.net/html".Node) as type *"golang.org/x/net/html".Node in argument to cs.Match
gopath/src/github.com/PuerkitoBio/goquery/traversal.go:385: cannot use c (type *"code.google.com/p/go.net/html".Node) as type *"golang.org/x/net/html".Node in argument to sel.MatchAll
gopath/src/github.com/PuerkitoBio/goquery/traversal.go:385: cannot use sel.MatchAll(c) (type []*"golang.org/x/net/html".Node) as type []*"code.google.com/p/go.net/html".Node in append
2014/11/06 15:01:48 The command [/bin/sh -c go get github.com/PuerkitoBio/goquery] returned a non-zero code: 2
Wondering if anyone can give me some inisght ?
when i excute go get github.com/PuerkitoBio/goquery ,i get an error says :imports exp/html : unrecognized import path "exp/html".
when html head include charset=gbk, the Text() return not correct.
I'm trying to split a string using the
entity. This works with Jsoup for Java, but here it gets changed to a normal space, making what I want to do impossible.
selection does not have a remove method?
thank you for your job:)
Hi, I got a problem when parsing the website:
<title>Saturday Night Live: The Best of Chris Kattan (2004)</title>The golang code is like this:
for _, n := range doc.Find("body").Children().Not("style").Not("script").Nodes {
buf.WriteString(getNodeText(n))
}
What I finally get is only the "Back to Movie index". I dont quite understand why.
doc.Find("dl[class='brand_tree'] dd ul li").Each(func(index int, s *goquery.Selection) {
brandLiId, exists := s.Attr("id")
if exists == false {
return // End "Each" ?
}
//......
})
panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x8 pc=0x459f16]
goroutine 5196 [running]:
github.com/PuerkitoBio/goquery.func·008(0x0, 0x0, 0x0, 0x0, 0x0, ...)
/Users/jinke/golang/src/github.com/PuerkitoBio/goquery/traversal.go:383 +0x46
github.com/PuerkitoBio/goquery.mapNodes(0xc2091358f0, 0x1, 0x1, 0x2afca5fda5f8, 0x0, ...)
/Users/jinke/golang/src/github.com/PuerkitoBio/goquery/traversal.go:532 +0x8f
github.com/PuerkitoBio/goquery.findWithSelector(0xc2091358f0, 0x1, 0x1, 0x718dd0, 0x25, ...)
/Users/jinke/golang/src/github.com/PuerkitoBio/goquery/traversal.go:389 +0x8d
github.com/PuerkitoBio/goquery.(*Selection).Find(0xc208006660, 0x718dd0, 0x25, 0xc2000b3000)
/Users/jinke/golang/src/github.com/PuerkitoBio/goquery/traversal.go:27 +0x45
main.qiche4sListSpider(0xc2000aca40, 0x2, 0xc20487e400, 0x32, 0x1512, ...)
/Users/jinke/golang/src/cds_spider/price/main/bitauto.go:195 +0x406
main.func·005(0x1, 0xc2007fe180)
/Users/jinke/golang/src/cds_spider/price/main/bitauto.go:165 +0x20e
github.com/PuerkitoBio/goquery.(*Selection).Each(0xc20125a480, 0x2afca5fdadb8, 0x25)
/Users/jinke/golang/src/github.com/PuerkitoBio/goquery/iteration.go:7 +0xf7
main.qiche4sSpider(0xc2000aca40, 0x1447, 0x6b80f0, 0x0, 0x1512, ...)
/Users/jinke/golang/src/cds_spider/price/main/bitauto.go:166 +0x5e5
main.func·008(0xc2000aca40, 0x1447, 0x6b80f0, 0x0, 0x1512, ...)
/Users/jinke/golang/src/cds_spider/price/main/bitauto.go:263 +0xb6
created by cds_spider/price/frame.(*Frame).Start
/Users/jinke/golang/src/cds_spider/price/frame/frame.go:116 +0x29c
code :
193 root, _ := html.Parse(res.Body)
194 document := goquery.NewDocumentFromNode(root)
195 selections := document.Find("div[class='lm_subprice_blc'] table tr")
196 if selections.Size() == 0 {
197 //.....
198 return
199 }
Will it be "root, _ := html.Parse(res.Body)" wrong here ?
update:
root, err := html.Parse(res.Body)
if err != nil {
//.......
return
}
I had a look at your tests and you're only testing updating attributes. I confirmed this with:
package main
import (
"fmt"
"strings"
"github.com/PuerkitoBio/goquery"
)
func main() {
doc, _ := goquery.NewDocumentFromReader(strings.NewReader(`
<html>
<body>
<input id="foo" placeholder="123"/>
</body>
</html>
`))
doc.Find("#foo").SetAttr("value", "bar").SetAttr("placeholder", "456")
out, _ := doc.Html()
fmt.Print(out)
}
$ go run example.go
<html><head></head><body>
<input id="foo" placeholder="456"/>
</body></html>
e.g.
...
doc.Find("img[src]")
...
< img src="assets/images/gallery/thumb-1.jpg" alt="150x150"/> is ok
< img alt="150x150" src="assets/images/gallery/thumb-1.jpg" /> will not match
If I want to modify a document, then RemoveAttr and RemoveNode is needed.
uri = "http://www.google.com"
doc, e := goquery.NewDocument(uri)
if e != nil {
beego.Error(e)
}
form := doc.Find("input").Each(func(j int, input *goquery.Selection) {
println(input.Html())
})
the input is nil~~, help~~~
Hi,
I can't solve one thing..
I have selection:
[h1 class="title"]
Some kind of titlte [a href="url" class="encore"][span class="comm red"][/span][/a]
[/h1]
selecting element:
sel := dom.Find(h1[class=title]
)
then
sel = sel.Not(a
)
after this I am expecting that element "a" will be removed with all childs
then I call
html, _ = sel.Html()
fmt.Println(html)
and I get title and "a" element with it..
probably I am doing something wrong, in my case I need to remove elements from my selection
thanks
Bug
root@go-hacking:~/gocrawl# go get github.com/PuerkitoBio/goquery
/usr/lib/go/src/pkg/code.google.com/p/go.net/html/token.go:289: undefined: io.ErrNoProgress
OS:
Linux go-hacking 3.2.0-4-amd64 #1 SMP Debian 3.2.46-1+deb7u1 x86_64 GNU/Linux
I saw this: #20 for NewDocumentFromReader
. I could also really use NewDocumentFromString
... I have the unfortunate responsibility to "clean" some html and resave that is in an XML pseudo-RSS feed. So by time I get down to the parts I'd need to manipulate the html I'm within a loop of strings.
After executing go get github.com/PuerkitoBio/goquery
, I got following prompt:
abort: code.google.com certificate error: certificate is for *.googleusercontent.com, *.blogspot.com, *.bp.blogspot.com, *.commondatastorage.googleapis.com, *.doubleclickusercontent.com, *.ggpht.com, *.googledrive.com, *.googlesyndication.com, *.sandbox.googleusercontent.com, *.storage.googleapis.com, blogspot.com, bp.blogspot.com, commondatastorage.googleapis.com, doubleclickusercontent.com, ggpht.com, googledrive.com, googleusercontent.com, static.panoramio.com.storage.googleapis.com, storage.googleapis.com
(configure hostfingerprint 70:03:d2:44:35:d0:d4:64:85:f0:3e:c8:15:9c:4d:e7:59:91:50:0d or use --insecure to connect insecurely)
package github.com/PuerkitoBio/goquery
imports code.google.com/p/cascadia: exit status 255
I'm a newbie to Golang. Could anyone tell me how to fix it? Thanks!
the exp/html api must've changed, since cascadia relies on functions that don't exist anymore.
Here's some terminal printouts for your pleasure:
$ go get github.com/PuerkitoBio/goquery
# code.google.com/p/cascadia
/usr/local/go/src/pkg/code.google.com/p/cascadia/selector.go:18: n.FirstChild undefined (type *html.Node has no field or method FirstChild)
/usr/local/go/src/pkg/code.google.com/p/cascadia/selector.go:30: n.FirstChild undefined (type *html.Node has no field or method FirstChild)
/usr/local/go/src/pkg/code.google.com/p/cascadia/selector.go:70: n.FirstChild undefined (type *html.Node has no field or method FirstChild)
/usr/local/go/src/pkg/code.google.com/p/cascadia/selector.go:257: n.FirstChild undefined (type *html.Node has no field or method FirstChild)
/usr/local/go/src/pkg/code.google.com/p/cascadia/selector.go:274: n.FirstChild undefined (type *html.Node has no field or method FirstChild)
/usr/local/go/src/pkg/code.google.com/p/cascadia/selector.go:354: parent.FirstChild undefined (type *html.Node has no field or method FirstChild)
/usr/local/go/src/pkg/code.google.com/p/cascadia/selector.go:399: parent.FirstChild undefined (type *html.Node has no field or method FirstChild)
/usr/local/go/src/pkg/code.google.com/p/cascadia/selector.go:419: n.FirstChild undefined (type *html.Node has no field or method FirstChild)
/usr/local/go/src/pkg/code.google.com/p/cascadia/selector.go:465: n.PrevSibling undefined (type *html.Node has no field or method PrevSibling)
/usr/local/go/src/pkg/code.google.com/p/cascadia/selector.go:468: n.PrevSibling undefined (type *html.Node has no field or method PrevSibling)
/usr/local/go/src/pkg/code.google.com/p/cascadia/selector.go:468: too many errors
...
Can we a way of configuring the user-agent?
I'm getting problems recently because of it :(
I see NextUntil
and NextFilteredUntil
, which are awesome. Is there a way to select the next elements while they match a certain selector? For example, I know I need to select a bunch of <p>
elements but I don't know what the next non-p tag will be, so I don't know what to put in for "until".
If there isn't a way to do this currently, would you like me to try implementing this and submitting a pull request?
How to use the library? Can you give more examples demonstrate?
Are the docs outdated?
When running an Each statement such as the following,
r.Doc.Find("form").Each(func(i int, form *goquery.Selection) {
}
gives me undefined: goquery.Selection.
It seems to return *goquery.Node instead, which then means that I can't call .Find() on the form object.
What am I overlooking?
Say the html document I'm parsing has the attribute class="1post"
. Is there a way to use the Find()
function for this class? If I run doc.Find(".1post")
, I get this error:
panic: expected identifier, found 1 instead
This might be a cascadia issue. Do you know of a way around it?
When i run $ go run main.go
I got this :
# command-line-arguments
./main.go:5: imported and not used: "github.com/PuerkitoBio/goquery"
./main.go:9: undefined: NewDocument
./main.go:13: undefined: doc
The file main.go
is the example.
I adjust it in order to run it outside.
I added package main
and the corresponding import (import "github.com/PuerkitoBio/goquery"
).
Before this i run go get github.com/PuerkitoBio/goquery
Hello, I'm using your package as a testing library in my project, but starting of today my builds started to fail. The CI logs print the following error:
# github.com/PuerkitoBio/goquery
../../PuerkitoBio/goquery/filter.go:116: cannot use sel.Nodes (type []*"code.google.com/p/go.net/html".Node) as type []*"golang.org/x/net/html".Node in argument to cs.Filter
../../PuerkitoBio/goquery/filter.go:116: cannot use cs.Filter(sel.Nodes) (type []*"golang.org/x/net/html".Node) as type []*"code.google.com/p/go.net/html".Node in return argument
../../PuerkitoBio/goquery/filter.go:120: cannot use s.Get(0) (type *"code.google.com/p/go.net/html".Node) as type *"golang.org/x/net/html".Node in argument to cs.Match
../../PuerkitoBio/goquery/query.go:20: cannot use s.Nodes[0] (type *"code.google.com/p/go.net/html".Node) as type *"golang.org/x/net/html".Node in argument to cs.Match
../../PuerkitoBio/goquery/query.go:22: cannot use s.Nodes (type []*"code.google.com/p/go.net/html".Node) as type []*"golang.org/x/net/html".Node in argument to cs.Filter
../../PuerkitoBio/goquery/traversal.go:105: cannot use n (type *"code.google.com/p/go.net/html".Node) as type *"golang.org/x/net/html".Node in argument to cs.Match
../../PuerkitoBio/goquery/traversal.go:385: cannot use c (type *"code.google.com/p/go.net/html".Node) as type *"golang.org/x/net/html".Node in argument to sel.MatchAll
../../PuerkitoBio/goquery/traversal.go:385: cannot use sel.MatchAll(c) (type []*"golang.org/x/net/html".Node) as type []*"code.google.com/p/go.net/html".Node in append
FAIL github.com/9uuso/vertigo [build failed]
I tried using different Go versions, but the command seems to fail on at least Go 1.2, 1.3 and 1.3.1.
doc.html()
will return the string wrapped by html struct.How can I just to string, no html
tag,head
tag...
Can I use Select{nil, nil, nil} ?
First I would like to thank you for your code, it eases my life 👍
Issue I got
Source page is as follow:
<html>
<head>
<meta http-equiv="Content-Language" content="en-us">
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
<title>Test page</title>
</head>
<body>
<h1>This is the Test page for a crawler</h1>
<p>Before getting the Admission of.</p>
</body>
</html>
doc.Find("body").Children().Not("style").Not("script").Text()
gave me the result:
This is the Test page for a crawlerBefore getting the Admission of
Why is crawler and Before not seperated? I think it should not be the problem of windows-1252. Maybe something is wrong? I have not read through the source code yet.
like
<input type="hidden" name="size" value="S">
tks
can run js and handle DOM or BOM? ajax is import for web2.0
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.