advancedlogic / goose Goto Github PK
View Code? Open in Web Editor NEWHtml Content / Article Extractor in Golang
License: Apache License 2.0
Html Content / Article Extractor in Golang
License: Apache License 2.0
Extracting content from schema.org articleBody http://schema.org/Article doesn't work.
For example, try this URL in your example:
http://blog.schema.org/2014/09/schemaorg-v191-offerprice-documentation.html
GoOse won't find and extract any content.
# github.com/advancedlogic/GoOse
../github.com/advancedlogic/GoOse/extractor.go:216: not enough arguments in call to set.New
../github.com/advancedlogic/GoOse/extractor.go:234: cannot use tags (type set.Interface) as type *set.Set in return argument: need type assertion
../github.com/advancedlogic/GoOse/extractor.go:250: not enough arguments in call to set.New
../github.com/advancedlogic/GoOse/stopwords.go:20: not enough arguments in call to set.New
../github.com/advancedlogic/GoOse/stopwords.go:20: cannot use set.New() (type set.Interface) as type *set.Set in assignment: need type assertion
../github.com/advancedlogic/GoOse/stopwords.go:69: not enough arguments in call to set.New
../github.com/advancedlogic/GoOse/stopwords.go:85: cannot use stopWords (type set.Interface) as type *set.Set in assignment: need type assertion
../github.com/advancedlogic/GoOse/videos.go:28: not enough arguments in call to set.New
../github.com/advancedlogic/GoOse/videos.go:28: cannot use set.New() (type set.Interface) as type *set.Set in field value: need type assertion
../github.com/advancedlogic/GoOse/videos.go:29: not enough arguments in call to set.New
../github.com/advancedlogic/GoOse/videos.go:29: too many errors
Can this library extract dates? Seems article{} has the parameter, but maybe it's not supported yet?
Some examples of how dates appear in meta:
<meta content="2014-06-16T06:01:15.750Z" property="article:published_time"/>
<meta content="2014-06-16T06:01:15.750Z" property="article:modified_time"/>
<span>
<time class="articlepublishdate" datetime="2015-09-04" itemprop="datePublished">4 September, 2015
</span>
<meta itemprop="datePublished" content="2009-05-08">
Here is my test URL :
http://www.nytimes.com/2015/05/02/nyregion/christie-ally-expected-to-plead-guilty-in-george-washington-bridge-lane-closing-case.html
The code :
package main
import (
"github.com/advancedlogic/GoOse"
)
func main() {
g := goose.New()
article := g.ExtractFromUrl("http://www.nytimes.com/2015/05/02/nyregion/christie-ally-expected-to-plead-guilty-in-george-washington-bridge-lane-closing-case.html")
println("title : ", article.Title)
println("content : ", article.CleanedText[0:150])
}
The output :
title : 2 Indicted in George Washington Bridge Case; Ally of Christie Pleads Guilty
content : Continue reading the main story
Continue reading the main story
After a 16-month federal investigation into the George Washington Bridge lane closin
The source from NY Times contains the following :
<a class="visually-hidden skip-to-text-link" href="#story-continues-1">Continue reading the main story</a>
<span class="sharetools-label visually-hidden">Share This Page</span>
<div class="ad sharetools-inline-article-ad hidden nocontent robots-nocontent">
<a class="visually-hidden skip-to-text-link" href="#story-continues-1">Continue reading the main story</a>
</div>
<div id="MiddleLeft" class="ad middle-left-ad hidden nocontent robots-nocontent">
<a class="visually-hidden skip-to-text-link" href="#story-continues-1">Continue reading the main story</a>
</div>
</div>
Could we think of a regexp that would remove text when classes contain "hidden" ?
Would it be possible to introduce a new method alongside Crawl, or change the existing one, so that an error is returned as the last argument rather than panicing? It'd be nice to obtain the error on any stage of the crawl (fetching, parsing, etc) so it can be handled accordingly.
An error occurred when running go get github.com/advancedlogic/GoOse
:
module declares its path as: github.com/fino-digital/GoOse
but was required as: github.com/advancedlogic/GoOse
The module declaration in go.mod
is:
module github.com/fino-digital/GoOse
is it supposed to be:
github.com/advancedlogic/GoOse
instead?
Go version: 1.13
for example, try this url: http://www.alibaba.com/product-detail/all-purpose-custom-display-stand-all_60134717814.html
cleaned result includes:
<img src=
"http://g01.s.alicdn.com/kf/HTB1HjVWGVXXXXa1XXXXq6xXFXXXo/222119091/HTB1HjVWGVXXXXa1XXXXq6xXFXXXo.jpg"
alt=
"all purpose custom display stand ,all kinds of metal display stand ,alibaba stand up paper display stand"
style="border: 0px currentColor;" ori-width="750" ori-height=
"750"><img src=
"http://g03.s.alicdn.com/kf/HTB1RWaYGFXXXXc.aXXXq6xXFXXXA/222119091/HTB1RWaYGFXXXXc.aXXXq6xXFXXXA.jpg"
alt=
"all purpose custom display stand ,all kinds of metal display stand ,alibaba stand up paper display stand"
style="border: 0px currentColor;" ori-width="750" ori-height=
"750"><img src=
"http://g04.s.alicdn.com/kf/HTB19NqTGFXXXXb8aXXXq6xXFXXXJ/222119091/HTB19NqTGFXXXXb8aXXXq6xXFXXXJ.jpg"
alt=
"all purpose custom display stand ,all kinds of metal display stand ,alibaba stand up paper display stand"
style="border: 0px currentColor;" ori-width="750" ori-height=
"350"><img src="http://g02.s.alicdn.com/kf/HTB1zZq0GFXXXXXPapXXq6xXFXXXn/222119091/HTB1zZq0GFXXXXXPapXXq6xXFXXXn.jpg"
style="border: 0px currentColor;" ori-width="750" ori-height="350"
alt="alibaba stand up paper display stand"><img src="http://g02.s.alicdn.com/kf/HTB12Ou1GFXXXXcMaXXXq6xXFXXX4/222119091/HTB12Ou1GFXXXXcMaXXXq6xXFXXX4.jpg"
style="border: 0px currentColor;" ori-width="750" ori-height="350"
alt="alibaba stand up paper display stand"><img src=
"http://g04.s.alicdn.com/kf/HTB1dy6tGFXXXXcmXVXXq6xXFXXXP/222119091/HTB1dy6tGFXXXXcmXVXXq6xXFXXXP.jpg"
alt=
"all purpose custom display stand ,all kinds of metal display stand ,alibaba stand up paper display stand"
style="border: 0px currentColor;" ori-width="750" ori-height=
"350"><img src="http://g02.s.alicdn.com/kf/HTB1X_vxGFXXXXc6aXXXq6xXFXXXW/222119091/HTB1X_vxGFXXXXc6aXXXq6xXFXXXW.jpg"
style="border: 0px currentColor;" ori-width="750" ori-height="350"
alt="alibaba stand up paper display stand">
go get github.com/advancedlogic/GoOse
fails with:
package code.google.com/p/cascadia: unable to detect version control system for code.google.com/ path
The cascadia package has been moved to https://github.com/andybalholm/cascadia
I was digging through the struct for article and noticed there wasn't an entry for author. Any plans for this?
I'm seeing panics in certain situations at line 66 on crawler.go.
I made a fork, will see if I can fix or workaround. But any thoughts appreciated
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x20 pc=0x444624]
goroutine 198 [running]:
io/ioutil.func·002()
/root/.gvm/gos/go1.4rc2/src/io/ioutil/ioutil.go:30 +0x103
bytes.(*Buffer).ReadFrom(0xc20c2b1340, 0x0, 0x0, 0x0, 0x0, 0x0)
/root/.gvm/gos/go1.4rc2/src/bytes/buffer.go:169 +0x254
io/ioutil.readAll(0x0, 0x0, 0x200, 0x0, 0x0, 0x0, 0x0, 0x0)
/root/.gvm/gos/go1.4rc2/src/io/ioutil/ioutil.go:33 +0x1b0
io/ioutil.ReadAll(0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
/root/.gvm/gos/go1.4rc2/src/io/ioutil/ioutil.go:42 +0x68
github.com/marketmuse/GoOse.Crawler.Crawl(0x7dd470, 0x0, 0x1194, 0x101, 0x7ec390, 0x2, 0x825570, 0x10, 0x8255b0, 0x11, ...)
/root/.gvm/pkgsets/go1.4rc2/global/src/github.com/marketmuse/GoOse/crawler.go:66 +0xf53
github.com/marketmuse/GoOse.Goose.ExtractFromRawHtml(0x7dd470, 0x0, 0x1194, 0x101, 0x7ec390, 0x2, 0x825570, 0x10, 0x8255b0, 0x11, ...)
/root/.gvm/pkgsets/go1.4rc2/global/src/github.com/marketmuse/GoOse/goose.go:21 +0x12b
For example:
The "true" finalURL is https://medium.com/@thelatemercutio/waltonchain-partners-mobius-a-protocol-for-the-future-of-blockchain-86d9a1d417a0 however.
Is there anyway to fix this so that the "FinalURL" is the true value? e.g. specifying the maximum number of redirects?
Moved, maybe to golang.org/x/net/html/charset
It looks like the fork of goquery fixes some import paths from google code to github, but the upstream has already fixed those as well, among other things.
Current code sets a Content-Type
header of application/json
to request HTML. It should use text/html
.
This causes some URL requests to fail when servers pay attention to this header and don't have any way of serving JSON.
Opening a PR.
Hi,
The stopword list and language detection code is rather useful on its own. Would you consider moving it into its own library, so that it can be used more easily by code that doesn't necessarily deal with html formatting.
I am using GoOse for extract content of the some webs, but if the url is like this
"http://domain.com/?cat=food" the result text is empty, only happen with this kind of url. Now if the url has this form "http://domain.com/cat/food" work fine. Comments??
Is there anything blocking switching back to upstream goquery?
Line 4 in 9df1c62
The field PublishDate seems to return nil on every page.
i am getting this error, why i need to pass two vars if i have the raw html in var???
not enough arguments in call to g.ExtractFromRawHtml
article.CleanedText
always empty
I tried to use it on some of my medium articles and it returned nothing. I tested on other medium articles and noticed that it does not work on any of them.
I already use goquery in my project, it would be really nice to be able to use all GoOse possibilities while avoiding extra document parsing as I already do this before GoOse call
Hey there!
I belong to an open source security research community, and a member (@akincibor) has found an issue, but doesn’t know the best way to disclose it.
If not a hassle, might you kindly add a SECURITY.md
file with an email, or another contact method? GitHub recommends this best practice to ensure security issues are responsibly disclosed, and it would serve as a simple instruction for security researchers in the future.
Thank you for your consideration, and I look forward to hearing from you!
(cc @huntr-helper)
If you try crawling e.g. these URLs you get strange results, but nothing from the actual article text:
The lib "https://github.com/fatih/set" was updated recently and GoOse is trying to use an old version (resulting in the dependency not being installed at all).
Probably it needs an update at Gopkg.toml
for the gopkg.in/fatih/set.v0
version (or change it for the GitHub version: github.com/fatih/set
).
Here is the error log:
go version go1.10.3 linux/amd64
# github.com/advancedlogic/GoOse
../../advancedlogic/GoOse/extractor.go:253:17: not enough arguments in call to set.New
have ()
want (set.SetType)
../../advancedlogic/GoOse/extractor.go:270:2: cannot use tags (type set.Interface) as type *set.Set in return argument: need type assertion
../../advancedlogic/GoOse/extractor.go:371:24: not enough arguments in call to set.New
have ()
want (set.SetType)
../../advancedlogic/GoOse/stopwords.go:24:25: cannot use set.New() (type set.Interface) as type *set.Set in assignment: need type assertion
../../advancedlogic/GoOse/stopwords.go:24:34: not enough arguments in call to set.New
have ()
want (set.SetType)
../../advancedlogic/GoOse/stopwords.go:74:22: not enough arguments in call to set.New
have ()
want (set.SetType)
../../advancedlogic/GoOse/stopwords.go:90:15: cannot use stopWords (type set.Interface) as type *set.Set in assignment: need type assertion
../../advancedlogic/GoOse/videos.go:30:13: cannot use set.New() (type set.Interface) as type *set.Set in field value: need type assertion
../../advancedlogic/GoOse/videos.go:30:22: not enough arguments in call to set.New
have ()
want (set.SetType)
../../advancedlogic/GoOse/videos.go:31:22: not enough arguments in call to set.New
have ()
want (set.SetType)
../../advancedlogic/GoOse/videos.go:31:22: too many errors
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.