Git Product home page Git Product logo

goose's People

Contributors

advancedlogic avatar akibalogh avatar bkaradzic avatar codelingobot avatar cutd avatar dependabot[bot] avatar dhowden avatar dotpot avatar ejamesc avatar jamesbloomer avatar jaytaylor avatar jeffail avatar lemoussel avatar marcosinger avatar nicolaasuni avatar pedox avatar profpatsch avatar quipo avatar rahal avatar rubenn avatar shaneiseminger avatar sisteamnik avatar syou6162 avatar thesoenke avatar truongsinh avatar urandom avatar wkornewald avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

goose's Issues

Error when trying to install

# github.com/advancedlogic/GoOse
../github.com/advancedlogic/GoOse/extractor.go:216: not enough arguments in call to set.New
../github.com/advancedlogic/GoOse/extractor.go:234: cannot use tags (type set.Interface) as type *set.Set in return argument: need type assertion
../github.com/advancedlogic/GoOse/extractor.go:250: not enough arguments in call to set.New
../github.com/advancedlogic/GoOse/stopwords.go:20: not enough arguments in call to set.New
../github.com/advancedlogic/GoOse/stopwords.go:20: cannot use set.New() (type set.Interface) as type *set.Set in assignment: need type assertion
../github.com/advancedlogic/GoOse/stopwords.go:69: not enough arguments in call to set.New
../github.com/advancedlogic/GoOse/stopwords.go:85: cannot use stopWords (type set.Interface) as type *set.Set in assignment: need type assertion
../github.com/advancedlogic/GoOse/videos.go:28: not enough arguments in call to set.New
../github.com/advancedlogic/GoOse/videos.go:28: cannot use set.New() (type set.Interface) as type *set.Set in field value: need type assertion
../github.com/advancedlogic/GoOse/videos.go:29: not enough arguments in call to set.New
../github.com/advancedlogic/GoOse/videos.go:29: too many errors

Date extraction support

Can this library extract dates? Seems article{} has the parameter, but maybe it's not supported yet?

Some examples of how dates appear in meta:

<meta content="2014-06-16T06:01:15.750Z" property="article:published_time"/>
<meta content="2014-06-16T06:01:15.750Z" property="article:modified_time"/>
<span>
<time class="articlepublishdate" datetime="2015-09-04" itemprop="datePublished">4 September, 2015
</span>
<meta itemprop="datePublished" content="2009-05-08">

Extract hidden text from NY Times

Here is my test URL :
http://www.nytimes.com/2015/05/02/nyregion/christie-ally-expected-to-plead-guilty-in-george-washington-bridge-lane-closing-case.html

The code :

package main

import (
    "github.com/advancedlogic/GoOse"
)

func main() {
    g := goose.New()
    article := g.ExtractFromUrl("http://www.nytimes.com/2015/05/02/nyregion/christie-ally-expected-to-plead-guilty-in-george-washington-bridge-lane-closing-case.html")
    println("title : ", article.Title)
    println("content : ", article.CleanedText[0:150])
}

The output :

title :  2 Indicted in George Washington Bridge Case; Ally of Christie Pleads Guilty
content :  Continue reading the main story

Continue reading the main story

After a 16-month federal investigation into the George Washington Bridge lane closin

The source from NY Times contains the following :

<a class="visually-hidden skip-to-text-link" href="#story-continues-1">Continue reading the main story</a>
<span class="sharetools-label visually-hidden">Share This Page</span>
<div class="ad sharetools-inline-article-ad hidden nocontent robots-nocontent">
<a class="visually-hidden skip-to-text-link" href="#story-continues-1">Continue reading the main story</a>
</div>
<div id="MiddleLeft" class="ad middle-left-ad hidden nocontent robots-nocontent">
<a class="visually-hidden skip-to-text-link" href="#story-continues-1">Continue reading the main story</a>
</div>
</div>

Could we think of a regexp that would remove text when classes contain "hidden" ?

Avoiding panics

Would it be possible to introduce a new method alongside Crawl, or change the existing one, so that an error is returned as the last argument rather than panicing? It'd be nice to obtain the error on any stage of the crawl (fetching, parsing, etc) so it can be handled accordingly.

Error when doing GO GET

An error occurred when running go get github.com/advancedlogic/GoOse:

module declares its path as: github.com/fino-digital/GoOse
but was required as: github.com/advancedlogic/GoOse

The module declaration in go.mod is:

module github.com/fino-digital/GoOse

is it supposed to be:

github.com/advancedlogic/GoOse

instead?

Go version: 1.13

img tags not removed if tags contain \n formatting

for example, try this url: http://www.alibaba.com/product-detail/all-purpose-custom-display-stand-all_60134717814.html

cleaned result includes:

<img src=
"http://g01.s.alicdn.com/kf/HTB1HjVWGVXXXXa1XXXXq6xXFXXXo/222119091/HTB1HjVWGVXXXXa1XXXXq6xXFXXXo.jpg"
alt=
"all purpose custom display stand ,all kinds of metal display stand ,alibaba stand up paper display stand"
style="border: 0px currentColor;" ori-width="750" ori-height=
"750"><img src=
"http://g03.s.alicdn.com/kf/HTB1RWaYGFXXXXc.aXXXq6xXFXXXA/222119091/HTB1RWaYGFXXXXc.aXXXq6xXFXXXA.jpg"
alt=
"all purpose custom display stand ,all kinds of metal display stand ,alibaba stand up paper display stand"
style="border: 0px currentColor;" ori-width="750" ori-height=
"750"><img src=
"http://g04.s.alicdn.com/kf/HTB19NqTGFXXXXb8aXXXq6xXFXXXJ/222119091/HTB19NqTGFXXXXb8aXXXq6xXFXXXJ.jpg"
alt=
"all purpose custom display stand ,all kinds of metal display stand ,alibaba stand up paper display stand"
style="border: 0px currentColor;" ori-width="750" ori-height=
"350"><img src="http://g02.s.alicdn.com/kf/HTB1zZq0GFXXXXXPapXXq6xXFXXXn/222119091/HTB1zZq0GFXXXXXPapXXq6xXFXXXn.jpg"
style="border: 0px currentColor;" ori-width="750" ori-height="350"
alt="alibaba stand up paper display stand"><img src="http://g02.s.alicdn.com/kf/HTB12Ou1GFXXXXcMaXXXq6xXFXXX4/222119091/HTB12Ou1GFXXXXcMaXXXq6xXFXXX4.jpg"
style="border: 0px currentColor;" ori-width="750" ori-height="350"
alt="alibaba stand up paper display stand"><img src=
"http://g04.s.alicdn.com/kf/HTB1dy6tGFXXXXcmXVXXq6xXFXXXP/222119091/HTB1dy6tGFXXXXcmXVXXq6xXFXXXP.jpg"
alt=
"all purpose custom display stand ,all kinds of metal display stand ,alibaba stand up paper display stand"
style="border: 0px currentColor;" ori-width="750" ori-height=
"350"><img src="http://g02.s.alicdn.com/kf/HTB1X_vxGFXXXXc6aXXXq6xXFXXXW/222119091/HTB1X_vxGFXXXXc6aXXXq6xXFXXXW.jpg"
style="border: 0px currentColor;" ori-width="750" ori-height="350"
alt="alibaba stand up paper display stand">

article author info?

I was digging through the struct for article and noticed there wasn't an entry for author. Any plans for this?

charset panics

I'm seeing panics in certain situations at line 66 on crawler.go.
I made a fork, will see if I can fix or workaround. But any thoughts appreciated

panic: runtime error: invalid memory address or nil pointer dereference [recovered]
        panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x20 pc=0x444624]

goroutine 198 [running]:
io/ioutil.func·002()
        /root/.gvm/gos/go1.4rc2/src/io/ioutil/ioutil.go:30 +0x103
bytes.(*Buffer).ReadFrom(0xc20c2b1340, 0x0, 0x0, 0x0, 0x0, 0x0)
        /root/.gvm/gos/go1.4rc2/src/bytes/buffer.go:169 +0x254
io/ioutil.readAll(0x0, 0x0, 0x200, 0x0, 0x0, 0x0, 0x0, 0x0)
        /root/.gvm/gos/go1.4rc2/src/io/ioutil/ioutil.go:33 +0x1b0
io/ioutil.ReadAll(0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
        /root/.gvm/gos/go1.4rc2/src/io/ioutil/ioutil.go:42 +0x68
github.com/marketmuse/GoOse.Crawler.Crawl(0x7dd470, 0x0, 0x1194, 0x101, 0x7ec390, 0x2, 0x825570, 0x10, 0x8255b0, 0x11, ...)
        /root/.gvm/pkgsets/go1.4rc2/global/src/github.com/marketmuse/GoOse/crawler.go:66 +0xf53
github.com/marketmuse/GoOse.Goose.ExtractFromRawHtml(0x7dd470, 0x0, 0x1194, 0x101, 0x7ec390, 0x2, 0x825570, 0x10, 0x8255b0, 0x11, ...)
        /root/.gvm/pkgsets/go1.4rc2/global/src/github.com/marketmuse/GoOse/goose.go:21 +0x12b

Wrong Content-Type in request

Current code sets a Content-Type header of application/json to request HTML. It should use text/html.

This causes some URL requests to fail when servers pay attention to this header and don't have any way of serving JSON.

Opening a PR.

cannot extract from raw html

i am getting this error, why i need to pass two vars if i have the raw html in var???
not enough arguments in call to g.ExtractFromRawHtml

Does not work on medium articles

I tried to use it on some of my medium articles and it returned nothing. I tested on other medium articles and noticed that it does not work on any of them.

Who to contact for security issues

Hey there!

I belong to an open source security research community, and a member (@akincibor) has found an issue, but doesn’t know the best way to disclose it.

If not a hassle, might you kindly add a SECURITY.md file with an email, or another contact method? GitHub recommends this best practice to ensure security issues are responsibly disclosed, and it would serve as a simple instruction for security researchers in the future.

Thank you for your consideration, and I look forward to hearing from you!

(cc @huntr-helper)

Installation error due to dependency "https://github.com/fatih/set"

The lib "https://github.com/fatih/set" was updated recently and GoOse is trying to use an old version (resulting in the dependency not being installed at all).

Probably it needs an update at Gopkg.toml for the gopkg.in/fatih/set.v0 version (or change it for the GitHub version: github.com/fatih/set).

Here is the error log:

go version go1.10.3 linux/amd64

# github.com/advancedlogic/GoOse
../../advancedlogic/GoOse/extractor.go:253:17: not enough arguments in call to set.New
    have ()
    want (set.SetType)
../../advancedlogic/GoOse/extractor.go:270:2: cannot use tags (type set.Interface) as type *set.Set in return argument: need type assertion
../../advancedlogic/GoOse/extractor.go:371:24: not enough arguments in call to set.New
    have ()
    want (set.SetType)
../../advancedlogic/GoOse/stopwords.go:24:25: cannot use set.New() (type set.Interface) as type *set.Set in assignment: need type assertion
../../advancedlogic/GoOse/stopwords.go:24:34: not enough arguments in call to set.New
    have ()
    want (set.SetType)
../../advancedlogic/GoOse/stopwords.go:74:22: not enough arguments in call to set.New
    have ()
    want (set.SetType)
../../advancedlogic/GoOse/stopwords.go:90:15: cannot use stopWords (type set.Interface) as type *set.Set in assignment: need type assertion
../../advancedlogic/GoOse/videos.go:30:13: cannot use set.New() (type set.Interface) as type *set.Set in field value: need type assertion
../../advancedlogic/GoOse/videos.go:30:22: not enough arguments in call to set.New
    have ()
    want (set.SetType)
../../advancedlogic/GoOse/videos.go:31:22: not enough arguments in call to set.New
    have ()
    want (set.SetType)
../../advancedlogic/GoOse/videos.go:31:22: too many errors

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.