Git Product home page Git Product logo

Comments (6)

grangier avatar grangier commented on June 20, 2024

Hello,

First of all, why using screen shot to report an issue, there is no way to copy/paste URL.
Regarding the NYT it seems this is a cookie issue. Even a curl is not able to retrive the raw html :

curl -I "http://www.nytimes.com/2013/08/18/world/middleeast/pressure-by-us-failed-to-sway-egypts-leaders.html"
HTTP/1.1 303 See Other
Date: Sun, 18 Aug 2013 14:17:10 GMT
Server: Apache
Set-Cookie: RMID=007f01000c0c5210d766001c; Expires=Mon, 18 Aug 2014 14:17:10 GMT; Path=/; Domain=.nytimes.com;
Vary: Host
Location: http://www.nytimes.com/glogin?URI=http://www.nytimes.com/2013/08/18/world/middleeast/pressure-by-us-failed-to-sway-egypts-leaders.html&OQ=_rQ3D0&OP=15f69d57Q2FQ3Cg_ZQ3C.tZQ3CsssQ3CjZ!0Q3CzgMQ3AQ7EggZAQ3CART@Q3CRFQ3CTFQ3CsgQ7E0zQ3C!Q60zz0ccQ5CQ3AZQ3C_Q7EcQ3AQ3AQ7BQ7EcQ7CetQ7CQ7BQ3AQ7COQ5CQ600czQ7CZgQ7CQ3AsQ5CtQ7CcQ7Dt_ZQ3AQ7C0cQ5CzcQ7EQ3A3jZ!0
Connection: close
Content-Type: text/plain

Regarding Gizmodo, thanks to google I found the url : http://gizmodo.com/the-gear-and-apps-you-need-to-survive-the-next-semester-1141460933

It seems that the data structure of the html page is to complicated for goose

from python-goose.

MojoJolo avatar MojoJolo commented on June 20, 2024

Hi, sorry about using a screenshot. This is my first time reporting an issue. Will take note of it. Thanks for the reply.

Any recommendation or fallback to extract those kinds of websites?

from python-goose.

grangier avatar grangier commented on June 20, 2024

The Gizmodo issue should be fixed in the latest head :

>>> url = "http://gizmodo.com/the-gear-and-apps-you-need-to-survive-the-next-semester-1141460933"
>>> import goose
>>> g = goose.Goose()
>>> a = g.extract(url=url)
>>> a.cleaned_text[:150]
u"Okay, this is it. Back to school, again. Whether it's your first college semester or you can see graduation on the horizon, these tools will make the "

from python-goose.

grangier avatar grangier commented on June 20, 2024

For the NYT the issue seems to be cookie handeling. I guess the commit 4d1ccaf is not in favor of cookie handeling.

At the moment the only way to extract NYT content will be using the raw_html method :

>>> import urllib2
>>> import goose
>>> 
>>> # fetch html
... url = "http://www.nytimes.com/2013/08/18/world/middleeast/pressure-by-us-failed-to-sway-egypts-leaders.html?hp"
>>> opener = urllib2.build_opener(urllib2.HTTPCookieProcessor())
>>> response = opener.open(url)
>>> raw_html = response.read()
>>> 
>>> # goose
... g = goose.Goose()
>>> a = g.extract(raw_html=raw_html)
>>> a.cleaned_text
u'CAIRO \u2014 For a moment, at least, American and European diplomats trying to defuse the volatile standoff in Egypt thought they had a breakthrough.\n\nAs thousands of Islamist supporters of the ousted president, Mohamed Morsi, braced for a crackdown by the military-imposed government, a senior European diplomat, Bernardino Le\xf3n, told the Islamists of \u201cindications\u201d from the leadership that within hours it would free two imprisoned opposition leaders. In turn, the Islamists had agreed to reduce the size of two protest camps by about half.\n\nAn hour passed, and nothing happened. Another hour passed, and still no one had been released.\n\nThe Americans heightened the pressure. Two senators visiting Cairo, John McCain of Arizona and Lindsey Graham of South Carolina, met with Gen. Abdul-Fattah el-Sisi, the officer who ousted Mr. Morsi and appointed the new government, and the interim prime minister, Hazem el-Beblawi, and pushed for the release of the two prisoners. But the Egyptians brushed them off.\n\n\u201cYou could tell people were itching for a fight,\u201d Mr. Graham recalled in an interview. \u201cThe prime minister was a disaster. He kept preaching to me: \u2018You can\u2019t negotiate with these people. They\u2019ve got to get out of the streets and respect the rule of law.\u2019 I said: \u2018Mr. Prime Minister, it\u2019s pretty hard for you to lecture anyone on the rule of law. How many votes did you get? Oh, yeah, you didn\u2019t have an election.\u2019\xa0\u201d\n\nGeneral Sisi, Mr. Graham said, seemed \u201ca little bit intoxicated by power.\u201d\n\nThe senators walked out that day, Aug. 6, gloomy and convinced that a violent showdown was looming. But the diplomats still held out hope, believing they had persuaded Egypt\u2019s government at least not to declare the talks a failure.\n\nThe next morning, the government issued a statement declaring that diplomatic efforts had been exhausted and blaming the Islamists for any casualties from the coming crackdown. A week later, Egyptian forces opened a ferocious assault that so far has killed more than 1,000 protesters.\n\nAll of the efforts of the United States government, all the cajoling, the veiled threats, the high-level envoys from Washington and the 17 personal phone calls by Defense Secretary Chuck Hagel, failed to forestall the worst political bloodletting in modern Egyptian history. The generals in Cairo felt free to ignore the Americans first on the prisoner release and then on the statement, in a cold-eyed calculation that they would not pay a significant cost \u2014 a conclusion bolstered when President Obama responded by canceling a joint military exercise but not $1.5 billion in annual aid.\n\nThe violent crackdown has left Mr. Obama in a no-win position: risk a partnership that has been the bedrock of Middle East peace for 35 years, or stand by while longtime allies try to hold on to power by mowing down opponents. From one side, the Israelis, Saudis and other Arab allies have lobbied him to go easy on the generals in the interest of thwarting what they see as the larger and more insidious Islamist threat. From the other, an unusual mix of conservatives and liberals has urged him to stand more forcefully against the sort of autocracy that has been a staple of Egyptian life for decades.\n\nFor now the administration has decided to keep the close relationship with the Egyptian military fundamentally unchanged. But the death toll is climbing, the streets are descending into chaos, and the government and the Islamists are vowing to escalate. It is unclear if the military\u2019s new government can reimpose a version of the old order now that the public believes street protests have toppled two leaders in less than three years, or if, after winning democratic elections, the Islamists will ever again compliantly retreat.\n\nAs Mr. Obama acknowledged in a statement on Thursday, the American response turns not only on humanitarian values but also on national interests. A country consumed by civil strife may no longer function as a stabilizing ally in a volatile region.'

from python-goose.

grangier avatar grangier commented on June 20, 2024

I close this issue. I opened a ticket for cookie handeling #35

from python-goose.

MojoJolo avatar MojoJolo commented on June 20, 2024

Thanks!

from python-goose.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.