Comments (11)
A government site with non-compliant html? can't be...
Mor - great for locating the problem. as for the solution:
As mention here
html5lib is very slow. maybe there is a way around this . Did you ask in
beautifulsoup google group ? another way might
be downloading the html, adding the missing </tr>
with some regex and
text replacing and then parsing with lxml.
Twitter:@alonisser https://twitter.com/alonisser
LinkedIn Profile http://www.linkedin.com/in/alonisser
Facebook https://www.facebook.com/alonisser
_Tech blog:_4p-tech.co.il/blog
_Personal Blog:_degeladom.wordpress.com
Tel:972-54-6734469
On Mon, Dec 9, 2013 at 1:27 AM, florpor [email protected] wrote:
I was playing around with the scraping code and couldn't figure why it's
not working.
Apparently mmi in their IturTabot page currently serve a html with a
closing tr tag which does not have an opening one, so when beautifulsoup
parses the html with lxml as parser it reaches this point and from thereon
just closes the open tags it has, leaving the information tables out of the
returned object.
Then, when a table of class highLines is searched for we just take the
first result, but there are no results.
The error in heroku logs:2013-12-08T23:08:39.661518+00:00 app[scheduler.7645]: checking gush 30727
2013-12-08T23:08:39.661518+00:00 app[scheduler.7645]: http://mmi.gov.il/IturTabot/taba2.asp?Gush=30727&fromTaba1=true
2013-12-08T23:09:11.482607+00:00 http://mmi.gov.il/IturTabot/taba2.asp?Gush=30727&fromTaba1=true2013-12-08T23:09:11.482607+00:00 app[scheduler.7645]: HTML new, inserting data
2013-12-08T23:09:11.518180+00:00 app[scheduler.7645]: [2013-12-08 23:09] ERROR: horse: IndexError: list index out of range
2013-12-08T23:09:11.518180+00:00 app[scheduler.7645]: Traceback (most recent call last):
2013-12-08T23:09:11.518180+00:00 app[scheduler.7645]: File "/app/.heroku/python/lib/python2.7/site-packages/rq/worker.py", line 393, in perform_job
2013-12-08T23:09:11.518180+00:00 app[scheduler.7645]: rv = job.perform()
2013-12-08T23:09:11.518180+00:00 app[scheduler.7645]: File "/app/.heroku/python/lib/python2.7/site-packages/rq/job.py", line 328, in perform
2013-12-08T23:09:11.518180+00:00 app[scheduler.7645]: self._result = self.func(_self.args, *_self.kwargs)
2013-12-08T23:09:11.518180+00:00 app[scheduler.7645]: File "/app/tools/scrapelib.py", line 139, in scrape_gush
2013-12-08T23:09:11.537331+00:00 app[scheduler.7645]: [2013-12-08 23:09] INFO: worker: *** Listening on high, default, low...
2013-12-08T23:09:11.518180+00:00 app[scheduler.7645]: data = extract_data(html)
2013-12-08T23:09:11.518180+00:00 app[scheduler.7645]: File "/app/tools/scrapelib.py", line 49, in extract_data
2013-12-08T23:09:11.518180+00:00 app[scheduler.7645]: table = s("table", "highLines")[0]
2013-12-08T23:09:11.518482+00:00 app[scheduler.7645]: IndexError: list index out of range
2013-12-08T23:09:11.518482+00:00 app[scheduler.7645]:
2013-12-08T23:09:11.518482+00:00 app[scheduler.7645]: [2013-12-08 23:09] DEBUG: horse: Invoking exception handler <bound method Worker.move_to_failed_queue of <rq.worker.Worker object at 0x19c7290>>
2013-12-08T23:09:11.518779+00:00 app[scheduler.7645]: [2013-12-08 23:09] WARNING: horse: Moving job to failed queue.Seems like a solution could be changing the parser beautifulsoup uses from
lxml to html5lib. It seems to work for me so far, still looking into that.
Right now though - no new data is being fetched.—
Reply to this email directly or view it on GitHubhttps://github.com//issues/40
.
from opentaba-server.
are you certain that's the issue? i've noticed that the URLs have been returning a .asp file source - seems like a MIME issue on their servers - instead of the HTML.
I see it both on the URLs we use (eg http://mmi.gov.il/IturTabot/taba2.asp?Gush=30727&fromTaba1=true ) and when trying use the MMI site on the browser (ie http://mmi.gov.il/IturTabot/taba1.asp ) - I assumed that was what's stopping the scraping, but since I only discovered it on saturday I thought they might fix it in a couple days.. apparently not
from opentaba-server.
I guess I picked a bad gush from the log and now I can't get the logs again without my computer...
anyways I'm sure that some URLs work:
http://mmi.gov.il/IturTabot/taba2.asp?Gush=360&fromTaba1=true
(it's one in ashkelon I think)
I'll look into the jerusalem gushim tomorrow night...
@alonisser - yes html5lib is much slower than lxml, but it's much more flexible and format-error tolerant. considering that it takes 5-10 seconds just to connect to the mmi web server I think it shouldn't bother us and that it's better than regexing this one error which might appear again in other parts of the page later on and require more fixing.
still gotta make sure this is really what's happening on prod. will let you know
from opentaba-server.
I've seen that too and I believe it only happens on some gushim pages and
indicates a server - side crash. Not sure what we can do about it.
As for lxml vs html5lib I believe lxml has some kind of "html" mode which
should be more tolerant, I'm not sure if it can be used and if it makes any
difference (perhaps it is the mode which is used by BeutifulSoup in the
first place). In any case I agree with @florpor that it will most likely be
a negligible performance impact since it is not an on-line process and it
is marginal compared to the time it takes to get the data from MMI servers.
On Dec 9, 2013 11:55 AM, "Nir Yariv" [email protected] wrote:
are you certain that's the issue? i've noticed that the URLs have been
returning a .asp file source - seems like a MIME issue on their servers -
instead of the HTML.I see it both on the URLs we use (eg
http://mmi.gov.il/IturTabot/taba2.asp?Gush=30727&fromTaba1=true ) and
when trying use the MMI site on the browser (ie
http://mmi.gov.il/IturTabot/taba1.asp ) - I assumed that was what's
stopping the scraping, but since I only discovered it on saturday I thought
they might fix it in a couple days.. apparently not—
Reply to this email directly or view it on GitHubhttps://github.com//issues/40#issuecomment-30118819
.
from opentaba-server.
I agree performance is here is at much lower priority than handling the HTML.
@florpor are you certain the issue is caused by the missing <tr>
? I get the same error when trying to scrape gush 30027 - which returns the bad .asp I mentioned above - so assumed it's just because the HTML output didn't have the table the code is looking for.
Did you try downloading gush 360 HTML and seeing if the code parses it?
from opentaba-server.
I think our main problem is that we are doing exploratory debuging instead
of writting the proper granular unit tests for the parser. I guess that if
we did write those, with granular types of malformed html, mime, etc, we
already had the answer what goes wrong and could solve it/write a
try/except around it. not blaming any one (as you know, testing is part of
my responsibility is this project) I just think we won't know for sure
without this.
Twitter:@alonisser https://twitter.com/alonisser
LinkedIn Profile http://www.linkedin.com/in/alonisser
Facebook https://www.facebook.com/alonisser
_Tech blog:_4p-tech.co.il/blog
_Personal Blog:_degeladom.wordpress.com
Tel:972-54-6734469
On Wed, Dec 11, 2013 at 10:02 AM, Nir Yariv [email protected]:
I agree performance is here is at much lower priority than handling the
HTML.@florpor https://github.com/florpor are you certain the issue is caused
by the missing ? I get the same error when trying to scrape gush
30027 - which returns the bad .asp I mentioned above - so assumed it's just
because the HTML output didn't have the table the code is looking for.Did you try downloading gush 360 HTML and seeing if the code parses it?
—
Reply to this email directly or view it on GitHubhttps://github.com//issues/40#issuecomment-30301176
.
from opentaba-server.
Completely agree - we should have a full test suite for the parser, and
have it run once a day (or however often we'll be parsing). Another great
task for the Hackathon, which with the snow we're having now I'm pretty
sure I won't be attending ;)
On Wed, Dec 11, 2013 at 11:36 PM, Alonisser [email protected]:
I think our main problem is that we are doing exploratory debuging instead
of writting the proper granular unit tests for the parser. I guess that if
we did write those, with granular types of malformed html, mime, etc, we
already had the answer what goes wrong and could solve it/write a
try/except around it. not blaming any one (as you know, testing is part of
my responsibility is this project) I just think we won't know for sure
without this.Twitter:@alonisser https://twitter.com/alonisser
LinkedIn Profile http://www.linkedin.com/in/alonisser
Facebook https://www.facebook.com/alonisser
_Tech blog:_4p-tech.co.il/blog
_Personal Blog:_degeladom.wordpress.com
Tel:972-54-6734469On Wed, Dec 11, 2013 at 10:02 AM, Nir Yariv [email protected]:
I agree performance is here is at much lower priority than handling the
HTML.@florpor https://github.com/florpor are you certain the issue is
caused
by the missing ? I get the same error when trying to scrape gush
30027 - which returns the bad .asp I mentioned above - so assumed it's
just
because the HTML output didn't have the table the code is looking for.Did you try downloading gush 360 HTML and seeing if the code parses it?
—
Reply to this email directly or view it on GitHub<
https://github.com/niryariv/opentaba-server/issues/40#issuecomment-30301176>.
—
Reply to this email directly or view it on GitHubhttps://github.com//issues/40#issuecomment-30366495
.
from opentaba-server.
Sorry for the downtime..
So yeah, it's kinda hard proving that it actually happens on our production because the gushim i got from the logs are all either duplicate htmls (not updated since last scrape), which is checked before the parsing, or they give the index error because the site returns an error page (got asp code, nice job mmi!). i tried about 30 of them before i gave up.
On my system i have a slightly different lxml version than in the requirements.txt file (mine is 3.2.4 as opposed to 3.2.3), but i just ran the (slightly modified) code against a gush that is duplicate according to heroku logs (scraped already - number 30649) and i do get the error with lxml and not with html5lib.
Agreed about the tests. we could really start at the hackathon.
from opentaba-server.
ok.@florpor - can you compile some urls, with or without problems?, so we can download to build the specifc use cases. - we don't need full website. but specific cases of malform html,
from opentaba-server.
apparently only happens on my system... reason is still unknown.
@alonisser started writing some parse tests and already merged them. i think this bug can be closed.
from opentaba-server.
we also found out that mmi site is crashing on every gush that has more then 10 plans, mor opened an "Issue" with them. we still need to find out are there plans that do appear on mmi and don't crash the site but don't appear in our scrapers..
from opentaba-server.
Related Issues (20)
- broken plans links HOT 6
- Feeds are missing muni.display HOT 2
- connect re:dash
- more improvements to rss feed and fb and twitter pages HOT 4
- Fab deployment HOT 3
- remove atom_feed_city ? HOT 10
- move caching helper methods to a different file HOT 1
- error: Unknown subcommand: remove HOT 5
- rename fab tasks HOT 1
- create_site breaking HOT 2
- when creating a client, set the subdomain CNAME via namecheap API
- all-heroku servers HOT 1
- DRY recents function HOT 1
- update DEPLOYMENT.md HOT 1
- error on fab create_server HOT 2
- can we remove the display name parameter from create_server? HOT 2
- add support for MongoLab
- what's the difference between fab refresh_db and renew_db? HOT 1
- Broken likns on Yeruham site HOT 1
- Mavat scraping and links don't work HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from opentaba-server.