Scraping does not work because mmi html is invalid about opentaba-server HOT 11 CLOSED

niryariv commented on September 16, 2024

Scraping does not work because mmi html is invalid

from opentaba-server.

Comments (11)

alonisser commented on September 16, 2024

A government site with non-compliant html? can't be...

Mor - great for locating the problem. as for the solution:

As mention here
html5lib is very slow. maybe there is a way around this . Did you ask in
beautifulsoup google group ? another way might
be downloading the html, adding the missing </tr> with some regex and
text replacing and then parsing with lxml.

Twitter:@alonisser https://twitter.com/alonisser
LinkedIn Profile http://www.linkedin.com/in/alonisser
Facebook https://www.facebook.com/alonisser
_Tech blog:_4p-tech.co.il/blog
_Personal Blog:_degeladom.wordpress.com
Tel:972-54-6734469

On Mon, Dec 9, 2013 at 1:27 AM, florpor [email protected] wrote:

I was playing around with the scraping code and couldn't figure why it's
not working.
Apparently mmi in their IturTabot page currently serve a html with a
closing tr tag which does not have an opening one, so when beautifulsoup
parses the html with lxml as parser it reaches this point and from thereon
just closes the open tags it has, leaving the information tables out of the
returned object.
Then, when a table of class highLines is searched for we just take the
first result, but there are no results.
The error in heroku logs:

2013-12-08T23:08:39.661518+00:00 app[scheduler.7645]: checking gush 30727
2013-12-08T23:08:39.661518+00:00 app[scheduler.7645]: http://mmi.gov.il/IturTabot/taba2.asp?Gush=30727&fromTaba1=true
2013-12-08T23:09:11.482607+00:00 http://mmi.gov.il/IturTabot/taba2.asp?Gush=30727&fromTaba1=true2013-12-08T23:09:11.482607+00:00 app[scheduler.7645]: HTML new, inserting data
2013-12-08T23:09:11.518180+00:00 app[scheduler.7645]: [2013-12-08 23:09] ERROR: horse: IndexError: list index out of range
2013-12-08T23:09:11.518180+00:00 app[scheduler.7645]: Traceback (most recent call last):
2013-12-08T23:09:11.518180+00:00 app[scheduler.7645]: File "/app/.heroku/python/lib/python2.7/site-packages/rq/worker.py", line 393, in perform_job
2013-12-08T23:09:11.518180+00:00 app[scheduler.7645]: rv = job.perform()
2013-12-08T23:09:11.518180+00:00 app[scheduler.7645]: File "/app/.heroku/python/lib/python2.7/site-packages/rq/job.py", line 328, in perform
2013-12-08T23:09:11.518180+00:00 app[scheduler.7645]: self._result = self.func(_self.args, *_self.kwargs)
2013-12-08T23:09:11.518180+00:00 app[scheduler.7645]: File "/app/tools/scrapelib.py", line 139, in scrape_gush
2013-12-08T23:09:11.537331+00:00 app[scheduler.7645]: [2013-12-08 23:09] INFO: worker: *** Listening on high, default, low...
2013-12-08T23:09:11.518180+00:00 app[scheduler.7645]: data = extract_data(html)
2013-12-08T23:09:11.518180+00:00 app[scheduler.7645]: File "/app/tools/scrapelib.py", line 49, in extract_data
2013-12-08T23:09:11.518180+00:00 app[scheduler.7645]: table = s("table", "highLines")[0]
2013-12-08T23:09:11.518482+00:00 app[scheduler.7645]: IndexError: list index out of range
2013-12-08T23:09:11.518482+00:00 app[scheduler.7645]:
2013-12-08T23:09:11.518482+00:00 app[scheduler.7645]: [2013-12-08 23:09] DEBUG: horse: Invoking exception handler <bound method Worker.move_to_failed_queue of <rq.worker.Worker object at 0x19c7290>>
2013-12-08T23:09:11.518779+00:00 app[scheduler.7645]: [2013-12-08 23:09] WARNING: horse: Moving job to failed queue.

Seems like a solution could be changing the parser beautifulsoup uses from
lxml to html5lib. It seems to work for me so far, still looking into that.
Right now though - no new data is being fetched.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/40
.

from opentaba-server.

niryariv commented on September 16, 2024

are you certain that's the issue? i've noticed that the URLs have been returning a .asp file source - seems like a MIME issue on their servers - instead of the HTML.

I see it both on the URLs we use (eg http://mmi.gov.il/IturTabot/taba2.asp?Gush=30727&fromTaba1=true ) and when trying use the MMI site on the browser (ie http://mmi.gov.il/IturTabot/taba1.asp ) - I assumed that was what's stopping the scraping, but since I only discovered it on saturday I thought they might fix it in a couple days.. apparently not

from opentaba-server.

florpor commented on September 16, 2024

I guess I picked a bad gush from the log and now I can't get the logs again without my computer...
anyways I'm sure that some URLs work:
http://mmi.gov.il/IturTabot/taba2.asp?Gush=360&fromTaba1=true
(it's one in ashkelon I think)
I'll look into the jerusalem gushim tomorrow night...

@alonisser - yes html5lib is much slower than lxml, but it's much more flexible and format-error tolerant. considering that it takes 5-10 seconds just to connect to the mmi web server I think it shouldn't bother us and that it's better than regexing this one error which might appear again in other parts of the page later on and require more fixing.

still gotta make sure this is really what's happening on prod. will let you know

from opentaba-server.

shevron commented on September 16, 2024

I've seen that too and I believe it only happens on some gushim pages and
indicates a server - side crash. Not sure what we can do about it.

As for lxml vs html5lib I believe lxml has some kind of "html" mode which
should be more tolerant, I'm not sure if it can be used and if it makes any
difference (perhaps it is the mode which is used by BeutifulSoup in the
first place). In any case I agree with @florpor that it will most likely be
a negligible performance impact since it is not an on-line process and it
is marginal compared to the time it takes to get the data from MMI servers.

On Dec 9, 2013 11:55 AM, "Nir Yariv" [email protected] wrote:

are you certain that's the issue? i've noticed that the URLs have been
returning a .asp file source - seems like a MIME issue on their servers -
instead of the HTML.

I see it both on the URLs we use (eg
http://mmi.gov.il/IturTabot/taba2.asp?Gush=30727&fromTaba1=true ) and
when trying use the MMI site on the browser (ie
http://mmi.gov.il/IturTabot/taba1.asp ) - I assumed that was what's
stopping the scraping, but since I only discovered it on saturday I thought
they might fix it in a couple days.. apparently not

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/40#issuecomment-30118819
.

from opentaba-server.

niryariv commented on September 16, 2024

I agree performance is here is at much lower priority than handling the HTML.

@florpor are you certain the issue is caused by the missing <tr>? I get the same error when trying to scrape gush 30027 - which returns the bad .asp I mentioned above - so assumed it's just because the HTML output didn't have the table the code is looking for.

Did you try downloading gush 360 HTML and seeing if the code parses it?

from opentaba-server.

alonisser commented on September 16, 2024

I think our main problem is that we are doing exploratory debuging instead
of writting the proper granular unit tests for the parser. I guess that if
we did write those, with granular types of malformed html, mime, etc, we
already had the answer what goes wrong and could solve it/write a
try/except around it. not blaming any one (as you know, testing is part of
my responsibility is this project) I just think we won't know for sure
without this.

On Wed, Dec 11, 2013 at 10:02 AM, Nir Yariv [email protected]:

I agree performance is here is at much lower priority than handling the
HTML.

@florpor https://github.com/florpor are you certain the issue is caused
by the missing ? I get the same error when trying to scrape gush
30027 - which returns the bad .asp I mentioned above - so assumed it's just
because the HTML output didn't have the table the code is looking for.

Did you try downloading gush 360 HTML and seeing if the code parses it?

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/40#issuecomment-30301176
.

from opentaba-server.

niryariv commented on September 16, 2024

Completely agree - we should have a full test suite for the parser, and
have it run once a day (or however often we'll be parsing). Another great
task for the Hackathon, which with the snow we're having now I'm pretty
sure I won't be attending ;)

On Wed, Dec 11, 2013 at 11:36 PM, Alonisser [email protected]:

I think our main problem is that we are doing exploratory debuging instead
of writting the proper granular unit tests for the parser. I guess that if
we did write those, with granular types of malformed html, mime, etc, we
already had the answer what goes wrong and could solve it/write a
try/except around it. not blaming any one (as you know, testing is part of
my responsibility is this project) I just think we won't know for sure
without this.

Twitter:@alonisser https://twitter.com/alonisser
LinkedIn Profile http://www.linkedin.com/in/alonisser
Facebook https://www.facebook.com/alonisser
_Tech blog:_4p-tech.co.il/blog
_Personal Blog:_degeladom.wordpress.com
Tel:972-54-6734469

On Wed, Dec 11, 2013 at 10:02 AM, Nir Yariv [email protected]:

I agree performance is here is at much lower priority than handling the
HTML.

@florpor https://github.com/florpor are you certain the issue is
caused
by the missing ? I get the same error when trying to scrape gush
30027 - which returns the bad .asp I mentioned above - so assumed it's
just
because the HTML output didn't have the table the code is looking for.

Did you try downloading gush 360 HTML and seeing if the code parses it?

—
Reply to this email directly or view it on GitHub<
https://github.com/niryariv/opentaba-server/issues/40#issuecomment-30301176>

.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/40#issuecomment-30366495
.

from opentaba-server.

florpor commented on September 16, 2024

Sorry for the downtime..
So yeah, it's kinda hard proving that it actually happens on our production because the gushim i got from the logs are all either duplicate htmls (not updated since last scrape), which is checked before the parsing, or they give the index error because the site returns an error page (got asp code, nice job mmi!). i tried about 30 of them before i gave up.
On my system i have a slightly different lxml version than in the requirements.txt file (mine is 3.2.4 as opposed to 3.2.3), but i just ran the (slightly modified) code against a gush that is duplicate according to heroku logs (scraped already - number 30649) and i do get the error with lxml and not with html5lib.
Agreed about the tests. we could really start at the hackathon.

from opentaba-server.

alonisser commented on September 16, 2024

ok.@florpor - can you compile some urls, with or without problems?, so we can download to build the specifc use cases. - we don't need full website. but specific cases of malform html,

from opentaba-server.

florpor commented on September 16, 2024

apparently only happens on my system... reason is still unknown.
@alonisser started writing some parse tests and already merged them. i think this bug can be closed.

from opentaba-server.

alonisser commented on September 16, 2024

we also found out that mmi site is crashing on every gush that has more then 10 plans, mor opened an "Issue" with them. we still need to find out are there plans that do appear on mmi and don't crash the site but don't appear in our scrapers..

from opentaba-server.

Scraping does not work because mmi html is invalid about opentaba-server HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent