niryariv / opentaba-server Goto Github PK
View Code? Open in Web Editor NEWLicense: BSD 3-Clause "New" or "Revised" License
License: BSD 3-Clause "New" or "Revised" License
As a good open project we should have a license
There's seems to be a difference between data from current scraping to the data returned by the Json api in heroku.
Enables us to utilize the all of the taba information and not only jerusalem where we have a map (bringing in Tel Aviv and rest of gush dan)
Algorithm:
maybe the collection should be reversed (an table of gush:[email subscribers])
More parts:
I think this can be the killer app for open taba
currently the data model is gush-centered - as it works now, we scrape each gush at a time then store all the gush plans in the DB.
that's result of the original sprint to get the code working, and isn't clean, already causing some issues: eg, when a plan exists in several gushim it is stored several times. since MMI have a weird protocol where a plan appears in ALL gushim at certain stages, we have plans duplicated ~500 times, requiring the whole blacklisting system etc.
the idea here is to switch to a plan-centric model. the plan_id is the index, each plan appears once and has an array of gushim it belongs to.
On some local installations it can be very nice to allow scraping without requiring the Redis queue. It is also very easy to achieve. Would be nice to add an option to run the scraper without Redis.
I noticed that when clicking some plan links from opentaba.info I get an error message in the MMI site instead of the expected plan details page.
For example go to http://opentaba.info/#/gush/30321 and click the 2nd and 3rd plans (you can try others) - some will open the MMI site with an error message. This is the URL clicked (generated by opentaba):
http://mmi.gov.il/IturTabot/taba4.asp?kod=3000&MsTochnit=%u05DE%u05D9/994
I assume this happens due to a broken link we generate, but I could be wrong. Maybe its just a bug in the MMI site (won't be surprised).
Frozen Flask is an extension that converts output of a Flask app into static files. It knows how to access the routes (including parameter based ones) - sounds like just the ticket for us to generate the static files we'll later push to git.
@alonisser @oreniko @florpor @shevron - anyone feel like playing with this a bit?
currently we're filtering plans with > x gushim for the map display, which makes sense, but in the feed we should show all and link to the details page
@shevron maybe you can help (since it's connected to your code):
when I run the test locally, when preparing the setup steps I get this error, when I run python worker.py
:
No handlers could be found for logger "tools.scrapelib" [2014-01-06 21:45] ERROR: horse: SystemExit: 1 Traceback (most recent call last): File "/home/alon/.virtualenvs/opentaba-server/local/lib/python2.7/site-packages/rq/worker.py", line 393, in perform_job rv = job.perform() File "/home/alon/.virtualenvs/opentaba-server/local/lib/python2.7/site-packages/rq/job.py", line 328, in perform self._result = self.func(*self.args, **self.kwargs) File "/home/alon/Projects/opentaba-server/tools/scrapelib.py", line 132, in scrape_gush html = get_gush_html(gush_id) File "/home/alon/Projects/opentaba-server/tools/scrapelib.py", line 32, in get_gush_html exit(1) File "/home/alon/.virtualenvs/opentaba-server/lib/python2.7/site.py", line 403, in __call__ raise SystemExit(code) SystemExit: 1
since no scraping is done, some of the local tests also fail. in travis-ci everything works fine , So i'm not sure what the issue is - any Ideas? directions?
create_db
clean_db
scrape_py
As basis to specific alerts service to subscribers.
the api should support
/feed/cityname
/feed/gush_number
Currently, the code prints out messages using print()
in various places including toos/scrapelib.py and other "library" files.
Replacing print() calls with python logging
based logging will allow us to easily redirect messages to files or STDERR, control verbosity and automatically add timestamps and origin to messages.
Feel free to close this if everyone else thinks this is silly, but I'm bothered by the non-standard code style in this project. If its fine by everyone, I would suggest standardizing on PEP8. Actually fixing the code should be quite easy, I can do it automatically.
see this:
https://groups.google.com/forum/#!msg/hasadna11/6Os5O8GA8Ks/02UOfYrZYVcJ
@niryariv maybe yair should prepare this?
feed is > 512KB (since we're reading a lot of plans to then remove the blacklisted ones, reminiscent of old scraper times) so dlvr.it doesn't read it
In many Gushim in the DB the status
field has an extra whitespace at the end:
{
_id: ObjectId("50ec3cb8c11698000730ed0b"),
status: "פרסום תוקף ברשומות ",
essence: "תוספת קומות לבניין קיים",
...
}
Most likely a .strip()
call is missing from https://github.com/niryariv/opentaba-server/blob/master/tools/scrapelib.py#L74 but I'm noting this here as there are no tests for this and I know @florpor is doing work on an entirely new parser. We should make sure (+ add tests) that Gush fields are properly scrapped including WS stripping.
@florpor scraped a more complete map than what we have.
The point of this ticket isn't just to import the map, but to build an automated way of adding a map to the DB - in particular extracting all the Gush IDs, which currently reside both in the DB and in a file on the server.
The result of this should be an "import_map.py" utility, which takes a gushim map (perhaps a URL to that map in the client github repo) and adds it to the project on the server side (and client, if needed)
apparently the nightly "scrape -g all" commands accumulate on redis and the memory doesn't get cleared.
when the 5MB limit of the free plan is reached, this stops the server from collecting new data properly.
the interim hack solution is to reinstall redis-to-go every few months, but obviously there should be a way to actually fix this in the code.
create a CONTRIBUTING.md with the neccessary information
when this file exists in the project's root - It would appear as guidelines before someone forks/pull requests into the project
this will allow automating them with heroku's cron
@shevron @alonisser @florpor
I just discovered MMI switched to a new format: http://mmi.gov.il/IturTabot2/taba1.aspx
the data passes via JSON which theoretically should make our job easier, in reality it's the clusterfuck of government + ASPX.
i haven't figured out a way to get the clean JSON (or the data at all) yet, will keep it - posting here in case one of you might have some free time to hack on it meanwhile..
I was playing around with the scraping code and couldn't figure why it's not working.
Apparently mmi in their IturTabot page currently serve a html with a closing tr tag which does not have an opening one, so when beautifulsoup parses the html with lxml as parser it reaches this point and from thereon just closes the open tags it has, leaving the information tables out of the returned object.
Then, when a table of class highLines is searched for we just take the first result, but there are no results.
The error in heroku logs:
2013-12-08T23:08:39.661518+00:00 app[scheduler.7645]: checking gush 30727
2013-12-08T23:08:39.661518+00:00 app[scheduler.7645]: http://mmi.gov.il/IturTabot/taba2.asp?Gush=30727&fromTaba1=true
2013-12-08T23:09:11.482607+00:00 app[scheduler.7645]: HTML new, inserting data
2013-12-08T23:09:11.518180+00:00 app[scheduler.7645]: [2013-12-08 23:09] ERROR: horse: IndexError: list index out of range
2013-12-08T23:09:11.518180+00:00 app[scheduler.7645]: Traceback (most recent call last):
2013-12-08T23:09:11.518180+00:00 app[scheduler.7645]: File "/app/.heroku/python/lib/python2.7/site-packages/rq/worker.py", line 393, in perform_job
2013-12-08T23:09:11.518180+00:00 app[scheduler.7645]: rv = job.perform()
2013-12-08T23:09:11.518180+00:00 app[scheduler.7645]: File "/app/.heroku/python/lib/python2.7/site-packages/rq/job.py", line 328, in perform
2013-12-08T23:09:11.518180+00:00 app[scheduler.7645]: self._result = self.func(*self.args, **self.kwargs)
2013-12-08T23:09:11.518180+00:00 app[scheduler.7645]: File "/app/tools/scrapelib.py", line 139, in scrape_gush
2013-12-08T23:09:11.537331+00:00 app[scheduler.7645]: [2013-12-08 23:09] INFO: worker: *** Listening on high, default, low...
2013-12-08T23:09:11.518180+00:00 app[scheduler.7645]: data = extract_data(html)
2013-12-08T23:09:11.518180+00:00 app[scheduler.7645]: File "/app/tools/scrapelib.py", line 49, in extract_data
2013-12-08T23:09:11.518180+00:00 app[scheduler.7645]: table = s("table", "highLines")[0]
2013-12-08T23:09:11.518482+00:00 app[scheduler.7645]: IndexError: list index out of range
2013-12-08T23:09:11.518482+00:00 app[scheduler.7645]:
2013-12-08T23:09:11.518482+00:00 app[scheduler.7645]: [2013-12-08 23:09] DEBUG: horse: Invoking exception handler <bound method Worker.move_to_failed_queue of <rq.worker.Worker object at 0x19c7290>>
2013-12-08T23:09:11.518779+00:00 app[scheduler.7645]: [2013-12-08 23:09] WARNING: horse: Moving job to failed queue.
Seems like a solution could be changing the parser beautifulsoup uses from lxml to html5lib. It seems to work for me so far, still looking into that.
Right now though - no new data is being fetched.
(note to self)
Some solutions in this blog post. including pinging from heroku newrelic free plan,. using an heroku one-up dyno and other solutions.
I figure we should require that each new PR passes the tests and if necessary adds its own tests. Unfortunately as a Python testing noob I'm not sure how to do this myself, let alone explain to new contributors :)
@alonisser could you add some basic info on this to CONTRIBUTING.md? Just some URLs to info on that would be fine.
(also a personal issue of mine: when running nosetests
I get ERROR: Failure: ImportError (No module named pymongo)
- though it is installed in my venv. Any pointers on what I'm doing wrong?)
As basis to specific alerts service to subscribers.
the api should support
/feed/cityname
/feed/gush_number
.nocerc file to specify where is the package to be tested
pip install coverage
replaces #60
current response:
[
{
"gush_id": "28046",
"last_checked_at": {
"$date": 1390647702698
}
},
..
]
Suggested:
[
{
"gush_id": "28046",
"plan_dates": [1551411, 123123131, 1231231]
"<status x>": 3,
"<status y>": 4,
},
...
]
status_x, status_y is ״תוכנית בהפקדה״ etc (Github Hebrew support is not great). the keys are determined by the plan statuses, not pre-known.
plan_dates should only include plans 5 years or younger.
@shevron what do u think?
we need a system handlng street address that can't be automatically geo-coded:
currently we scrape all the plan info from the gush results list page. if we start scraping the plan page (eg this) we'll be able to get the plan's street address - not just the Gush ID.
this could be useful for a number of things:
the downside is that this adds a lot of requests to the scraping process. the trick is to apply it just to relevant plans - eg only plans from 2011 onwards - and to make sure we don't run a request for plans whose address was already scraped
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.