niryariv / opentaba-server Goto Github PK

View Code? Open in Web Editor NEW

4.0 4.0 14.0 1.83 MB

License: BSD 3-Clause "New" or "Revised" License

Python 33.10% HTML 66.81% Shell 0.09%

opentaba-server's People

Contributors

Stargazers

Watchers

Forkers

barakargaman giladgo alonisser florpor shevron guy-kdm modli yaron1000 yarivgilad shreethi nathanil noamoss urishab avanavana fangod

opentaba-server's Issues

Move to Continuous Deployment to Heroku with Travis-CI

Add license

As a good open project we should have a license

decide on what license to use: MIT, BSD, GPL2 other?
Add a LICENSE.md file to the project root

updated README to reflect current API status

Patch clean_db to be more testable with setting threshold from command line argument

Switch to scraping Ministry of Interior site instead of MMI

Data refreshment at heroku

There's seems to be a difference between data from current scraping to the data returned by the Json api in heroku.

Feature:add email subscription for taba changes for a specific address

Enables us to utilize the all of the taba information and not only jerusalem where we have a map (bringing in Tel Aviv and rest of gush dan)

Algorithm:

add a "update-time" timefield to the gush model
add a collection subscribers: including: name/email/[subscribed gushim -array{gush_id:update-time}]
when someone subscribe he receives an email with the current taba and the update-time field is updated to current date(or date of the last db update)
the same email can't subscribe to more then 3 gush in one email (to prevent abuse or commercial use)
after every db update(scraping) a job run over all email address. check for each email/gush_id-update-time against the last update of that gush and if <last update. send last update.
For beginners we could send all the relevant plans every update, after that figure out how to send only updated plans

maybe the collection should be reversed (an table of gush:[email subscribers])

More parts:

client side; add a subscription modal with some email validation scheme (a regex + maybe tfa with another service)
an email sending lib/service.
Html email design

I think this can be the killer app for open taba

Switch to a plan-centered instead of gush-centered data model

currently the data model is gush-centered - as it works now, we scrape each gush at a time then store all the gush plans in the DB.

that's result of the original sprint to get the code working, and isn't clean, already causing some issues: eg, when a plan exists in several gushim it is stored several times. since MMI have a weird protocol where a plan appears in ALL gushim at certain stages, we have plans duplicated ~500 times, requiring the whole blacklisting system etc.

the idea here is to switch to a plan-centric model. the plan_id is the index, each plan appears once and has an array of gushim it belongs to.

Add option to scrape without using Redis queue

On some local installations it can be very nice to allow scraping without requiring the Redis queue. It is also very easy to achieve. Would be nice to add an option to run the scraper without Redis.

Integrate nearby-gushim for gush area feed

use .json extension on json files, .atom on atom etc

Error in MMI site when clicking some plan links

I noticed that when clicking some plan links from opentaba.info I get an error message in the MMI site instead of the expected plan details page.

For example go to http://opentaba.info/#/gush/30321 and click the 2nd and 3rd plans (you can try others) - some will open the MMI site with an error message. This is the URL clicked (generated by opentaba):

http://mmi.gov.il/IturTabot/taba4.asp?kod=3000&MsTochnit=%u05DE%u05D9/994

I assume this happens due to a broken link we generate, but I could be wrong. Maybe its just a bug in the MMI site (won't be surprised).

get all the data into git files via Frozen Flask

Frozen Flask is an extension that converts output of a Flask app into static files. It knows how to access the routes (including parameter based ones) - sounds like just the ticket for us to generate the static files we'll later push to git.

@alonisser @oreniko @florpor @shevron - anyone feel like playing with this a bit?

dont filter blacklist in feed

currently we're filtering plans with > x gushim for the map display, which makes sense, but in the feed we should show all and link to the details page

Problem with local testing and logger

@shevron maybe you can help (since it's connected to your code):

when I run the test locally, when preparing the setup steps I get this error, when I run python worker.py:

No handlers could be found for logger "tools.scrapelib"
[2014-01-06 21:45] ERROR: horse: SystemExit: 1
Traceback (most recent call last):
  File "/home/alon/.virtualenvs/opentaba-server/local/lib/python2.7/site-packages/rq/worker.py", line 393, in perform_job
    rv = job.perform()
  File "/home/alon/.virtualenvs/opentaba-server/local/lib/python2.7/site-packages/rq/job.py", line 328, in perform
    self._result = self.func(*self.args, **self.kwargs)
  File "/home/alon/Projects/opentaba-server/tools/scrapelib.py", line 132, in scrape_gush
    html = get_gush_html(gush_id)
  File "/home/alon/Projects/opentaba-server/tools/scrapelib.py", line 32, in get_gush_html
    exit(1)
  File "/home/alon/.virtualenvs/opentaba-server/lib/python2.7/site.py", line 403, in __call__
    raise SystemExit(code)
SystemExit: 1

since no scraping is done, some of the local tests also fail. in travis-ci everything works fine , So i'm not sure what the issue is - any Ideas? directions?

Add unit testing to helpers:

create_db
clean_db
scrape_py

Add specifics to "/feed" api

As basis to specific alerts service to subscribers.

the api should support

/feed/cityname
/feed/gush_number

Replace debug prints in scrapelib.py and other libraries with proper logging

Currently, the code prints out messages using print() in various places including toos/scrapelib.py and other "library" files.

Replacing print() calls with python logging based logging will allow us to easily redirect messages to files or STDERR, control verbosity and automatically add timestamps and origin to messages.

Standardize Python code style, preferrably PEP8

Feel free to close this if everyone else thinks this is silly, but I'm bothered by the non-standard code style in this project. If its fine by everyone, I would suggest standardizing on PEP8. Actually fixing the code should be quite easy, I can do it automatically.

link feed items to the MMI details page

Prepare an English brief/pitch about opentaba to london summit

see this:
https://groups.google.com/forum/#!msg/hasadna11/6Os5O8GA8Ks/02UOfYrZYVcJ

@niryariv maybe yair should prepare this?

look into async scraping

http://compiletoi.net/fast-scraping-in-python-with-asyncio.html

fix atom feed

feed is > 512KB (since we're reading a lot of plans to then remove the blacklisted ones, reminiscent of old scraper times) so dlvr.it doesn't read it

Add Testing

Fix/improve feed

Gush status field is not stripped

In many Gushim in the DB the status field has an extra whitespace at the end:

{
  _id: ObjectId("50ec3cb8c11698000730ed0b"),
  status: "פרסום תוקף ברשומות ",
  essence: "תוספת קומות לבניין קיים",
  ...
}

Most likely a .strip() call is missing from https://github.com/niryariv/opentaba-server/blob/master/tools/scrapelib.py#L74 but I'm noting this here as there are no tests for this and I know @florpor is doing work on an entirely new parser. We should make sure (+ add tests) that Gush fields are properly scrapped including WS stripping.

Import new Jerusalem gushim

@florpor scraped a more complete map than what we have.

The point of this ticket isn't just to import the map, but to build an automated way of adding a map to the DB - in particular extracting all the Gush IDs, which currently reside both in the DB and in a file on the server.

The result of this should be an "import_map.py" utility, which takes a gushim map (perhaps a URL to that map in the client github repo) and adds it to the project on the server side (and client, if needed)

Implement multi-municipality model

fix redis max mamory issue

apparently the nightly "scrape -g all" commands accumulate on redis and the memory doesn't get cleared.

when the 5MB limit of the free plan is reached, this stops the server from collecting new data properly.

the interim hack solution is to reinstall redis-to-go every few months, but obviously there should be a way to actually fix this in the code.

Add Contributing information for project collabarators

create a CONTRIBUTING.md with the neccessary information

when this file exists in the project's root - It would appear as guidelines before someone forks/pull requests into the project

add Fabric tasks for scrape, clean db tasks

this will allow automating them with heroku's cron

parse MMI's new format

@shevron @alonisser @florpor
I just discovered MMI switched to a new format: http://mmi.gov.il/IturTabot2/taba1.aspx

the data passes via JSON which theoretically should make our job easier, in reality it's the clusterfuck of government + ASPX.

i haven't figured out a way to get the clean JSON (or the data at all) yet, will keep it - posting here in case one of you might have some free time to hack on it meanwhile..

Add "gush near me" information to gushim

Not sure how to implent this.

The first step should be "bordering" gushim - ploygons who have who have a common side (maybe mongodb spatial can do that for us? @niryariv )
The second step is taking into considerations things like the size of the gush etc.

maybe @oreniko can take on this issue?

Add CI with travic ci

install opentaba.info test site on the city of your choice and document the process

Scraping does not work because mmi html is invalid

I was playing around with the scraping code and couldn't figure why it's not working.
Apparently mmi in their IturTabot page currently serve a html with a closing tr tag which does not have an opening one, so when beautifulsoup parses the html with lxml as parser it reaches this point and from thereon just closes the open tags it has, leaving the information tables out of the returned object.
Then, when a table of class highLines is searched for we just take the first result, but there are no results.
The error in heroku logs:

2013-12-08T23:08:39.661518+00:00 app[scheduler.7645]: checking gush 30727
2013-12-08T23:08:39.661518+00:00 app[scheduler.7645]: http://mmi.gov.il/IturTabot/taba2.asp?Gush=30727&fromTaba1=true
2013-12-08T23:09:11.482607+00:00 app[scheduler.7645]: HTML new, inserting data
2013-12-08T23:09:11.518180+00:00 app[scheduler.7645]: [2013-12-08 23:09] ERROR: horse: IndexError: list index out of range
2013-12-08T23:09:11.518180+00:00 app[scheduler.7645]: Traceback (most recent call last):
2013-12-08T23:09:11.518180+00:00 app[scheduler.7645]:   File "/app/.heroku/python/lib/python2.7/site-packages/rq/worker.py", line 393, in perform_job
2013-12-08T23:09:11.518180+00:00 app[scheduler.7645]:     rv = job.perform()
2013-12-08T23:09:11.518180+00:00 app[scheduler.7645]:   File "/app/.heroku/python/lib/python2.7/site-packages/rq/job.py", line 328, in perform
2013-12-08T23:09:11.518180+00:00 app[scheduler.7645]:     self._result = self.func(*self.args, **self.kwargs)
2013-12-08T23:09:11.518180+00:00 app[scheduler.7645]:   File "/app/tools/scrapelib.py", line 139, in scrape_gush
2013-12-08T23:09:11.537331+00:00 app[scheduler.7645]: [2013-12-08 23:09] INFO: worker: *** Listening on high, default, low...
2013-12-08T23:09:11.518180+00:00 app[scheduler.7645]:     data = extract_data(html)
2013-12-08T23:09:11.518180+00:00 app[scheduler.7645]:   File "/app/tools/scrapelib.py", line 49, in extract_data
2013-12-08T23:09:11.518180+00:00 app[scheduler.7645]:     table = s("table", "highLines")[0]
2013-12-08T23:09:11.518482+00:00 app[scheduler.7645]: IndexError: list index out of range
2013-12-08T23:09:11.518482+00:00 app[scheduler.7645]: 
2013-12-08T23:09:11.518482+00:00 app[scheduler.7645]: [2013-12-08 23:09] DEBUG: horse: Invoking exception handler <bound method Worker.move_to_failed_queue of <rq.worker.Worker object at 0x19c7290>>
2013-12-08T23:09:11.518779+00:00 app[scheduler.7645]: [2013-12-08 23:09] WARNING: horse: Moving job to failed queue.

Seems like a solution could be changing the parser beautifulsoup uses from lxml to html5lib. It seems to work for me so far, still looking into that.
Right now though - no new data is being fetched.

Approach heroku and mongoHQ for multi free plan for this opensource project

post-new scraper checklist

(note to self)

change API URL back to the main site
set scheduler back to late night, consolidate to one task

Solving heroku idle dyno problems

Some solutions in this blog post. including pinging from heroku newrelic free plan,. using an heroku one-up dyno and other solutions.

add explanation on test to CONTRIBUTING.md

I figure we should require that each new PR passes the tests and if necessary adds its own tests. Unfortunately as a Python testing noob I'm not sure how to do this myself, let alone explain to new contributors :)

@alonisser could you add some basic info on this to CONTRIBUTING.md? Just some URLs to info on that would be fine.

(also a personal issue of mine: when running nosetests I get ERROR: Failure: ImportError (No module named pymongo) - though it is installed in my venv. Any pointers on what I'm doing wrong?)

make the default opentaba.info domain redirect to jerusalem.opentaba.info

Add specifics to "/feed" api

As basis to specific alerts service to subscribers.

the api should support

/feed/cityname
/feed/gush_number

Add coverage report

.nocerc file to specify where is the package to be tested
pip install coverage

connect to coveralls

Post PDF to feed as images

replaces #60

add history to gushim.json per city

draft for new gushim.json API call

current response:

[
  {
    "gush_id": "28046",
    "last_checked_at": {
      "$date": 1390647702698
    }
  },
..
]

Suggested:

[
  {
    "gush_id": "28046",
    "plan_dates": [1551411, 123123131, 1231231]
    "<status x>": 3,
    "<status y>": 4,
  },
...
]

status_x, status_y is ״תוכנית בהפקדה״ etc (Github Hebrew support is not great). the keys are determined by the plan statuses, not pre-known.

plan_dates should only include plans 5 years or younger.

@shevron what do u think?

Failed geocoding handler

we need a system handlng street address that can't be automatically geo-coded:

on geocoding failure, send email to human handler (for Jerusalem, [email protected])
create a simple UI allowing the handler to change the address, check if it geocodes OK, repeat if neccessary
once handler is done, use the address supplied (or let handler state "un-geocode-able" so it won't be sent again)

scrape plan address

currently we scrape all the plan info from the gush results list page. if we start scraping the plan page (eg this) we'll be able to get the plan's street address - not just the Gush ID.

this could be useful for a number of things:

displaying plans on the map in their exact (or near exact) location - in the longer run perhaps even getting rid of the gushim map requirement
adding the plan address to the feed - making the feed a lot more interesting for the casual user

the downside is that this adds a lot of requests to the scraping process. the trick is to apply it just to relevant plans - eg only plans from 2011 onwards - and to make sure we don't run a request for plans whose address was already scraped