Git Product home page Git Product logo

opentaba-server's People

Contributors

alonisser avatar florpor avatar nathanil avatar niryariv avatar orenkishon avatar shevron avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

opentaba-server's Issues

Add license

As a good open project we should have a license

  1. decide on what license to use: MIT, BSD, GPL2 other?
  2. Add a LICENSE.md file to the project root

Data refreshment at heroku

There's seems to be a difference between data from current scraping to the data returned by the Json api in heroku.

Feature:add email subscription for taba changes for a specific address

Enables us to utilize the all of the taba information and not only jerusalem where we have a map (bringing in Tel Aviv and rest of gush dan)

Algorithm:

  1. add a "update-time" timefield to the gush model
  2. add a collection subscribers: including: name/email/[subscribed gushim -array{gush_id:update-time}]
  3. when someone subscribe he receives an email with the current taba and the update-time field is updated to current date(or date of the last db update)
  4. the same email can't subscribe to more then 3 gush in one email (to prevent abuse or commercial use)
  5. after every db update(scraping) a job run over all email address. check for each email/gush_id-update-time against the last update of that gush and if <last update. send last update.
  6. For beginners we could send all the relevant plans every update, after that figure out how to send only updated plans

maybe the collection should be reversed (an table of gush:[email subscribers])

More parts:

  • client side; add a subscription modal with some email validation scheme (a regex + maybe tfa with another service)
  • an email sending lib/service.
  • Html email design

I think this can be the killer app for open taba

Switch to a plan-centered instead of gush-centered data model

currently the data model is gush-centered - as it works now, we scrape each gush at a time then store all the gush plans in the DB.

that's result of the original sprint to get the code working, and isn't clean, already causing some issues: eg, when a plan exists in several gushim it is stored several times. since MMI have a weird protocol where a plan appears in ALL gushim at certain stages, we have plans duplicated ~500 times, requiring the whole blacklisting system etc.

the idea here is to switch to a plan-centric model. the plan_id is the index, each plan appears once and has an array of gushim it belongs to.

Add option to scrape without using Redis queue

On some local installations it can be very nice to allow scraping without requiring the Redis queue. It is also very easy to achieve. Would be nice to add an option to run the scraper without Redis.

Error in MMI site when clicking some plan links

I noticed that when clicking some plan links from opentaba.info I get an error message in the MMI site instead of the expected plan details page.

For example go to http://opentaba.info/#/gush/30321 and click the 2nd and 3rd plans (you can try others) - some will open the MMI site with an error message. This is the URL clicked (generated by opentaba):

http://mmi.gov.il/IturTabot/taba4.asp?kod=3000&MsTochnit=%u05DE%u05D9/994 

I assume this happens due to a broken link we generate, but I could be wrong. Maybe its just a bug in the MMI site (won't be surprised).

dont filter blacklist in feed

currently we're filtering plans with > x gushim for the map display, which makes sense, but in the feed we should show all and link to the details page

Problem with local testing and logger

@shevron maybe you can help (since it's connected to your code):

when I run the test locally, when preparing the setup steps I get this error, when I run python worker.py:

No handlers could be found for logger "tools.scrapelib"
[2014-01-06 21:45] ERROR: horse: SystemExit: 1
Traceback (most recent call last):
  File "/home/alon/.virtualenvs/opentaba-server/local/lib/python2.7/site-packages/rq/worker.py", line 393, in perform_job
    rv = job.perform()
  File "/home/alon/.virtualenvs/opentaba-server/local/lib/python2.7/site-packages/rq/job.py", line 328, in perform
    self._result = self.func(*self.args, **self.kwargs)
  File "/home/alon/Projects/opentaba-server/tools/scrapelib.py", line 132, in scrape_gush
    html = get_gush_html(gush_id)
  File "/home/alon/Projects/opentaba-server/tools/scrapelib.py", line 32, in get_gush_html
    exit(1)
  File "/home/alon/.virtualenvs/opentaba-server/lib/python2.7/site.py", line 403, in __call__
    raise SystemExit(code)
SystemExit: 1

since no scraping is done, some of the local tests also fail. in travis-ci everything works fine , So i'm not sure what the issue is - any Ideas? directions?

Replace debug prints in scrapelib.py and other libraries with proper logging

Currently, the code prints out messages using print() in various places including toos/scrapelib.py and other "library" files.

Replacing print() calls with python logging based logging will allow us to easily redirect messages to files or STDERR, control verbosity and automatically add timestamps and origin to messages.

Standardize Python code style, preferrably PEP8

Feel free to close this if everyone else thinks this is silly, but I'm bothered by the non-standard code style in this project. If its fine by everyone, I would suggest standardizing on PEP8. Actually fixing the code should be quite easy, I can do it automatically.

fix atom feed

feed is > 512KB (since we're reading a lot of plans to then remove the blacklisted ones, reminiscent of old scraper times) so dlvr.it doesn't read it

Gush status field is not stripped

In many Gushim in the DB the status field has an extra whitespace at the end:

{
  _id: ObjectId("50ec3cb8c11698000730ed0b"),
  status: "פרסום תוקף ברשומות ",
  essence: "תוספת קומות לבניין קיים",
  ...
}

Most likely a .strip() call is missing from https://github.com/niryariv/opentaba-server/blob/master/tools/scrapelib.py#L74 but I'm noting this here as there are no tests for this and I know @florpor is doing work on an entirely new parser. We should make sure (+ add tests) that Gush fields are properly scrapped including WS stripping.

Import new Jerusalem gushim

@florpor scraped a more complete map than what we have.

The point of this ticket isn't just to import the map, but to build an automated way of adding a map to the DB - in particular extracting all the Gush IDs, which currently reside both in the DB and in a file on the server.

The result of this should be an "import_map.py" utility, which takes a gushim map (perhaps a URL to that map in the client github repo) and adds it to the project on the server side (and client, if needed)

fix redis max mamory issue

apparently the nightly "scrape -g all" commands accumulate on redis and the memory doesn't get cleared.

when the 5MB limit of the free plan is reached, this stops the server from collecting new data properly.

the interim hack solution is to reinstall redis-to-go every few months, but obviously there should be a way to actually fix this in the code.

parse MMI's new format

@shevron @alonisser @florpor
I just discovered MMI switched to a new format: http://mmi.gov.il/IturTabot2/taba1.aspx

the data passes via JSON which theoretically should make our job easier, in reality it's the clusterfuck of government + ASPX.

i haven't figured out a way to get the clean JSON (or the data at all) yet, will keep it - posting here in case one of you might have some free time to hack on it meanwhile..

Add "gush near me" information to gushim

Not sure how to implent this.

  1. The first step should be "bordering" gushim - ploygons who have who have a common side (maybe mongodb spatial can do that for us? @niryariv )
  2. The second step is taking into considerations things like the size of the gush etc.

maybe @oreniko can take on this issue?

Scraping does not work because mmi html is invalid

I was playing around with the scraping code and couldn't figure why it's not working.
Apparently mmi in their IturTabot page currently serve a html with a closing tr tag which does not have an opening one, so when beautifulsoup parses the html with lxml as parser it reaches this point and from thereon just closes the open tags it has, leaving the information tables out of the returned object.
Then, when a table of class highLines is searched for we just take the first result, but there are no results.
The error in heroku logs:

2013-12-08T23:08:39.661518+00:00 app[scheduler.7645]: checking gush 30727
2013-12-08T23:08:39.661518+00:00 app[scheduler.7645]: http://mmi.gov.il/IturTabot/taba2.asp?Gush=30727&fromTaba1=true
2013-12-08T23:09:11.482607+00:00 app[scheduler.7645]: HTML new, inserting data
2013-12-08T23:09:11.518180+00:00 app[scheduler.7645]: [2013-12-08 23:09] ERROR: horse: IndexError: list index out of range
2013-12-08T23:09:11.518180+00:00 app[scheduler.7645]: Traceback (most recent call last):
2013-12-08T23:09:11.518180+00:00 app[scheduler.7645]:   File "/app/.heroku/python/lib/python2.7/site-packages/rq/worker.py", line 393, in perform_job
2013-12-08T23:09:11.518180+00:00 app[scheduler.7645]:     rv = job.perform()
2013-12-08T23:09:11.518180+00:00 app[scheduler.7645]:   File "/app/.heroku/python/lib/python2.7/site-packages/rq/job.py", line 328, in perform
2013-12-08T23:09:11.518180+00:00 app[scheduler.7645]:     self._result = self.func(*self.args, **self.kwargs)
2013-12-08T23:09:11.518180+00:00 app[scheduler.7645]:   File "/app/tools/scrapelib.py", line 139, in scrape_gush
2013-12-08T23:09:11.537331+00:00 app[scheduler.7645]: [2013-12-08 23:09] INFO: worker: *** Listening on high, default, low...
2013-12-08T23:09:11.518180+00:00 app[scheduler.7645]:     data = extract_data(html)
2013-12-08T23:09:11.518180+00:00 app[scheduler.7645]:   File "/app/tools/scrapelib.py", line 49, in extract_data
2013-12-08T23:09:11.518180+00:00 app[scheduler.7645]:     table = s("table", "highLines")[0]
2013-12-08T23:09:11.518482+00:00 app[scheduler.7645]: IndexError: list index out of range
2013-12-08T23:09:11.518482+00:00 app[scheduler.7645]: 
2013-12-08T23:09:11.518482+00:00 app[scheduler.7645]: [2013-12-08 23:09] DEBUG: horse: Invoking exception handler <bound method Worker.move_to_failed_queue of <rq.worker.Worker object at 0x19c7290>>
2013-12-08T23:09:11.518779+00:00 app[scheduler.7645]: [2013-12-08 23:09] WARNING: horse: Moving job to failed queue.

Seems like a solution could be changing the parser beautifulsoup uses from lxml to html5lib. It seems to work for me so far, still looking into that.
Right now though - no new data is being fetched.

post-new scraper checklist

(note to self)

  1. change API URL back to the main site
  2. set scheduler back to late night, consolidate to one task

add explanation on test to CONTRIBUTING.md

I figure we should require that each new PR passes the tests and if necessary adds its own tests. Unfortunately as a Python testing noob I'm not sure how to do this myself, let alone explain to new contributors :)

@alonisser could you add some basic info on this to CONTRIBUTING.md? Just some URLs to info on that would be fine.

(also a personal issue of mine: when running nosetests I get ERROR: Failure: ImportError (No module named pymongo) - though it is installed in my venv. Any pointers on what I'm doing wrong?)

draft for new gushim.json API call

current response:

[
  {
    "gush_id": "28046",
    "last_checked_at": {
      "$date": 1390647702698
    }
  },
..
]

Suggested:

[
  {
    "gush_id": "28046",
    "plan_dates": [1551411, 123123131, 1231231]
    "<status x>": 3,
    "<status y>": 4,
  },
...
]

status_x, status_y is ״תוכנית בהפקדה״ etc (Github Hebrew support is not great). the keys are determined by the plan statuses, not pre-known.

plan_dates should only include plans 5 years or younger.

@shevron what do u think?

Failed geocoding handler

we need a system handlng street address that can't be automatically geo-coded:

  1. on geocoding failure, send email to human handler (for Jerusalem, [email protected])
  2. create a simple UI allowing the handler to change the address, check if it geocodes OK, repeat if neccessary
  3. once handler is done, use the address supplied (or let handler state "un-geocode-able" so it won't be sent again)

scrape plan address

currently we scrape all the plan info from the gush results list page. if we start scraping the plan page (eg this) we'll be able to get the plan's street address - not just the Gush ID.

this could be useful for a number of things:

  1. displaying plans on the map in their exact (or near exact) location - in the longer run perhaps even getting rid of the gushim map requirement
  2. adding the plan address to the feed - making the feed a lot more interesting for the casual user

the downside is that this adds a lot of requests to the scraping process. the trick is to apply it just to relevant plans - eg only plans from 2011 onwards - and to make sure we don't run a request for plans whose address was already scraped

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.