Git Product home page Git Product logo

Comments (21)

knaaptime avatar knaaptime commented on June 15, 2024

this is also why dash is a good option

from geosnap.

suhanmappingideas avatar suhanmappingideas commented on June 15, 2024

The performance issue of OSNAP is not because of CGI. I tried to run the code below in python script (no CGI) in Tobler using the command like: python osnap_example_noCGI.py and measured the duration. And I found that It took 20 seconds to get the response. Please note that the code below does not contain any CGI, so it has nothing to do with CGI

============================================================
#!/home/suhan/public_html/cgi-bin/anaconda3/envs/osnap/bin/python

#-- coding: UTF-8 --
#enable debugging

import numpy
import json

import os
os.environ['QUILT_PACKAGE_DIRS']= "/home/suhan/public_html/cgi-bin/osnap/quilt_packages"

import osnap
import pandas as pd

Albertville = osnap.data.Community(source="ltdb",cbsafips="10700")
AlbertvilleDataframe = Albertville.census
out_data = Albertville.census[Albertville.census.year==1990]['p_hispanic_persons']
print(out_data)

========================================================
Also, when I ran the code below (now it is CGI) on the Web, it took 20 seconds, too. This time I ran it like this : http://osnap.cloud/~suhan/osnap/python/osnap_example.py**

===================================================================
#!/home/suhan/public_html/cgi-bin/anaconda3/envs/osnap/bin/python

#-- coding: UTF-8 --
#enable debugging
import cgitb
cgitb.enable()

import numpy
import json

import os
os.environ['QUILT_PACKAGE_DIRS']= "/home/suhan/public_html/cgi-bin/osnap/quilt_packages"

import osnap
import pandas as pd

print("Content-type: text/html\n\n")

Albertville = osnap.data.Community(source="ltdb",cbsafips="10700")
AlbertvilleDataframe = Albertville.census
out_data = Albertville.census[Albertville.census.year==1990]['p_hispanic_persons']
print(out_data)

===============================================================

On the other hand, my serverside program which includes CGI takes less than 1 second.
Try here: http://sarasen.asuscomm.com/LNE/python/getLNAnalysis.py?&year=1990&state=01%20IL&metro=10700&codes=1-03%20hispXX

My serverside program which uses CGI does not slow down when the data are queried from the web. So the performance issue of OSNAP is not because of CGI. I guess the delay comes from the panda dataframe (not sure).

I know that Django is faster than CGI and it is newer than CGI. But the difference between two should be like 1 second per 1 query. 1 second performance improvement per every query might be a big deal. But I don't think that changing CGI to Django is the top priority that we need to do now to improve the performance of OSNAP because the problem does not come from CGI.

from geosnap.

sjsrey avatar sjsrey commented on June 15, 2024

If I run things locally, I'm seeing like 15 seconds (not minutes):

2019-03-11-0846-1552319181-sel

But let's touch base on the issue during today's scrum.

from geosnap.

suhanmappingideas avatar suhanmappingideas commented on June 15, 2024

Sorry. I was trying to write 20 seconds, not minutes. Yes it took 20 seconds. 20 Seconds is a long time for users to wait to see a small choropleth map. Also, when we visualize the map, the visualization time will be added. My version (not osnap) with CGI takes less than 1 second. Let's talk about this issue, today.

from geosnap.

knaaptime avatar knaaptime commented on June 15, 2024

my primary motivation for raising this issue was actually the first bullet

it is a script based interface, so each time a script is called it spawns a new python interpreter (which means programs cant really be dynamic)

with the CGI interface, each interaction with osnap is written as an individual script and executed as a its own python process. So there's no persisting the user state or holding data in memory for the user to explore with different visualizations etc.

We might be able to get around some of that by writing out temp files and such, but that has its own performance cost, and why reinvent the wheel? So it probably makes sense to move to a more dynamic framework anyway, and hopefully we should get a small speed boost for free

from geosnap.

knaaptime avatar knaaptime commented on June 15, 2024

the performance difference between osnap and the custom implementation comes from the fact that osnap has to convert wkb into a geodataframe each time a new Community is instantiated--which is expensive. The custom webapp has geojson sitting ready in a database so it just needs to run a quick filter.

There are some straightforward ways we can speed up that process so that osnap will run faster in the web. But this is likely to always be a problem with the CGI interface, because you can't persist the geometries in memory. So you need that expensive conversion each time you make a CGI call

from geosnap.

suhanmappingideas avatar suhanmappingideas commented on June 15, 2024

I agree that we are better to use Django for the things we develope for the future. At this point, OSNAP (in python script) should run locally at a reasonable speed. No matter it is the first time or not, every query should run less than or around 1 second. Once we achieve it, I can move on to Django or whatever we want. Developing Django version with current OSNAP does not make sense to me because we won't be able to use it. I am not sure if the architecture of OSNAP can finally beat the speed of the relational database.

Another solution is to make a batch program (in python) that brings data from OSNAP and automatically creates a relational database having the current database schema. But it is different from our original plan.

from geosnap.

knaaptime avatar knaaptime commented on June 15, 2024

Another solution is to make a batch program (in python) that brings data from OSNAP and automatically creates a relational database having the current database schema. But it is different from our original plan.

this is essentially what is happening now. When you call osnap with CGI, it reads in a parquet file of all the tracts in the US and converts them into a GeoDataFrame, then runs a query to filter the data you selected. The queries execute in well under a second if the geoms are already in memory, and again, that's something we can do easily. But creating the spatial database and loading the geometries into memory takes a few seconds. With the CGI structure, we will never get around that bottleneck, because it needs to happen with each call.

I am not sure if the architecture of OSNAP can finally beat the speed of the relational database

we will never beat the speed of a relational database. But most users will not have a RDBMS at their disposal, and the web is only one interface to osnap, so we need to make sure the web front-end can sit on top of more general data structures. Otherwise we are forced to write a ton of custom code for a section of the package that users rarely touch

from geosnap.

suhanmappingideas avatar suhanmappingideas commented on June 15, 2024

Maybe I didn't get your point. I feel like we are talking about different things. But the program that I ran has nothing to do with CGI. I did not even import CGI. I ran the code below locally and took 20 seconds. My understanding is that regardless of CGI or Django, it takes 20 seconds for this small query.

============================================================
#!/home/suhan/public_html/cgi-bin/anaconda3/envs/osnap/bin/python

#-- coding: UTF-8 --
#enable debugging

import numpy
import json

import os
os.environ['QUILT_PACKAGE_DIRS']= "/home/suhan/public_html/cgi-bin/osnap/quilt_packages"

import osnap
import pandas as pd

Albertville = osnap.data.Community(source="ltdb",cbsafips="10700")
AlbertvilleDataframe = Albertville.census
out_data = Albertville.census[Albertville.census.year==1990]['p_hispanic_persons']
print(out_data)

========================================================

from geosnap.

knaaptime avatar knaaptime commented on June 15, 2024

right now you're calling a single python script, and the processing time is happening mostly in this line

Albertville = osnap.data.Community(source="ltdb",cbsafips="10700")

which does two things:

  1. read the spatial data from a file and convert it to a dataframe -- this part takes time)
  2. query and join two dataframes to select the data for Albertville (fips 10700) --this part is fast

we can make the instantiation of Community faster by speeding up step one. The way we speed that up is to load all the geometries into memory when you first import osnap. Then, each time a user instantiates a Community, they're really only executing step 2, which takes less than a second.

But you will always need to perform step 1 at least once. So with CGI (or in your current implementation where you're running a single python script, which is the same thing) that speedup never materializes because there's no way to load the geometries ahead of the user querying the data

from geosnap.

suhanmappingideas avatar suhanmappingideas commented on June 15, 2024

I see your point. Does it mean that the user has to wait about 20 seconds for the first time even if the user wants to see the small choropleth map?

from geosnap.

knaaptime avatar knaaptime commented on June 15, 2024

to do anything related to spatial data (including plotting it) osnap will always need to load the data into memory once, which takes ~15 seconds. That could essentially be hidden from the user by ensuring that when they access the web platform, there's already a python kernel ready with the geoms in memory.

(it's also worth pointint out that a lot of this discussion becomes moot, in the case of allowing users to upload/use their own data, in which case these I/O bottlenecks become unavoidable...)

from geosnap.

suhanmappingideas avatar suhanmappingideas commented on June 15, 2024

I think the reason that I got confused was because I got the impression that Eli was saying that the performance problem comes from CGI. Now I clearly see that the problem in terms of the performance rises when I run the code below in Python Script. (Again it has nothing to do with CGI). We know that OSNAP data is cached after the first run in Jupyter, but it is not the case when we run locally OSNAP in Python Script. (Please see the code below). Please let me know if it is possible.

So I have three concerns now.

  1. I think that there will be users who want to run OSNAP after they write the code in the python script in their desktop. In this case, the user will have to wait more than 20 seconds for every query since the data is not cached at the first time.

  2. Django might bring the effect that Jupyter provides. The data might be cached after the first run on the Django frame as Jupyter does. I think that it is worth trying Django if we have enough time. My worry is that I am 50:50 on it, Jupyter can be possibly the only frame that OSNAP data can be cached.

  3. If the Web-developers must use Django in order to use OSNAP (since OSNAP data must be cached for a reasonable performance), they will not be able to add OSNAP modules to the codes that are already written based on CGI platform. So OSNAP will not be able to be widely used by Web-developers.

In terms of solutions, I was thinking that we can probably develop another version OSNAP that utilizes a relational database. One of motivation is that the current architecture of OSNAP is hard to beat the speed of the relational database. I mean the way to use OSNAP will be the same between the two versions, but OSNAP relational database version will query the data from the relational database. I was thinking the new library that we can use just like OSNAP by using the same command line such as osnap.data.Community(source="ltdb",cbsafips="10700"), but brings the data from the PostGIS

The downside of the OSNAP relational database version will be that users will have to install DB in their machine (both can be done in Desktop or server), But I think that this version can be widely used by Web-developers. I think that there will be some web developers who want to add OSNAP module in their system which is already written in the platform other than Django. On the other hand, the current version of OSNAP will be widely used by Jupyter Users without the database.

But the downside of these ideas is that when we make the change in the library of OSNAP, we need to the same thing both in two different versions, and it will require a lot of time to develope OSNAP relational database version.

==Code ( this the code is not in CGI bin )=======================================
import timeit
import numpy
import json

import os
os.environ['QUILT_PACKAGE_DIRS']= "/var/www/quilt_packages"

tl = timeit.default_timer()
import osnap
import pandas as pd

Albertville = osnap.data.Community(source="ltdb",cbsafips="10700")
AlbertvilleDataframe = Albertville.census
out_data = Albertville.census[Albertville.census.year==1990]['p_hispanic_persons']
t2 = timeit.default_timer()

print(out_data)
print(t2-tl)

==The result======================================================
(osnap) suhan@tobler:~/public_html/Ex$ python osnap_example_noCGI.py
/home/suhan/public_html/osnap/osnap/osnap/data/data.py:44: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
df['geometry'] = df.wkb.apply(lambda x: wkb.loads(x, hex=True))
/home/suhan/public_html/cgi-bin/anaconda3/envs/osnap/lib/python3.7/site-packages/pandas/core/frame.py:3697: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
errors=errors)
/home/suhan/public_html/osnap/osnap/osnap/data/data.py:41: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
df['geometry'] = df.wkt.apply(wkt.loads)
geoid
01095030100 1.0
01095030201 1.0
01095030202 1.0
01095030300 0.0
01095030401 0.0
01095030402 0.0
01095030500 0.0
01095030600 0.0
01095030701 0.0
01095030702 0.0
01095030801 1.0
01095030802 1.0
01095030902 1.0
01095030903 1.0
01095030904 1.0
01095031000 1.0
01095031100 0.0
01095031200 0.0
Name: p_hispanic_persons, dtype: float64
20.08569398149848

from geosnap.

sjsrey avatar sjsrey commented on June 15, 2024

rdb vs pandas

from geosnap.

suhanmappingideas avatar suhanmappingideas commented on June 15, 2024

More concerns:

If users (mostly web developers) must use Django to use OSNAP, they will not be able to add OSNAP to the codes that are already written based on CGI framework. For example, I am one of the representatives of many users, I originally hoped to simply replace OSNAP in my current framework (which is all python script), but in reality, now I am in the situation that I need to replace the whole framework to Django (not sure if Django will be a solution, tho). It would be really nice for users like me to add OSNAP to my current framework without performance issues. In addition, even if somehow I manage to make the version that can be cached based on Django, it requires me to load OSNAP asynchronously at the beginning of the website loading, and it is not easy to achieve it. It will also require many changes on both clientside and serverside code. If I do not make this change, the user will experience 20 second waiting whenever they try to see the map for the first time on the Web.

I think that there will be many users ( among non-web developers) who want to add OSNAP in their python script and run their python script (not in Jupyter) in their desktop. In this case, the users will have to wait 20 seconds for every query.

My overall impression is that OSNAP is optimized for Jupyter users and it can be really easy to use on the Jupyter environment because users can use it without installing the database ( which is a big merit). But this framework is not so friendly for Python users who run their python codes in script either on desktop or servers because of the 20 second waiting time for every query.

from geosnap.

knaaptime avatar knaaptime commented on June 15, 2024

to clarify a couple points:

  1. i dont mean to suggest that we need to go with django--I'm sure there's a newer sexier option. I just think it's important that we keep in mind that we will eventually need to move away from CGI. It's just not flexible enough for the dashboard-like interface we need (which is why dash is another important option to consider)--but it is useful for planning out the calls to osnap we need the web interface to make and the kind of structures osnap will return that need to be styled in the browser

  2. it's important to be clear that the bottleneck in osnap is not from querying data (the link above shows pandas is actually considerably faster than postgres). The slowdown comes from reading data from flat files into memory. osnap is an analytics library, not a data repository, so a small slowdown when reading a huge data file into memory is acceptable--especially because we can optimize that part elsewhere in the library and because users working with other data will have a delay when they upload new data anyway (keep in mind we can't provide direct access to the LTDB data, even in a webapp). In short, there are several reasons I don't think we should be fixated on the 15 second load time while we are prototyping the web interface. I'm much more concerned about choosing an infrastructure that gives us the best layout & navigation tools, and has the interactivity that CGI lacks

from geosnap.

knaaptime avatar knaaptime commented on June 15, 2024

also, web developers can consumer osnap however they like. Just because we're building a visualization front end using a particular library doesnt mean other developers couldn't build a CGI-based app using osnap

from geosnap.

suhanmappingideas avatar suhanmappingideas commented on June 15, 2024

Eli. Thank you for your clarification.

  1. I agree that we need to move away from CGI about the program that we will develop in the future.
  2. I understand the performance bottle neck comes from reading data. Also, my understanding is that visualization of LTDB data in maps and charts on the web is fine unless we distribute the data out of our database. On the web app version, we do not need to ask users to upload LTDB data.
    I still prefer to start working on the replacement of OSNAP modules once the performance issue is resolved because we do not know if Django or another frame should be used to tackle the performance bottleneck.

Eli and Serge
The core issue is that I do not know how to cache out LTDB data in OSNAP. The caching never happens when I run OSNAP in Python Script. Without resolving it, it will take at least 20 seconds for every query since it reads the whole LTDB data for every query request. It would be really nice if someone can come up with a good solution, but at this point, I do not know how long it will take for me to find out a good solution. Since I do not know the architecture of OSNAP, it is really hard for me to quickly come up with the solution. Even though the performance of panda is generally better than Postgres, I do not see this benefit of Pandas in OSNAP because OSNAP currently reads the whole data for 20 seconds per every query in Python script.

If you guys think that I still need to work on replacing current modules querying LTDB data with the current modules of OSNAP without resolving the performance issue at this point, I can move on it as soon as I wrap up the client-side programming. (It will still be after a few weeks later from now because it is more efficient to complete what I am doing before I move on to the next thread).

But the programs that I can make at this point will take at least 20 seconds + visualization time for every query since the OSNAP will read the whole LTDB data for every query. ( Again, I currently don’t know how to cache out LTDB data at the beginning of the program). I, as a web developer, personally think that this is like making a garbage because I feel like it is like developing something that will never be used because of slowness. In this case, I personally think that it is better to invest my time to work on other threads instead of developing a server-side programs that is much slower than the current ones. Because of this reason, I honestly do not have a good motivation to work on it as it is. Also, it is why I would like to start working on replacing the current modules with OSNAP once I see that the LTDB can be cached in the Python Script. But I think that you guys can have a different perspective from me, which can probably work out (hopefully). So I am also okay to do whatever you would like to do.

I think that I have explained the current bottleneck of OSNAP enough when it runs in Python script and what's going to happen after we replace the current server-side program with OSNAP: it will take at least 15 seconds per every query unless we come up with a good solution to cache out LTDB data at the beginning. Even though it happens, I can still add OSNAP in my current serverside program if it is something that you would like to do.  it can be done around March deadline. Moving from CGI to Django will add another time. (Also, we don’t know if Django can be a solution)

Otherwise, I can explore some of solutions that enable caching out LTDB data, but in this case, I do not know if I can come up with a good solution. I might not be able to come up with the solution even after a few weeks of exploration. Anyway, the decision is upto you.

from geosnap.

knaaptime avatar knaaptime commented on June 15, 2024

I think the root of the problem is there's no way to cache the data when working with osnap using the CGI interface. When you use CGI, each python script executes independently, so there's no way to cache the geometries in memory so that they are available to different python processes. One thing we can do to improve this is provide functions that allow osnap to pull data from different backends. Right now, when you instantiate a Community, it automatically pulls tract geometries from a parquet file, which is fast enough for desktop analysis but leaves room for improvement in the browser. We could, for example, add an argument to Community that accepts postgres credentials, which will grab geoms from a database to try and speed up the data ingestion/conversion.

A better way to improve this pain point, I would argue, is to use a different web framework that keeps a python process alive across user sessions (rather than spawning a new process each time a user executes a script). We could still provide functionality to pull geoms from a database, but if we arent running each python process independently, it will greatly improve the application's performance (and generally make things easier on our end, because we can persist objects in memory to re-query, or build additional plots, etc)

But the important point, I think, is that we just shouldn't be too bothered about the speed at this stage. There are several opportunities to improve performance, but right now we can make the most progress by understanding how a user will interact with osnap on the web (i.e. what the UI needs to look like, and what it needs to accomplish) and whether we want to support certain kinds of workflows (like critical plotting functions) that arent yet implemented. What matters most right now is getting the infrastructure in place and understanding how osnap hooks into it. Once we have that in place, we can work on optimizing the connections between osnap and the web

from geosnap.

suhanmappingideas avatar suhanmappingideas commented on June 15, 2024

My finding is that LTDB data is never be cached when we run OSNAP in python script regardless of desktop or Web. Even on the desktop, OSNAP in python script does not cache LTDB data. Jupyter is the only platform where LTDB data can be cached.

from geosnap.

knaaptime avatar knaaptime commented on June 15, 2024

In the current implementation the data are never cached, even in jupyter. Instead, each time a user isntantiates a Community, the raw data are read in from a flat file.

This is an intentional design decision that increases processing time but decreases memory overhead. It is not required, and there are many ways we could cache the shape data in memory outside of jupyter. We can go over this more in the meeting today

from geosnap.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.