Comments (21)
this is also why dash is a good option
from geosnap.
The performance issue of OSNAP is not because of CGI. I tried to run the code below in python script (no CGI) in Tobler using the command like: python osnap_example_noCGI.py and measured the duration. And I found that It took 20 seconds to get the response. Please note that the code below does not contain any CGI, so it has nothing to do with CGI
============================================================
#!/home/suhan/public_html/cgi-bin/anaconda3/envs/osnap/bin/python
#-- coding: UTF-8 --
#enable debugging
import numpy
import json
import os
os.environ['QUILT_PACKAGE_DIRS']= "/home/suhan/public_html/cgi-bin/osnap/quilt_packages"
import osnap
import pandas as pd
Albertville = osnap.data.Community(source="ltdb",cbsafips="10700")
AlbertvilleDataframe = Albertville.census
out_data = Albertville.census[Albertville.census.year==1990]['p_hispanic_persons']
print(out_data)
========================================================
Also, when I ran the code below (now it is CGI) on the Web, it took 20 seconds, too. This time I ran it like this : http://osnap.cloud/~suhan/osnap/python/osnap_example.py**
===================================================================
#!/home/suhan/public_html/cgi-bin/anaconda3/envs/osnap/bin/python
#-- coding: UTF-8 --
#enable debugging
import cgitb
cgitb.enable()
import numpy
import json
import os
os.environ['QUILT_PACKAGE_DIRS']= "/home/suhan/public_html/cgi-bin/osnap/quilt_packages"
import osnap
import pandas as pd
print("Content-type: text/html\n\n")
Albertville = osnap.data.Community(source="ltdb",cbsafips="10700")
AlbertvilleDataframe = Albertville.census
out_data = Albertville.census[Albertville.census.year==1990]['p_hispanic_persons']
print(out_data)
===============================================================
On the other hand, my serverside program which includes CGI takes less than 1 second.
Try here: http://sarasen.asuscomm.com/LNE/python/getLNAnalysis.py?&year=1990&state=01%20IL&metro=10700&codes=1-03%20hispXX
My serverside program which uses CGI does not slow down when the data are queried from the web. So the performance issue of OSNAP is not because of CGI. I guess the delay comes from the panda dataframe (not sure).
I know that Django is faster than CGI and it is newer than CGI. But the difference between two should be like 1 second per 1 query. 1 second performance improvement per every query might be a big deal. But I don't think that changing CGI to Django is the top priority that we need to do now to improve the performance of OSNAP because the problem does not come from CGI.
from geosnap.
If I run things locally, I'm seeing like 15 seconds (not minutes):
But let's touch base on the issue during today's scrum.
from geosnap.
Sorry. I was trying to write 20 seconds, not minutes. Yes it took 20 seconds. 20 Seconds is a long time for users to wait to see a small choropleth map. Also, when we visualize the map, the visualization time will be added. My version (not osnap) with CGI takes less than 1 second. Let's talk about this issue, today.
from geosnap.
my primary motivation for raising this issue was actually the first bullet
it is a script based interface, so each time a script is called it spawns a new python interpreter (which means programs cant really be dynamic)
with the CGI interface, each interaction with osnap
is written as an individual script and executed as a its own python process. So there's no persisting the user state or holding data in memory for the user to explore with different visualizations etc.
We might be able to get around some of that by writing out temp files and such, but that has its own performance cost, and why reinvent the wheel? So it probably makes sense to move to a more dynamic framework anyway, and hopefully we should get a small speed boost for free
from geosnap.
the performance difference between osnap
and the custom implementation comes from the fact that osnap has to convert wkb into a geodataframe each time a new Community
is instantiated--which is expensive. The custom webapp has geojson sitting ready in a database so it just needs to run a quick filter.
There are some straightforward ways we can speed up that process so that osnap
will run faster in the web. But this is likely to always be a problem with the CGI interface, because you can't persist the geometries in memory. So you need that expensive conversion each time you make a CGI call
from geosnap.
I agree that we are better to use Django for the things we develope for the future. At this point, OSNAP (in python script) should run locally at a reasonable speed. No matter it is the first time or not, every query should run less than or around 1 second. Once we achieve it, I can move on to Django or whatever we want. Developing Django version with current OSNAP does not make sense to me because we won't be able to use it. I am not sure if the architecture of OSNAP can finally beat the speed of the relational database.
Another solution is to make a batch program (in python) that brings data from OSNAP and automatically creates a relational database having the current database schema. But it is different from our original plan.
from geosnap.
Another solution is to make a batch program (in python) that brings data from OSNAP and automatically creates a relational database having the current database schema. But it is different from our original plan.
this is essentially what is happening now. When you call osnap
with CGI, it reads in a parquet file of all the tracts in the US and converts them into a GeoDataFrame, then runs a query to filter the data you selected. The queries execute in well under a second if the geoms are already in memory, and again, that's something we can do easily. But creating the spatial database and loading the geometries into memory takes a few seconds. With the CGI structure, we will never get around that bottleneck, because it needs to happen with each call.
I am not sure if the architecture of OSNAP can finally beat the speed of the relational database
we will never beat the speed of a relational database. But most users will not have a RDBMS at their disposal, and the web is only one interface to osnap
, so we need to make sure the web front-end can sit on top of more general data structures. Otherwise we are forced to write a ton of custom code for a section of the package that users rarely touch
from geosnap.
Maybe I didn't get your point. I feel like we are talking about different things. But the program that I ran has nothing to do with CGI. I did not even import CGI. I ran the code below locally and took 20 seconds. My understanding is that regardless of CGI or Django, it takes 20 seconds for this small query.
============================================================
#!/home/suhan/public_html/cgi-bin/anaconda3/envs/osnap/bin/python
#-- coding: UTF-8 --
#enable debugging
import numpy
import json
import os
os.environ['QUILT_PACKAGE_DIRS']= "/home/suhan/public_html/cgi-bin/osnap/quilt_packages"
import osnap
import pandas as pd
Albertville = osnap.data.Community(source="ltdb",cbsafips="10700")
AlbertvilleDataframe = Albertville.census
out_data = Albertville.census[Albertville.census.year==1990]['p_hispanic_persons']
print(out_data)
========================================================
from geosnap.
right now you're calling a single python script, and the processing time is happening mostly in this line
Albertville = osnap.data.Community(source="ltdb",cbsafips="10700")
which does two things:
- read the spatial data from a file and convert it to a dataframe -- this part takes time)
- query and join two dataframes to select the data for Albertville (fips 10700) --this part is fast
we can make the instantiation of Community
faster by speeding up step one. The way we speed that up is to load all the geometries into memory when you first import osnap. Then, each time a user instantiates a Community, they're really only executing step 2, which takes less than a second.
But you will always need to perform step 1 at least once. So with CGI (or in your current implementation where you're running a single python script, which is the same thing) that speedup never materializes because there's no way to load the geometries ahead of the user querying the data
from geosnap.
I see your point. Does it mean that the user has to wait about 20 seconds for the first time even if the user wants to see the small choropleth map?
from geosnap.
to do anything related to spatial data (including plotting it) osnap will always need to load the data into memory once, which takes ~15 seconds. That could essentially be hidden from the user by ensuring that when they access the web platform, there's already a python kernel ready with the geoms in memory.
(it's also worth pointint out that a lot of this discussion becomes moot, in the case of allowing users to upload/use their own data, in which case these I/O bottlenecks become unavoidable...)
from geosnap.
I think the reason that I got confused was because I got the impression that Eli was saying that the performance problem comes from CGI. Now I clearly see that the problem in terms of the performance rises when I run the code below in Python Script. (Again it has nothing to do with CGI). We know that OSNAP data is cached after the first run in Jupyter, but it is not the case when we run locally OSNAP in Python Script. (Please see the code below). Please let me know if it is possible.
So I have three concerns now.
-
I think that there will be users who want to run OSNAP after they write the code in the python script in their desktop. In this case, the user will have to wait more than 20 seconds for every query since the data is not cached at the first time.
-
Django might bring the effect that Jupyter provides. The data might be cached after the first run on the Django frame as Jupyter does. I think that it is worth trying Django if we have enough time. My worry is that I am 50:50 on it, Jupyter can be possibly the only frame that OSNAP data can be cached.
-
If the Web-developers must use Django in order to use OSNAP (since OSNAP data must be cached for a reasonable performance), they will not be able to add OSNAP modules to the codes that are already written based on CGI platform. So OSNAP will not be able to be widely used by Web-developers.
In terms of solutions, I was thinking that we can probably develop another version OSNAP that utilizes a relational database. One of motivation is that the current architecture of OSNAP is hard to beat the speed of the relational database. I mean the way to use OSNAP will be the same between the two versions, but OSNAP relational database version will query the data from the relational database. I was thinking the new library that we can use just like OSNAP by using the same command line such as osnap.data.Community(source="ltdb",cbsafips="10700"), but brings the data from the PostGIS
The downside of the OSNAP relational database version will be that users will have to install DB in their machine (both can be done in Desktop or server), But I think that this version can be widely used by Web-developers. I think that there will be some web developers who want to add OSNAP module in their system which is already written in the platform other than Django. On the other hand, the current version of OSNAP will be widely used by Jupyter Users without the database.
But the downside of these ideas is that when we make the change in the library of OSNAP, we need to the same thing both in two different versions, and it will require a lot of time to develope OSNAP relational database version.
==Code ( this the code is not in CGI bin )=======================================
import timeit
import numpy
import json
import os
os.environ['QUILT_PACKAGE_DIRS']= "/var/www/quilt_packages"
tl = timeit.default_timer()
import osnap
import pandas as pd
Albertville = osnap.data.Community(source="ltdb",cbsafips="10700")
AlbertvilleDataframe = Albertville.census
out_data = Albertville.census[Albertville.census.year==1990]['p_hispanic_persons']
t2 = timeit.default_timer()
print(out_data)
print(t2-tl)
==The result======================================================
(osnap) suhan@tobler:~/public_html/Ex$ python osnap_example_noCGI.py
/home/suhan/public_html/osnap/osnap/osnap/data/data.py:44: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
df['geometry'] = df.wkb.apply(lambda x: wkb.loads(x, hex=True))
/home/suhan/public_html/cgi-bin/anaconda3/envs/osnap/lib/python3.7/site-packages/pandas/core/frame.py:3697: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
errors=errors)
/home/suhan/public_html/osnap/osnap/osnap/data/data.py:41: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
df['geometry'] = df.wkt.apply(wkt.loads)
geoid
01095030100 1.0
01095030201 1.0
01095030202 1.0
01095030300 0.0
01095030401 0.0
01095030402 0.0
01095030500 0.0
01095030600 0.0
01095030701 0.0
01095030702 0.0
01095030801 1.0
01095030802 1.0
01095030902 1.0
01095030903 1.0
01095030904 1.0
01095031000 1.0
01095031100 0.0
01095031200 0.0
Name: p_hispanic_persons, dtype: float64
20.08569398149848
from geosnap.
from geosnap.
More concerns:
If users (mostly web developers) must use Django to use OSNAP, they will not be able to add OSNAP to the codes that are already written based on CGI framework. For example, I am one of the representatives of many users, I originally hoped to simply replace OSNAP in my current framework (which is all python script), but in reality, now I am in the situation that I need to replace the whole framework to Django (not sure if Django will be a solution, tho). It would be really nice for users like me to add OSNAP to my current framework without performance issues. In addition, even if somehow I manage to make the version that can be cached based on Django, it requires me to load OSNAP asynchronously at the beginning of the website loading, and it is not easy to achieve it. It will also require many changes on both clientside and serverside code. If I do not make this change, the user will experience 20 second waiting whenever they try to see the map for the first time on the Web.
I think that there will be many users ( among non-web developers) who want to add OSNAP in their python script and run their python script (not in Jupyter) in their desktop. In this case, the users will have to wait 20 seconds for every query.
My overall impression is that OSNAP is optimized for Jupyter users and it can be really easy to use on the Jupyter environment because users can use it without installing the database ( which is a big merit). But this framework is not so friendly for Python users who run their python codes in script either on desktop or servers because of the 20 second waiting time for every query.
from geosnap.
to clarify a couple points:
-
i dont mean to suggest that we need to go with django--I'm sure there's a newer sexier option. I just think it's important that we keep in mind that we will eventually need to move away from CGI. It's just not flexible enough for the dashboard-like interface we need (which is why dash is another important option to consider)--but it is useful for planning out the calls to
osnap
we need the web interface to make and the kind of structuresosnap
will return that need to be styled in the browser -
it's important to be clear that the bottleneck in osnap is not from querying data (the link above shows pandas is actually considerably faster than postgres). The slowdown comes from reading data from flat files into memory.
osnap
is an analytics library, not a data repository, so a small slowdown when reading a huge data file into memory is acceptable--especially because we can optimize that part elsewhere in the library and because users working with other data will have a delay when they upload new data anyway (keep in mind we can't provide direct access to the LTDB data, even in a webapp). In short, there are several reasons I don't think we should be fixated on the 15 second load time while we are prototyping the web interface. I'm much more concerned about choosing an infrastructure that gives us the best layout & navigation tools, and has the interactivity that CGI lacks
from geosnap.
also, web developers can consumer osnap
however they like. Just because we're building a visualization front end using a particular library doesnt mean other developers couldn't build a CGI-based app using osnap
from geosnap.
Eli. Thank you for your clarification.
- I agree that we need to move away from CGI about the program that we will develop in the future.
- I understand the performance bottle neck comes from reading data. Also, my understanding is that visualization of LTDB data in maps and charts on the web is fine unless we distribute the data out of our database. On the web app version, we do not need to ask users to upload LTDB data.
I still prefer to start working on the replacement of OSNAP modules once the performance issue is resolved because we do not know if Django or another frame should be used to tackle the performance bottleneck.
Eli and Serge
The core issue is that I do not know how to cache out LTDB data in OSNAP. The caching never happens when I run OSNAP in Python Script. Without resolving it, it will take at least 20 seconds for every query since it reads the whole LTDB data for every query request. It would be really nice if someone can come up with a good solution, but at this point, I do not know how long it will take for me to find out a good solution. Since I do not know the architecture of OSNAP, it is really hard for me to quickly come up with the solution. Even though the performance of panda is generally better than Postgres, I do not see this benefit of Pandas in OSNAP because OSNAP currently reads the whole data for 20 seconds per every query in Python script.
If you guys think that I still need to work on replacing current modules querying LTDB data with the current modules of OSNAP without resolving the performance issue at this point, I can move on it as soon as I wrap up the client-side programming. (It will still be after a few weeks later from now because it is more efficient to complete what I am doing before I move on to the next thread).
But the programs that I can make at this point will take at least 20 seconds + visualization time for every query since the OSNAP will read the whole LTDB data for every query. ( Again, I currently don’t know how to cache out LTDB data at the beginning of the program). I, as a web developer, personally think that this is like making a garbage because I feel like it is like developing something that will never be used because of slowness. In this case, I personally think that it is better to invest my time to work on other threads instead of developing a server-side programs that is much slower than the current ones. Because of this reason, I honestly do not have a good motivation to work on it as it is. Also, it is why I would like to start working on replacing the current modules with OSNAP once I see that the LTDB can be cached in the Python Script. But I think that you guys can have a different perspective from me, which can probably work out (hopefully). So I am also okay to do whatever you would like to do.
I think that I have explained the current bottleneck of OSNAP enough when it runs in Python script and what's going to happen after we replace the current server-side program with OSNAP: it will take at least 15 seconds per every query unless we come up with a good solution to cache out LTDB data at the beginning. Even though it happens, I can still add OSNAP in my current serverside program if it is something that you would like to do. it can be done around March deadline. Moving from CGI to Django will add another time. (Also, we don’t know if Django can be a solution)
Otherwise, I can explore some of solutions that enable caching out LTDB data, but in this case, I do not know if I can come up with a good solution. I might not be able to come up with the solution even after a few weeks of exploration. Anyway, the decision is upto you.
from geosnap.
I think the root of the problem is there's no way to cache the data when working with osnap
using the CGI interface. When you use CGI, each python script executes independently, so there's no way to cache the geometries in memory so that they are available to different python processes. One thing we can do to improve this is provide functions that allow osnap
to pull data from different backends. Right now, when you instantiate a Community
, it automatically pulls tract geometries from a parquet file, which is fast enough for desktop analysis but leaves room for improvement in the browser. We could, for example, add an argument to Community
that accepts postgres credentials, which will grab geoms from a database to try and speed up the data ingestion/conversion.
A better way to improve this pain point, I would argue, is to use a different web framework that keeps a python process alive across user sessions (rather than spawning a new process each time a user executes a script). We could still provide functionality to pull geoms from a database, but if we arent running each python process independently, it will greatly improve the application's performance (and generally make things easier on our end, because we can persist objects in memory to re-query, or build additional plots, etc)
But the important point, I think, is that we just shouldn't be too bothered about the speed at this stage. There are several opportunities to improve performance, but right now we can make the most progress by understanding how a user will interact with osnap
on the web (i.e. what the UI needs to look like, and what it needs to accomplish) and whether we want to support certain kinds of workflows (like critical plotting functions) that arent yet implemented. What matters most right now is getting the infrastructure in place and understanding how osnap
hooks into it. Once we have that in place, we can work on optimizing the connections between osnap and the web
from geosnap.
My finding is that LTDB data is never be cached when we run OSNAP in python script regardless of desktop or Web. Even on the desktop, OSNAP in python script does not cache LTDB data. Jupyter is the only platform where LTDB data can be cached.
from geosnap.
In the current implementation the data are never cached, even in jupyter. Instead, each time a user isntantiates a Community
, the raw data are read in from a flat file.
This is an intentional design decision that increases processing time but decreases memory overhead. It is not required, and there are many ways we could cache the shape data in memory outside of jupyter. We can go over this more in the meeting today
from geosnap.
Related Issues (20)
- standardize conventions and arguments
- remove Community interface
- update infrastructure HOT 2
- acs processing gha recipe HOT 1
- version not found when building docs HOT 3
- User guide link is broken on user docs HOT 2
- AttributeError: 'Series' object has no attribute 'iteritems' HOT 2
- Google Colab install geosnap error #184 HOT 4
- Functions not available HOT 5
- reimplement isochrones for pandana>=0.7
- default basemap for plots HOT 1
- Not quite geoparquet? HOT 2
- bea api
- incorrect filename for seda 'poolsub'
- harmonize to target_gdf fails
- rm stamen basemaps
- Depending on abandoned scikitplot HOT 1
- coverage reported incorrectly
- test on apple silicon HOT 1
- seda v5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from geosnap.