While working on the slow CSV download, I discovered another area where we could possi

I opened a new <a href="https://github.com/zachtheclimber/mpv/tree/improve-ticklist-fu

The next slowest aspect of parse_user_data() is <code

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Increasing parse_tick_list() performance about mpv HOT 5 CLOSED

zach-wahrer commented on August 11, 2024

Increasing parse_tick_list() performance

from mpv.

Comments (5)

commented on August 11, 2024

I opened a new branch to work on this issue. This commit changes from using the native CSV operations of Python to using Pandas dataframes. In addition to being more performant, this also cleaned up my messy processing code.

Unfortunately, this change breaks some of the testing suite. I updated one of the variables in mp_api_response.py to make one of the tests pass, but I'm still getting the following errors when running pytest:

============================================= test session starts ==============================================
platform linux -- Python 3.7.6, pytest-5.3.2, py-1.8.1, pluggy-0.13.1 -- 
collected 13 items                                                                                             

app/tests/test_endpoints.py::test_index PASSED                                                           [  7%]
app/tests/test_errors.py::TestErrorHandlers::test_404 PASSED                                             [ 15%]
app/tests/test_errors.py::TestErrorHandlers::test_mountain_project_api_exception PASSED                  [ 23%]
app/tests/test_errors.py::TestErrorHandlers::test_request_exception PASSED                               [ 30%]
app/tests/test_errors.py::TestErrorHandlers::test_database_exception FAILED                              [ 38%]
app/tests/test_mpv_helpers.py::TestDatabaseHelpers::test_connect PASSED                                  [ 46%]
app/tests/test_mpv_helpers.py::TestDatabaseHelpers::test_failed_db_connection PASSED                     [ 53%]
app/tests/test_mpv_helpers.py::TestMountainProjectHandler::test_mp_api_user_data PASSED                  [ 61%]
app/tests/test_mpv_helpers.py::TestMountainProjectHandler::test_mp_api_user_ticks PASSED                 [ 69%]
app/tests/test_mpv_helpers.py::TestMountainProjectHandler::test_mp_dev_env_user_data PASSED              [ 76%]
app/tests/test_mpv_helpers.py::TestMountainProjectHandler::test_mp_dev_env_parse_user_data PASSED        [ 84%]
app/tests/test_mpv_helpers.py::TestMountainProjectHandler::test_mp_dev_env_ticks PASSED                  [ 92%]
app/tests/test_mpv_helpers.py::TestMountainProjectHandler::test_mp_dev_env_parse_ticks FAILED            [100%]

=================================================== FAILURES ===================================================
__________________________________ TestErrorHandlers.test_database_exception ___________________________________

self = <app.tests.test_errors.TestErrorHandlers object at 0x7f70d9460250>, app = <Flask 'app'>

    def test_database_exception(self, app: pytest.fixture) -> None:
        """
        Confirms the correct response status and safe error message of DatabaseException. The test config settings
        do not have database credentials, therefore an error will be raised when the attempting to process test user
        data.
        """
        client = app.test_client()
>       response = client.post("/data", data={'test': 'yes'})

app/tests/test_errors.py:70: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../../../anaconda3/envs/mpv/lib/python3.7/site-packages/werkzeug/test.py:1039: in post
    return self.open(*args, **kw)
../../../anaconda3/envs/mpv/lib/python3.7/site-packages/flask/testing.py:227: in open
    follow_redirects=follow_redirects,
../../../anaconda3/envs/mpv/lib/python3.7/site-packages/werkzeug/test.py:993: in open
    response = self.run_wsgi_app(environ.copy(), buffered=buffered)
../../../anaconda3/envs/mpv/lib/python3.7/site-packages/werkzeug/test.py:884: in run_wsgi_app
    rv = run_wsgi_app(self.application, environ, buffered=buffered)
../../../anaconda3/envs/mpv/lib/python3.7/site-packages/werkzeug/test.py:1119: in run_wsgi_app
    app_rv = app(environ, start_response)
../../../anaconda3/envs/mpv/lib/python3.7/site-packages/flask/app.py:2463: in __call__
    return self.wsgi_app(environ, start_response)
../../../anaconda3/envs/mpv/lib/python3.7/site-packages/flask/app.py:2449: in wsgi_app
    response = self.handle_exception(e)
../../../anaconda3/envs/mpv/lib/python3.7/site-packages/flask/app.py:1866: in handle_exception
    reraise(exc_type, exc_value, tb)
../../../anaconda3/envs/mpv/lib/python3.7/site-packages/flask/_compat.py:39: in reraise
    raise value
../../../anaconda3/envs/mpv/lib/python3.7/site-packages/flask/app.py:2446: in wsgi_app
    response = self.full_dispatch_request()
../../../anaconda3/envs/mpv/lib/python3.7/site-packages/flask/app.py:1951: in full_dispatch_request
    rv = self.handle_user_exception(e)
../../../anaconda3/envs/mpv/lib/python3.7/site-packages/flask/app.py:1820: in handle_user_exception
    reraise(exc_type, exc_value, tb)
../../../anaconda3/envs/mpv/lib/python3.7/site-packages/flask/_compat.py:39: in reraise
    raise value
../../../anaconda3/envs/mpv/lib/python3.7/site-packages/flask/app.py:1949: in full_dispatch_request
    rv = self.dispatch_request()
../../../anaconda3/envs/mpv/lib/python3.7/site-packages/flask/app.py:1935: in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
app/__init__.py:64: in data
    csv = api.parse_tick_list(dev_env=dev_env)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <app.helpers.mountain_project.MountainProjectHandler object at 0x7f70d94d4f50>, dev_env = True

    def parse_tick_list(self, dev_env: bool = False) -> Dict:
        """Parse the request data into a Pandas dataframe to clean."""
        if dev_env:
            with open(_DEV_TEST_TICKS) as ticklist:
                tick_list_file = list(csv.reader(ticklist, delimiter=','))
        else:
            try:
                tick_list_file = self.api_data.get("tick_list").content.decode("utf-8")
    
            except (AttributeError, UnicodeDecodeError) as e:
                raise MPAPIException
    
            columns = ["Date", "Route", "Pitches", "Style",
                       "Lead Style", "Route Type", "Length", "Rating Code"]
            df = pd.read_csv(io.StringIO(tick_list_file),
                             usecols=columns, na_filter=False)
    
>       return {"status": 0, "data": df.values.tolist()}
E       UnboundLocalError: local variable 'df' referenced before assignment

app/helpers/mountain_project.py:54: UnboundLocalError
____________________________ TestMountainProjectHandler.test_mp_dev_env_parse_ticks ____________________________

self = <app.tests.test_mpv_helpers.TestMountainProjectHandler object at 0x7f70d914acd0>

    def test_mp_dev_env_parse_ticks(self) -> None:
        """Ensure when dev_env=True that the processed test_ticks.csv file is the output of parse_tick_list()"""
>       data = self.api_dev.parse_tick_list(dev_env=True)

app/tests/test_mpv_helpers.py:102: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <app.helpers.mountain_project.MountainProjectHandler object at 0x7f70d914aed0>, dev_env = True

    def parse_tick_list(self, dev_env: bool = False) -> Dict:
        """Parse the request data into a Pandas dataframe to clean."""
        if dev_env:
            with open(_DEV_TEST_TICKS) as ticklist:
                tick_list_file = list(csv.reader(ticklist, delimiter=','))
        else:
            try:
                tick_list_file = self.api_data.get("tick_list").content.decode("utf-8")
    
            except (AttributeError, UnicodeDecodeError) as e:
                raise MPAPIException
    
            columns = ["Date", "Route", "Pitches", "Style",
                       "Lead Style", "Route Type", "Length", "Rating Code"]
            df = pd.read_csv(io.StringIO(tick_list_file),
                             usecols=columns, na_filter=False)
    
>       return {"status": 0, "data": df.values.tolist()}
E       UnboundLocalError: local variable 'df' referenced before assignment

app/helpers/mountain_project.py:54: UnboundLocalError
=============================================== warnings summary ===============================================
/home/zachtheclimber/anaconda3/envs/mpv/lib/python3.7/site-packages/flask_wtf/recaptcha/widgets.py:5
  /home/zachtheclimber/anaconda3/envs/mpv/lib/python3.7/site-packages/flask_wtf/recaptcha/widgets.py:5: DeprecationWarning: The import 'werkzeug.url_encode' is deprecated and will be removed in Werkzeug 1.0. Use 'from werkzeug.urls import url_encode' instead.
    from werkzeug import url_encode

-- Docs: https://docs.pytest.org/en/latest/warnings.html
=================================== 2 failed, 11 passed, 1 warning in 1.37s ====================================

I find the UnboundLocalError: local variable 'df' referenced before assignment line to be odd, as this error doesn't throw when running MPV normally.

@benjpalmer : Any ideas on why this is happening?

from mpv.

commented on August 11, 2024

The next slowest aspect of parse_user_data() is tick_list_file = self.api_data.get("tick_list").content.decode("utf-8") as evidenced by:

Get ticklist Took:  3.5762786865234375e-06
Decode Took:  5.98429274559021
Dataframe Took:  0.024965286254882812

I spent some time trying to decrease the decoding time, using several different methods (including io.BytesIO() instead of `io.StringIO() in addition to seeing if the Requests module could give us the tick list content without encoding/dencoding). Everything seemed to produce similar times as the existing code.

I feel like multithreaded decoding (if such a thing exists) would provide better performance, but I'm unsure how to implement it, and preliminary Google searches came up empty.

If anyone has ideas/solutions, I'd be happy to hear them!

from mpv.

benjpalmer commented on August 11, 2024

@zachtheclimber I think what is happening with this

I find the UnboundLocalError: local variable 'df' referenced before assignment line to be odd, as this error doesn't throw when running MPV normally.

I believe you are getting that error because you are always returning that dataframe object from the parse_tick_list() and there is only a reference to that object when dev_env is `False. I think you just have a small indentation error.

This might work for you:

    def parse_tick_list(self, dev_env: bool = False) -> Dict:
        """Parse the request data into a Pandas dataframe to clean."""
        if dev_env:
            with open(_DEV_TEST_TICKS) as ticklist:
                tick_list_file = list(csv.reader(ticklist, delimiter=','))
        else:
            try:
                tick_list_file = self.api_data.get("tick_list").content.decode("utf-8")
            except (AttributeError, UnicodeDecodeError) as e:
                raise MPAPIException

        columns = ["Date", "Route", "Pitches", "Style",
                   "Lead Style", "Route Type", "Length", "Rating Code"]
        df = pd.read_csv(io.StringIO(tick_list_file),
                         usecols=columns, na_filter=False)
        return {"status": 0, "data": df.values.tolist()}

Notice at the bottom of the function after the except block, those remaining lines have been un-indented and now the dataframe object is being created no matter if your in a dev env or not.

I'm admittedly not very familiar with Pandas, but I would suspect you might also want to be careful and catch any errors that the last few lines might raise.

Sorry I haven't been contributing as much lately, I am very much still interested, just busy at moment. I am happy to help out however I can! Nice work on identifying all of the performance issues!!

from mpv.

commented on August 11, 2024

@benjpalmer Awesome! Thank you so much. I wasn't thinking about how the df = ... code wasn't running when in dev_env. I had it stuck in my head it was something related to the testing, since the code worked, but I never tried it in dev mode.

I updated the dev_env block to use a dataframe as well, just to make it consistent with the rest of the function, since pandas can open csv files directly. I also added in relevant error catching in the except block.

All tests pass (after slightly modifying the input data, because of how the dataframe doesn't quote single digit ints) and MPV visibly functions as expected.

No worries on not having time to contribute. I just work on this when able as well. I've been applying for SWE jobs as well as still studying, so I've been busy too. I appreciate all the work you've put into this so far! Thank for pointing out how my changes should have error catching as well. That is something I need to get in the mindset of.

I'm gonna leave the improve-ticklist-funcs branch open for awhile to work on the slow decoding, but if I (or we) can't figure anything out within the next few weeks, I might just merge it and open a new one to work on the issue.

from mpv.

commented on August 11, 2024

Dug into this a bit more today.

It seems that while adding stream=True has decreased the runtime of the _mp_generic_request() function, it is still taking the same amount of time to download the file (the function just passes the mp_request variable on before the file is completely downloaded, at least from how I understand it).

And I thought the decode was causing the load time, but it appears to actually still just be the HTTP request resolving. So back to square one. I'm going to close this issue and incorporate the previous improvements into master

Reopening issue 30 to address the file download speed.

from mpv.

Increasing parse_tick_list() performance about mpv HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent