Git Product home page Git Product logo

Comments (4)

fredshone avatar fredshone commented on June 1, 2024

to be pedantic - where does the casting occur:

  1. elara.input reads the schedule input from xml, so the indexes in elara at this point will be string I'm pretty sure (NOT CHECKED THOUGH).
  2. I like string ids.
  3. elara.event_handlers does it's thing and outputs a big csv of required counts. But I still expect the output indices to be string type (NOT CHECKED). Therefore i do not expect pandas to have done any scientic notation. Therefore I expect the csv indices to be strings.
  4. elara.benchmark read in the bm data as a dict from json. So they pressumably get nice integers if we want or strings if we prefer - I gather you are using strings.
  5. However, elara.benchmark pd.read_csvs in the csv of sim results (from elara.event_handlers) and somehow gets something that isn't string???
  6. One of my above steps must be wrong.

For your solution, you can more simply specify the datatype for individual columns using `pd.read_csv(path, dtype={"id":str}) I think. Based on my logic above i am happy for you to do this as I am expecting a str regardless. But my logic is wrong somewhere so who knows.

Ultimately happy for you to make changes if the tests still pass.

If we were to be very careful we could add in a test that represents this problem. But we would have to build new test data and so on. So maybe not unless you are twiddling your thumbs.

I will also talk to Kasia about avoiding massive integers and being consistent with type, ideally strings.

from elara.

Georgea75 avatar Georgea75 commented on June 1, 2024

Hey, a response to your points:

  1. Yes, this is my understanding as well.
  2. Yes, at this point the CSV has the correct IDs.
  3. Yes, this is correct and the Ids are valid at read from the benchmark.
  4. Correct, this is where the error occurs:
    results_df = pd.read_csv(path, index_col=0)
    results_df = results_df.groupby(results_df.index).sum()
    results_df = results_df[[str(h) for h in range(24)]]
    results_df.index = results_df.index.map(str)
    results_df.index.name = 'stop_id'

What is happening line by line
results_df = pd.read_csv(path, index_col=0)
loads the index as a numpy.float64. At this point python displays the float as 1.211729924256132e+19. The type is inferred as float as the data in the csv takes the form 12117299242561318912. Maybe "12117299242561318912" would fix this?

Next line of interest is
results_df.index = results_df.index.map(str)
this converts the index which is numpy.float64 to str , which generates "1.211729924256132e+19"

Therefore, it finds no matches as "1.211729924256132e+19" != "12117299242561318912"

I have changed my solution to use dtype={0:str} as you suggested :) Very happy to write a test as well. But would we have to make a test to cover this issue for each benchmark?

Action
Do you want me to make a branch change all cases of read_csv from the csv dump to dtype={0:str} as this issue may occur for every benchmark type? Or should I just make the change for PTInteraction? Or can I just make the changes to the current new-zealand-branch (At this point this branch would cover several things, adding each of the four new nz-benchmarks as well as this load change)

from elara.

fredshone avatar fredshone commented on June 1, 2024

awesome explanation.

I have trouble reproducing, eg:

In [96]: df = pd.DataFrame(["99999999999999999999999999999999999"]*5, columns=["a"])

In [97]: df
Out[97]:
                                     a
0  99999999999999999999999999999999999
1  99999999999999999999999999999999999
2  99999999999999999999999999999999999
3  99999999999999999999999999999999999
4  99999999999999999999999999999999999

In [98]: df.a
Out[98]:
0    99999999999999999999999999999999999
1    99999999999999999999999999999999999
2    99999999999999999999999999999999999
3    99999999999999999999999999999999999
4    99999999999999999999999999999999999
Name: a, dtype: object

In [99]: df.to_csv(path)

In [100]: df = pd.read_csv(path)

In [101]: df.a
Out[101]:
0    99999999999999999999999999999999999
1    99999999999999999999999999999999999
2    99999999999999999999999999999999999
3    99999999999999999999999999999999999
4    99999999999999999999999999999999999
Name: a, dtype: object

But i trust you and i like strings so please:

  1. force string index - please do this for every bm - i think it's a good test
  2. don't worry about new tests
  3. happy for you to include on your NZ branch

from elara.

Georgea75 avatar Georgea75 commented on June 1, 2024

Excellent, I will make the changes today

from elara.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.