Git Product home page Git Product logo

Comments (7)

mrocklin avatar mrocklin commented on August 14, 2024

@seibert any thoughts on fast string operations with Numba?

Regarding pandas series if we know that all of the rows line up then its best to delay creating the series for as long as possible. The map method might not be a big help. I suspect that a list comprehension is about as fast in many cases. Delaying using series would also help with the eventual desire to support straight NumPy.

from fastparquet.

seibert avatar seibert commented on August 14, 2024

The problem with strings in the above case is that Numba is basically not equipped to deal with a container that holds Python objects when generating code in nopython mode. I'm not sure what we would do here.

from fastparquet.

seibert avatar seibert commented on August 14, 2024

@sklam might have an idea for a hack that would allow us to jump back into Python mode to fetch the pointer to the string data for a given element.

from fastparquet.

martindurant avatar martindurant commented on August 14, 2024

Presumably numpy does this internally when you do astype("S") on an object array?

Is it this call? https://docs.python.org/2/c-api/string.html#c.PyString_AsStringAndSize

Cython is able to do this, apparently: http://stackoverflow.com/questions/17511309/fast-string-array-cython

from fastparquet.

martindurant avatar martindurant commented on August 14, 2024
  • For 8- and 16-bit ints (signed or not), may be faster, and certainly more compact, to use the bit-packing encoding (which will actually look just like tostring()) rather than expand to 32-bits; on read, the final datatype may well be 8- or 16-bit anyway.
    Similarly, dictionary codes are always stored in bit-packing, and storing more bits to round up to 8 should be much faster than using the minimum number of bits per value. PR #39

from fastparquet.

martindurant avatar martindurant commented on August 14, 2024
  • Writing or reading an optional column should not take significant additional time if there are no actual nulls. PR #33

from fastparquet.

martindurant avatar martindurant commented on August 14, 2024

The variable string encoding will remain outstanding, but with good fixed string and categorical options, it is less pressing.
Everything else here is complete, so closing this.

from fastparquet.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.