Git Product home page Git Product logo

Comments (7)

TristanBilot avatar TristanBilot commented on July 28, 2024 1

Hello there,
Thank you for using bqfetch!

  1. to fix ERROR: Could not open requirements file:, you should create the requirements file locally to install the dependencies. Please create the requirements.txt file in your google colab and paste the content of the requirements file present in the repo.
  2. the import error is due to the way you are importing the module. Until now, the only way to import a class from bqfetch was by doing from bqfetch.bqfetch import BigQueryFetcher. I created a new release where it is now possible to import more intuitively using from bqfetch import BigQueryFetcher.
    To fix this error, you can thus use the legacy method with from bqfetch.bqfetch import .. or download and install the 1.1.0 release locally to use the new import system.

from bqfetch.

marinaperezs avatar marinaperezs commented on July 28, 2024

Thank you so much for your quick answer!!

I need to read a table that has 122602779 rows as a data frame in google colab to be able to use it with machine learning algorithms. Which parallel_backend do you recommend? billiard, joblib or multiprocessing ? And do you recommend
to fetch by number of chunks instead of chunk size ? How much size/number of chunks?

Thank you!!

from bqfetch.

TristanBilot avatar TristanBilot commented on July 28, 2024

Before dividing your big dataset into chunks of relatively small size using one of the functions available in bqfetch, did you verify that your data are independent? Namely that it is possible for you to divide the whole dataset into multiple batches that you can train independently.
If it is independent, then you can leverage this module.

Next, for the backend, I recommend you using the default backend, so no need to set the parallel_backend argument for the moment. If this backend leads to issues, then maybe try one of the other available backends.

Then for the fetching, I recommend to fetch by chunk size as you can easily manage your memory and avoid memory overflows in your colab environment. However, you need to specify an index column in your dataset on which the dataset can be partitioned. I give some examples of index columns in the README.
If it is hard for you to find a nice index column, then try to fetch by number of chunks by estimating the chunk size with respect to the size of your dataset and your available memory. You can also start with a small chunk size, and increase more and more until you raise an overflow error.

Do not hesitate if any other questions!

from bqfetch.

marinaperezs avatar marinaperezs commented on July 28, 2024

What do you mean my data are independent? What I wanted to do is read a BigQuery table as dataframe in python, but it seems to be too big and if I do it by connecting to BQ and querying the table it doesn't work. So my idea was to read it "by parts" so that I can then maybe download them in csv and put them together so that I construct my table again and I can load it to python.

I don't know if this is possible with bqfetch, but if not, do you know any other way to load a big BQ table to python?

Also, I just got this error, I don't know if it has something to do with BQ or Colab:
Captura de pantalla 2023-05-08 a las 19 25 23

from bqfetch.

TristanBilot avatar TristanBilot commented on July 28, 2024

I understand that you want to split your dataset into multiple parts, however it's no use to reconstruct the whole dataframe as you will still end up with a memory overflow because your dataframe is too huge to fit on your machine. What you can do to deal with this large table is to run the training loop on the parts of data instead of the whole dataframe, like in mini-batch training.

To summary:

  • ❌ Fetch all the data => train: this leads to memory overflow as you have too much data to process.
  • ✅ Fetch 1/n chunk of the data => train, ..., Fetch n/n chunk of the data => train: this is is feasible, but each row in your dataset should be independent, meaning that running the training loop on all the data at once or running the training loop on many small batches should approximately produce the same results. If your rows are not independent, then you need to provide a proper index column.

from bqfetch.

TristanBilot avatar TristanBilot commented on July 28, 2024

Your error is about the BigQuery API, can you provide the full code you used to fetch? I think your chunk size is too big and the fetching took more than 10min and raised a timeout error.

from bqfetch.

marinaperezs avatar marinaperezs commented on July 28, 2024

I'll think about what you said and probably come back with questions hahah thank you !!!

This is my code:

from bqfetch.bqfetch import BigQueryFetcher
from bqfetch.bqfetch import BigQueryTable

table = BigQueryTable("gcpinv-230419-ifym1p4zudsrz4zu", "prueba_lstm", "results")
fetcher = BigQueryFetcher('/content/drive/MyDrive/TFG mio/gcpinv-230419-ifym1p4zudsrz4zu-699b17de7880.json', table)
chunks = fetcher.chunks('row_num', by_chunk_size_in_GB=2, verbose=True)

for chunk in chunks:
  df = fetcher.fetch(chunk, nb_cores=-1, verbose=True)

Captura de pantalla 2023-05-08 a las 19 43 14

from bqfetch.

Related Issues (5)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.