Hi everyone ! I'm trying to read a big table from BigQuery using pyt

Hello there, Thank you for using bqfetch! to fix <code cla

Error when importing bqfetch about bqfetch HOT 7 OPEN

tristanbilot commented on September 23, 2024

Error when importing bqfetch

from bqfetch.

Comments (7)

TristanBilot commented on September 23, 2024 1

Hello there,
Thank you for using bqfetch!

to fix ERROR: Could not open requirements file:, you should create the requirements file locally to install the dependencies. Please create the requirements.txt file in your google colab and paste the content of the requirements file present in the repo.
the import error is due to the way you are importing the module. Until now, the only way to import a class from bqfetch was by doing from bqfetch.bqfetch import BigQueryFetcher. I created a new release where it is now possible to import more intuitively using from bqfetch import BigQueryFetcher.
To fix this error, you can thus use the legacy method with from bqfetch.bqfetch import .. or download and install the 1.1.0 release locally to use the new import system.

from bqfetch.

marinaperezs commented on September 23, 2024

Thank you so much for your quick answer!!

I need to read a table that has 122602779 rows as a data frame in google colab to be able to use it with machine learning algorithms. Which parallel_backend do you recommend? billiard, joblib or multiprocessing ? And do you recommend
to fetch by number of chunks instead of chunk size ? How much size/number of chunks?

Thank you!!

from bqfetch.

TristanBilot commented on September 23, 2024

Before dividing your big dataset into chunks of relatively small size using one of the functions available in bqfetch, did you verify that your data are independent? Namely that it is possible for you to divide the whole dataset into multiple batches that you can train independently.
If it is independent, then you can leverage this module.

Next, for the backend, I recommend you using the default backend, so no need to set the parallel_backend argument for the moment. If this backend leads to issues, then maybe try one of the other available backends.

Then for the fetching, I recommend to fetch by chunk size as you can easily manage your memory and avoid memory overflows in your colab environment. However, you need to specify an index column in your dataset on which the dataset can be partitioned. I give some examples of index columns in the README.
If it is hard for you to find a nice index column, then try to fetch by number of chunks by estimating the chunk size with respect to the size of your dataset and your available memory. You can also start with a small chunk size, and increase more and more until you raise an overflow error.

Do not hesitate if any other questions!

from bqfetch.

marinaperezs commented on September 23, 2024

What do you mean my data are independent? What I wanted to do is read a BigQuery table as dataframe in python, but it seems to be too big and if I do it by connecting to BQ and querying the table it doesn't work. So my idea was to read it "by parts" so that I can then maybe download them in csv and put them together so that I construct my table again and I can load it to python.

I don't know if this is possible with bqfetch, but if not, do you know any other way to load a big BQ table to python?

Also, I just got this error, I don't know if it has something to do with BQ or Colab:

from bqfetch.

TristanBilot commented on September 23, 2024

I understand that you want to split your dataset into multiple parts, however it's no use to reconstruct the whole dataframe as you will still end up with a memory overflow because your dataframe is too huge to fit on your machine. What you can do to deal with this large table is to run the training loop on the parts of data instead of the whole dataframe, like in mini-batch training.

To summary:

❌ Fetch all the data => train: this leads to memory overflow as you have too much data to process.
✅ Fetch 1/n chunk of the data => train, ..., Fetch n/n chunk of the data => train: this is is feasible, but each row in your dataset should be independent, meaning that running the training loop on all the data at once or running the training loop on many small batches should approximately produce the same results. If your rows are not independent, then you need to provide a proper index column.

from bqfetch.

TristanBilot commented on September 23, 2024

Your error is about the BigQuery API, can you provide the full code you used to fetch? I think your chunk size is too big and the fetching took more than 10min and raised a timeout error.

from bqfetch.

marinaperezs commented on September 23, 2024

I'll think about what you said and probably come back with questions hahah thank you !!!

This is my code:

from bqfetch.bqfetch import BigQueryFetcher
from bqfetch.bqfetch import BigQueryTable

table = BigQueryTable("gcpinv-230419-ifym1p4zudsrz4zu", "prueba_lstm", "results")
fetcher = BigQueryFetcher('/content/drive/MyDrive/TFG mio/gcpinv-230419-ifym1p4zudsrz4zu-699b17de7880.json', table)
chunks = fetcher.chunks('row_num', by_chunk_size_in_GB=2, verbose=True)

for chunk in chunks:
  df = fetcher.fetch(chunk, nb_cores=-1, verbose=True)

from bqfetch.

Error when importing bqfetch about bqfetch HOT 7 OPEN

Comments (7)

Related Issues (5)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent