Comments (7)
Hello there,
Thank you for using bqfetch!
- to fix
ERROR: Could not open requirements file:
, you should create the requirements file locally to install the dependencies. Please create the requirements.txt file in your google colab and paste the content of the requirements file present in the repo. - the import error is due to the way you are importing the module. Until now, the only way to import a class from bqfetch was by doing
from bqfetch.bqfetch import BigQueryFetcher
. I created a new release where it is now possible to import more intuitively usingfrom bqfetch import BigQueryFetcher
.
To fix this error, you can thus use the legacy method withfrom bqfetch.bqfetch import ..
or download and install the1.1.0
release locally to use the new import system.
from bqfetch.
Thank you so much for your quick answer!!
I need to read a table that has 122602779 rows as a data frame in google colab to be able to use it with machine learning algorithms. Which parallel_backend do you recommend? billiard, joblib or multiprocessing ? And do you recommend
to fetch by number of chunks instead of chunk size ? How much size/number of chunks?
Thank you!!
from bqfetch.
Before dividing your big dataset into chunks of relatively small size using one of the functions available in bqfetch, did you verify that your data are independent? Namely that it is possible for you to divide the whole dataset into multiple batches that you can train independently.
If it is independent, then you can leverage this module.
Next, for the backend, I recommend you using the default backend, so no need to set the parallel_backend argument for the moment. If this backend leads to issues, then maybe try one of the other available backends.
Then for the fetching, I recommend to fetch by chunk size as you can easily manage your memory and avoid memory overflows in your colab environment. However, you need to specify an index column in your dataset on which the dataset can be partitioned. I give some examples of index columns in the README.
If it is hard for you to find a nice index column, then try to fetch by number of chunks by estimating the chunk size with respect to the size of your dataset and your available memory. You can also start with a small chunk size, and increase more and more until you raise an overflow error.
Do not hesitate if any other questions!
from bqfetch.
What do you mean my data are independent? What I wanted to do is read a BigQuery table as dataframe in python, but it seems to be too big and if I do it by connecting to BQ and querying the table it doesn't work. So my idea was to read it "by parts" so that I can then maybe download them in csv and put them together so that I construct my table again and I can load it to python.
I don't know if this is possible with bqfetch, but if not, do you know any other way to load a big BQ table to python?
Also, I just got this error, I don't know if it has something to do with BQ or Colab:
from bqfetch.
I understand that you want to split your dataset into multiple parts, however it's no use to reconstruct the whole dataframe as you will still end up with a memory overflow because your dataframe is too huge to fit on your machine. What you can do to deal with this large table is to run the training loop on the parts of data instead of the whole dataframe, like in mini-batch training.
To summary:
- ❌ Fetch all the data => train: this leads to memory overflow as you have too much data to process.
- ✅ Fetch 1/n chunk of the data => train, ..., Fetch n/n chunk of the data => train: this is is feasible, but each row in your dataset should be independent, meaning that running the training loop on all the data at once or running the training loop on many small batches should approximately produce the same results. If your rows are not independent, then you need to provide a proper index column.
from bqfetch.
Your error is about the BigQuery API, can you provide the full code you used to fetch? I think your chunk size is too big and the fetching took more than 10min and raised a timeout error.
from bqfetch.
I'll think about what you said and probably come back with questions hahah thank you !!!
This is my code:
from bqfetch.bqfetch import BigQueryFetcher
from bqfetch.bqfetch import BigQueryTable
table = BigQueryTable("gcpinv-230419-ifym1p4zudsrz4zu", "prueba_lstm", "results")
fetcher = BigQueryFetcher('/content/drive/MyDrive/TFG mio/gcpinv-230419-ifym1p4zudsrz4zu-699b17de7880.json', table)
chunks = fetcher.chunks('row_num', by_chunk_size_in_GB=2, verbose=True)
for chunk in chunks:
df = fetcher.fetch(chunk, nb_cores=-1, verbose=True)
from bqfetch.
Related Issues (5)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from bqfetch.