Git Product home page Git Product logo

Comments (16)

computron avatar computron commented on June 18, 2024

@kylebystrom can you look into this? Has to do with the git-lfs

from matminer.

kylebystrom avatar kylebystrom commented on June 18, 2024

from matminer.

kylebystrom avatar kylebystrom commented on June 18, 2024

Did you try git lfs clone? It's possible that everyone needs to have git lfs to interact with the repo now. Alternatively, it might work to git remote add the repo and then pull it. I understand that both of these solutions are inconvenient and may not work, so it's totally fine if you want to get rid of the datasets file and the config files associated with lfs for the time being. I am looking for a better solution as we speak though.

from matminer.

shyuep avatar shyuep commented on June 18, 2024

Yes. git lfs is automatic with new versions of git, but I tried both git lfs clone and git clone. It seems the object does not exist on server. So it does not seem to be an issue with git itself.

from matminer.

kylebystrom avatar kylebystrom commented on June 18, 2024

As a temporary workaround, you can do:
git lfs install --skip-smudge
git clone ...
git lfs install --force
This should let you clone the repo, but won't give you the dataset files, just the pointer files. I'm still working on a permanent fix but this should let you clone and work on anything that doesn't involve the lfs files!

from matminer.

kylebystrom avatar kylebystrom commented on June 18, 2024

Okay, I figured out the issue. It turns out that git-lfs does not support pushing lfs files to public forks, but via some bug or other error I managed to push pointer files to the fork without actually putting the large files themselves on the server. I'm trying to learn more about this, but it appears that I cannot actually handle the datasets on my own. What I could do is send the files to @computron so he can track them with lfs.
For the time being, though, we should remove the pointer files in the datasets directory (git rm *.csv or git lfs rm *.csv, not sure which) because they are junk pointers.
Sorry for the inconvenience, everyone! I have no idea how I managed to push junk pointer files without getting the lfs error message I was supposed to, but I'll be more careful now.

from matminer.

computron avatar computron commented on June 18, 2024

Hi @kylebystrom , yes the files should be on the main repo and not the fork. I am just going to give you direct push access to matminer (master) temporarily so you can get it set up

from matminer.

shyuep avatar shyuep commented on June 18, 2024

Hmmm.... this is a rather big bug if forkers cannot push git lfs files.

from matminer.

kylebystrom avatar kylebystrom commented on June 18, 2024

Okay, it appears I fixed it now. As a warning, there are some known issues with git lfs on some clients where one is asked repeatedly for login info while downloading lfs files. The git lfs crew recommends login caching for this. @shyuep you're right that it does seem like a pretty big issue that they don't have that kind of compatibility, but it's a git lfs issue and not something I can change.

If someone wants to test cloning on their system as a double check, that would be great, but I can assure that I can push, clone, and pull with lfs successfully. Sorry again for the issue!

from matminer.

kylebystrom avatar kylebystrom commented on June 18, 2024

And thanks @computron for help resolving the issue. Do I need to remove myself from direct push access or do you do that?

from matminer.

computron avatar computron commented on June 18, 2024

Ok I checked and it works for me.

I am wondering if we think that, apart from maybe one or two small data sets for playing around, the more official place for data sets should be in a different repo like matminer-datasets. That way, someone downloading the code for the data mining tools doesn't have to also download a dozen data sets, particularly if they want some kind of lightweight installation for an analytics server.

from matminer.

kylebystrom avatar kylebystrom commented on June 18, 2024

It probably depends on how many datasets we want. If the three I added are all we want, then it might be best to just keep in matminer (My math might be off but I think that the datasets are currently 1/3 of the size of the repo). If we want all the datasets we can find, then it might get overwhelming to keep them all in one repo. Let me know if you want help making those sorts of changes; I'll make sure to have my lfs act together next time around.

from matminer.

WardLT avatar WardLT commented on June 18, 2024

from matminer.

kylebystrom avatar kylebystrom commented on June 18, 2024

Good question. Some other Python packages, like sklearn, have some datasets contained in the Python package itself. However, those datasets probably aren't that big, and if the "datasets" module is written so that there are convenience methods for loading stuff from the datasets in the data publishing service, it might be a nicer interface to just do it that way.

from matminer.

computron avatar computron commented on June 18, 2024

Ok let's keep the 3 that we have in the repo. These will be sample data sets that people can use to get started. e.g., we'll design Jupyter notebooks around these. As Kyle said, it's similar to how scikit-learn has a few example data sets so that people can get started quickly.

In terms of repo size, the data sets currently there are ~9MB which is not too bad. The example Jupyter notebooks are actually a bigger problem, weighing in at 32MB.

@kylebystrom If there are going to be more datasets in the future let's coordinate putting them somewhere else.

from matminer.

shyuep avatar shyuep commented on June 18, 2024

I wouldn't bother with a separate repo until the total data set size goes to order of a few 100Mb or Gb and beyond. Anything less the admin effort is not worth it.

from matminer.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.