Git Product home page Git Product logo

Comments (11)

crysoberil avatar crysoberil commented on September 22, 2024 2

Since I have not heard back from anyone, I wrote a script that uses selenium to populate the downloads file from the webpage.

import argparse
import time
import requests
import selenium.webdriver.firefox.options
from selenium import webdriver


CO3D_WEBPAEGE_URL = "https://ai.facebook.com/datasets/co3d-downloads/"


def fetch_url_by_span_text(driver, query_text):
    text_elem = driver.find_element_by_xpath("//span[contains(text(),'{}')]".format(query_text))
    a_elm = text_elem.find_element_by_xpath("..")
    url = a_elm.get_attribute("href")
    return url


def get_category_ids(driver):
    cur_list_url = fetch_url_by_span_text(driver, "Download all links")
    response = requests.get(cur_list_url)
    data = response.text
    lines = data.split('\n')[1:]
    category_ids = [elm.split()[0].strip() for elm in lines]
    return category_ids


def get_co3d_urls(page_path):
    options = selenium.webdriver.firefox.options.Options()
    options.headless = True
    firefox_profile = webdriver.FirefoxProfile()
    firefox_profile.set_preference("browser.privatebrowsing.autostart", True)
    with webdriver.Firefox(options=options, firefox_profile=firefox_profile) as driver:
        driver.get(page_path)
        time.sleep(1)  # Some delay to let the webpage populate
        category_ids = get_category_ids(driver)
        item_path_pairs = []
        for category_id in category_ids:
            url = fetch_url_by_span_text(driver, category_id)
            item_path_pairs.append((category_id, url))
        return item_path_pairs


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--download_files_list", type=str, required=False, help="Where the downloadable list will be generated", default="./downloadpaths.txt")
    args = parser.parse_args()
    co3d_item_urls = get_co3d_urls(CO3D_WEBPAEGE_URL)
    with open(args.download_files_list, 'w') as f_out:
        f_out.write("file_name\tcdn_link\n")
        for i, (item, url) in enumerate(co3d_item_urls):
            f_out.write(item)
            f_out.write('\t')
            f_out.write(url)
            if i < len(co3d_item_urls) - 1:
                f_out.write('\n')
                

from co3d.

pwais avatar pwais commented on September 22, 2024

+1

There is also a ZeroDivisionError in download_dataset.py (line 138) if the download fails like this.

from co3d.

davnov134 avatar davnov134 commented on September 22, 2024

Thanks for releasing this useful dataset. I was trying to download the data following the CDN links found in the text file, but for the URLs I get "URL signature expired" error from any browser and any machine I try it from. How do I solve this?

Hi, thanks for the interest in our dataset and sorry for being late with the response due to some of us being on summer holiday.

The CDN links expire once every few days and the link text file has to be re-downloaded. Make sure to download a fresh list of links whenever you start the download. This should fix the problem.

Indeed the solution using selenium seems to do the latter automatically. Thanks for the code!

from co3d.

stalkerrush avatar stalkerrush commented on September 22, 2024

Hi, I tried to use the script to download the dataset with the copied CDN links from your website, but met the ZeroDivisionError also.
image
Do you know what could be the reason for this?

from co3d.

crysoberil avatar crysoberil commented on September 22, 2024

Thanks for releasing this useful dataset. I was trying to download the data following the CDN links found in the text file, but for the URLs I get "URL signature expired" error from any browser and any machine I try it from. How do I solve this?

Hi, thanks for the interest in our dataset and sorry for being late with the response due to some of us being on summer holiday.

The CDN links expire once every few days and the link text file has to be re-downloaded. Make sure to download a fresh list of links whenever you start the download. This should fix the problem.

Indeed the solution using selenium seems to do the latter automatically. Thanks for the code!

I just downloaded the text file and retried the URLs and I still get the "URL signature expired" error. The URLs within the text files are getting expired. The ZeroDivisionError in the python script is also happening because of this. That's why I wrote the script above. It requires selenium to work, but generates a fresh text file which should allow one to download the dataset without facing these errors.

from co3d.

pwais avatar pwais commented on September 22, 2024

@davnov134 Happy Summer Holiday! The bugs here are:

  1. The website generating the 51-line text links file appears to be broken. The urls are all expired. Perhaps it's generating a static file and/or there are some caching problems going on. Seems we have multiple reproductions here.
  2. The download script has a ZeroDivisionError bug that triggers when one or more of the files can't be downloaded. Also several reproductions.

Edit: huh it seems the manual download links may have also expired now (i.e. the 50 links on https://ai.facebook.com/datasets/co3d-downloads/ ). I haven't seen that happen before.

from co3d.

crysoberil avatar crysoberil commented on September 22, 2024

@davnov134 Happy Summer Holiday! The bugs here are:
Edit: huh it seems the manual download links may have also expired now (i.e. the 50 links on https://ai.facebook.com/datasets/co3d-downloads/ ). I haven't seen that happen before.

@pwais The links on the page do work for me still. Perhaps the problem on your end happens due to webpage caching by your browser? Do the links work if you load the page in incognito?

from co3d.

pwais avatar pwais commented on September 22, 2024

hmm looks like I had some connection issues. the download script still doesn't work for me tho. the selenium script does help a lot!!

from co3d.

davnov134 avatar davnov134 commented on September 22, 2024

@pwais , I just downloaded a fresh link list file and launched the download without issues.
If you make sure that you are using a fresh set of links (i.e. do a no-cache refresh of the link page, in Chrome+Mac this is done via Cmd+R), do you still encounter the zero-division error?

The selenium solution is very nice, but introduces too big of a dependency to be supported officially. So I'd rather make sure that the problem cannot be solved in a simpler manner.

from co3d.

pwais avatar pwais commented on September 22, 2024

Agree that selenium is a heavy requirement, but the links seem to expire from time to time nonetheless. I was able to download using wget --continue which was handy because the downloads did fail a bit from time to time. The ZeroDivision error remains-- when a download has zero bytes (it fails), the cited exception hides the issue of the link being broken-- the division at hand is to inform the progress bar, and the progress bar not working is irrelevant if the file can't be downloaded at all. If the response is zero, perhaps just raise a ValueError ?

@davnov134 The paper says "The CO3D collection effort still continues at a steady pace of ∼500 videos per week which we plan to release in the near future." --- do you intend to version the dataset and/or provide the new videos? I think what most people would want here is an experience similar to rsync or aws s3 sync -- any partial data is not re-downloaded, and new data can get downloaded easily too. (Note that the existing download script always starts from scratch-- that doesn't scale well for a dataset this size... I had to resume multiple times due to network issues, and I never saw better than 50 MByte/sec download). awscli is a healthy multi-platform client but I can understand why Facbeook might not want to depend on that and/or publish under an S3-compatible server-side solution ...

At any rate, thanks for this amazing dataset! I wish there were a more straightforward way for distributing stuff like this, COCO, imagenet, etc...

from co3d.

pwais avatar pwais commented on September 22, 2024

@davnov134 Thanks for the recent fix. I still can't download the whole dataset though :( It will eventually time out, and the download script doesn't allow resumes (it tends to blank out everything downloaded thus far). I have fiber internet so I don't think the problem is my connection is too slow.

  1. Will the dataset be available via one big download (e.g. how imagenet was) or bittorrent or something? The current distribution method doesn't seem to work. Once upon a time Facebook had Wirehog (https://en.wikipedia.org/wiki/Wirehog ) ... maybe they can revive that?

  2. Again, the paper says "The CO3D collection effort still continues at a steady pace of ∼500 videos per week which we plan to release in the near future." --- do you intend to version the dataset and/or provide the new videos?

from co3d.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.