Git Product home page Git Product logo

ckanext-s3filestore's People

Contributors

amercader avatar blazhovsky avatar brew avatar bstutsky avatar chris-randall-qol avatar dependabot-preview[bot] avatar dumyan avatar duttonw avatar goranmaxim avatar mbocevski avatar orihoch avatar rosswebsterwork avatar thrawnca avatar tino097 avatar visar avatar zoranpandovski avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ckanext-s3filestore's Issues

Running unit tests

Hi everybody, are the instructions for running the unit tests currently supposed to work? On Python 3 it fails for me due to a dependency on Pylons, which I don't think supports Python 3. Trying with a Python 2 Docker container, I was running into several missing packages.
Just wanted to ask if it's supposed to work, in case running tests is probably broken.

Add query string etag to redirects to s3 bucket where pre-signed urls are not in use (ensuring file freshness)

It has been found that caching is great for user performance but has major drawbacks when combined with xloader and similar systems.

Unless the filename is changed, the previous file will be returned till cache has expired, but by then the xloader job has already run and ignored re-building the table.

To fix this, the proposal is to add a query string which contains the last modified date to the 302 redirect on public s3 access.

Why has this not occurred with signed urls before?
Internal the the ckanext-s3filestore when a signed url is generated for a resource, it is stored in redis to speed up the user experience. This is cleared when the file is updated so that the next file retrieval re-populates the redis key/value store.

Note: We cannot do the same with CloudFront due to costs associated with invalidations. After the first 1000 cache invalidations per month they then charge $0.005 USD per invalidation which can add up quite quickly.

Solution:
Inside https://github.com/qld-gov-au/ckanext-s3filestore/blob/master/ckanext/s3filestore/uploader.py#L270 we already have metadata of the file such as Last-Modified or etag. For simplicity, let us go with etag as it won't change till the file has been altered and won't interfere with our cache system performance.

so around https://github.com/qld-gov-au/ckanext-s3filestore/blob/master/ckanext/s3filestore/uploader.py#L296 for public_read files, add ?etag=${etag}
This does not need to be done for pre-signed urls.

info on etags for s3: https://teppen.io/2018/06/23/aws_s3_etags/

Error Trying to Download Image (from Minio)

Hello,

Successfully uploaded a small DATASET containing one image (14KB) through (a compiled) CKAN (v 2.9.4) to Minio (.exe version RELEASE.2021-11-24T23-19-33Z) using (compiled-develop) s3filestore (qld-gov-au/ckanext-s3filestore tag: 0.7.3).

But attempting to download the image file from Minio via CKAN using s3filestore doesn't work, and ckan complains (in the console) about: bytes-like object being read as "str".... (running under Python 3.6.8)

2021-12-03 03:10:00,850 ERROR [ckan.config.middleware.flask_app] a bytes-like object is required, not 'str'
Traceback (most recent call last):

File "/usr/lib/ckan/default/lib64/python3.6/site-packages/flask/app.py", line 1949, in full_dispatch_request

rv = self.dispatch_request()

File "/usr/lib/ckan/default/lib64/python3.6/site-packages/flask/app.py", line 1935, in dispatch_request

return self.view_functions[rule.endpoint](**req.view_args)

File "/usr/lib/ckan/default/src/ckanext-s3filestore/ckanext/s3filestore/views/resource.py", line 69, in resource_download

url = upload.get_signed_url_to_key(key_path)

File "/usr/lib/ckan/default/src/ckanext-s3filestore/ckanext/s3filestore/uploader.py", line 277, in get_signed_url_to_key

if cache_url and is_public_read != _is_presigned_url(cache_url):

File "/usr/lib/ckan/default/src/ckanext-s3filestore/ckanext/s3filestore/uploader.py", line 97, in _is_presigned_url

parts = url.split('?')

TypeError: a bytes-like object is required, not 'str'
2021-12-03 03:10:00,923 INFO [ckan.config.middleware.flask_app] 500 /dataset/a75dbe92-1c80-4d44-82bb-212c5beebd56/resource/1c4e3745-b308-4f40-b1a6-b3ce01f453aa/download/MYIMAGE.png render time 0.245 seconds

Do others get the same behavior ?

Question: behaviour with old artefacts in minio/s3 against a resource

Hi @dvnicolasdh and @aruneko,

With the move to have the s3 bucket objects set to public for 'public' datasets. There is now the possibility of super deep linked results that point to a particular file instead of using ckan's redirect download feature. We have played with trying to stop funnelback and similar from caching the end result but they don't work (or remove the entire file entirely)

This was not an issue when we had it set to 1 hour ttl signed urls. This then had the side effect of creating bad search results for search engines which cached the 1hour ttl link..

On our s3 bucket we have versioning enabled so we can recover overridden files if a publisher accidentally needs it

Problem:
What to do with resources still in s3 bucket that is not the main linked file. In the past with file store storage, we had many a request during the year to delete old files off disk as they did not want them visible any more.

Solutions:

  1. Leave all previous items public when dataset is 'public'
    Pros: no change to code base
    Cons: deep links still have old file which publisher may not want.

  2. Ensure other files in resource s3 bucket location are marked as private and only matching url in Ckan resource filename is public
    Pros: removes non valid files from being visible.
    Cons: old files can't be downloaded. There is a work around by going to /<resource_id>/orig_download/ and simliar for non s3 objects /<resource_id>/fs_download/

  3. When file is uploaded, try and delete file if filename is different (if IUploader Interface is what we have in qld-gov-au/ckan branch i.e https://github.com/qld-gov-au/ckan/blob/qgov-master-2.8.8/ckan/plugins/interfaces.py#L1729 ). We already have this plugged in on resource download if IUploader matches https://github.com/qld-gov-au/ckan/blob/qgov-master-2.8.8/ckan/logic/action/delete.py#L189 https://github.com/qld-gov-au/ckanext-s3filestore/blob/master/ckanext/s3filestore/uploader.py#L421

I think it would be good to have both options with the default of number 2, but can leave previous objects as public by an optional config option.

What do you both think?

cc: @ThrawnCA @chris-randall-osssio

XLSX files are not detected correctly

Hi all, XLSX files are detected as Zip files when ckanext-s3filestore is active. The reason is, that the extension only looks at the first 512 bytes of the uploaded file: https://github.com/qld-gov-au/ckanext-s3filestore/blob/cf0c5bd/ckanext/s3filestore/uploader.py#L545

The part that differentiates a XLSX from a regular Zip comes later (when, depends on the file as well):

In [1]: import magic
In [2]: mime = magic.Magic(mime=True)
In [3]: f = open("empty.xlsx", "rb")
In [4]: mime.from_buffer(f.read(512))
Out[4]: 'application/zip'

In [5]: f.seek(0)
Out[5]: 0

In [6]: mime.from_buffer(f.read(1000))
Out[6]: 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet'

Since the amount of bytes you have to read before python-magic determines it's a XLSX changes with XLSX size/complexity, you probably have to pass the whole file to be sure it works reliably.

CKAN by default tries to look at the file extension to determine the mimetype (config ckan.mimetype_guess = file_ext), or reads the whole file (ckan.mimetype_guess = file_contents): https://github.com/ckan/ckan/blob/0a596b8/ckan/lib/uploader.py#L274

What do you think is the best way to fix this? To behave the same way as CKAN does, or stick to magic.from_buffer but read the whole file? Performance wise that probably won't do any harm. Or use magic.from_file/from_descriptor?

Suggestions to improve the documentation.

Here are some suggestions to improve the documentation.

Our CKAN installation was done on the Amazon AWS cloud infrastructure. The main CKAN application runs in a Fargate/ECS container. In this case, there were some missing instructions on how to proceed with the installation, but we realized that it was enough to follow the following steps:

  1. git clone s3filestorage repo
  2. pip install boto3 && python setup.py install
  3. add s3filestorage in ckan.plugin list

We encountered an issue with a self-signed certificate. Our "solution" was to edit the source code and insert verify=False in the get_s3_resource and get_s3_client methods. We will address this properly later.

Another issue we faced was that our company policy prohibits ACL in S3 buckets, so it is disabled. Consequently, updating ACL after file upload causes errors. We had to modify the source code to prevent this.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.