qld-gov-au / ckanext-s3filestore Goto Github PK
View Code? Open in Web Editor NEWUse Amazon S3 as a filestore for CKAN
License: GNU Affero General Public License v3.0
Use Amazon S3 as a filestore for CKAN
License: GNU Affero General Public License v3.0
Hi everybody, are the instructions for running the unit tests currently supposed to work? On Python 3 it fails for me due to a dependency on Pylons, which I don't think supports Python 3. Trying with a Python 2 Docker container, I was running into several missing packages.
Just wanted to ask if it's supposed to work, in case running tests is probably broken.
It has been found that caching is great for user performance but has major drawbacks when combined with xloader and similar systems.
Unless the filename is changed, the previous file will be returned till cache has expired, but by then the xloader job has already run and ignored re-building the table.
To fix this, the proposal is to add a query string which contains the last modified date to the 302 redirect on public s3 access.
Why has this not occurred with signed urls before?
Internal the the ckanext-s3filestore when a signed url is generated for a resource, it is stored in redis to speed up the user experience. This is cleared when the file is updated so that the next file retrieval re-populates the redis key/value store.
Note: We cannot do the same with CloudFront due to costs associated with invalidations. After the first 1000 cache invalidations per month they then charge $0.005 USD per invalidation which can add up quite quickly.
Solution:
Inside https://github.com/qld-gov-au/ckanext-s3filestore/blob/master/ckanext/s3filestore/uploader.py#L270 we already have metadata of the file such as Last-Modified or etag. For simplicity, let us go with etag as it won't change till the file has been altered and won't interfere with our cache system performance.
so around https://github.com/qld-gov-au/ckanext-s3filestore/blob/master/ckanext/s3filestore/uploader.py#L296 for public_read files, add ?etag=${etag}
This does not need to be done for pre-signed urls.
info on etags for s3: https://teppen.io/2018/06/23/aws_s3_etags/
Hello,
Successfully uploaded a small DATASET containing one image (14KB) through (a compiled) CKAN (v 2.9.4) to Minio (.exe version RELEASE.2021-11-24T23-19-33Z) using (compiled-develop) s3filestore (qld-gov-au/ckanext-s3filestore tag: 0.7.3).
But attempting to download the image file from Minio via CKAN using s3filestore doesn't work, and ckan complains (in the console) about: bytes-like object being read as "str".... (running under Python 3.6.8)
2021-12-03 03:10:00,850 ERROR [ckan.config.middleware.flask_app] a bytes-like object is required, not 'str'
Traceback (most recent call last):
File "/usr/lib/ckan/default/lib64/python3.6/site-packages/flask/app.py", line 1949, in full_dispatch_request
rv = self.dispatch_request()
File "/usr/lib/ckan/default/lib64/python3.6/site-packages/flask/app.py", line 1935, in dispatch_request
return self.view_functions[rule.endpoint](**req.view_args)
File "/usr/lib/ckan/default/src/ckanext-s3filestore/ckanext/s3filestore/views/resource.py", line 69, in resource_download
url = upload.get_signed_url_to_key(key_path)
File "/usr/lib/ckan/default/src/ckanext-s3filestore/ckanext/s3filestore/uploader.py", line 277, in get_signed_url_to_key
if cache_url and is_public_read != _is_presigned_url(cache_url):
File "/usr/lib/ckan/default/src/ckanext-s3filestore/ckanext/s3filestore/uploader.py", line 97, in _is_presigned_url
parts = url.split('?')
TypeError: a bytes-like object is required, not 'str'
2021-12-03 03:10:00,923 INFO [ckan.config.middleware.flask_app] 500 /dataset/a75dbe92-1c80-4d44-82bb-212c5beebd56/resource/1c4e3745-b308-4f40-b1a6-b3ce01f453aa/download/MYIMAGE.png render time 0.245 seconds
Do others get the same behavior ?
Hi @dvnicolasdh and @aruneko,
With the move to have the s3 bucket objects set to public for 'public' datasets. There is now the possibility of super deep linked results that point to a particular file instead of using ckan's redirect download feature. We have played with trying to stop funnelback and similar from caching the end result but they don't work (or remove the entire file entirely)
This was not an issue when we had it set to 1 hour ttl signed urls. This then had the side effect of creating bad search results for search engines which cached the 1hour ttl link..
On our s3 bucket we have versioning enabled so we can recover overridden files if a publisher accidentally needs it
Problem:
What to do with resources still in s3 bucket that is not the main linked file. In the past with file store storage, we had many a request during the year to delete old files off disk as they did not want them visible any more.
Solutions:
Leave all previous items public when dataset is 'public'
Pros: no change to code base
Cons: deep links still have old file which publisher may not want.
Ensure other files in resource s3 bucket location are marked as private and only matching url in Ckan resource filename is public
Pros: removes non valid files from being visible.
Cons: old files can't be downloaded. There is a work around by going to /<resource_id>/orig_download/ and simliar for non s3 objects /<resource_id>/fs_download/
When file is uploaded, try and delete file if filename is different (if IUploader Interface is what we have in qld-gov-au/ckan branch i.e https://github.com/qld-gov-au/ckan/blob/qgov-master-2.8.8/ckan/plugins/interfaces.py#L1729 ). We already have this plugged in on resource download if IUploader matches https://github.com/qld-gov-au/ckan/blob/qgov-master-2.8.8/ckan/logic/action/delete.py#L189 https://github.com/qld-gov-au/ckanext-s3filestore/blob/master/ckanext/s3filestore/uploader.py#L421
I think it would be good to have both options with the default of number 2, but can leave previous objects as public by an optional config option.
What do you both think?
cc: @ThrawnCA @chris-randall-osssio
Hi all, XLSX files are detected as Zip files when ckanext-s3filestore is active. The reason is, that the extension only looks at the first 512 bytes of the uploaded file: https://github.com/qld-gov-au/ckanext-s3filestore/blob/cf0c5bd/ckanext/s3filestore/uploader.py#L545
The part that differentiates a XLSX from a regular Zip comes later (when, depends on the file as well):
In [1]: import magic
In [2]: mime = magic.Magic(mime=True)
In [3]: f = open("empty.xlsx", "rb")
In [4]: mime.from_buffer(f.read(512))
Out[4]: 'application/zip'
In [5]: f.seek(0)
Out[5]: 0
In [6]: mime.from_buffer(f.read(1000))
Out[6]: 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet'
Since the amount of bytes you have to read before python-magic determines it's a XLSX changes with XLSX size/complexity, you probably have to pass the whole file to be sure it works reliably.
CKAN by default tries to look at the file extension to determine the mimetype (config ckan.mimetype_guess = file_ext
), or reads the whole file (ckan.mimetype_guess = file_contents
): https://github.com/ckan/ckan/blob/0a596b8/ckan/lib/uploader.py#L274
What do you think is the best way to fix this? To behave the same way as CKAN does, or stick to magic.from_buffer
but read the whole file? Performance wise that probably won't do any harm. Or use magic.from_file/from_descriptor
?
Here are some suggestions to improve the documentation.
Our CKAN installation was done on the Amazon AWS cloud infrastructure. The main CKAN application runs in a Fargate/ECS container. In this case, there were some missing instructions on how to proceed with the installation, but we realized that it was enough to follow the following steps:
We encountered an issue with a self-signed certificate. Our "solution" was to edit the source code and insert verify=False
in the get_s3_resource
and get_s3_client
methods. We will address this properly later.
Another issue we faced was that our company policy prohibits ACL in S3 buckets, so it is disabled. Consequently, updating ACL after file upload causes errors. We had to modify the source code to prevent this.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.