Comments (8)
cc @menzenski since it sounds like they might be taking over some of the maintainer duties based on #28 (comment) 😄
from tap-spreadsheets-anywhere.
@menzenski plugin inheritance is one of the solutions I thought of, so we could add the connection string to the config and treat it like a normal password. I don't have any objections to that solution.
from tap-spreadsheets-anywhere.
Regarding your other question about the additional parameter:
I think my personal preference would be for something like this:
def get_streamreader(uri, universal_newlines=True,newline='',open_mode='r'):
transport_params = None
if uri.startswith('azure://'):
connect_str = os.environ['AZURE_STORAGE_CONNECTION_STRING']
transport_params = {
'client': BlobServiceClient.from_connection_string(connect_str),
}
streamreader = smart_open.open(uri, open_mode, newline=newline, errors='surrogateescape', transport_params=transport_params)
if not universal_newlines and isinstance(streamreader, StreamReader):
return monkey_patch_streamreader(streamreader)
return streamreader
This would keep the single reference to smart_open.open
and would also keep the monkeypatching for azure (though to be honest I don't know if that's required).
from tap-spreadsheets-anywhere.
Thanks @pnadolny13 - @ets has indeed made me a collaborator on this repository although I haven't really taken any actions yet in that capacity.
I'll preface this comment with a caveat that I don't personally have any Azure experience. My own cloud experience is really AWS-only.
In Meltano, this works well by adding it to the .env file. But if anyone wants to read from mulitple azure storage accounts, this will be hard to configure.
My understanding is that this is something that's generally true of Meltano plugins - separating environment variables in this way isn't possible. (I don't believe I could do that with AWS S3 either).
Is there a need to connect to mulitple azure storage accounts in the same invocation of the same tap? Or could you use plugin inheritance to run the tap once for each storage account, providing different environment variables to each run?
from tap-spreadsheets-anywhere.
I would do this @radbrt
kwarg_dispatch = {
"azure": lambda: {
"transport_params": {
"client": BlobServiceClient.from_connection_string(
os.environ['AZURE_STORAGE_CONNECTION_STRING'],
)
}
},
"gcs": lambda: {
"transport_params": {
"client": storage.Client.from_service_account_json(
os.environ['GOOGLE_APPLICATION_CREDENTIALS'],
# We can add more nuanced transport params here
)
}
},
# Adding support for more is intuitive...
}
SCHEME_SEP = "://"
kwargs = kwarg_dispatch.get(uri.split(SCHEME_SEP, 1)[0], lambda: {})()
streamreader = smart_open.open(uri, open_mode, newline=newline, errors='surrogateescape', **kwargs)
if not universal_newlines and isinstance(streamreader, StreamReader):
return monkey_patch_streamreader(streamreader)
return streamreader
EDIT: it lazily evaluates itself so nothing in gcs
would be evaluated if the scheme was resolved to azure
from tap-spreadsheets-anywhere.
Or as an FP 1-liner if you go full-tilt 😆 same thing.
def get_streamreader(uri: str, universal_newlines: bool = True, newline: str = "", open_mode: str = "r"):
return (lambda rdr: rdr if not universal_newlines and isinstance(rdr, StreamReader)
else monkey_patch_streamreader(rdr))(
smart_open.open(uri, open_mode, newline=newline, errors="surrogateescape", **{
"azure": lambda: {
"transport_params": {
"client": BlobServiceClient.from_connection_string(
os.environ["AZURE_STORAGE_CONNECTION_STRING"],
)
}
},
"gcs": lambda: {
"transport_params": {
"client": storage.Client.from_service_account_json(
os.environ["GOOGLE_APPLICATION_CREDENTIALS"],
# We can add more nuanced transport params here
)
}
},
# Adding support for more is intuitive...
}.get(uri.split("://", 1)[0], lambda: {})())
)
from tap-spreadsheets-anywhere.
@radbrt I'd encourage you to go ahead and open a PR for this.
from tap-spreadsheets-anywhere.
@menzenski I absolutely plan to, there were a lot of good ideas here. Will probably have time this weekend.
from tap-spreadsheets-anywhere.
Related Issues (20)
- Skip extra header lines
- Add option to set encoding
- Extend "json_path" config option with JSONPath parser for deep nested data
- Error during discovery doesn't fail job
- Walking a non existant local file directory doesn't fail
- Add new output type `object` HOT 1
- Silent failure during sampling of an Excel spreadsheet with blank rows before the data
- CI is failing on missing PDM lock file
- `zipfile.BadZipFile: File is not a zip file` when loading an `.xlsx` file HOT 7
- Azure sync process logs quite noisy HOT 2
- Stream contains all rows in `.xlsx` sheet instead of only data rows.
- Azure: Use DefaultAzureCredential over storage key access to blob container
- SFTP error
- *csv not working as RegEx in pattern (but .csv$ does work)
- Create a way to extract spreadsheets with no header row
- Ability to set granularity of replication key HOT 1
- [Documentation] How to tap from s3ninja
- Bug when reading `.xlsx` files. Excel files not properly tapped and no output with `ERROR Unable to write Catalog entry for 'filexlsx' - it will be skipped due to error File is not a zip file`
- TAP_SPREADSHEETS_ANYWHERE_TABLES environment variable is not seen by the tap
- Executable 'tap-spreadsheets-anywhere' could not be found HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tap-spreadsheets-anywhere.