wjohnson / pyapacheatlas Goto Github PK
View Code? Open in Web Editor NEWA python package to help work with the apache atlas REST APIs
Home Page: https://wjohnson.github.io/pyapacheatlas-docs/latest/
License: MIT License
A python package to help work with the apache atlas REST APIs
Home Page: https://wjohnson.github.io/pyapacheatlas-docs/latest/
License: MIT License
Enabling a rough "discovery" of possible candidates for assets that need to be tagged with a glossary term.
Should be possible through the advanced search and parsing through the glossary terms.
pyapachealtas enables to upload entities specifying classifications. It would be useful to be able to define new classifications directly from the Excel Template.
After #47 it should include the ability to handle multiple inputs or outputs in the spreadsheet.
If there's an N/A in one row and it's being defined in another row, assume the N/A and WARN on the output.
PEP 479 indicates that StopIteration inside a Generator is not good behavior and replaces it with a RuntimeError instead in Python 3.7+
Need to replace this StopIteration and simply return
when the inner function completes on AtlasClient.search_entities
Hi, I got a customer that wants to also be able to manage the owner and expert with the API and also to assign during the creation of the custom ones.
The Azure Data Catalog provides a CSV glossary term upload with the following fields. The goal of this issue would be to develop a similar offering via the excel template and replicate the features.
Columns of CSV / Excel File:
The dynamic attribute should be attached to an attributes property
{
"attributes:{
"termTemplateName": {
"extraAttributeName": ""
}
}
This is accomplished through the search API and requires paging through the results.
The goal would be to extract every entity and enable users to essentially "back up" their data catalog but also potentially re-locating their data catalog by uploading the results of this extraction.
Need to consider the upload process as well. Assuming you have to replace the guids when pushing to the new catalog since entity upload requires a negative number as guid.
Hello,
Would your samples/excel/excel_bulk_entities_upload.py work for updating existing columns? I am trying to find a way to bulk update columns that have already been scanned in via the GUI. We want to add additional information to the columns, mainly descriptions and glossary links.
I am trying to test it by updating a single column (adding a description to it). Below is what I have in the spreadsheet.
typeName | name | qualifiedName | classifications | [Relationship] table | type | description |
---|---|---|---|---|---|---|
mssql_column | my_column | mssql://XXXXXXXXXXX:XXXXXXX/MSSQLSERVER/XXXXX/XXXXX/my_table#my_column | pyapacheatlas://my_table | smallint | testing |
Running this gives me the following error --
KeyError: 'The entity pyapacheatlas://my_table should be listed before mssql://XXXXXXXXXXX:XXXXXXX/MSSQLSERVER/XXXXX/XXXXX/my_table#my_column.'
I am not sure how to interpret this. Any help is greatly appreciated. Thank you.
HI,
I am trying to run the code and create a sample entity but i am getting the following error. I have checked the credentials and everything seems fine.
Traceback (most recent call last):
File "c:\Users\shkh\Purview.py", line 105, in
batch=[output01, input01, process]
File "C:\Users\shkh\AppData\Roaming\Python\Python36\site-packages\pyapacheatlas\core\client.py", line 927, in upload_entities
headers=self.authentication.get_authentication_headers()
File "C:\Users\shkh\AppData\Roaming\Python\Python36\site-packages\pyapacheatlas\auth\serviceprincipal.py", line 58, in get_authentication_headers
self._set_access_token()
File "C:\Users\shkh\AppData\Roaming\Python\Python36\site-packages\pyapacheatlas\auth\serviceprincipal.py", line 48, in _set_access_token
self.expiration = datetime.fromtimestamp(int(authJson["expires_in"]))
OSError: [Errno 22] Invalid argument
Create an excel reader function that supports upload of entities without needing column or table level lineage.
Currently, the Columns and Tables tabs expect you to be creating source and targets.
A new tab should be added to the template to support BulkEntities and the Columns and Tables tabs should be renamed to ColumnsLineage and TablesLineage as defaults.
BulkEntities should be able to automatically take column headers as the attributes. If a cell is empty, it will not add that attribute to the entity.
Currently, pyapacheatlas uploads entity classifications with the propagation attribute activated.
This is not convenient for all use cases. For instance, one would like to add a classification such as "manual_import" to differentiate, when browsing the catalog, the entities imported with pyapacheatlas from those populated automatically. Currently, when uploading related entities with this classification, one ends up with a series of "propagated classifications" stating "manual_import manual_import manual_import manual_import..." as many times as there are relationthips (which can be >10 in my case).
ADC Gen 1 glossary terms should be importable into Purview!
Add a migration sample for ADC Gen1 that:
When the client is created with PurviewClient()
, the is_purview
client attribute is incorrectly set to False
.
This causes search_entities()
to throw RuntimeWarning: You're using a Purview only feature on a non-purview endpoint
:
from pyapacheatlas.auth import ServicePrincipalAuthentication
from pyapacheatlas.core import PurviewClient
auth = ServicePrincipalAuthentication(
tenant_id = "...",
client_id = "...",
client_secret = "..."
)
client = PurviewClient(
account_name = "my-purview-account-name",
authentication = auth
)
print('client.is_purview:', client.is_purview)
# >> False
for i in client.search_entities('totemove'):
print(i)
# >> ...python3.9/site-packages/pyapacheatlas/core/util.py:18:
# >> RuntimeWarning: You're using a Purview only feature on a non-purview endpoint.
# >> warnings.warn(
The output is ok, despite this warning.
Workaround: Set client.is_purview = True
after client creation.
Any of the additional "required" attributes should be okay if they're not part of the type since they are ignored.
When you create a column lineage entity, you have a dependencyType attribute that is either SIMPLE or EXPRESSION. If you have an EXPRESSION value then you would also see an expression attribute. That expression attribute would contain the code used to create that field.
If you go to re-run the parse_lineages method with existing entities (based on type and qualified name) and remove the transformation value for a given column lineage, you end up with a SIMPLE dependencyType but still have a value in the expression attribute.
Instead, the default for expression should be set to null. However, this may break other scenarios where we want to omit null values. There may have to be a compromise of an empty string value instead or an NA value?
Congratulations and many thanks for this great tool!
The sample provided are very useful but I cannot figure out how the custom attributes and relationships are passed to the Atlas API.
For instance, the script samples/excel/excel_bulk_entities_upload.py produces an excel BulkEntities sheet with two additional columns: "[Relationship] table", and "type".
The corresponding information is visible in the dict outputed by excel_reader.parse_bulk_entities(), but I cannot find them in the result of client.upload_entities() that also get printed on the console (see below). How are the attributes "[Relationship] table", and "type" passed to Apache Atlas in this case?
I would really need to understand that to grasp exactly what kind of related objects I can pass to the catalog API with pyapacheatlas.
runfile('C:/Users/FBEDECARRA/Desktop/Tests Apache Atlas/sample_bulk_upload.py', wdir='C:/Users/FBEDECARRA/Desktop/Tests Apache Atlas')
{
"mutatedEntities": {
"CREATE": [
{
"typeName": "DataSet",
"attributes": {
"qualifiedName": "pyapacheatlas://dataset",
"name": "exampledataset"
},
"guid": "f24c4f22-c5e3-4776-a630-41e533b47099",
"status": "ACTIVE",
"displayText": "exampledataset",
"classificationNames": [],
"classifications": [],
"meaningNames": [],
"meanings": [],
"isIncomplete": false,
"labels": []
},
{
"typeName": "hive_table",
"attributes": {
"createTime": 0,
"qualifiedName": "pyapacheatlas://hivetable01",
"name": "hivetable01"
},
"guid": "46efb945-281d-497b-8334-92c668fb8d5b",
"status": "ACTIVE",
"displayText": "hivetable01",
"classificationNames": [],
"classifications": [],
"meaningNames": [],
"meanings": [],
"isIncomplete": false,
"labels": []
},
{
"typeName": "hive_column",
"attributes": {
"qualifiedName": "pyapacheatlas://hivetable01#colA",
"name": "columnA"
},
"guid": "195d4775-69f0-48fe-b63c-88c0e30066fa",
"status": "ACTIVE",
"displayText": "columnA",
"classificationNames": [],
"classifications": [],
"meaningNames": [],
"meanings": [],
"isIncomplete": false,
"labels": []
},
{
"typeName": "hive_column",
"attributes": {
"qualifiedName": "pyapacheatlas://hivetable01#colB",
"name": "columnB"
},
"guid": "f43b8f63-63da-4c82-b5f5-2b09c0418e67",
"status": "ACTIVE",
"displayText": "columnB",
"classificationNames": [],
"classifications": [],
"meaningNames": [],
"meanings": [],
"isIncomplete": false,
"labels": []
},
{
"typeName": "hive_column",
"attributes": {
"qualifiedName": "pyapacheatlas://hivetable01#colC",
"name": "columnC"
},
"guid": "f1650ead-6b7e-4dce-aa2b-03ddb18ebca3",
"status": "ACTIVE",
"displayText": "columnC",
"classificationNames": [],
"classifications": [],
"meaningNames": [],
"meanings": [],
"isIncomplete": false,
"labels": []
}
]
},
"guidAssignments": {
"-1005": "f1650ead-6b7e-4dce-aa2b-03ddb18ebca3",
"-1004": "f43b8f63-63da-4c82-b5f5-2b09c0418e67",
"-1001": "f24c4f22-c5e3-4776-a630-41e533b47099",
"-1003": "195d4775-69f0-48fe-b63c-88c0e30066fa",
"-1002": "46efb945-281d-497b-8334-92c668fb8d5b"
}
}
Completed bulk upload successfully!
Search for hivetable01 to see your results.
Related to #31 in the sense that a table has a columns relationship attribute.
This is the appropriate syntax and needs to be fixed in the column lineage reader.
column_mapping = [
{"ColumnMapping": [
{"Source": "AddressType", "Sink": "address"},
{"Source": "CustomerId", "Sink": "cust_id"}],
"DatasetMapping": {
"Source": custAddr.qualifiedName, "Sink": customer.qualifiedName}
},
{"ColumnMapping": [
{"Source": " total_emp", "Sink": "cust_id"},
{"Source": " description", "Sink": "username"}],
"DatasetMapping": {"Source": sample.qualifiedName, "Sink": customer.qualifiedName}
}
]
After completing #29 and merging #32 , there is a potential need to connect relationship attributes to an uploaded entity. For example, you might upload several tables and columns. However, those columns would be unattached entities and have no relationships.
There needs to be something like (Relationship) attributeX
in the BulkEntities tab or Target (Relationship) attributeY
in the Lineages tabs.
Others may want to implement readers for different formats.
For example, you may want to create a JSON reader or a DelimitedFile reader that implements the same standard methods to parse the results.
This will result in merging:
This would be a breaking change for the samples.
There should be a Purview Client that accepts a account_name attribute and fills in the endpoint_url for you.
The PurviewClient should be used as a test to warn when...
Given an excel spreadsheet with column headers, generate an entity based on the column headers as attributes.
The goal would be to quickly generate the type and have it be hand edited to modify the results.
Stretch goal should be to allow for entities (the rows of the spreadsheet) to be created for that entity type.
Right now, the Excel files are only smart enough to include classifications (which might need to be made into an optional field).
By including glossary terms, this would support bulk updates to entities that can't be currently done in Purview.
Implementation should look at adding a meanings special header that supports multiple semi-colon delimited terms that get mapped as relationship attributes.
Hi,
I've ran the excel_custom_table_column_lineage.py sample and it works fine. Below is a picture of the lineage tab from the perspective of DestTable01.
However when I try the same exact type of lineage setup using some of my MSSQL tables, I get the following --
Three of the four some_adf_job entities are of type MS SQL Column Lineage. I don't want those showing. I only want the process entity showing like how it is in the demo. Also in the demo you can search for the columns on the left, but I can't do that here.
Any idea on what I could be doing wrong? I uploaded the missing MSSQL typedefs using the column_lineage_scaffold template beforehand.
Support the following REST Endpoints with AtlasClient methods to round out the supported features
/v2/entity/bulk/classification (POST)
/v2/entity/guid/{guid}/classifications (GET | POST | PUT)
/v2/entity/guid/{guid}/classification/{classificationName} (DELETE | GET)
/v2/types/classificationdef/guid/{guid} (GET) (Already supported)
/v2/types/classificationdef/name/{name} (GET) (Already supported)
Add a switch to the import process (or the PurviewClient's authentication?) so that the user can signal "My SP has admin-granted permissions to call the Graph". In that case, the package will know it doesn't have to ask for interactive login. It can use the SP to call the Graph straight away. This would enable a scenario where the package can be used in a fully automated environment.
A CLI would help with using PyApacheAtlas as part of a tool chain and handle simple, reoccurring tasks such as:
Hello,
I'm seeing the following error when running databricks_catalog_dataframe.py in Databricks:
TypeError: 'EntityTypeDef' object is not subscriptable
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<command-1278818675318490> in <module>
78 "relationshipDefs":[spark_column_to_df_relationship]
79 },
---> 80 force_update=True)
81 print(typedef_results)
82
/databricks/python/lib/python3.7/site-packages/pyapacheatlas/core/client.py in upload_typedefs(self, typedefs, force_update, **kwargs)
840 new_types[cat] = []
841 for t in typelist:
--> 842 if t["name"] in types_from_client[cat]:
843 existing_types[cat].append(t)
844 else:
TypeError: 'EntityTypeDef' object is not subscriptable
Hello,
I am trying to retrieve data about the synapse (azure_sql_dw) instance I have setup in our Purview catalog. However when I try to get that information via get_entity, it returns an error.
Here is the code --
azure_sql_dw = client.get_entity(guid='c39966cb-fbe1-4394-9b44-1d3bbafeb38e')
Here is the error --
HTTPError: 500 Server Error: Internal Server Error for url: https://XXXXXXXXXXXX.catalog.purview.azure.com/api/atlas/v2/entity/bulk?guid=c39966cb-fbe1-4394-9b44-1d3bbafeb38e
Any idea on what could be causing this? Calls I make to table or column objects work fine.
Thanks,
Zack
Change the package so that it looks at the experts and owners input. If the values look like guids, then proceed as before. If they look like email addresses, force the user to login interactively and then the package will use the Graph API to translate the email addresses to guids on the user's behalf.
This should only occur in the PurviewClient and only applies to Entities upload and Glossary Term uploads. This is already handled in the terms/import csv route developed in #77 .
Currently upload_entities only supports a dictionary or list of dictionaries. It should handle a single AtlasEntity or a list of AtlasEntities. If the batch is a dictionary of "entities: []
then assume they are passing in a list of dicts already since they know the format.
Knock out the LineageREST section!
Purview ONLY Support
GET /atlas/v2/lineage/{guid}/next/
GET /atlas/v2/lineage/{guid}
Purview Limitation
GET /v2/lineage/uniqueAttribute/type/{typeName}
Hi,
I'm trying to build a function that takes two entities and creates a process between those two entities using the AtlasProcess function. My problem is that I need to create a new guid and be sure that the guid is not already assigned to one of my asset in Purview. Is there a function that creates a new guid, knowing the already existing ones on Purview?
Thank you,
Edoardo
As major methods like AtlasClient.upload_entities
take on the role of converting objects into json, so should the AtlasProcess.
Three areas require changes:
__init__
should handle the inputs and outputs attributes.set_outputs
...set_inputs
...In each case, it should allow an AtlasEntity and execute the to_json(minimum=True)
method for you.
upload_typedefs currently accepts a typedef parameter that can take in different values.
I think it would be better if it had arguments for the required keys: "classificationDefs", "entityDefs","enumDefs", "relationshipDefs", "structDefs". That way you don't have to construct the dict yourself.
The arguments should accept a list of either AtlasTypeDefs (and converts them into dicts) or dicts.
@properties
to get and set attribute defs in whole@properties
to get and set relationship attribute defs in whole@properties
to get and set input/output in whole@properties
@properties
to get and set relationship attributes in wholeThere are two sort of headers that work!
The currently supported version looks like this:
{
"guid":-1,
"typeName": "",
"qualifiedName": ""
}
However, if you don't provide a guid, the to_json(minimum=True) should specify:
{
"typeName": "type",
"uniqueAttributes": {
"qualifiedName": "qualified name"
}
}
This could help avoid having to upload the entity as part of the batch.
As far as I understand PurView offers a very limited API for searches when compared to the original Apache Atlas. One example: there is no v2/search/basic
in PurView, but there is in Atlas .
In the light of this information, did you mean this instead in the README.md?
Search (the only search available for Azure Purview advanced search)
And as a side question, do you know if the original Atlas API is still accessible somehow?
Allow experts and owners to be imported by putting the object ID's into the Excel sheet. This is enough to get the ball rolling. It's the easiest solution and it gives a path for users who are desperate for a solution. It also separates the basic API and import parsing problem from the more complicated "Graph authentication" problem.
In order to make working with uploads easier, the client.upload_typedefs force_update parameter should be smarter.
Currently, it simply does a POST request if FALSE and a PUT request if TRUE. However, doing a PUT request on a type def that does not exist will break the entire upload and doing a POST request for a type def that exists breaks the upload as well.
A better solution is to look up each type def by name and category (entity, relationship, classification) and see if it exists. If it exists, then use the PUT request.
However, there may be dependencies between types and there may be issues in updating a type that will conflict against the existing entities.
Need to test:
Hi again. Is it possible to add a classification and/or glossary term to columns using the excel_bulk_entities_upload method? I see the sample has a classifications column. I have tried populating this with an existing classification and it runs without error but nothing shows up in the interface for the column. Other fields like description/data_type update fine. Thanks.
Hi,
I am trying to figure out how to add column specific lineage. I have ran the excel_custom_table_column_lineage sample but am not seeing any lineage in the interface. The demo tables and columns are uploaded but I do not see a lineage tab. Are there any changes I need to make to the sample code beside entering the authentication information?
The excel_update_lineage_upload sample works fine for me but this only shows table lineage.
Thank you,
Zack
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.