singer-io / tap-hubspot Goto Github PK

View Code? Open in Web Editor NEW

50.0 18.0 94.0 697 KB

License: GNU Affero General Public License v3.0

Python 99.95% Shell 0.05%

tap singer

tap-hubspot's Introduction

tap-hubspot

This is a Singer tap that produces JSON-formatted data following the Singer spec.

This tap:

Pulls raw data from HubSpot's REST API
Extracts the following resources from HubSpot
- Campaigns
- Companies
- Contacts
- Contact Lists
- Deals
- Deal Pipelines
- Email Events
- Engagements
- Forms
- Keywords
- Owners
- Subscription Changes
- Workflows
- Tickets
Outputs the schema for each resource
Incrementally pulls data based on the input state

Configuration

This tap requires a config.json which specifies details regarding OAuth 2.0 authentication, a cutoff date for syncing historical data, an optional parameter request_timeout for which request should wait to get the response and an optional flag which controls collection of anonymous usage metrics. See config.sample.json for an example. You may specify an API key instead of OAuth parameters for development purposes, as detailed below.

To run tap-hubspot with the configuration file, use this command:

› tap-hubspot -c my-config.json

API Key Authentication (for development)

As an alternative to OAuth 2.0 authentication during development, you may specify an API key (HAPIKEY) to authenticate with the HubSpot API. This should be used only for low-volume development work, as the HubSpot API Usage Guidelines specify that integrations should use OAuth for authentication.

To use an API key, include a hapikey configuration variable in your config.json and set it to the value of your HubSpot API key. Any OAuth authentication parameters in your config.json will be ignored if this key is present!

tap-hubspot's People

Stargazers

Watchers

Forkers

anthonyp cdchan ezcater wcjohnson11 mvaerle thoughtchad mariohoh stvhanna kevinsanzrevprep domb16 nicositter88 alancallfire lightyoruichi basiisania bogdancojocar ohbadiah eternallearner42 mirelagrigoras b-ryan ablschools dorcieg jeffsloan dialoguemd-archives khamilowicz allen-ziegenfus jpsanchezasmalljob ajludwikowski dreamdata-io uptilab2 flatfair arseniocosta1 luisovando whalyapp vibeus signpost jasontruter goes-funky easypost fwhuesken rkdev robert-krasevec asphaltkingdom vroomly isabella232 gh-pankaj monkeybreadtech gofrendiasgard hotgluexyz trsilva32 symon-ai davidrodriguezcofi franccesco lpillmann pon-insights-lab prolific-oss ralphtq mogest clysun checkandvisit financeit willimoka marcos314 gupy-io ben-seedlegals peliqan-io futuri-pro rossywhite ianldias jackyspark roopkumark markestey tylerj therealbmills sagedata-ood meshcloud pnadolny13 francisace sellerant vinamilkcorp chrislondon dashbook

tap-hubspot's Issues

Support Hubspot feedback submissions data.

Support Hubspot Feedback Submissions part of the CRM component.

In HubSpot, feedback submissions store information submitted to a feedback survey. Surveys in HubSpot include Net Promoter Score (NPS), Customer Satisfaction (CSAT), Customer Effort Score (CES), and custom surveys. Using the feedback submission endpoints, you can retrieve submission data about your feedback surveys.

Information regarding this API is available here: https://developers.hubspot.com/docs/api/crm/feedback-submissions

Merged Contacts not updating correctly. Specifically canonical-vid is not being updated.

Merged 3000+ contacts yesterday via Merge contact API.

The next day I reviewed the stitch data and the contacts.canonical-vid has not been updated.

The contact in hubspot is merged.
The hubspot getcontact api shows they are merged with correct canonical-vid pointing to the other.
The stitch contacts__merged-vids and contacts__merge-audits tables both reflect that the merge occurred.

But... the none of the contact rows in the stitch contacts table have the new canonical-vid. On all the records that were involved in the merge, the canonical-vid = vid.

The expected result is:
Original State:
Merge parent
vid = 1000, canonical-vid = 1000
Merge child
vid = 2000, canonical-vid = 2000
Merged State:
Merge parent
vid = 1000, canonical-vid = 1000
Merge child
vid = 2000, canonical-vid = 1000

AttributeError: 'list' object has no attribute 'upper'

I am going tap-hubspot to target-bigquery, and after everything set-up and run the tap and target in separate virtual environment, the stated error above is prompted. After reviewing, I believe this is because array of array is extracted. Below is the example of data I extracted into csv.

Any solution to solve this problem?

[{'deleted-changed-timestamp': '1970-01-01T00:00:00.000000Z', 'saved-at-timestamp': '2020-03-26T06:05:56.130000Z', 'vid': 301, 'identities': [{'type': 'EMAIL', 'value': '[email protected]', 'timestamp': '2020-03-26T06:05:56.104000Z'}, {'type': 'LEAD_GUID', 'value': 'c985e285-2c5a-40c7-a81f-bd4679bcc881', 'timestamp': '2020-03-26T06:05:56.125000Z'}]}]

Refactor for loop

In _sync_contact_vids() the for loop uses data.items() but only drops the key and only cares about the value.

Change this to use data.values() instead

--properties is not deprecated, --catalog doesn't work

This is pretty minor, but overlooked it

When using --catalog param with catalog.json file, I get the below notice
INFO No properties were selected

When using --properties, it works, however --properties is marked as deprecated by --help as below:

  -p PROPERTIES, --properties PROPERTIES
                        Property selections: DEPRECATED, Please use --catalog
                        instead
  --catalog CATALOG     Catalog file

List API: filters schema change

Hello everyone,
First of all, thanks for doing great work supporting this tap.

I'd like to start publicly tracking or at least discuss the changes in how HubSpot is going to present list filters, e.g. contact_lists filters.
There's a couple of articles promising to deliver these changes on July 26, 2021:
[1] List API changes announcement
[2] New recursive filterBranch format description

Currently, the contact_lists.json schema defines a filters property.
But the API call /contacts/v1/lists now returns an empty array as a value of this property.
It returns a new property though: ilsFilterBranch which is a string containing seemingly valid JSON described by the article [2]

My questions:

Is there any ongoing activity around restoring v1 API behaviour?
What's your take on how Singer should address this issue?
How to define a recursive schema like this?

Thank you!

ALL deals and companies are sync'd on every run

Hubspot's modified deals and companies endpoints cannot return more than 10,000 results. As such, the Tap requires full syncs of those endpoints in order to get a complete set of historical data with more than 10K results.

Is there a more efficient way to sync deals and companies?

contact_lists.* to target-bigquery causes OBJECT is not a valid value

the target-bigquery throws and OBJECT is not a valid value error when data from hubspot-tap is piped into it.

meltano invoke tap-hubspot | meltano invoke target-bigquery

INFO Appending to t_test-contact_lists_0ca3xxxa20c1
INFO loading t_test-contact_lists_0ca3xx20c1 to BigQuery
ERROR failed to load table t_test-contact_lists_0ca39xx20c1 from file: 400 POST https://bigquery.googleapis.com/upload/bigquery/v2/projects/warehxx6/jobs?uploadType=resumable: Invalid value for type: OBJECT is not a valid value
CRITICAL 400 POST https://bigquery.googleapis.com/upload/bigquery/v2/projects/warehxxx06/jobs?uploadType=resumable: Invalid value for type: OBJECT is not a valid value

Data missing in owners and deals API

List of owners coming through stitch has missing data. Some owner ids which are visible in the hubspot platform, and through using the same api in postman, don't show in data synced through stitch.

The same is true for the deals API. There are one off deals that don't get extracted, but are visible in hubspot.

Any ideas or has anyone faced anything similar?

deals.json

We are experiencing some issues resulting from the following PR:

remove isDeleted from deals and companies because it is confusing

Removing this field means that we cannot flag deal within the HubSpot recycling bin. This results in deals that are in the recycling bin being counted as an active deal.

Would it be possible to return it?

Private App support

As HubSpot document says hapikey will be unavailable at the end of November, I'm consider moving from hapikey to private app access token.

I've tried to set private access token to access_token field in config.json,
but tap-hubspot failed because it tried to refresh access token without oauth parameters.

Then, I've tried to set some dummy value to token_expires to prevent from refreshing access token (ref),
but type mismatch error occurred because given value is str, not instance of datetime.datetime.

Any good ideas?

Proper support for `products`

It would be good to have products data sync'd and have the relationships by ID work correctly with deals.

https://developers.hubspot.com/docs/methods/products/products-overview

Missing Tickets table

Hubspot has added a service hub with a tickets object. There is nothing in the current schema that grabs data from the tickets API: https://developers.hubspot.com/docs/methods/tickets/get-all-tickets

Vanilla hubspot sales does not work out of the box with this tap.

The company object in Hubspot has a default property called "hs_analytics_last_touch_converting_campaign".

Out of the box this prevents stitch from loading company data into a postgres instance.

Column name too long properties__hs_analytics_last_touch_converting_campaign__timest (63),Column name too long properties__hs_analytics_last_touch_converting_campaign__source (63),Column name too long properties__hs_analytics_first_touch_converting_campaign__sourc (63),field collision on properties__hs_analytics_first_touch_converting_campaign__source

Calendar tasks not imported

It would be great to support the calendar API too, as it is interesting to query a CRM database based on whether one lead has no related task yet a salesperson has been assigned to.

Ticket have a pipeline and pipeline stage, but the related data is not loaded

Tickets have a pipeline and pipeline stage. The current V2 tap will load these as properties__hs_pipeline and properties__hs_pipeline_stage. However, these are just the ids, and the current tap does not store info on the related object. This info is available via the Hubspot pipelines API, using the ticket object type: https://developers.hubspot.com/docs/api/crm/pipelines.

This tap needs to load the pipeline and stage data for tickets into a separate table, similar to what is done for deals, so that the pipeline data can be queried and displayed by label and ordered as per the actual Hubspot stage ordering.

Blog data missing

Hi,

We are missing the Blog data from Hubspot. Why is it not included?

COS Blog API
COS Blog Authors API
COS Blog Comments API
COS Blog Post API
COS Blog Topics API

Is it possible for you to include this?

State not working

If provided the -s argument (or --state) and a valid state file, the tap does not "pick up" from the point that was handled by the Target as it should happen (as stated as Step 5, here: https://github.com/singer-io/getting-started/blob/master/docs/RUNNING_AND_DEVELOPING.md#running-a-singer-tap-with-a-singer-target). For the streams that have the replictation_mthod 'FULL_TABLE', the Tap disregards the value written in the state file, in "bookmarks" and always starts from the 'start_date' provided in the configuration file and passes ALL THE RECORDS TO THE TARGET, even if the value of the replication key of a certain record in NOT more recent than the value written in the state file.

KeyError: 'offset'

Just received an error notification from stitch UI for our beta hubspot integration.
This is all it says: "KeyError: 'offset'"

@iterati Not sure how often the master codeline makes it into the actual stitch jobs... Perhaps something to do with this commit? 512d496

Do I need to blow everything out and reload perhaps?

Save state more frequently while syncing email events

We've had a few reports of tap-hubspot running for six hours without emitting any state messages.

HubSpot's email_event endpoint allows us to specify a startTimestamp, endTimestamp, and offset parameter. Currently we save the startTimestamp and endTimestamp values in the state. We advance those two endpoints an hour at a time, so when a job stops, the next one will start no more than an hour back from where the previous one left off. Apparently there are cases where it takes more than six hours to fetch an hour's worth of email events. I'm assuming this is because email events tend to happen in bursts.

We should be able to fix this by saving the offset parameter in the state. I was a little concerned that that wouldn't work because HubSpot's documentation explicitly says that the offset token is not designed to be long-lived. However, I've tested it by hitting the API, getting an offset token, then sleeping a certain amount of time before hitting the API with that token. I double the amount of time it sleeps each time. So far the farthest it's gotten is sleep 16,384 seconds, or about 4.5 hours. That's enough time for us to expect that an offset token is still active between successive jobs.

We should make sure that this change is backwards-compatible, meaning that the next version of tap-hubspot will store the offset token in the state but will also be able to process input state files where the offset is not stored.

We should also make sure that tap-hubspot gracefully handles situations where the offset token is invalid. If we read in a state file that has an offset token, we should first make a request with that token. If the request fails in a way that we can determine is due to an expired offset token, we should repeat the request without the token.

Note that while we've only observed this issue with email_events, it's probably worth storing the offset token in the state for all resource types.

Detect contact deletion

For Deals there is an isdeleted flag but for Contacts there doesn't appear to be anything similar. How are we supposed to filter out contacts which have been deleted in Hubspot?

Ensure deals associations are being extracted

Further to a convo with Stitch, the Associations array on the Deals API is being lost. They requested I raise it as an Issue here.

This is the only connection to companies and contacts making is very important to have for downstream reporting.

Things to do

Validate associations are correctly being retrieved in this tap
Add unit tests to verify automatically
Fix code if required

Any plans or cons regarding historical values of properties support?

Contacts don't pass schema validation

$ stream-hubspot sync -c ~/configs/hubspot.json  | persist-stitch sync -c ~/configs/hubspot-gate.json -n
  INFO Refreshing oath token
  INFO Starting sync
  INFO Syncing all contacts
  INFO Grabbed 100 contacts
  INFO Getting details for 100 contacts
  INFO Persisting 100 contacts
  INFO ---- DRY RUN: NOTHING IS BEING PERSISTED TO STITCH ----
  INFO Persisted batch of 0 records to Stitch
Traceback (most recent call last):
  File "/opt/code/stitch-orchestrator/venv/lib/python3.4/site-packages/persist_stitch-0.3.1-py3.4.egg/persist_stitch.py", line 96, in parse_record
  File "/opt/code/stitch-orchestrator/venv/lib/python3.4/site-packages/jsonschema-2.5.1-py3.4.egg/jsonschema/validators.py", line 123, in validate
    raise error
jsonschema.exceptions.ValidationError: '2015-12-15T15:29:57.990999' is not a 'date-time'

Failed validating 'format' in schema['properties']['properties']['properties']['hs_lifecyclestage_opportunity_date']['properties']['value']:
    {'format': 'date-time', 'type': ['null', 'integer']}

On instance['properties']['hs_lifecyclestage_opportunity_date']['value']:
    '2015-12-15T15:29:57.990999'

Switch requests logging to debug.

I'm using tap-hubspot to load a lot of data and the logging level info for each request is making the process slower.
Can we switch the requests logging to debug ?

Ready to help and it would be my first issue.

Support for deal history

Currently, the tap only includes the most recent version of deal properties in the versions data. This is not super useful, as it simply matches the data available in the deal entity itself. The Hubspot API supports a parameter to include the full history of a deal's properties, which enables the recreation of a deal pipeline over time.

From Hupspot's API docs
&includePropertyVersions=true | By default, you will only get data for the most recent version of a property in the "versions" data. If you include this parameter, you will get data for all previous versions.

tap-hubspot/tap_hubspot/__init__.py

Line 494 in d168b50

def sync_deals(STATE, ctx):

Custom Objects support

Hello! I was curious if there were any plans to support Custom Objects in Hubspot? It would be great to have the relevant tables associated with them available in destinations like BigQuery.

Happy to provide more info or context if helpful.

Permit specific property value specification in catalog

Referencing this line of code in has_selected_custom_field:

https://github.com/singer-io/tap-hubspot/blob/master/tap_hubspot/__init__.py#L488

Specifically, the len(x) == 2 conditional.

If all of my requests for property values in the catalog are very specific, like this:

        {
          "breadcrumb": [
            "properties",
            "property_hs_lastmodifieddate",
            "properties",
            "value"
          ],
          "metadata": {
            "selected": true
          }
        },

That conditional will result in me getting no properties at all. A very simple fix would be to replace with len(x) >= 2.

Include Contact Properties Histories (feature request)

Currently, the tap gets only the current contact properties using the /contacts/v1/contact/vids/batch/ Hubspot API endpoint. By including the query parameter propertyMode=value_and_history it could retrieve the contact property history as well which would be extremely useful for analytics.

Unable to remove properties from the deals schema

I am trying to setup a hubspot tap with a postgres target and I keep getting an error about postgres trying to create a table with more than 1600 columns in it.
Even though at the time my schema only had 4 properties in it.

However going through the code it looks like if deals is selected as a stream the schema is automatically applied from the json output of this api call https://api.hubapi.com/properties/v1/deals/properties
See code here: https://github.com/singer-io/tap-hubspot/blob/master/tap_hubspot/__init__.py#L191

Is there a way to modify he computed schema for the deals stream at all to remove the properties that I don't need? I haven't been able to find a way to do that at the moment.

My deals schema:

{
"streams": [
    {
        "stream": "deals",
        "tap_stream_id": "deals",
        "key_properties": ["dealId"],
        "schema": {
            "type": "object",
            "properties": {
                "portalId": {
                    "type": [
                        "null",
                        "integer"
                    ]
                },
                "dealId": {
                    "type": [
                        "null",
                        "integer"
                    ]
                },
                "dealname": {
                    "type": [
                        "null",
                        "string"
                    ]
                },
                "dealstage": {
                    "type": [
                        "null",
                        "string"
                    ]
                }
            }
        },
        "metadata": [
            {
                "breadcrumb": [ ],
                "metadata": {
                    "selected": true,
                    "table-key-properties": [
                        "dealId"
                    ],
                    "forced-replication-method": "INCREMENTAL",
                    "valid-replication-keys": [
                        "hs_lastmodifieddate"
                    ]
                }
            }
        ]
    }
]}

post stitch import webhook

There is currently not a good way to know when a stitch integration starts or ends. This makes it difficult to schedule additional data processing on the destination.

It would be ideal for stitch to support webhooks that fired when various stages of the integrations occurred.
Example webhooks:
Extraction Start/End
Loading Start/End

With these webhooks, admins of the destination databases are able to do the following:

ensure that destination etl's processes are not running during stitch data loads
execute post stitch data load sql, code, processes, etc.

An alternative to this approach - or in addition to - would be the ability to have custom sql scripts execute post stitch data loads. The scripts would be maintained on the stitch front-end.

Error when transforming timestamps

I'm getting this error related to the timestamps that hubspot APIs are using

( I tried to change the type in the schema but no change)

sync_contact_lists not checking list 'processing' state prior to sync.

Came across a scenario yesterday where I built a Smart List.
Was excitedly anticipating the data arriving in our destination.

It appears the stitch scheduled sync kicked off and started syncing the newly added smart list - but the smart list was still being 'processed' on the hubspot side. This means the smart list was being loaded with the contacts that met the criteria. This is done via some type of backend process on hubspot - and can take some time to complete.

The end result is that the synced membership on the destination was never fully populated.

The tap should look at the processing attribute and only sync lists where metaData->processing = 'DONE'

https://developers.hubspot.com/docs/methods/lists/get_lists
{
"lists": [
{
"dynamic": true,
"metaData": {
"processing": "DONE",
"size": 161,
"error": "",
"lastProcessingStateChangeAt": 1477518645996,
"lastSizeChangeAt": 1479298304451
},

Configuration instructions

This tap requires a few configuration parameters, and generating those requires detailed knowledge of how to bootstrap an app in HubSpot and generate OAuth 2 credentials against that app.

Ideally, the README should include information on the required configs along with a basic setup guide and/or links to the corresponding documentation in HubSpot:
https://developers.hubspot.com/docs/faq/how-do-i-create-an-app-in-hubspot
https://developers.hubspot.com/docs/methods/oauth2/oauth2-overview
https://developers.hubspot.com/docs/methods/oauth2/get-access-and-refresh-tokens

Can't run stream-hubspot outside of source tree

It looks for the schemas relative to the current working directory

/opt/code/stream-freshdesk (master)
$ stream-hubspot sync -c ~/configs/hubspot-ezcater.json
  INFO Refreshing oath token
  INFO Starting sync
  INFO Syncing all contacts
 ERROR Error ocurred during sync. Aborting.
Traceback (most recent call last):
  File "/opt/code/stream-hubspot/stream_hubspot.py", line 645, in main
    persisted_count = do_sync()
  File "/opt/code/stream-hubspot/stream_hubspot.py", line 572, in do_sync
    persisted_count += sync_contacts()
  File "/opt/code/stream-hubspot/stream_hubspot.py", line 238, in sync_contacts
    schema = get_schema('contacts')
  File "/opt/code/stream-hubspot/stream_hubspot.py", line 129, in get_schema
    with open("stream_hubspot/{}.json".format(entity_name)) as f:
FileNotFoundError: [Errno 2] No such file or directory: 'stream_hubspot/contacts.json'
(venv)

'Click' is not integrated in the schema of campaigns

Click is not included and lots of other like drop and bounce, which are all returned by the API. I wish I can have those number in my data warehouse!

Add activityType in Engagements

There is an additional field in Engagements called activityType, which represents "Call or Meeting Type." Can that be included?

Doesn't work with no initial state

Getting an error when trying to run stream-hubspot with no state file:

$ stream-hubspot sync -c ~/configs/hubspot.json  | persist-stitch sync -c ~/configs/hubspot-gate.json -n
  INFO Refreshing oath token
  INFO Starting sync
 ERROR Error ocurred during sync. Aborting.
Traceback (most recent call last):
  File "/opt/code/stream-hubspot/stream_hubspot.py", line 647, in main
    persisted_count = do_sync()
  File "/opt/code/stream-hubspot/stream_hubspot.py", line 574, in do_sync
    persisted_count += sync_contacts()
  File "/opt/code/stream-hubspot/stream_hubspot.py", line 230, in sync_contacts
    days_since_sync = (datetime.datetime.utcnow() - last_sync).days
TypeError: can't subtract offset-naive and offset-aware datetimes
  INFO ---- DRY RUN: NOTHING IS BEING PERSISTED TO STITCH ----
  INFO Persisted batch of 0 records to Stitch

Primary keys

Declare primary keys in each schema.

`email_events` schema does not include support for additional properties

HubSpot's Email Events object can include additional properties obsoletedBy and causedBy, though these are not currently included in the schema for email_events.

This page of their documentation has more information on the properties: https://developers.hubspot.com/docs/methods/email/email_events_overview

"obsoletedBy | JSON | The EventId which uniquely identifies the follow-on event which makes this current event obsolete. If not applicable, this property is omitted."

"causedBy | JSON | The EventId which uniquely identifies the event which directly caused this event. If not applicable, this property is omitted."

Hubspot doesn't appear to outline the actual format of the JSON returned for these properties in any of their reference documentation.

Incremental mode duplicates last record (deals)

In case of deals (but I think it is true for other entities as well):
When running the tap with a state file, comparison based on hs_lastmodifieddate is buggy, and the last record of the previous execution is exported again during the next execution.
I think the culprit is in this line :

            if not modified_time or modified_time >= start:
                record = bumble_bee.transform(lift_properties_and_versions(row), schema, mdata)
                singer.write_record("deals", record, catalog.get('stream_alias'), time_extracted=utils.now())

I guess modified_time should be strictly greater ( > ) than start in order to not write out again the same record.