EcommerceTools

EcommerceTools is a data science toolkit for those working in technical ecommerce, marketing science, and technical seo and includes a wide range of features to aid analysis and model building. The package is written in Python and is designed to be used with Pandas and works within a Jupyter notebook environment or in standalone Python projects.

Installation

You can install EcommerceTools and its dependencies via PyPi by entering pip3 install ecommercetools in your terminal, or !pip3 install ecommercetools within a Jupyter notebook cell.

Transactions

Load sample transaction items data

If you want to get started with the transactions, products, and customers features, you can use the load_sample_data() function to load a set of real world data. This imports the transaction items from widely-used Online Retail dataset and reformats it ready for use by EcommerceTools.

from ecommercetools import utilities

transaction_items = utilities.load_sample_data()
transaction_items.head()

	order_id	sku	description	quantity	order_date	unit_price	customer_id	country	line_price
0	536365	85123A	WHITE HANGING HEART T-LIGHT HOLDER	6	2010-12-01 08:26:00	2.55	17850.0	United Kingdom	15.30
1	536365	71053	WHITE METAL LANTERN	6	2010-12-01 08:26:00	3.39	17850.0	United Kingdom	20.34
2	536365	84406B	CREAM CUPID HEARTS COAT HANGER	8	2010-12-01 08:26:00	2.75	17850.0	United Kingdom	22.00
3	536365	84029G	KNITTED UNION FLAG HOT WATER BOTTLE	6	2010-12-01 08:26:00	3.39	17850.0	United Kingdom	20.34
4	536365	84029E	RED WOOLLY HOTTIE WHITE HEART.	6	2010-12-01 08:26:00	3.39	17850.0	United Kingdom	20.34

Create a transaction items dataframe

The utilities module includes a range of tools that allow you to format data, so it can be used within other EcommerceTools functions. The load_data() function is used to create a Pandas dataframe of formatted transactional item data. When loading your transaction items data, all you need to do is define the column mappings, and the function will reformat the dataframe accordingly.

import pandas as pd
from ecommercetools import utilities

transaction_items = utilities.load_data('transaction_items_non_standard_names.csv',
                                 date_column='InvoiceDate',
                                 order_id_column='InvoiceNo',
                                 customer_id_column='CustomerID',
                                 sku_column='StockCode',
                                 quantity_column='Quantity',
                                 unit_price_column='UnitPrice'
                                 )
transaction_items.to_csv('transaction_items.csv', index=False)
print(transaction_items.head())

	order_id	sku	description	quantity	order_date	unit_price	customer_id	country	line_price
0	536365	85123A	WHITE HANGING HEART T-LIGHT HOLDER	6	2010-12-01 08:26:00	2.55	17850.0	United Kingdom	15.30
1	536365	71053	WHITE METAL LANTERN	6	2010-12-01 08:26:00	3.39	17850.0	United Kingdom	20.34
2	536365	84406B	CREAM CUPID HEARTS COAT HANGER	8	2010-12-01 08:26:00	2.75	17850.0	United Kingdom	22.00
3	536365	84029G	KNITTED UNION FLAG HOT WATER BOTTLE	6	2010-12-01 08:26:00	3.39	17850.0	United Kingdom	20.34
4	536365	84029E	RED WOOLLY HOTTIE WHITE HEART.	6	2010-12-01 08:26:00	3.39	17850.0	United Kingdom	20.34

Create a transactions dataframe

The get_transactions() function takes the formatted Pandas dataframe of transaction items and returns a Pandas dataframe of aggregated transaction data, which includes features identifying the order number.

import pandas as pd
from ecommercetools import customers

transaction_items = pd.read_csv('transaction_items.csv')
transactions = transactions.get_transactions(transaction_items)
transactions.to_csv('transactions.csv', index=False)
print(transactions.head())

	order_id	order_date	customer_id	skus	items	revenue	order_number
0	536365	2010-12-01 08:26:00	17850.0	7	40	139.12	1
1	536366	2010-12-01 08:28:00	17850.0	2	12	22.20	2
2	536367	2010-12-01 08:34:00	13047.0	12	83	278.73	1
3	536368	2010-12-01 08:34:00	13047.0	4	15	70.05	2
4	536369	2010-12-01 08:35:00	13047.0	1	3	17.85	3

Products

1. Get product data from transaction items

products_df = products.get_products(transaction_items)
products_df.head()

	sku	first_order_date	last_order_date	customers	orders	items	revenue	avg_unit_price	avg_quantity	avg_revenue	avg_orders	product_tenure	product_recency
0	10002	2010-12-01 08:45:00	2011-04-28 15:05:00	40	73	1037	759.89	1.056849	14.205479	10.409452	1.82	3749	3600
1	10080	2011-02-27 13:47:00	2011-11-21 17:04:00	19	24	495	119.09	0.376667	20.625000	4.962083	1.26	3660	3393
2	10120	2010-12-03 11:19:00	2011-12-04 13:15:00	25	29	193	40.53	0.210000	6.433333	1.351000	1.16	3746	3380
3	10123C	2010-12-03 11:19:00	2011-07-15 15:05:00	3	4	-13	3.25	0.487500	-3.250000	0.812500	1.33	3746	3522
4	10123G	2011-04-08 11:13:00	2011-04-08 11:13:00	0	1	-38	0.00	0.000000	-38.000000	0.000000	inf	3620	3620

2. Calculate product consumption and repurchase rate

repurchase_rates = products.get_repurchase_rates(transaction_items)
repurchase_rates.head(3).T

	0	1	2
sku	10002	10080	10120
revenue	759.89	119.09	40.53
items	1037	495	193
orders	73	24	29
customers	40	19	25
avg_unit_price	1.05685	0.376667	0.21
avg_line_price	10.4095	4.96208	1.351
avg_items_per_order	14.2055	20.625	6.65517
avg_items_per_customer	25.925	26.0526	7.72
purchased_individually	0	0	9
purchased_once	34	17	22
bulk_purchases	73	24	20
bulk_purchase_rate	1	1	0.689655
repurchases	39	7	7
repurchase_rate	0.534247	0.291667	0.241379
repurchase_rate_label	Moderate repurchase	Low repurchase	Low repurchase
bulk_purchase_rate_label	Very high bulk	Very high bulk	High bulk
bulk_and_repurchase_label	Moderate repurchase_Very high bulk	Low repurchase_Very high bulk	Low repurchase_High bulk

Customers

1. Create a customers dataset

from ecommercetools import customers

customers_df = customers.get_customers(transaction_items)
customers_df.head()

	customer_id	revenue	orders	skus	items	first_order_date	last_order_date	avg_items	avg_order_value	tenure	recency	cohort
0	12346.0	0.00	2	1	0	2011-01-18 10:01:00	2011-01-18 10:17:00	0.00	0.00	3701	3700	20111
1	12347.0	4310.00	7	7	2458	2010-12-07 14:57:00	2011-12-07 15:52:00	351.14	615.71	3742	3377	20104
2	12348.0	1797.24	4	4	2341	2010-12-16 19:09:00	2011-09-25 13:13:00	585.25	449.31	3733	3450	20104
3	12349.0	1757.55	1	1	631	2011-11-21 09:51:00	2011-11-21 09:51:00	631.00	1757.55	3394	3394	20114
4	12350.0	334.40	1	1	197	2011-02-02 16:01:00	2011-02-02 16:01:00	197.00	334.40	3685	3685	20111

2. Create a customer cohort analysis dataset

from ecommercetools import customers

cohorts_df = customers.get_cohorts(transaction_items, period='M')
cohorts_df.head()

	customer_id	order_id	order_date	acquisition_cohort	order_cohort
0	17850.0	536365	2010-12-01 08:26:00	2010-12	2010-12
7	17850.0	536366	2010-12-01 08:28:00	2010-12	2010-12
9	13047.0	536367	2010-12-01 08:34:00	2010-12	2010-12
21	13047.0	536368	2010-12-01 08:34:00	2010-12	2010-12
25	13047.0	536369	2010-12-01 08:35:00	2010-12	2010-12

3. Create a customer cohort analysis matrix

from ecommercetools import customers

cohort_matrix_df = customers.get_cohort_matrix(transaction_items, period='M', percentage=True)
cohort_matrix_df.head()

periods	0	1	2	3	4	5	6	7	8	9	10	11	12
acquisition_cohort
2010-12	1.0	0.381857	0.334388	0.387131	0.359705	0.396624	0.379747	0.354430	0.354430	0.394515	0.373418	0.500000	0.274262
2011-01	1.0	0.239905	0.282660	0.242280	0.327791	0.299287	0.261283	0.256532	0.311164	0.346793	0.368171	0.149644	NaN
2011-02	1.0	0.247368	0.192105	0.278947	0.268421	0.247368	0.255263	0.281579	0.257895	0.313158	0.092105	NaN	NaN
2011-03	1.0	0.190909	0.254545	0.218182	0.231818	0.177273	0.263636	0.238636	0.288636	0.088636	NaN	NaN	NaN
2011-04	1.0	0.227425	0.220736	0.210702	0.207358	0.237458	0.230769	0.260870	0.083612	NaN	NaN	NaN	NaN

from ecommercetools import customers

cohort_matrix_df = customers.get_cohort_matrix(transaction_items, period='M', percentage=False)
cohort_matrix_df.head()

periods	0	1	2	3	4	5	6	7	8	9	10	11	12
acquisition_cohort
2010-12	948.0	362.0	317.0	367.0	341.0	376.0	360.0	336.0	336.0	374.0	354.0	474.0	260.0
2011-01	421.0	101.0	119.0	102.0	138.0	126.0	110.0	108.0	131.0	146.0	155.0	63.0	NaN
2011-02	380.0	94.0	73.0	106.0	102.0	94.0	97.0	107.0	98.0	119.0	35.0	NaN	NaN
2011-03	440.0	84.0	112.0	96.0	102.0	78.0	116.0	105.0	127.0	39.0	NaN	NaN	NaN
2011-04	299.0	68.0	66.0	63.0	62.0	71.0	69.0	78.0	25.0	NaN	NaN	NaN	NaN

4. Create a customer "retention" dataset

from ecommercetools import customers

retention_df = customers.get_retention(transactions_df)
retention_df.head()

	acquisition_cohort	order_cohort	customers	periods
0	2010-12	2010-12	948	0
1	2010-12	2011-01	362	1
2	2010-12	2011-02	317	2
3	2010-12	2011-03	367	3
4	2010-12	2011-04	341	4

5. Create an RFM (H) dataset

This is an extension of the regular Recency, Frequency, Monetary value (RFM) model that includes an additional parameter "H" for heterogeneity. This shows the number of unique SKUs purchased by each customer. While typically unassociated with targeting, this value can be very useful in identifying which customers should probably be buying a broader mix of products than they currently are, as well as spotting those who may have stopped buying certain items.

from ecommercetools import customers

rfm_df = customers.get_rfm_segments(customers_df)
rfm_df.head()

	customer_id	acquisition_date	recency_date	recency	frequency	monetary	heterogeneity	tenure	r	f	m	h	rfm	rfm_score	rfm_segment_name
0	12346.0	2011-01-18 10:01:00	2011-01-18 10:17:00	3700	2	0.00	1	3701	1	1	1	1	111	3	Risky
1	12350.0	2011-02-02 16:01:00	2011-02-02 16:01:00	3685	1	334.40	1	3685	1	1	1	1	111	3	Risky
2	12365.0	2011-02-21 13:51:00	2011-02-21 14:04:00	3666	3	320.69	2	3666	1	1	1	1	111	3	Risky
3	12373.0	2011-02-01 13:10:00	2011-02-01 13:10:00	3686	1	364.60	1	3686	1	1	1	1	111	3	Risky
4	12377.0	2010-12-20 09:37:00	2011-01-28 15:45:00	3690	2	1628.12	2	3730	1	1	1	1	111	3	Risky

6. Create a purchase latency dataset

from ecommercetools import customers 

latency_df = customers.get_latency(transactions_df)
latency_df.head()

	customer_id	frequency	recency_date	recency	avg_latency	min_latency	max_latency	std_latency	cv	days_to_next_order	label
0	12680.0	4	2011-12-09 12:50:00	3388	28	16	73	30.859898	1.102139	-3329.0	Order overdue
1	13113.0	24	2011-12-09 12:49:00	3388	15	0	52	12.060126	0.804008	-3361.0	Order overdue
2	15804.0	13	2011-12-09 12:31:00	3388	15	1	39	11.008261	0.733884	-3362.0	Order overdue
3	13777.0	33	2011-12-09 12:25:00	3388	11	0	48	12.055274	1.095934	-3365.0	Order overdue
4	17581.0	25	2011-12-09 12:21:00	3388	14	0	67	21.974293	1.569592	-3352.0	Order overdue

7. Customer ABC segmentation

from ecommercetools import customers

abc_df = customers.get_abc_segments(customers_df, months=12, abc_class_name='abc_class_12m', abc_rank_name='abc_rank_12m')
abc_df.head()

	customer_id	abc_class_12m	abc_rank_12m
0	12346.0	D	1.0
1	12347.0	D	1.0
2	12348.0	D	1.0
3	12349.0	D	1.0
4	12350.0	D	1.0

8. Predict customer AOV, CLV, and orders

EcommerceTools allows you to predict the AOV, Customer Lifetime Value (CLV) and expected number of orders via the Gamma-Gamma and BG/NBD models from the excellent Lifetimes package. By passing the dataframe of transactions from get_transactions() to the get_customer_predictions() function, EcommerceTools will fit the BG/NBD and Gamma-Gamma models and predict the AOV, order quantity, and CLV for each customer in the defined number of future days after the end of the observation period.

customer_predictions = customers.get_customer_predictions(transactions_df, 
                                                          observation_period_end='2011-12-09', 
                                                          days=90)
customer_predictions.head(10)

	customer_id	predicted_purchases	aov	clv
0	12346.0	0.188830	NaN	NaN
1	12347.0	1.408736	569.978836	836.846896
2	12348.0	0.805907	333.784235	308.247354
3	12349.0	0.855607	NaN	NaN
4	12350.0	0.196304	NaN	NaN
5	12352.0	1.682277	376.175359	647.826169
6	12353.0	0.272541	NaN	NaN
7	12354.0	0.247183	NaN	NaN
8	12355.0	0.262909	NaN	NaN
9	12356.0	0.645368	324.039419	256.855226

---

Advertising

1. Create paid search keywords

from ecommercetools import advertising

product_names = ['fly rods', 'fly reels']
keywords_prepend = ['buy', 'best', 'cheap', 'reduced']
keywords_append = ['for sale', 'price', 'promotion', 'promo', 'coupon', 'voucher', 'shop', 'suppliers']
campaign_name = 'fly_fishing'

keywords = advertising.generate_ad_keywords(product_names, keywords_prepend, keywords_append, campaign_name)
keywords.head()

	product	keywords	match_type	campaign_name
0	fly rods	[fly rods]	Exact	fly_fishing
1	fly rods	[buy fly rods]	Exact	fly_fishing
2	fly rods	[best fly rods]	Exact	fly_fishing
3	fly rods	[cheap fly rods]	Exact	fly_fishing
4	fly rods	[reduced fly rods]	Exact	fly_fishing

2. Create paid search ad copy using Spintax

from ecommercetools import advertising

text = "Fly Reels from {Orvis|Loop|Sage|Airflo|Nautilus} for {trout|salmon|grayling|pike}"
spin = advertising.generate_spintax(text, single=False)

spin

['Fly Reels from Orvis for trout',
 'Fly Reels from Orvis for salmon',
 'Fly Reels from Orvis for grayling',
 'Fly Reels from Orvis for pike',
 'Fly Reels from Loop for trout',
 'Fly Reels from Loop for salmon',
 'Fly Reels from Loop for grayling',
 'Fly Reels from Loop for pike',
 'Fly Reels from Sage for trout',
 'Fly Reels from Sage for salmon',
 'Fly Reels from Sage for grayling',
 'Fly Reels from Sage for pike',
 'Fly Reels from Airflo for trout',
 'Fly Reels from Airflo for salmon',
 'Fly Reels from Airflo for grayling',
 'Fly Reels from Airflo for pike',
 'Fly Reels from Nautilus for trout',
 'Fly Reels from Nautilus for salmon',
 'Fly Reels from Nautilus for grayling',
 'Fly Reels from Nautilus for pike']

Operations management

1. Create an ABC inventory classification

inventory_classification = operations.get_inventory_classification(transaction_items)
inventory_classification.head()

	sku	abc_class	abc_rank
0	10002	A	1
1	10080	A	2
2	10120	A	3
3	10123C	A	4
4	10123G	A	4

Marketing

1. Get ecommerce trading calendar

from ecommercetools import marketing

trading_calendar_df = marketing.get_trading_calendar('2021-01-01', days=365)
trading_calendar_df.head()

	date	event
0	2021-01-01	January sale
1	2021-01-02
2	2021-01-03
3	2021-01-04
4	2021-01-05

2. Get ecommerce trading events

from ecommercetools import marketing

trading_events_df = marketing.get_trading_events('2021-01-01', days=365)
trading_events_df.head()

	date	event
0	2021-01-01	January sale
1	2021-01-29	January Pay Day
2	2021-02-11	Valentine's Day [last order date]
3	2021-02-14	Valentine's Day
4	2021-02-26	February Pay Day

SEO

1. Discover XML sitemap locations

The get_sitemaps() function takes the location of a robots.txt file (always stored at the root of a domain), and returns the URLs of any XML sitemaps listed within.

from ecommercetools import seo

sitemaps = seo.get_sitemaps("http://www.flyandlure.org/robots.txt")
print(sitemaps)

2. Get an XML sitemap

The get_dataframe() function allows you to download the URLs in an XML sitemap to a Pandas dataframe. If the sitemap contains child sitemaps, each of these will be retrieved. You can save the Pandas dataframe to CSV in the usual way.

from ecommercetools import seo

df = seo.get_sitemap("http://flyandlure.org/sitemap.xml")
print(df.head())

	loc	changefreq	priority	domain	sitemap_name
0	http://flyandlure.org/	hourly	1.0	flyandlure.org	http://www.flyandlure.org/sitemap.xml
1	http://flyandlure.org/about	monthly	1.0	flyandlure.org	http://www.flyandlure.org/sitemap.xml
2	http://flyandlure.org/terms	monthly	1.0	flyandlure.org	http://www.flyandlure.org/sitemap.xml
3	http://flyandlure.org/privacy	monthly	1.0	flyandlure.org	http://www.flyandlure.org/sitemap.xml
4	http://flyandlure.org/copyright	monthly	1.0	flyandlure.org	http://www.flyandlure.org/sitemap.xml

3. Get Core Web Vitals from PageSpeed Insights

The get_core_web_vitals() function retrieves the Core Web Vitals metrics for a list of sites from the Google PageSpeed Insights API and returns results in a Pandas dataframe. The function requires a a Google PageSpeed Insights API key.

from ecommercetools import seo

pagespeed_insights_key = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
urls = ['https://www.bbc.co.uk', 'https://www.bbc.co.uk/iplayer']
df = seo.get_core_web_vitals(pagespeed_insights_key, urls)
print(df.head())

4. Get Google Knowledge Graph data

The get_knowledge_graph() function returns the Google Knowledge Graph data for a given search term. This requires the use of a Google Knowledge Graph API key. By default, the function returns output in a Pandas dataframe, but you can pass the output="json" argument if you wish to receive the JSON data back.

from ecommercetools import seo

knowledge_graph_key = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
knowledge_graph = seo.get_knowledge_graph(knowledge_graph_key, "tesla", output="dataframe")
print(knowledge_graph)

5. Get Google Search Console API data

The query_google_search_console() function runs a search query on the Google Search Console API and returns data in a Pandas dataframe. This function requires a JSON client secrets key with access to the Google Search Console API.

from ecommercetools import seo

key = "google-search-console.json"
site_url = "http://flyandlure.org"
payload = {
    'startDate': "2019-01-01",
    'endDate': "2019-12-31",
    'dimensions': ["page", "device", "query"],
    'rowLimit': 100,
    'startRow': 0
}

df = seo.query_google_search_console(key, site_url, payload)
print(df.head())

	page	device	query	clicks	impressions	ctr	position
0	http://flyandlure.org/articles/fly_fishing_gea...	MOBILE	simms freestone waders review	56	217	25.81	3.12
1	http://flyandlure.org/	MOBILE	fly and lure	37	159	23.27	3.81
2	http://flyandlure.org/articles/fly_fishing_gea...	DESKTOP	orvis encounter waders review	35	134	26.12	4.04
3	http://flyandlure.org/articles/fly_fishing_gea...	DESKTOP	simms freestone waders review	35	200	17.50	3.50
4	http://flyandlure.org/	DESKTOP	fly and lure	32	170	18.82	3.09

6. Get the number of "indexed" pages

The get_indexed_pages() function uses the "site:" prefix to search Google for the number of pages "indexed". This is very approximate and may not be a perfect representation, but it's usually a good guide of site "size" in the absence of other data.

from ecommercetools import seo

urls = ['https://www.bbc.co.uk', 'https://www.bbc.co.uk/iplayer', 'http://flyandlure.org']
df = seo.get_indexed_pages(urls)
print(df.head())

	url	indexed_pages
2	http://flyandlure.org	2090
1	https://www.bbc.co.uk/iplayer	215000
0	https://www.bbc.co.uk	12700000

7. Get keyword suggestions from Google Autocomplete

The google_autocomplete() function returns a set of keyword suggestions from Google Autocomplete. The include_expanded=True argument allows you to expand the number of suggestions shown by appending prefixes and suffixes to the search terms.

from ecommercetools import seo

suggestions = seo.google_autocomplete("data science", include_expanded=False)
print(suggestions)

suggestions = seo.google_autocomplete("data science", include_expanded=True)
print(suggestions)

	term	relevance
0	data science jobs	650
1	data science jobs chester	601
2	data science course	600
3	data science masters	554
4	data science salary	553
5	data science internship	552
6	data science jobs london	551
7	data science graduate scheme	550

8. Retrieve robots.txt content

The get_robots() function returns the contents of a robots.txt file in a Pandas dataframe so it can be parsed and analysed.

from ecommercetools import seo

robots = seo.get_robots("http://www.flyandlure.org/robots.txt")
print(robots)

	directive	parameter
0	User-agent	*
1	Disallow	/signin
2	Disallow	/signup
3	Disallow	/users
4	Disallow	/contact
5	Disallow	/activate
6	Disallow	/*/page
7	Disallow	/articles/search
8	Disallow	/search.php
9	Disallow	q=
10	Disallow	category_slug=
11	Disallow	country_slug=
12	Disallow	county_slug=
13	Disallow	features=

9. Get Google SERPs

The get_serps() function returns a Pandas dataframe containing the Google search engine results for a given search term. Note that this function is not suitable for large-scale scraping and currently includes no features to prevent it from being blocked.

from ecommercetools import seo

serps = seo.get_serps("data science blog")
print(serps)

	title	link	text
0	10 of the best data science blogs to follow - ...	https://www.tableau.com/learn/articles/data-sc...	10 of the best data science blogs to follow. T...
1	Best Data Science Blogs to Follow in 2020 \| by...	https://towardsdatascience.com/best-data-scien...	14 Jul 2020 — 1. Towards Data Science · Joined...
2	Top 20 Data Science Blogs And Websites For Dat...	https://medium.com/@exastax/top-20-data-scienc...	Top 20 Data Science Blogs And Websites For Dat...
3	Data Science Blog – Dataquest	https://www.dataquest.io/blog/	Browse our data science blog to get helpful ti...
4	51 Awesome Data Science Blogs You Need To Chec...	https://365datascience.com/trending/51-data-sc...	Blog name: DataKind · datakind data science bl...
5	Blogs on AI, Analytics, Data Science, Machine ...	https://www.kdnuggets.com/websites/blogs.html	Individual/small group blogs · Ai4 blog, featu...
6	Data Science Blog – Applied Data Science	https://data-science-blog.com/	... an Bedeutung – DevOps for Data Science. De...
7	Top 10 Data Science and AI Blogs in 2020 - Liv...	https://livecodestream.dev/post/top-data-scien...	Some of the best data science and AI blogs for...
8	Data Science Blogs: 17 Must-Read Blogs for Dat...	https://www.thinkful.com/blog/data-science-blogs/	Data scientists could be considered the magici...
9	rushter/data-science-blogs: A curated list of ...	https://github.com/rushter/data-science-blogs	A curated list of data science blogs. Contribu...

Natural Language Processing (NLP)

1. Generate text summaries

The get_summaries() function of the nlp module takes a Pandas dataframe containing text and returns a machine-generated summary of the content using a Huggingface Transformers pipeline via PyTorch. To use this feature, first load your Pandas dataframe and import the nlp module from ecommercetools.

import pandas as pd
from ecommercetools import nlp 

pd.set_option('max_colwidth', 1000)
df = pd.read_csv('text.csv')
df.head()

Specify the name of the Pandas dataframe, the column containing the text you wish to summarise (i.e. product_description), and specify a column name in which to store the machine-generated summary. The min_length and max_length arguments control the number of words generated, while the do_sample argument controls whether the generated text is completely unique (do_sample=False) or extracted from the text (do_sample=True).

df = nlp.get_summaries(df, 'product_description', 'sampled_summary', min_length=50, max_length=100, do_sample=True)
df = nlp.get_summaries(df, 'product_description', 'unsampled_summary', min_length=50, max_length=100, do_sample=False)
df = nlp.get_summaries(df, 'product_description', 'unsampled_summary_20_to_30', min_length=20, max_length=30, do_sample=False)

Since the model used for text summarisation is very large (1.2 GB plus), this function will take some time to complete. Once loaded, summaries are generated within a second or two per piece of text, so it is advisable to try smaller volumes of data initially.

xlcdp / ecommercetools Goto Github PK

ecommercetools's Introduction