curationcorp / curation-corpus Goto Github PK

View Code? Open in Web Editor NEW

122.0 122.0 27.0 298 KB

Code for obtaining the Curation Corpus abstractive text summarisation dataset

License: Creative Commons Attribution 4.0 International

Python 100.00%

curation-corpus's People

Contributors

Stargazers

Watchers

curation-corpus's Issues

Issues with the open-source dataset

Hi Curation,

I would like to point that, in its current form, this dataset is almost unusable. The main problem is with the fact that users have to scrape documents from the original websites. Website changes, URL forwarding, paywalls, etc. inevitably cause a lot of web-related errors that manifest in the documents, which is very difficult to preprocess out. The provided preprocessing script (in its current state) does not do enough to filter out the noise. There are at least two larger problems with this:

Pre-trained models, like BERT, are very sensitive to the integrity of the document, and if there are spurious tokens or ill-formed sentences, the model will not be able to form coherent representations of the document. Seemingly minor things like [gallery ids=\"1318996,1318995,1318988,1318986,1319003\"] being injected in a document can cause the number of wordpieces to explode. This generally wouldn't be a big problem, but BERT (and most other pre-trained encoders) have a maximum sequence length of 512 wordpieces/subwords, so the encoder will not have a chance to see other sentences if a single sentence saturates the batch.
Results obtained on this dataset might not be reproducible given that the documents are not released with the original distribution. When I first used the scraper to collect this dataset (around the time when this dataset was published), I was only able to retrieve 39,917 documents, not 40k as is advertised on this repository. If someone were to run the scraper now, the recall may be substantially lower. I understand that there may be licensing issues associated with Curation releasing the documents along with the abstracts, but this is an important point for consideration -- people having different copies of this dataset (along with different content) will not be able to compare results in a scientifically meaningful way.

Finally, in the interest of providing constructive criticism, I'll point out some specific issues I saw in the documents that may be addressed by finer-grained quality control. These errors are not cherrypicked -- they are chosen from a random sample of about 100 documents/summaries.

1) HTML artifacts in the middle of a document

[DOCUMENT]
...
[gallery ids=\"1318996,1318995,1318988,1318986,1319003\"]
...

2) Non-document related content appears in the document

[DOCUMENT]
...
For more information, visit www.GoSafr.com. Contacts Elevate Communications, for SafrLucy Muscarella, 617-312-6411cell: [email protected] Humphrey Flynncell: 617-549-1718Marketing and PR [email protected]

3) Tables get squeezed in with the rest of the document -- hard to preprocess out

[DOCUMENT]
...
Crude steel consumption2017*2018'2019f2020f2018'2019'2020' World steel consumption1.7011,7591.7621,7583.50.2-0.2 China7888107947762.8-1.9-2.3 European Union 281721751771791.81.21.0 United States1071111121114.01.0-1.0 India961021081155.36.16.3 Japan75737372-3.10.5-1.8 South Korea59595959-0.1-0.3-0.4 Russia43424242-0.90.30.0 Brazil222323230.71.71.5 Crude steel production20172018'2019f2020f2018'2019'2020' World steel production1.6891,7711.7681.7594.8-0.2-0.5 China8508868618424.2-2.8-2.2 European Union 281681721741752.40.60.9 Japan1051061081091.52.10.8 India1011081151236.56.86.9 United States828690905.44.30.1 Russia717272720.60.20.0 South Korea71717170-0.2-0.3-0.4 Brazil34343434-1.20.30.8 Notes: s estimate f forecast.

4) Document/summary pairs are incorrect

[DOCUMENT]
Published: 9:34am, 18 Jan, 2019Updated: 8:43pm, 18 Jan, 2019

[SUMMARY]
Italian insurance company Generali said it was ready to expand into Asia and Latin America after a restructuring that saw it sell unprofitable operations. A three-year strategy launched in November 2018 included a target of compound earnings per share annual growth of up to 8%. CEO Philippe Donnet said the firm was considering potential acquisitions of a bancassurance provider in Asia, a Central and Eastern European property and casualty insurer and a global health insurer. About three-quarters of Generalis business came from France, Germany and Italy but it already had a presence in 10 Asian markets.

5) Hitting a paywall

[DOCUMENT]
A recent event that received surprisingly little media attention serves as a reminder of a lurking cyber risk that is different in kind and scale than more widely and frequently reported privacy-related data breaches. Want to continue reading?Become a FreePropertyCasualty360 Digital Reader. INCLUDED IN A DIGITAL MEMBERSHIP: All PropertyCasualty360.com news coverage, best practices, and in-depth analysis. Educational webcasts, resources from industry leaders, and informative newsletters. Other award-winning websites including BenefitsPRO.com and ThinkAdvisor.com. Register Now Already have an account? Sign In Now

[SUMMARY]
A recent series of industrial fires in Iran, which have been blamed on hackers, has opened up the question of insuring against physical damage as a result of cyber breaches, according to Alex J Lathrop, a partner at law firmPillsbury Winthrop Shaw Pittman. Rather than relying on cyber insurance, which covers only data breaches, traditional commercial general liability, property and business interruption policies,where exclusions are not clearly indicated, should provide coverage for physical damage sustained during a cyber attack, he said.

6) Spurious client-side access errors

[DOCUMENT]
StackPath Please enable JavaScript This website is using a security service to protect itself from online attacks. The service requires full JavaScript support in order to view the website. Please enable JavaScript on your browser and try again. Reference ID: ad34175b419f6e80a9fe5cd1f5e57ab1

7) Summary in English, document is not

[DOCUMENT]
Regus slandi opnar dag, 24. janar kl. 17.00, formlega njan skrifstofukjarna 3. h Hafnartorgi. ar vera starfrktar 46 skrifstofur, fundarherbergi mismunandi strum og svi fyrir sameiginlega vinnuastu. Allar skrifstofur og starfsstvar eru afhentar viskiptavinum fullbnar me rafmagnsborum, skrifstofustlum og rum nausynlegum bnai, s.s. fjarskiptabnai, nettengingu og fleira segir frttatilkynningu fr flaginu. Skrifstofukjarni samanstendur af einkaskrifstofum, samnttum vinnusvum, setustofum, fundarherbergjum og fjarskrifstofum. tilefni af opnun skrifstofukjarnans hafa au Andrzej Mrozek-Folkierski, yfirmaur vrurunar Regus, og Roz Young, svisstjri Regus Evrpu, komi hinga til lands til a vera vi opnunina. Regus slandi rekur n egar fjra skrifstofukjarna hr landi undir merkjum Regus og Orange Project, rmla 4-6, Sktuvogi, Hfatorgi og Akureyri. Skrifstofukjarninn Hafnartorgi verur fimmti skrifstofukjarni flagsins. Tmas Hilmar Ragnarz, framkvmdastjri og eigandi Regus slandi opnunina vera skref inn framtina me ntmalegri skrifstofuastu. Fyrir utan a a vera vel stasett mibnum bur ntt hsni upp mikinn sveigjanleika, fallegt umhverfi og ga vinnuastu, segir Tmas Hilmar. a er ljst a eigendur fyrirtkja af llum strum og gerum horfa til ess a nta auknari mli sveigjanleika skrifstofurekstri me v a nta jnustu skrifstofukjarna.

[SUMMARY]
Regus has opened its fifth office in Iceland, located on the thirdfloor of Hafnartorg Kvosinn, Reykjavik. The new centre has 46 offices, meeting rooms and shared office space.

8) Lack of sentence separation

[DOCUMENT]
Wells Fargo picks four directors for sales scandal probe -source By Reuters Published: 18:22 EDT, 8 December 2016 | Updated: 18:22 EDT, 8 December 2016 By Dan FreedDec 8 (Reuters) - Wells Fargo & Co Chairman Stephen Sanger and Vice Chair Elizabeth Duke have been named to a four-member committee that will lead an internal investigation into the bank's recent sales scandal, a person familiar with the matter said on Thursday
...

9) Summaries are not well-formed (word separation issues)

[SUMMARY 1]
A diagram by the Bank of International Settlements (BIS) has revealed that China's shadow banking sector is even more indecipherable and complex than itsUS counterpart. In particular BIS pointed to uncertainty about who would befinancially responsible when adebt forequity swap defaulted.New and more complex structured shadow credit intermediation has emerged and quickly reached a large scale, notedBIS, with thiscomplexity packaged and sold through wealth management products. Ifthere were to be afinancial crash, this complexity would make it worse.

[SUMMARY 2]
The FAA has proposed its biggest fine of$1.9magainst SkyPan International, an aerial photography company,for illegally flyingdronesthroughbusy airspace above New York and Chicago. A total of 65 unauthorised flights were recorded over a two year period between 2012 to 2014.SkyPan failed to get a valid Certificate of Waiver for the flights, and 43 of those flights were over a tightly restricted Class B airspace in New York, without permission from air traffic control.

[SUMMARY 3]
The completion of new energy storage schemes with a combined 340.5 MW capacityin China in the first sixmonths of this year will almost equalthe total 389.4 MW capacity of energy storage facilitiesoperational in China at the end of 2017, according to the China Energy Storage Alliance (CNESA).The biggeststorage project built in H1 was actually eight linked lithium-ion battery modules, on one sitein Jiangsu Zhenjiang,addingup to 101 MW/202 MWh grid-connected capacity. The new facility started operations inJuly, CNESA said.

[SUMMARY 4]
Eurex Clearing is working on a clearing model that will enable buy-side members to clear directly, circumventing bank clearing members. Their plans have caused some concern with Europeanregulators over risk from lower-rated counterparties.

[SUMMARY 5]
Knotel is reportedly in talks with Wafra, a New York City-based investment firm owned by the Kuwaiti government pension fund, about an investment that would value the company at around$1.5bn. Wafra is expected to lead the funding round, potentially joined by Singapores sovereign-wealth fund. Knotel has expanded its portfolio of flexible office locations to around 200from 20 in early 2018. The discussion highlights strong interest in the sector from institutional investors.

10) Scraper does not handle errors gracefully

[DOCUMENT]
Exception

[SUMMARY]
Non-US companies have been warned there is an increased risk of enforcement action against them from the US's Securities and Exchange Commission (SEC). The warning came in an article in the Banking Law Journal following a ruling made earlier this year by the Court of Appeals for the Tenth Circuit in SEC v Scoville, which seemingly bolstered the regulator's extraterritorial enforcement authority under the Dodd-Frank Act. \"One risk created is that purely foreign transactions, which arguably influence the price of securities traded on United States exchanges, may be the subject of SEC enforcement actions\", the article noted.

what is the file format of { data_path="../data/private_dataset.file",}

I see the dataset for fine tuning is stored at ../data/private_dataset.file, and codes show that it at least has column "text" and "summary". Could you offer the format of this file or offer an small example of it?

cannot compute loss

Hi, can you please update the rquirement.txt file with the correct versions required for the Bert summarisation code. I am getting a name error with flattened loss in the summary loss function.I suppose this is because of the version mismatch.

Error in _init_weights function in modeling_bertabs.py

Hey there!

I am replicating the code and this paper for one of my personal projects. I am facing one issue when running :

config = BertAbsConfig(max_pos=args.block_size)
model = BertAbs.from_pretrained(
    'remi/bertabs-finetuned-cnndm-extractive-abstractive-summarization',
    config=config
)

The error I am getting is :

---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
/tmp/ipykernel_14334/239094548.py in <cell line: 4>()
      2 
      3 
----> 4 model = BertAbs.from_pretrained(
      5     './model',
      6     config=config

~/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/transformers/modeling_utils.py in from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs)
   2377                 offload_index,
   2378                 error_msgs,
-> 2379             ) = cls._load_pretrained_model(
   2380                 model,
   2381                 state_dict,

~/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/transformers/modeling_utils.py in _load_pretrained_model(cls, model, state_dict, loaded_keys, resolved_archive_file, pretrained_model_name_or_path, ignore_mismatched_sizes, sharded_metadata, _fast_init, low_cpu_mem_usage, device_map, offload_folder, offload_state_dict, dtype, load_in_8bit)
   2523             )
   2524             for module in uninitialized_modules:
-> 2525                 model._init_weights(module)
   2526 
   2527         # Make sure we are able to load base models as well as derived models (with heads)

~/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/transformers/modeling_utils.py in _init_weights(self, module)
   1103         Initialize the weights. This method should be overridden by derived class.
   1104         """
-> 1105         raise NotImplementedError(f"Make sure `_init_weights` is implemented for {self.__class__}")
   1106 
   1107     def tie_weights(self):

NotImplementedError: Make sure `_init_weights` is implemented for <class 'modeling_bertabs.BertAbs'>

I am using the following versions:
pytorch : '1.10.0'
transformers : '4.25.1'

I believe this is more of a version issue, as this is very old implementation. Which library version was this code written?

Please let me know if there is any fresh implementation of summarization using BERT.

Dolt version of dataset

Hi Curation,

This is Tim, the CEO of the company that built Dolt and DoltHub. Dolt is git semantics wrapped on top of a SQL database and DoltHub is a place to share those databases. We think this dataset makes a lot of sense on DoltHub.

I took the liberty of importing it (even with the scraped articles):

https://www.dolthub.com/repositories/Liquidata/curation-corpus

We thought Dolt might be an interesting tool for you to check out.

--Tim

device=torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

How can I change the code to leverage multiple GPUs?

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

curationcorp / curation-corpus Goto Github PK

curation-corpus's People

Contributors

Stargazers

Watchers

Forkers

curation-corpus's Issues

Issues with the open-source dataset

what is the file format of { data_path="../data/private_dataset.file",}

cannot compute loss

Error in _init_weights function in modeling_bertabs.py

Dolt version of dataset

Getting an error with vector sizes, while following BERTAbs finetuning tutorial.

Web Archive Links

Multi-GPU Finetuning of BART

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent