google / corpuscrawler Goto Github PK

View Code? Open in Web Editor NEW

181.0 21.0 56.0 499 KB

Crawler for linguistic corpora

License: Other

Python 100.00%

corpus-linguistics corpus-builder crawling linguistics minority-language

corpuscrawler's Introduction

Corpus Crawler

Corpus Crawler is a tool for Corpus Linguistics.

Modern linguistic research works on language corpora, which are large samples of “real world” text. This crawler helps to build such corpora: it follows links to publicly accessible web pages known to be written in a certain language; it removes boilerplate and HTML markup; finally, it writes its output into plaintext files. The crawler implements the Robots Exclusion Standard, and it is intentionally slow so it does not cause much load on the crawled web sites.

This is not an official Google product. But if you’re a linguistic researcher, or if you’re writing a spell checker (or similar language-processing software) for an “exotic” language, you might find Corpus Crawler useful.

To build corpora for not-yet-supported languages, please read the contribution guidelines and send us GitHub pull requests.

The crawled corpora have been used to compute word frequencies in Unicode’s Unilex project.

Supported Languages

IETF BCP47 Code	Language	Tokens¹
`aai`	Arifama-Miniafia	181K 💾
`aak`	Ankave	194K 💾
`aau`	Abau	313K 💾
`aaz`	Amarasi	308K 💾
`abt`	Ambulas	297K 💾
`aby`	Aneme Wake	233K 💾
`acd`	Gikyode	323K 💾
`ace`	Aceh/Acehnese	817K 💾
`acf`	Saint Lucian Creole French	236K 💾
`ach`	Acoli	178K 💾
`acn`	Achang	232K 💾
`acr`	Achi	239K 💾
`acu`	Achuar-Shiwiar	174K 💾
`ade`	Adele	267K 💾
`adh`	Adhola	166K 💾
`adj`	Adioukrou	233K 💾
`ae`	Avestan	129K 💾
`ae-Latn`	Avestan (Latin)	141K 💾
`aey`	Amele	218K 💾
`agd`	Agarabi	256K 💾
`agg`	Angor	214K 💾
`agm`	Angaataha	238K 💾
`agn`	Agutaynen	234K 💾
`agr`	Aguaruna	149K 💾
`ahk`	Akha	367K 💾
`aia`	Arosi	223K 💾
`akb`	Batak Angkola	220K 💾
`ake`	Akawaio	190K 💾
`akh`	Akha	408K 💾
`akp`	Siwu	191K 💾
`alj`	Alangan	185K 💾
`alp`	Alune	225K 💾
`alt`	Southern Altai	121K 💾
`alz`	Alur	160K 💾
`am`	Amharic	2,170K 💾
`ame`	Yanesha'	221K 💾
`amf`	Hamer-Banna	152K 💾
`amk`	Ambai	229K 💾
`amm`	Ama (Papua New Guinea)	246K 💾
`amn`	Amanab	207K 💾
`amp`	Alamblak	241K 💾
`amr`	Amarakaeri	151K 💾
`amu`	Guerrero Amuzgo	202K 💾
`ann`	Obolo	236K 💾
`anv`	Denya	214K 💾
`aoj`	Mufian	217K 💾
`aom`	Ömie	231K 💾
`aon`	Bumbita Arapesh	294K 💾
`aoz`	Uab Meto	197K 💾
`ape`	Bukiyip	294K 💾
`apr`	Arop-Lokep	373K 💾
`apz`	Safeyoka	235K 💾
`ar`	Arabic	19,593K 💾
`arl`	Arabela	206K 💾
`asg`	Cishingini	270K 💾
`aso`	Dano	290K 💾
`ata`	Pele-Ata	248K 💾
`atb`	Zaiwa	291K 💾
`atg`	Ivbie North-Okpela-Arhe	229K 💾
`atq`	Aralle-Tabulahan	202K 💾
`auy`	Awiyaana	164K 💾
`av`	Avaric	111K 💾
`avn`	Avatime	229K 💾
`avt`	Au	263K 💾
`avu`	Avokaya	391K 💾
`awa`	Awadhi	211K 💾
`awb`	Awa (Papua New Guinea)	179K 💾
`ay`	Aymara	482K 💾
`ayo`	Ayoreo	264K 💾
`az`	Azerbaijani	3,413K 💾
`azg`	San Pedro Amuzgos Amuzgo	271K 💾
`azz`	Highland Puebla Nahuatl	265K 💾
`ba`	Bashkir	666K 💾
`ban`	Balinese	211K 💾
`bao`	Waimaha	232K 💾
`bav`	Vengo	250K 💾
`bba`	Baatonum	792K 💾
`bbb`	Barai	289K 💾
`bbo`	Northern Bobo Madaré	211K 💾
`bbr`	Girawa	245K 💾
`bch`	Bariai	248K 💾
`bcw`	Bana	304K 💾
`bdd`	Bunama	171K 💾
`be`	Belarusian	1,441K 💾
`be-tarask`	Belarusian (Taraškievica)	108,431K 💾
`bef`	Benabena	239K 💾
`bep`	Besoa	204K 💾
`bex`	Jur Modo	254K 💾
`bfd`	Bafut	276K 💾
`bfo`	Malba Birifor	260K 💾
`bg`	Bulgarian	10,597K 💾
`bgr`	Bawm Chin	213K 💾
`bgz`	Banggai	186K 💾
`bhl`	Bimin	324K 💾
`bhw`	Biak	164K 💾
`bi`	Bislama	315K 💾
`bib`	Bissa	243K 💾
`big`	Biangai	229K 💾
`bik`	Central Bikol	183K 💾
`bim`	Bimoba	215K 💾
`biv`	Southern Birifor	221K 💾
`bjr`	Binumarien	226K 💾
`bjv`	Bedjond	268K 💾
`bkl`	Berik	306K 💾
`bku`	Buhid	204K 💾
`bkv`	Bekwarra	244K 💾
`blh`	Kuwaa	259K 💾
`blt-Latn`	Tai Dam (Latin)	262K 💾
`blz`	Balantak	199K 💾
`bm`	Bambara	30K 💾
`bmh`	Kein	253K 💾
`bmq`	Bomu	207K 💾
`bmr`	Muinane	122K 💾
`bmu`	Somba-Siawari	234K 💾
`bmv`	Bum	258K 💾
`bn`	Bangla	7,258K 💾
`bnj`	Eastern Tawbuid	239K 💾
`bnp`	Bola	263K 💾
`bo`	Tibetan	5,642K 💾
`boa`	Bora	133K 💾
`boj`	Anjam	255K 💾
`bon`	Bine	244K 💾
`bov`	Tuwuli	203K 💾
`box`	Buamu	274K 💾
`bpr`	Koronadal Blaan	204K 💾
`bps`	Sarangani Blaan	214K 💾
`bqc`	Boko	567K 💾
`bqj`	Bandial	175K 💾
`bqp`	Busa	162K 💾
`bru`	Eastern Bru	261K 💾
`bs`	Bosnian	8,993K 💾
`bsn`	Barasana-Eduria	225K 💾
`bss`	Akoose	199K 💾
`btd`	Batak Dairi	192K 💾
`bts`	Batak Simalungun	175K 💾
`btt`	Bete-Bendi	266K 💾
`btx`	Batak Karo	189K 💾
`bua`	Buriat	143K 💾
`bud`	Ntcham	207K 💾
`buk`	Bugawac	264K 💾
`bus`	Bokobaru	159K 💾
`bvc`	Baelelea	308K 💾
`bvz`	Bauzi	509K 💾
`bwq`	Southern Bobo Madaré	214K 💾
`bwu`	Buli	285K 💾
`byr`	Baruya	182K 💾
`byx`	Qaqet	387K 💾
`bzh`	Mapos Buang	251K 💾
`bzi`	Bisu	381K 💾
`bzj`	Belize Kriol English	240K 💾
`ca-valencia`	Valencian	24,295K 💾
`caa`	Chortí	307K 💾
`cab`	Garifuna	154K 💾
`cac`	Chuj	244K 💾
`cak`	Kaqchikel	259K 💾
`cap`	Chipaya	154K 💾
`car`	Galibi Carib	160K 💾
`cax`	Chiquitano	149K 💾
`cbc`	Carapana	256K 💾
`cbi`	Chachi	187K 💾
`cbl`	Bualkhaw Chin	210K 💾
`cbr`	Cashibo-Cacataibo	236K 💾
`cbs`	Cashinahua	198K 💾
`cbt`	Chayahuita	150K 💾
`cbv`	Cacua	265K 💾
`cce`	Chopi	204K 💾
`ccp`	Chakma	79K 💾
`cdf`	Chiru	193K 💾
`ce`	Chechen	669K 💾
`ceb`	Cebuano	1,067K 💾
`ceg`	Chamacoco	232K 💾
`cfm`	Falam Chin	438K 💾
`cgc`	Kagayanen	299K 💾
`chj`	Ojitlán Chinantec	305K 💾
`chm`	Mari	132K 💾
`chr`	Cherokee	119K 💾
`chz`	Ozumacín Chinantec	205K 💾
`cjo`	Ashéninka Pajonal	141K 💾
`cjp`	Cabécar	199K 💾
`cjv`	Chuave	286K 💾
`cko`	Anufo	272K 💾
`cle`	Lealao Chinantec	313K 💾
`cme`	Cerma	230K 💾
`cmr`	Mro-Khimi Chin	275K 💾
`cnh`	Hakha Chin	934K 💾
`cni`	Asháninka	122K 💾
`cnk`	Khumi Chin	237K 💾
`cnl`	Lalana Chinantec	308K 💾
`cnt`	Tepetotutla Chinantec	261K 💾
`coe`	Koreguaje	181K 💾
`cof`	Colorado	183K 💾
`cok`	Santa Teresa Cora	230K 💾
`con`	Cofán	151K 💾
`cot`	Caquinte	128K 💾
`crh`	Crimean Tatar	505K 💾
`cs`	Czech	3,141K 💾
`csk`	Jola-Kasa	177K 💾
`cso`	Sochiapam Chinantec	328K 💾
`ctd-Latn`	Tedim Chin (Latin)	852K 💾
`ctu`	Chol	203K 💾
`cub`	Cubeo	220K 💾
`cuc`	Usila Chinantec	278K 💾
`cui`	Cuiba	292K 💾
`cuk`	San Blas Kuna	187K 💾
`cul`	Culina	221K 💾
`cv`	Chuvash	111K 💾
`cwe`	Kwere	144K 💾
`cwt`	Kuwaataay	168K 💾
`cy`	Welsh	11,519K 💾
`cya`	Nopala Chatino	245K 💾
`czt`	Zotung Chin	227K 💾
`da`	Danish	655K 💾
`daa`	Dangaléat	208K 💾
`dad`	Marik	197K 💾
`dah`	Gwahatike	274K 💾
`ddn`	Dendi	210K 💾
`de`	German	46,431K 💾
`ded`	Dedua	146K 💾
`des`	Desano	210K 💾
`dga`	Southern Dagaare	458K 💾
`dgi`	Northern Dagara	257K 💾
`dgz`	Daga	219K 💾
`din`	Southwestern Dinka	196K 💾
`dip`	Northeastern Dinka	193K 💾
`djk`	Eastern Maroon Creole	307K 💾
`dln`	Darlong	776K 💾
`dnw`	Western Dani	254K 💾
`dob`	Dobu	179K 💾
`dop`	Lukpa	226K 💾
`dsh`	Daasanach	211K 💾
`dtb`	Labuk-Kinabatangan Kadazan	248K 💾
`dtp`	Kadazan Dusun	1,038K 💾
`dts`	Toro So Dogon	202K 💾
`due`	Umiray Dumaget Agta	247K 💾
`dug`	Duruma	172K 💾
`duo`	Dupaninan Agta	266K 💾
`dwr`	Dawro	254K 💾
`dww`	Dawawa	208K 💾
`dyi`	Djimini Senoufo	268K 💾
`dyo`	Jola-Fonyi	158K 💾
`dyu`	Dyula	1,156K 💾
`dz`	Dzongkha	61K 💾
`ee`	Ewe	421K 💾
`eka`	Ekajuk	213K 💾
`el`	Greek	5,470K 💾
`emi`	Mussau-Emira	176K 💾
`emp`	Northern Emberá	158K 💾
`enb`	Markweeta	147K 💾
`enq`	Enga	217K 💾
`enx`	Enxet	772K 💾
`eri`	Ogea	269K 💾
`es`	Spanish	32,670K 💾
`ese`	Ese Ejja	226K 💾
`et`	Estonian	3,658K 💾
`eu`	Basque	130K 💾
`ewo`	Ewondo	158K 💾
`eza`	Ezaa	963K 💾
`fa`	Persian	9,114K 💾
`fa-AF`	Dari	7,363K 💾
`faa`	Fasu	238K 💾
`fai`	Faiwol	256K 💾
`fal`	South Fali	198K 💾
`far`	Fataleka	286K 💾
`fi`	Finnish	4,837K 💾
`fil`	Tagalog	184K 💾
`fip`	Fipa	134K 💾
`fit`	Tornedalen Finnish	292K 💾
`fj`	Fijian	257K 💾
`fo`	Faroese	851K 💾
`fon`	Fon	266K 💾
`for`	Fore	169K 💾
`fr`	French	5,488K 💾
`fue`	Borgu Fulfulde	148K 💾
`fuf`	Pular	174K 💾
`fuq`	Central-Eastern Niger Fulfulde	156K 💾
`fuv`	Nigerian Fulfulde	13K 💾
`ga`	Irish	7,587K 💾
`gag`	Gagauz	245K 💾
`gah`	Alekano	210K 💾
`gam`	Kandawo	250K 💾
`gaw`	Nobonob	246K 💾
`gbi`	Galela	288K 💾
`gd`	Scottish Gaelic	17,105K 💾
`gde`	Gude	217K 💾
`gdn`	Umanakaina	306K 💾
`gdr`	Wipi	271K 💾
`gej`	Gen	236K 💾
`gfk`	Patpatar	294K 💾
`ghs`	Guhu-Samane	186K 💾
`gil`	Gilbertese	228K 💾
`gkn`	Gokana	267K 💾
`gmv-Latn`	Gamo (Latin)	127K 💾
`gn`	Guarani	142K 💾
`gnd`	Zulgo-Gemzek	364K 💾
`gng`	Ngangam	219K 💾
`gnw`	Western Bolivian Guaraní	263K 💾
`gof`	Gofa	124K 💾
`gog`	Gogo	173K 💾
`gor`	Gorontalo	211K 💾
`gqr`	Gor	218K 💾
`grb`	Northern Grebo	270K 💾
`grt`	Garo	141K 💾
`gso`	Southwest Gbaya	228K 💾
`gsw-u-sd-chag`	Swiss German (Aargau)	99K 💾
`gsw-u-sd-chbe`	Swiss German (Bern)	73K 💾
`gsw-u-sd-chfr`	Swiss German (Fribourg)	42K 💾
`gu`	Gujarati	702K 💾
`gub`	Guajajára	997K 💾
`guc`	Wayuu	211K 💾
`gud`	Yocoboué Dida	216K 💾
`guh`	Guahibo	204K 💾
`gui`	Eastern Bolivian Guaraní	197K 💾
`gum`	Guambiano	186K 💾
`gun`	Mbyá Guaraní	176K 💾
`guo`	Guayabero	203K 💾
`guq`	Aché	184K 💾
`gur`	Farefare	240K 💾
`gux`	Gourmanchéma	215K 💾
`gv`	Manx Gaelic	152K 💾
`gvc`	Guanano	241K 💾
`gvf`	Golin	276K 💾
`gvl`	Gulay	270K 💾
`gwr`	Gwere	157K 💾
`gym`	Ngäbere	294K 💾
`gyr`	Guarayu	176K 💾
`ha`	Hausa	1,775K 💾
`hae`	Eastern Oromo	163K 💾
`hag`	Hanga	202K 💾
`haw`	Hawaiian	2,221K 💾
`hay`	Haya	112K 💾
`heh`	Hehe	136K 💾
`hi`	Hindi	10,004K 💾
`hif`	Fiji Hindi	204K 💾
`hig`	Kamwe	261K 💾
`hil`	Hiligaynon	208K 💾
`hla`	Halia	273K 💾
`hne`	Chhattisgarhi	207K 💾
`hnn`	Hanunoo	212K 💾
`hns`	Caribbean Hindustani	312K 💾
`ho`	Hiri Motu	240K 💾
`hot`	Hote	222K 💾
`hr`	Croatian	8,188K 💾
`ht`	Haitian	1,101K 💾
`hto`	Minica Huitoto	182K 💾
`hu`	Hungarian	600K 💾
`hub`	Huambisa	160K 💾
`hui`	Huli	232K 💾
`hus`	Huastec	236K 💾
`huu`	Murui Huitoto	165K 💾
`huv`	San Mateo Del Mar Huave	197K 💾
`hvn`	Sabu	312K 💾
`hy`	Armenian	25,972K 💾
`ian`	Iatmul	224K 💾
`iba`	Iban	179K 💾
`icr`	Islander Creole English	248K 💾
`id`	Indonesian	6,634K 💾
`ifa`	Amganad Ifugao	810K 💾
`ifb`	Batad Ifugao	835K 💾
`ife`	Ifè	300K 💾
`ifk`	Tuwali Ifugao	214K 💾
`ifu`	Mayoyao Ifugao	258K 💾
`ify`	Keley-I Kallahan	863K 💾
`ig`	Igbo	13K 💾
`ign`	Ignaciano	161K 💾
`ik`	Inupiaq	96K 💾
`ilo`	Iloko	169K 💾
`imo`	Imbongu	280K 💾
`inb`	Inga	151K 💾
`ino`	Inoke-Yate	236K 💾
`iou`	Tuma-Irumu	225K 💾
`ipi`	Ipili	312K 💾
`iri`	Irigwe	243K 💾
`irk`	Iraqw	184K 💾
`iry`	Iraya	205K 💾
`it`	Italian	13,569K 💾
`itv`	Itawit	242K 💾
`iu`	Inuktitut	98K 💾
`iws`	Sepik Iwam	307K 💾
`izr`	Izere	216K 💾
`izz`	Izii	908K 💾
`ja`	Japanese	2,116K 💾
`jac`	Popti'	221K 💾
`jae`	Yabem	186K 💾
`jam`	Jamaican Creole English	254K 💾
`jbu`	Jukun Takum	264K 💾
`jic`	Tol	285K 💾
`jiv`	Shuar	134K 💾
`jmc`	Machame	150K 💾
`jun`	Juang	178K 💾
`jv`	Javanese	177K 💾
`jvn`	Caribbean Javanese	211K 💾
`ka`	Georgian	4,978K 💾
`kaa`	Kara-Kalpak	135K 💾
`kab-Arab`	Kabyle (Arabic)	715K 💾
`kab-Tfng`	Kabyle (Tifinagh)	1,338K 💾
`kab`	Kabyle	66K 💾
`kac`	Kachin	1,057K 💾
`kao`	Xaasongaxango	205K 💾
`kaq`	Capanahua	164K 💾
`kbh`	Camsá	193K 💾
`kbm`	Iwal	298K 💾
`kbp`	Kabiyè	571K 💾
`kbq`	Kamano	156K 💾
`kbr`	Kafa	147K 💾
`kcg`	Tyap	279K 💾
`kdc`	Kutu	140K 💾
`kdi`	Kumam	195K 💾
`kdj`	Karamojong	163K 💾
`kdn`	Kunda	144K 💾
`kek`	Kekchí	406K 💾
`ken`	Kenyang	200K 💾
`keo`	Kakwa	215K 💾
`ker`	Kera	267K 💾
`kew`	West Kewa	247K 💾
`kez`	Kukele	173K 💾
`kgf`	Kube	175K 💾
`kgr`	Abun	356K 💾
`khz`	Keapara	196K 💾
`kia`	Kim	525K 💾
`kij`	Kilivila	155K 💾
`kj`	Kuanyama	1,474K 💾
`kjb`	Q'anjob'al	263K 💾
`kje`	Kisar	235K 💾
`kjh`	Khakas	128K 💾
`kjs`	East Kewa	251K 💾
`kk`	Kazakh	642K 💾
`kki`	Kagulu	125K 💾
`kkj`	Kako	263K 💾
`kln`	Kalenjin	149K 💾
`km`	Khmer	29,110K 💾
`kma`	Konni	230K 💾
`kmg`	Kâte	127K 💾
`kmo`	Kwoma	213K 💾
`kms`	Kamasau	293K 💾
`kmu`	Kanite	214K 💾
`kn`	Kannada	126K 💾
`kne`	Kankanaey	230K 💾
`knf`	Mankanya	164K 💾
`knj`	Western Kanjobal	1,350K 💾
`knk`	Kuranko	228K 💾
`kno`	Kono	360K 💾
`knv`	Tabo	243K 💾
`kog`	Cogui	189K 💾
`kpf`	Komba	174K 💾
`kpg`	Kapingamarangi	967K 💾
`kpr`	Korafe-Yegha	262K 💾
`kpw`	Kobon	288K 💾
`kpx`	Mountain Koiali	190K 💾
`kpz`	Kupsabiny	166K 💾
`kqc`	Doromu-Koki	209K 💾
`kqe`	Kalagan	241K 💾
`kqp`	Kimré	254K 💾
`kqw`	Kandas	201K 💾
`kqy`	Koorete	156K 💾
`krc`	Karachay-Balkar	132K 💾
`kri`	Krio	256K 💾
`krj`	Kinaray-A	228K 💾
`kru`	Kurukh	182K 💾
`ksd`	Kuanua	228K 💾
`ksr`	Borong	233K 💾
`ktb`	Kambaata	113K 💾
`ktj`	Plapo Krumen	356K 💾
`kto`	Kuot	286K 💾
`ku`	Kurdish	2,479K 💾
`kub`	Kutep	281K 💾
`kud`	‘Auhelawa	167K 💾
`kue`	Kuman (Papua New Guinea)	230K 💾
`kum`	Kumyk	142K 💾
`kup`	Kunimaipa	279K 💾
`kus`	Kusaal	200K 💾
`kv`	Komi	122K 💾
`kvn`	Border Kuna	212K 💾
`kwf`	Kwara'ae	296K 💾
`kwi`	Awa-Cuaiquer	165K 💾
`kwj`	Kwanga	290K 💾
`kxc`	Konso	148K 💾
`kxm`	Northern Khmer	257K 💾
`ky`	Kyrgyz	18,597K 💾
`kyc`	Kyaka	220K 💾
`kyf`	Kouya	215K 💾
`kyg`	Keyagana	190K 💾
`kyq`	Kenga	250K 💾
`kyu`	Western Kayah	466K 💾
`kyz`	Kayabí	324K 💾
`kze`	Kosena	164K 💾
`kzf`	Da'a Kaili	213K 💾
`kzj`	Coastal Kadazan	215K 💾
`la`	Latin	48K 💾
`laj`	Lango	175K 💾
`las`	Lama	235K 💾
`law`	Lauje	262K 💾
`lb`	Luxembourgish	5,173K 💾
`lcm`	Tungag	239K 💾
`lee`	Lyélé	257K 💾
`lef`	Lelemi	211K 💾
`lem`	Nomaande	249K 💾
`leu`	Kara (Papua New Guinea)	255K 💾
`lew`	Ledo Kaili	198K 💾
`lex`	Luang	271K 💾
`lgg`	Lugbara	188K 💾
`lhu`	Lahu	352K 💾
`lia`	West-Central Limba	247K 💾
`lid`	Nyindrou	308K 💾
`lif`	Limbu	138K 💾
`lip`	Sekpele	214K 💾
`lis`	Lisu	304K 💾
`ljp`	Lampung Api	188K 💾
`lln`	Lele	291K 💾
`lme`	Pévé	245K 💾
`lmk`	Lamkang	217K 💾
`lnd`	Lundayeh	670K 💾
`lo`	Lao	4,384K 💾
`lob`	Lobi	192K 💾
`loe`	Saluan	220K 💾
`lok`	Loko	264K 💾
`lon`	Malawi Lomwe	137K 💾
`lsi`	Lashi	1,077K 💾
`lsm`	Saamia	156K 💾
`lt`	Lithuanian	39,575K 💾
`luc`	Aringa	242K 💾
`lus`	Lushai	204K 💾
`lv`	Latvian	1,020K 💾
`lwo`	Luwo	255K 💾
`maa`	San Jerónimo Tecóatl Mazatec	487K 💾
`mad`	Madurese	706K 💾
`mag`	Magahi	193K 💾
`mai`	Maithili	211K 💾
`maj`	Jalapa De Díaz Mazatec	188K 💾
`mak`	Makasar	179K 💾
`mam`	Mam	834K 💾
`maw`	Mampruli	251K 💾
`maz`	Central Mazahua	286K 💾
`mbb`	Western Bukidnon Manobo	278K 💾
`mbc`	Macushi	221K 💾
`mbh`	Mangseng	321K 💾
`mbt`	Matigsalug Manobo	226K 💾
`mca`	Maca	208K 💾
`mcb`	Machiguenga	132K 💾
`mcd`	Sharanahua	200K 💾
`mco`	Coatlán Mixe	217K 💾
`mcp`	Makaa	237K 💾
`mcq`	Ese	158K 💾
`mcu`	Cameroon Mambila	260K 💾
`mda`	Mada	312K 💾
`mdy`	Male	589K 💾
`med`	Melpa	283K 💾
`mee`	Mengen	301K 💾
`mej`	Meyah	323K 💾
`mek`	Mekeo	234K 💾
`men`	Mende	210K 💾
`meq`	Merey	291K 💾
`meu`	Motu	175K 💾
`mfe`	Morisyen	172K 💾
`mfh`	Matal	238K 💾
`mfi`	Wandala	265K 💾
`mfk`	North Mofu	248K 💾
`mfq`	Moba	232K 💾
`mfy`	Mayo	167K 💾
`mfz`	Mabaan	237K 💾
`mg`	Malagasy	1,623K 💾
`mgd`	Moru	192K 💾
`mgh`	Makhuwa-Meetto	150K 💾
`mgo`	Meta'	251K 💾
`mh`	Marshallese	750K 💾
`mhi`	Ma'di	192K 💾
`mhl`	Mauwake	235K 💾
`mhx`	Maru	291K 💾
`mhy`	Ma'anyan	190K 💾
`mi`	Maori	1,504K 💾
`mib`	Atatláhuca Mixtec	263K 💾
`mif`	Mofu-Gudur	283K 💾
`mil`	Peñoles Mixtec	365K 💾
`min`	Minangkabau	242K 💾
`mio`	Pinotepa Nacional Mixtec	288K 💾
`miq`	Mískito	214K 💾
`mit`	Southern Puebla Mixtec	273K 💾
`mk`	Macedonian	10,422K 💾
`mkl`	Mokole	230K 💾
`ml`	Malayalam	118K 💾
`mlh`	Mape	235K 💾
`mlp`	Bargam	297K 💾
`mmo`	Mangga Buang	269K 💾
`mmx`	Madak	271K 💾
`mna`	Mbula	257K 💾
`mnb`	Muna	151K 💾
`mnf`	Mundani	241K 💾
`mnw`	Mon	1,836K 💾
`moa`	Mwan	308K 💾
`mog`	Mongondow	220K 💾
`mop`	Mopán Maya	296K 💾
`mor`	Moro	152K 💾
`mox`	Molima	222K 💾
`mpg`	Marba	210K 💾
`mpm`	Yosondúa Mixtec	336K 💾
`mps`	Dadibi	1,270K 💾
`mpt`	Mian	256K 💾
`mpx`	Misima-Panaeati	227K 💾
`mqb`	Mbuko	302K 💾
`mqj`	Mamasa	164K 💾
`mqn`	Moronene	164K 💾
`mr`	Marathi	16,594K 💾
`mrw`	Maranao	912K 💾
`ms`	Malay	659K 💾
`msm`	Agusan Manobo	225K 💾
`msy`	Aruamu	229K 💾
`mt`	Maltese	3,331K 💾
`mta`	Cotabato Manobo	262K 💾
`mti`	Maiwa (Papua New Guinea)	166K 💾
`mtj`	Moskona	321K 💾
`mto`	Totontepec Mixe	233K 💾
`mtp`	Wichí Lhamtés Nocten	183K 💾
`muh`	Mündü	392K 💾
`mur`	Murle	210K 💾
`mux`	Bo-Ung	363K 💾
`muy`	Muyang	265K 💾
`mva`	Manam	231K 💾
`mvp`	Duri	174K 💾
`mwv`	Mentawai	141K 💾
`mxb`	Tezoatlán Mixtec	281K 💾
`mxt`	Jamiltepec Mixtec	267K 💾
`my`	Burmese	1,007K 💾
`my-t-d0-zawgyi`	Burmese (Zawgyi encoding)	593K 💾
`myb`	Mbay	192K 💾
`myk`	Mamara Senoufo	272K 💾
`myv`	Erzya	143K 💾
`myw`	Muyuw	150K 💾
`myx`	Masaaba	164K 💾
`myy`	Macuna	245K 💾
`mza`	Santa María Zacatepec Mixtec	316K 💾
`mzi`	Ixcatlán Mazatec	190K 💾
`mzk`	Nigeria Mambila	283K 💾
`mzm`	Mumuye	265K 💾
`naf`	Nabak	220K 💾
`nak`	Nakanai	333K 💾
`nan-Latn`	Min Nan Chinese (Latin)	231K 💾
`nas`	Naasioi	168K 💾
`nca`	Iyo	203K 💾
`nch`	Central Huasteca Nahuatl	195K 💾
`ncj`	Northern Puebla Nahuatl	164K 💾
`ncu`	Chumburung	312K 💾
`ndj`	Ndamba	141K 💾
`ndy`	Lutos	216K 💾
`ndz`	Ndogo	350K 💾
`neb`	Toura	326K 💾
`new`	Newari	150K 💾
`nfr`	Nafaanra	233K 💾
`ngp`	Ngulu	149K 💾
`nho`	Takuu	309K 💾
`nhu`	Noone	270K 💾
`nhw`	Western Huasteca Nahuatl	194K 💾
`nhy`	Northern Oaxaca Nahuatl	185K 💾
`nia`	Nias	182K 💾
`nii`	Nii	316K 💾
`nij`	Ngaju	194K 💾
`nim`	Nilamba	117K 💾
`nin`	Ninzo	267K 💾
`nkf`	Inpui Naga	197K 💾
`nko`	Nkonya	168K 💾
`nl`	Dutch	58,357K 💾
`nlc`	Nalca	241K 💾
`nmz`	Nawdm	209K 💾
`nnb`	Nande	127K 💾
`nnq`	Ngindo	137K 💾
`nnw`	Southern Nuni	291K 💾
`noa`	Woun Meu	275K 💾
`nog`	Nogai	104K 💾
`nop`	Numanggang	183K 💾
`not`	Nomatsiguenga	141K 💾
`nou`	Ewage-Notu	266K 💾
`npl`	Southeastern Puebla Nahuatl	148K 💾
`npy`	Napu	192K 💾
`nsn`	Nehan	248K 💾
`nsu`	Sierra Negra Nahuatl	170K 💾
`ntm`	Nateni	229K 💾
`ntp`	Northern Tepehuan	173K 💾
`ntr`	Delo	272K 💾
`nuj`	Nyole	151K 💾
`nus`	Nuer	195K 💾
`nvm`	Namiae	290K 💾
`nwb`	Nyabwa	316K 💾
`nwi`	Southwest Tanna	230K 💾
`ny`	Nyanja	356K 💾
`nyf`	Giryama	169K 💾
`nyn`	Nyankole	120K 💾
`nyo`	Nyoro	120K 💾
`nyy`	Nyakyusa-Ngonde	138K 💾
`nzi`	Nzima	201K 💾
`obo`	Obo Manobo	266K 💾
`oc`	Occitan	2,706K 💾
`oku`	Oku	239K 💾
`okv`	Orokaiva	212K 💾
`old`	Mochi	151K 💾
`ong`	Olo	284K 💾
`opm`	Oksapmin	332K 💾
`or`	Oriya	175K 💾
`os`	Ossetic	135K 💾
`osa`	Osage	3K 💾
`otd`	Ot Danum	187K 💾
`ote`	Mezquital Otomi	251K 💾
`ozm`	Koonzime	267K 💾
`pa`	Punjabi	59,990K 💾
`pab`	Parecís	156K 💾
`pad`	Paumarí	242K 💾
`pag`	Pangasinan	177K 💾
`pah`	Tenharim	268K 💾
`pam`	Pampanga	196K 💾
`pau`	Palauan	255K 💾
`pbc`	Patamona	181K 💾
`pbi`	Parkwa	272K 💾
`pck`	Paite Chin	770K 💾
`pcm`	Nigerian Pidgin	315K 💾
`pez`	Eastern Penan	235K 💾
`pib`	Yine	114K 💾
`pir`	Piratapuyo	229K 💾
`pis`	Pijin	263K 💾
`pjt`	Pitjantjatjara	237K 💾
`pkb`	Pokomo	166K 💾
`pl`	Polish	7,148K 💾
`plw`	Brooke's Point Palawano	203K 💾
`pmf`	Pamona	307K 💾
`pny`	Pinyin	247K 💾
`poh`	Poqomchi'	266K 💾
`poi`	Highland Popoluca	179K 💾
`poy`	Pogolo	147K 💾
`ppk`	Uma	220K 💾
`ppo`	Folopa	258K 💾
`prf`	Paranan	203K 💾
`prk`	Parauk	1,026K 💾
`ps`	Pashto	7,343K 💾
`pss`	Kaulong	326K 💾
`pt`	Portuguese	20,891K 💾
`pt-PT`	Portuguese (Portugal)	666K 💾
`ptp`	Patep	294K 💾
`ptu`	Bambam	194K 💾
`pwg`	Gapapaiwa	208K 💾
`pww`	Pwo Northern Karen	345K 💾
`pxm`	Quetzaltepec Mixé	720K 💾
`qu`	Quechua	580K 💾
`qub`	Huallaga Huánuco Quechua	122K 💾
`quc`	K'iche'	207K 💾
`quf`	Lambayeque Quechua	161K 💾
`quh`	South Bolivian Quechua	623K 💾
`qul`	North Bolivian Quechua	140K 💾
`qup`	Southern Pastaza Quechua	177K 💾
`quw`	Tena Lowland Quichua	116K 💾
`quy`	Ayacucho Quechua	106K 💾
`qvc`	Cajamarca Quechua	166K 💾
`qve`	Eastern Apurímac Quechua	168K 💾
`qvi`	Imbabura Highland Quichua	146K 💾
`qvm`	Margos-Yarowilca-Lauricocha Quechua	132K 💾
`qvn`	North Junín Quechua	139K 💾
`qvo`	Napo Lowland Quechua	117K 💾
`qvs`	San Martín Quechua	153K 💾
`qvw`	Huaylla Wanca Quechua	111K 💾
`qvz`	Northern Pastaza Quichua	157K 💾
`qwh`	Huaylas Ancash Quechua	128K 💾
`qxh`	Panao Huánuco Quechua	123K 💾
`qxl`	Salasaca Highland Quichua	127K 💾
`qxn`	Northern Conchucos Ancash Quechua	150K 💾
`qxo`	Southern Conchucos Ancash Quechua	136K 💾
`qxr`	Cañar Highland Quichua	509K 💾
`rai`	Ramoaaina	273K 💾
`raj`	Malvi	198K 💾
`rav`	Sampang	138K 💾
`rej`	Rejang	178K 💾
`rim`	Nyaturu	151K 💾
`rm-puter`	Romansh (Puter)	1,068K 💾
`rm-rumgr`	Romansh (Grischun)	4,794K 💾
`rm-surmiran`	Romansh (Surmiran)	2,540K 💾
`rm-sursilv`	Romansh (Sursilvan)	11,678K 💾
`rm-sutsilv`	Romansh (Sutsilvan)	1,007K 💾
`rm-vallader`	Romansh (Vallader)	5,560K 💾
`rmc`	Carpathian Romani	170K 💾
`rmo`	Sinte Romani	228K 💾
`rn`	Rundi	120K 💾
`rnl`	Ranglong	221K 💾
`ro`	Romanian	13,962K 💾
`ro-MD`	Moldavian	2,694K 💾
`rom`	Vlax Romani	186K 💾
`roo`	Rotokas	292K 💾
`rro`	Waima	177K 💾
`ru`	Russian	40,987K 💾
`ruf`	Luguru	135K 💾
`rug`	Roviana	956K 💾
`rw`	Kinyarwanda	605K 💾
`rwo`	Rawa	261K 💾
`sab`	Buglere	405K 💾
`sah`	Sakha	2,457K 💾
`sas`	Sasak	196K 💾
`sat`	Santali	149K 💾
`sba`	Ngambay	246K 💾
`sbl`	Botolan Sambal	251K 💾
`sck`	Sadri	189K 💾
`sda`	Toraja-Sa'dan	154K 💾
`seh`	Sena	155K 💾
`sey`	Secoya	163K 💾
`sg`	Sango	265K 💾
`sgb`	Mag-antsi Ayta	233K 💾
`sgw`	Sebat Bet Gurage	116K 💾
`sgz`	Sursurunga	327K 💾
`shk`	Shilluk	189K 💾
`shn`	Shan	1,435K 💾
`shp`	Shipibo-Conibo	169K 💾
`si`	Sinhala	1,046K 💾
`sig`	Paasaal	277K 💾
`sil`	Tumulung Sisaala	256K 💾
`sim`	Mende (Papua New Guinea)	273K 💾
`sja`	Epena	194K 💾
`sk`	Slovak	70,933K 💾
`sl`	Slovenian	10,975K 💾
`sld`	Sissala	206K 💾
`sll`	Salt-Yui	264K 💾
`sm`	Samoan	248K 💾
`smt`	Simte	177K 💾
`sn`	Shona	2,542K 💾
`snc`	Sinaugoro	216K 💾
`snn`	Siona	222K 💾
`snp`	Siane	237K 💾
`snw`	Selee	212K 💾
`sny`	Saniyo-Hiyewe	348K 💾
`so`	Somali	874K 💾
`soq`	Kanasi	213K 💾
`soy`	Miyobe	205K 💾
`spl`	Selepet	244K 💾
`spp`	Supyire Senoufo	251K 💾
`sps`	Saposa	324K 💾
`sq`	Albanian	10,104K 💾
`sr`	Serbian	4,785K 💾
`sr-Latn`	Serbian (Latin)	10,143K 💾
`sri`	Siriano	166K 💾
`srm`	Saramaccan	369K 💾
`srn`	Sranan Tongo	232K 💾
`ssd`	Siroi	210K 💾
`ssg`	Seimat	221K 💾
`ssx`	Samberigi	233K 💾
`stn`	Owa	263K 💾
`su`	Sundanese	172K 💾
`sua`	Sulka	458K 💾
`sue`	Suena	227K 💾
`sur`	Mwaghavul	261K 💾
`sus`	Susu	205K 💾
`suz`	Sunwar	732K 💾
`sv`	Swedish	33,633K 💾
`sw`	Swahili	8,817K 💾
`swp`	Suau	175K 💾
`sxn`	Sangir	209K 💾
`ta`	Tamil	1,413K 💾
`tab`	Tabassaran	132K 💾
`taj`	Eastern Tamang	169K 💾
`tap`	Taabwa	145K 💾
`taq`	Tamasheq	218K 💾
`tav`	Tatuyo	256K 💾
`taw`	Tai	268K 💾
`tbc`	Takia	278K 💾
`tbg`	North Tairora	235K 💾
`tbo`	Tawala	198K 💾
`tby`	Tabaru	226K 💾
`tbz`	Ditammari	692K 💾
`tca`	Ticuna	251K 💾
`tcc`	Datooga	135K 💾
`te`	Telugu	574K 💾
`ted`	Tepo Krumen	346K 💾
`tem`	Timne	190K 💾
`teo`	Teso	118K 💾
`ter`	Tereno	187K 💾
`tfr`	Teribe	228K 💾
`tgo`	Sudest	216K 💾
`tgp`	Tangoa	228K 💾
`thk`	Tharaka	150K 💾
`ti`	Tigrinya	803K 💾
`tif`	Tifal	413K 💾
`tih`	Timugon Murut	879K 💾
`tik`	Tikar	264K 💾
`tim`	Timbe	206K 💾
`tk`	Turkmen	516K 💾
`tlb`	Tobelo	209K 💾
`tlf`	Telefol	422K 💾
`tlj`	Talinga-Bwisi	159K 💾
`tmc`	Tumak	245K 💾
`tna`	Tacana	216K 💾
`tnr`	Ménik	254K 💾
`to`	Tonga	1,214K 💾
`tob`	Toba	229K 💾
`toc`	Coyutla Totonac	218K 💾
`toh`	Gitonga	194K 💾
`top`	Papantla Totonac	168K 💾
`tos`	Highland Totonac	224K 💾
`tpi`	Tok Pisin	8,049K 💾
`tpm`	Tampulma	892K 💾
`tpp`	Pisaflores Tepehua	162K 💾
`tpt`	Tlachichilco Tepehua	173K 💾
`tpz`	Tinputz	370K 💾
`tqo`	Toaripi	215K 💾
`tr`	Turkish	13,846K 💾
`trs`	Chicahuaxtla Triqui	287K 💾
`tsz`	Purepecha	129K 💾
`tt`	Tatar	1,356K 💾
`ttc`	Tektiteko	231K 💾
`tte`	Bwanabwana	198K 💾
`tue`	Tuyuca	141K 💾
`tuf`	Central Tunebo	237K 💾
`twb`	Western Tawbuid	198K 💾
`twu`	Termanu	242K 💾
`txa`	Tombonuo	224K 💾
`txu`	Kayapó	354K 💾
`tyv`	Tuvinian	614K 💾
`tyz`	Tày	260K 💾
`tzh`	Tzeltal	901K 💾
`tzj`	Tz'utujil	245K 💾
`ubr`	Ubir	222K 💾
`ubu`	Umbu-Ungu	308K 💾
`udm`	Udmurt	135K 💾
`udu`	Uduk	287K 💾
`ug`	Uyghur	9,493K 💾
`uk`	Ukrainian	12,921K 💾
`ur`	Urdu	3,622K 💾
`ura`	Urarina	193K 💾
`urb`	Urubú-Kaapor	347K 💾
`urk`	Urak Lawoi'	368K 💾
`ury`	Orya	301K 💾
`usa`	Usarufa	171K 💾
`usp`	Uspanteco	228K 💾
`uvl`	Lote	277K 💾
`uz`	Uzbek	131K 💾
`vag`	Vagla	221K 💾
`vec`	Venetian	2K 💾
`vec-u-sd-itpd`	Venetian (Padua)	813K 💾
`vec-u-sd-itts`	Venetian (Trieste)	12K 💾
`vec-u-sd-itvr`	Venetian (Verona)	16K 💾
`vid`	Vidunda	151K 💾
`viv`	Iduna	220K 💾
`vmw`	Makhuwa	130K 💾
`vun`	Vunjo	141K 💾
`vut`	Vute	206K 💾
`waj`	Waffa	236K 💾
`wap`	Wapishana	193K 💾
`war`	Waray	208K 💾
`way`	Wayana	143K 💾
`wer`	Weri	209K 💾
`wiu`	Wiru	232K 💾
`wlx`	Wali	847K 💾
`wmw`	Mwani	139K 💾
`wnc`	Wantoat	238K 💾
`wnu`	Usan	234K 💾
`wob`	Wè Northern	270K 💾
`wos`	Hanga Hundi	264K 💾
`wrs`	Waris	213K 💾
`wsk`	Waskia	239K 💾
`wuv`	Wuvulu-Aua	187K 💾
`wwa`	Waama	239K 💾
`xal`	Kalmyk	135K 💾
`xav`	Xavánte	440K 💾
`xed`	Hdi	229K 💾
`xla`	Kamula	230K 💾
`xog`	Soga	127K 💾
`xrb`	Eastern Karaboro	286K 💾
`xsb`	Sambal	244K 💾
`xsi`	Sio	319K 💾
`xsm`	Kasem	604K 💾
`xsr`	Sherpa	184K 💾
`xsu`	Sanumá	408K 💾
`xtd`	Diuxi-Tilantongo Mixtec	277K 💾
`xtm`	Magdalena Peñasco Mixtec	335K 💾
`xuo`	Kuo	306K 💾
`yaa`	Yaminahua	204K 💾
`yad`	Yagua	142K 💾
`yal`	Yalunka	203K 💾
`yam`	Yamba	277K 💾
`yaz`	Lokaa	222K 💾
`yby`	Yaweyuha	219K 💾
`ycn`	Yucuna	202K 💾
`yle`	Yele	298K 💾
`yli`	Angguruk Yali	221K 💾
`yml`	Iamalele	245K 💾
`yo`	Yoruba	270K 💾
`yon`	Yongkom	202K 💾
`yrb`	Yareba	184K 💾
`yre`	Yaouré	285K 💾
`yss`	Yessan-Mayo	227K 💾
`yua`	Yucateco	813K 💾
`yuj`	Karkar-Yuri	258K 💾
`yut`	Yopno	227K 💾
`yuw`	Yau (Morobe Province)	243K 💾
`yva`	Yawa	250K 💾
`zaa`	Sierra de Juárez Zapotec	265K 💾
`zad`	Cajonos Zapotec	180K 💾
`zae`	Yareni Zapotec	248K 💾
`zap`	Zapotec	194K 💾
`zas`	Santo Domingo Albarradas Zapotec	184K 💾
`zaw`	Mitla Zapotec	157K 💾
`zca`	Coatecas Altas Zapotec	236K 💾
`zia`	Zia	242K 💾
`ziw`	Zigula	140K 💾
`zlm`	Malay	664K 💾
`zne`	Zande	253K 💾
`zpc`	Choapan Zapotec	208K 💾
`zpi`	Santa María Quiegolani Zapotec	209K 💾
`zpq`	Zoogocho Zapotec	208K 💾
`zpt`	San Vicente Coatlán Zapotec	229K 💾
`zpz`	Texmelucan Zapotec	281K 💾
`zyp`	Zyphe Chin	230K 💾

¹ Downloadable files include counts for each token; to get raw text, run the crawler yourself. For breaking text into words, we use an ICU word break iterator and count all tokens whose break status is one of UBRK_WORD_LETTER, UBRK_WORD_KANA, or UBRK_WORD_IDEO.

Running the Crawler

./corpuscrawler --language=yo --output=./corpus

corpuscrawler's People

Contributors

Stargazers

Watchers

corpuscrawler's Issues

Thanlwintimes.com No Longer Available

The site ''thanlwintimes.com'' is one of the sites used by Corpus Crawler. However, the site no longer seems to be available; when I visit, I see an error saying that the domain expired:

http://thanlwintimes.com/

The site should be replaced with another site using the same encoding.

Add Pali, Mon, and Karen

We've been referred to the following sources for corpora in additional Myanmar-script laguages.

Pali (Tri Pitaka) [pi-Mymr]
1. https://tipitaka.org/mymr/
Mon [mnw]
1. http://mon.monnews.org/
2. https://mnw.wikipedia.org/wiki/မုက်လိက်တမ်
Shan [shn] -- already included
1. https://shannews.org/
2. https://shn.wikipedia.org/wiki/ၼႃႈႁူဝ်ႁႅၵ်ႈ
Karen [kar]
1. http://karen.kicnews.org/
2. https://wol.jw.org/ksw/wol/h/r350/lp-kr (Bible, Publications...)

CC @sven-oly

[be-tarask] Add corpus for Belarusian (Taraškievica)

This site is in be-tarask: https://www.svaboda.org/

Add Norwegian language

Norwegian language is currently not present.
It would be great to have it added too.

how to

@brawer how to use this script?

Does not run in python3.7 or python 2.7

$ python2 --version
Python 2.7.16+

$ python3 --version
Python 3.7.2+

$ python3 ./corpuscrawler --language tzh --output output-tzh/
Cache-Hit:      http://listen.bible.is/robots.txt
Traceback (most recent call last):
  File "./corpuscrawler", line 28, in <module>
    sys.exit(corpuscrawler.main.main())
  File "/home/fran/source/corpuscrawler/Lib/corpuscrawler/main.py", line 1249, in main
    crawls[args.language](crawler)
  File "/home/fran/source/corpuscrawler/Lib/corpuscrawler/crawl_tzh.py", line 21, in crawl
    crawl_bibleis(crawler, out, bible='TZHSBM')
  File "/home/fran/source/corpuscrawler/Lib/corpuscrawler/util.py", line 776, in crawl_bibleis
    init = crawler.fetch(firsturl)
  File "/home/fran/source/corpuscrawler/Lib/corpuscrawler/util.py", line 136, in fetch
    if not self.is_fetch_allowed_by_robots_txt(url):
  File "/home/fran/source/corpuscrawler/Lib/corpuscrawler/util.py", line 259, in is_fetch_allowed_by_robots_txt
    checker.parse(robots_txt.decode('utf-8'))
AttributeError: 'str' object has no attribute 'decode'

$ python2 ./corpuscrawler --language tzh --output output-tzh/
Traceback (most recent call last):
  File "./corpuscrawler", line 24, in <module>
    import corpuscrawler.main
  File "/home/fran/source/corpuscrawler/Lib/corpuscrawler/main.py", line 20, in <module>
    from corpuscrawler import (
  File "/home/fran/source/corpuscrawler/Lib/corpuscrawler/crawl_aaz.py", line 16, in <module>
    from corpuscrawler.util import crawl_bibleis
  File "/home/fran/source/corpuscrawler/Lib/corpuscrawler/util.py", line 18, in <module>
    from builtins import open, bytes, chr
ImportError: No module named builtins

harfbuzz-testing-wikipedia

Hi Sascha,

Nice work! Here's the output of what roozbeh did for HarfBuzz testing by extracting Wikipedia:
https://github.com/behdad/harfbuzz-testing-wikipedia

Don't know if it's of much use. That one included all talk pages of Wikipedia as well, so the word distribution is skewed, for example the word for "User" is over-represented. Anyway, thought I share here for the record.

Use corpora from Universal Dependencies

The Universal Dependencies project has corpora in a set of languages; consider incorporating them.

what sites are crawled?

I looked through the readme; is there a list of what sites are crawled by this script? Is there documentation for how to add additional sites?

Add (Modern Standard) Arabic language

Is there any work being done regarding any Arabic dialects?

We can start with http://www.dw.com/ar/, which is Modern Standard Arabic. I think MSA is a good start, and we can add regional dialects later.

Please list here any source you think we should add, for MSA or regional dialects.

Shorten project structure

Related to #80. Suggestion. Mainly, move the core codes up so it is more visible.
The crawlers are kept into their own folder.

Reoganize project structure from :

corpuscrawler
├─ README.md
├─ LICENSE
├─ LICENSE.md
├─ CONTRIBUTING.md
├─ corpuscrawler
└─ Lib
   └─ corpuscrawler
      ├─ *.py : utilities
      └─ crawl_{iso}.py : crawlers

corpuscrawler
├─ README.md
├─ LICENSE
├─ LICENSE.md
├─ CONTRIBUTING.md
├─ corpuscrawler
├─ *.py : utilities
└─ crawlers
   └─ crawl_{iso}.py : crawlers

Would such changes disturb some complementary toolchain ?

Error when crawling Kaqchikel

Not sure what is going on here:

fran@ipek:~/source/corpuscrawler$ ./corpuscrawler --output ~/corpora/languages/kaqchikel/corpcrawl/ --language cak
Downloading:    http://listen.bible.is/robots.txt
Downloading:    http://listen.bible.is/CAKSBG/Matt/1
Traceback (most recent call last):
  File "./corpuscrawler", line 28, in <module>
    sys.exit(corpuscrawler.main.main())
  File "/home/fran/source/corpuscrawler/Lib/corpuscrawler/main.py", line 1249, in main
    crawls[args.language](crawler)
  File "/home/fran/source/corpuscrawler/Lib/corpuscrawler/crawl_cak.py", line 21, in crawl
    crawl_bibleis(crawler, out, bible='CAKSBG')
  File "/home/fran/source/corpuscrawler/Lib/corpuscrawler/util.py", line 718, in crawl_bibleis
    jsonraw = json.loads(content.split('var chaptersByBook = ')[1].split(';\n')[0])
IndexError: list index out of range

Python 3 compatibility

When attempting to run corpuscrawler in Python 3, I get the error:

ImportError: No module named 'mimetools'

It would be good to support Python 3.

When running in Python 2, I get this other error:

AttributeError: 'module' object has no attribute 'SSLContext'

which apparently happens because that function was backported very recently to Python 2.

[gd] Extend Scottish Gaelic corpus

For Scottish Gaelic, https://dasg.ac.uk/text/ now contains plaintext files which makes it easier to crawl than before. Some material is multilingual, but it’s already language-tagged with a custom tagging scheme using tags such as <eng> and <gai>. For example, https://dasg.ac.uk/text/68.txt has English sections that are marked up like this — a trivial regexp subsitution should be able to remove the English sections:

Dh’fhosgail e i; is léugh e:
<eng>The Queen, who is lying very ill, urges your immediate attendance.
(Signed) Eveleyn Marlborough.<gai>
“Ma thig am Prionnsa,” thuirt e, ...

/cc @jimregan

[mi] Filter out English text

The crawled Maori language corpus contains some English text. By filtering out every paragraph that contains the string the, we’d remove 1902 lines from the corpus which are mostly English-language quotes. I believe this would be an improvement. @jimregan, would you be fine with this change?

Here’s a few sample paragraphs that would be removed:

Hei tā Whakaruru, “That's what the King believed his mother signed was to give us the means to help. We want to put on the table lands, resources and people.”

The doting parents of Whakatōhea descendant Kayla Imrie have won a free ticket to the Olympics.

Hei tā Kiripatea, “This is a primary model of one of the backyard gardens that we have done and we did it over a weekend.”

Thanks to Volkswagen, they will get a free ticket to watch their daughter perform with the NZ K4 Women's Kayaking crew at Rio.

The first-time Olympian is currently in Portugal training alongside her K4 team, Aimee Fisher, Caitlin Ryan and Jamie Lovett, for the upcoming world cup in Germany this month.

Ko tā Marama Davidson, mema o te Pāti Kākāriki, “We are still in decline when it comes to those who can speak te reo. So I think it's only 3.7% of people in Aotearoa can have a conversation in te reo. We know that over 77% of children are not enrolled in any subject at schools at the moment and we know that half of all schools in Aotearoa have absolutely no student learning te reo or taking a te reo subject.”

Allow Zawgyi crawling separate from my

In order to crawl my-t-d0-zawgyi.txt, you have to wait for my.txt to finish. It would be nicer if the two files could be crawling in parallel. For example, one could initiate my.txt, my-t-d0-zawgyi.txt, and shn.txt crawling, and have data after just a few minutes; the data files can keep growing as more pages get crawled.

Use available sentences corpora for Wikipedia (290+ languages)

There are ready-to-download open licence Wikipedia corpora available.

Project introduction	Type	Languages (2024)	Portal all	Language specific	Download link	Comments
Wortschatz by Leipzig	Sentences Monolingual	290+	-	bre	bre 100k sentences (2021)	List of sentences corpora : API reference > https://api.wortschatz-leipzig.de/ws/corpora

Improve readme documentation on how to provide a new crawler

This /CONTRIBUTING.md is a License Agreement / Code of Conduite to sign. As far as I can see, this very cherishable project has no actual tutorial.

In don't have Python and code knowledge to fix this documentation issue but I can map the road so it become easier for the next person to do so.

Wanted

If an user want to add a language such as Catalan from Barcelona (ca, cat : missing). What do he needs to jump in quickly ? What should he provide ?

What isn the local structure :
- util.py : store functions uses by multiple languages crawlers
- main.py : stores the 1000+ crawlers calls, run them all.
- crawl_{iso}.py : stores language-specific copora's source url and processing functions.
  - crawl_ca_valencia.py
What tools I have :
- lists of available modules
- API of key functions
What input(s) : python list of url ?
What are the classic parts of a crawler function ?
What output format : raw text ? html is fine because a html balise wiper is then used ?
Example of easily hackable base-code.

API (to complete)

Defined functions within util.py, by order of apparition as of 2021/02/26. If you have some relevant knowledge, please help for a sub-section or one item.

Some tools

daterange(start, end): __
urlpath(url): __
urlencode(url): __

Main element

class Crawler(object):
- __init__(self, language, output_dir, cache_dir, crawldelay): __
- get_output(self, language=None): __
- close(self): __
- fetch(self, url, redirections=None, fetch_encoding='utf-8'): __
- fetch_content(self, url, allow_404=False): __
- fetch_sitemap(self, url, processed=set(), subsitemap_filter=lambda x: True): __
- is_fetch_allowed_by_robots_txt(self, url): __
- crawl_pngscriptures_org(self, out, language): __
- _find_urls_on_pngscriptures_org(self, language): __
- crawl_abc_net_au(self, out, program_id): __
- crawl_churchio(self, out, bible_id): __
- crawl_aps_dz(self, out, prefix): __
- crawl_sverigesradio(self, out, program_id): __
- crawl_voice_of_america(self, out, host, ignore_ascii=False): __
- set_context(self, context): __

Some crawlers for multi-languages sites

crawl_bbc_news(crawler, out, urlprefix): __
crawl_korero_html(crawler, out, project, genre, filepath): __
write_paragraphs(et, out): __
crawl_deutsche_welle(crawler, out, prefix, need_percent_in_url=False): __
crawl_radio_free_asia(crawler, out, edition, start_year=1998): __
crawl_sputnik_news(crawler, out, host): __
crawl_udhr(crawler, out, filename): __
crawl_voice_of_nigeria(crawler, out, urlprefix): __
crawl_bibleis(crawler, out, bible): __
crawl_tipitaka(crawler, out, script): __
find_wordpress_urls(crawler, site, **kwargs): __

Some cleaners

unichar(i): __
replace_html_entities(html): __
cleantext(html): __
clean_paragraphs(html): __
extract(before, after, html): __
fixquotes(s): __

Shorter way to do so

In code comments can do a lot. Pointing to wisely chosen sections too. If you have the required know how, please add comments onto a chosen, existing crawler and point to it as an in-code tutorial.

@sffc, @brawer : anyone could help on that ?

Adding New URLs

Hi,

Can we fetch data from URLs not mentioned in the existing code by adding custom functions? Also, does it not support the English language('en' not mentioned anywhere in the list of supported languages)?

Thanks

404 error with Myanmar Zawgyi

I ran ./corpuscrawler --language=my-t-d0-zawgyi --output=./corpus (with python 2.7 on Ubuntu 18.04) and the program crashed while downloading from some url. The output is shown below.

Downloading:    http://thanlwintimes.com/robots.txt
Downloading:    http://thanlwintimes.com/
Downloading:    http://thanlwintimes.com/category/%e1%80%9e%e1%80%90%e1%80%84%e1%80%b9%e1%80%b8/
Cache-Hit:      http://thanlwintimes.com/category/%e1%80%9e%e1%80%90%e1%80%84%e1%80%b9%e1%80%b8/
Downloading:    http://thanlwintimes.com/category/%e1%80%9e%e1%80%90%e1%80%84%e1%80%b9%e1%80%b8/page/2/
Downloading:    http://thanlwintimes.com/category/%e1%80%9e%e1%80%90%e1%80%84%e1%80%b9%e1%80%b8/page/3/
Downloading:    http://thanlwintimes.com/category/%e1%80%b1%e1%80%86%e1%80%ac%e1%80%84%e1%80%b9%e1%80%b8%e1%80%95%e1%80%ab%e1%80%b8/
Cache-Hit:      http://thanlwintimes.com/category/%e1%80%b1%e1%80%86%e1%80%ac%e1%80%84%e1%80%b9%e1%80%b8%e1%80%95%e1%80%ab%e1%80%b8/
Downloading:    http://thanlwintimes.com/category/%e1%80%b1%e1%80%86%e1%80%ac%e1%80%84%e1%80%b9%e1%80%b8%e1%80%95%e1%80%ab%e1%80%b8/page/2/
Downloading:    http://thanlwintimes.com/category/%e1%80%b1%e1%80%86%e1%80%ac%e1%80%84%e1%80%b9%e1%80%b8%e1%80%95%e1%80%ab%e1%80%b8/page/3/
Downloading:    http://thanlwintimes.com/category/%e1%80%b1%e1%80%86%e1%80%ac%e1%80%84%e1%80%b9%e1%80%b8%e1%80%95%e1%80%ab%e1%80%b8/page/4/
Downloading:    http://thanlwintimes.com/category/%e1%80%b1%e1%80%86%e1%80%ac%e1%80%84%e1%80%b9%e1%80%b8%e1%80%95%e1%80%ab%e1%80%b8/page/5/
Downloading:    http://thanlwintimes.com/category/%e1%80%b1%e1%80%86%e1%80%ac%e1%80%84%e1%80%b9%e1%80%b8%e1%80%95%e1%80%ab%e1%80%b8/page/6/
Downloading:    http://thanlwintimes.com/category/%e1%80%a1%e1%80%84%e1%80%b9%e1%80%90%e1%80%ac%e1%80%97%e1%80%ba%e1%80%b4%e1%80%b8/
Cache-Hit:      http://thanlwintimes.com/category/%e1%80%a1%e1%80%84%e1%80%b9%e1%80%90%e1%80%ac%e1%80%97%e1%80%ba%e1%80%b4%e1%80%b8/
Downloading:    http://thanlwintimes.com/category/%e1%80%a1%e1%80%84%e1%80%b9%e1%80%90%e1%80%ac%e1%80%97%e1%80%ba%e1%80%b4%e1%80%b8/page/2/
Downloading:    http://thanlwintimes.com/category/%e1%80%a1%e1%80%84%e1%80%b9%e1%80%90%e1%80%ac%e1%80%97%e1%80%ba%e1%80%b4%e1%80%b8/page/3/
Downloading:    http://thanlwintimes.com/category/%e1%80%a1%e1%80%84%e1%80%b9%e1%80%90%e1%80%ac%e1%80%97%e1%80%ba%e1%80%b4%e1%80%b8/page/4/
Downloading:    http://thanlwintimes.com/category/%e1%80%a1%e1%80%9a%e1%80%b9%e1%80%92%e1%80%ae%e1%80%90%e1%80%ac%e1%80%b7-%e1%80%a1%e1%80%ac%e1%80%b1%e1%80%98%e1%80%ac%e1%80%b9/
Cache-Hit:      http://thanlwintimes.com/category/%e1%80%a1%e1%80%9a%e1%80%b9%e1%80%92%e1%80%ae%e1%80%90%e1%80%ac%e1%80%b7-%e1%80%a1%e1%80%ac%e1%80%b1%e1%80%98%e1%80%ac%e1%80%b9/
Downloading:    http://thanlwintimes.com/category/%e1%80%a1%e1%80%9a%e1%80%b9%e1%80%92%e1%80%ae%e1%80%90%e1%80%ac%e1%80%b7-%e1%80%a1%e1%80%ac%e1%80%b1%e1%80%98%e1%80%ac%e1%80%b9/page/2/
Downloading:    http://thanlwintimes.com/category/%e1%80%80%e1%80%ac%e1%80%90%e1%80%bc%e1%80%94%e1%80%b9%e1%80%b8/
Cache-Hit:      http://thanlwintimes.com/category/%e1%80%80%e1%80%ac%e1%80%90%e1%80%bc%e1%80%94%e1%80%b9%e1%80%b8/
Downloading:    http://thanlwintimes.com/category/%e1%80%80%e1%80%ac%e1%80%90%e1%80%bc%e1%80%94%e1%80%b9%e1%80%b8/page/2/
Downloading:    http://thanlwintimes.com/category/%e1%80%80%e1%80%ac%e1%80%90%e1%80%bc%e1%80%94%e1%80%b9%e1%80%b8/page/3/
Downloading:    http://thanlwintimes.com/category/%e1%80%80%e1%80%ac%e1%80%90%e1%80%bc%e1%80%94%e1%80%b9%e1%80%b8/page/4/
Downloading:    http://thanlwintimes.com/category/%e1%80%80%e1%80%ac%e1%80%90%e1%80%bc%e1%80%94%e1%80%b9%e1%80%b8/page/5/
Downloading:    http://thanlwintimes.com/category/%e1%80%80%e1%80%ac%e1%80%90%e1%80%bc%e1%80%94%e1%80%b9%e1%80%b8/page/6/
Downloading:    http://thanlwintimes.com/category/%e1%80%9e%e1%80%90%e1%80%84%e1%80%b9%e1%80%b8%e1%80%93%e1%80%ab%e1%80%90%e1%80%b9%e1%80%95%e1%80%af%e1%80%b6/
Cache-Hit:      http://thanlwintimes.com/category/%e1%80%9e%e1%80%90%e1%80%84%e1%80%b9%e1%80%b8%e1%80%93%e1%80%ab%e1%80%90%e1%80%b9%e1%80%95%e1%80%af%e1%80%b6/
Downloading:    http://thanlwintimes.com/category/%e1%80%9e%e1%80%90%e1%80%84%e1%80%b9%e1%80%b8%e1%80%93%e1%80%ab%e1%80%90%e1%80%b9%e1%80%95%e1%80%af%e1%80%b6/page/2/
Traceback (most recent call last):
  File "./corpuscrawler", line 28, in <module>
    sys.exit(corpuscrawler.main.main())
  File "/home/dell/nlp/corpuscrawler/Lib/corpuscrawler/main.py", line 1249, in main
    crawls[args.language](crawler)
  File "/home/dell/nlp/corpuscrawler/Lib/corpuscrawler/crawl_my_t_d0_zawgyi.py", line 24, in crawl
    _crawl_than_lwin_times(crawler, out)
  File "/home/dell/nlp/corpuscrawler/Lib/corpuscrawler/crawl_my_t_d0_zawgyi.py", line 28, in _crawl_than_lwin_times
    urls = find_wordpress_urls(crawler, 'http://thanlwintimes.com/')
  File "/home/dell/nlp/corpuscrawler/Lib/corpuscrawler/util.py", line 767, in find_wordpress_urls
    assert pgdoc.status == 200, (pgdoc.status, pgurl)
AssertionError: (404, u'http://thanlwintimes.com/category/%e1%80%9e%e1%80%90%e1%80%84%e1%80%b9%e1%80%b8%e1%80%93%e1%80%ab%e1%80%90%e1%80%b9%e1%80%95%e1%80%af%e1%80%b6/page/2/')

Crawl Pali corpora

It would be nice to add crawlers for https://www.tipitaka.org to Corpus Crawler. This is the Buddhist Tipitaka in the Pali language, written in various scripts. The crawled corpora will be useful for testing Pali transliteration rules, contributed to Unicode CLDR by @mjansche.

Update Zawgyi locale to Qaag

CLDR is proposing the script code Qaag to use for Zawgyi text:

Qaag is a special script code for identifying the non-standard use of Myanmar characters for display with the Zawgyi font. The purpose of the code is to enable migration to standard, interoperable use of Unicode by providing an identifier for Zawgyi for tagging text, applications, input methods, font tables, transformations, and other mechanisms used for migration.

corpuscrawler should be updated to use the new script code instead of the -u-s0-zawgyi workaround. Myanmar Tools will need to also be updated to consume the new script code.

Add Wikipedia crawler ? (300+ languages)

A quick search shows you that CorpusCrawler does not crawl or use Wikipedia. I don't know Python but it seems feasible, either from scratch on Wikipedia API (1) or using existing server-side tools (2).

Assess interest

Assess how many Wikipedia languages are not in UNILEX. See unicode-org/unilex#14 .
Assess quality of wikipedia raw text data in minority languages.
Compare gain to other available public corpora such Tatoeba (358 languages).

Crawling via API

By using and loading available list of articles per wikipedia, then scrap the sites. If too large, could be limited to max=n articles.

Given an iso code such as Ndonga's ng :

download List of page titles in main namespace archive (see below)
get the articles into a python list variable (python)
code a crawler in /Lib/corpuscrawler/util.py, following other crawler as examples 1, which query Wikipedia API, extract the valuable text, save the text. (python)
Update relevant crawlers /Lib/corpuscrawler/

Wikipedia API provides text

Various formats available:

format : The format of the output.
- jsont : Output data in JSON format.
- jsonfmt : Output data in JSON format (pretty-print in HTML).
- nonet : Output nothing.
- phpt : Output data in serialised PHP format.
- phpfmt : Output data in serialised PHP format (pretty-print in HTML).
- rawfmt : Output data, including debugging elements, in JSON format (pretty-print in HTML).
- xmlt : Output data in XML format.
- xmlfmt : Output data in XML format (pretty-print in HTML).

List of Wikipedia (~300)

List_of_Wikipedias
List of dumps - Wikipedia and others wiki projects.

List of articles per Wikipedia

For convenience, I use the tiny Ndonga (ng) Wikipedia (8 articles), easier to explore by hand.

For larger demo, you could also inspect similar URLs with the iso of :

Language	Native	iso	Articles
Ndonga	Oshiwambo	ng	8
Inuktitut	ᐃᓄᒃᑎᑐᑦ/inuktitut	iu	514
Samoan	Gagana Samoa	sm	985
Igbo	Igbo	ig	2,085
Central Bikol	Bikol Central	bcl	10,824

Namespaces

On all wikis. See also here

0: (main)
1: Talk:
2: User:
3: User_talk:

Dumps' & paths

List of dumps
- /ngwiki/20200220 - manual (change the date)
- /ngwiki/latest - directory
  - /ngwiki-latest-all-titles.gz
  - /ngwiki-latest-all-titles-in-ns0.gz - articles only

Using Wikipedia extractors ?

Hybrid approach

ISO: get the list of all local wiki's iso codes.
Downloads: loop over each language code, download the dump.
Extract: use extractor above, zip each language
Cloud: put text result online.
Crawl: in util.py, code a simple crawler which get just that .zip, convert back to txt content, add to the corpora.

cc: @brawer

Portuguese: doubt about the corpus result

I was analyzing the exit file and I realized the text for each "news" is only the title, the headline, and the 1st paragraph. It must be correct?
I'm using the crawler for "pt" language.

Undefined names

% flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics

./corpuscrawler/Lib/corpuscrawler/crawl_mi.py:62:39: F821 undefined name 'sitemap'
        if pubdate is None: pubdate = sitemap[url]
                                      ^
./corpuscrawler/Lib/corpuscrawler/crawl_kab.py:53:48: F821 undefined name 'url'
        assert doc.status == 200, (doc.status, url)
                                               ^
./corpuscrawler/Lib/corpuscrawler/crawl_tpi.py:48:48: F821 undefined name 'url'
        assert doc.status == 200, (doc.status, url)
                                               ^
./corpuscrawler/Lib/corpuscrawler/crawl_shn.py:90:30: F821 undefined name 'striptags'
                p = ' '.join(striptags(replace_html_entities(p)).split())
                             ^
./corpuscrawler/Lib/corpuscrawler/crawl_shn.py:90:40: F821 undefined name 'replace_html_entities'
                p = ' '.join(striptags(replace_html_entities(p)).split())
                                       ^
./corpuscrawler/Lib/corpuscrawler/crawl_ga.py:147:39: F821 undefined name 'fetchresult'
        if pubdate is None: pubdate = fetchresult.headers.get('Last-Modified')
                                      ^
./corpuscrawler/Lib/corpuscrawler/crawl_th.py:25:5: F821 undefined name 'crawl_bibleis'
    crawl_bibleis(crawler, out, bible='THATSV')
    ^
./corpuscrawler/Lib/corpuscrawler/crawl_vec.py:43:48: F821 undefined name 'start_url'
        assert doc.status == 200, (doc.status, start_url)
                                               ^
8     F821 undefined name 'fetchresult'
8

https://flake8.pycqa.org/en/latest/user/error-codes.html

On the flake8 test selection, this PR does not focus on "style violations" (the majority of flake8 error codes that psf/black can autocorrect). Instead, these tests are focus on runtime safety and correctness:

E9 tests are about Python syntax errors usually raised because flake8 can not build an Abstract Syntax Tree (AST). Often these issues are a sign of unused code or code that has not been ported to Python 3. These would be compile-time errors in a compiled language but in a dynamic language like Python, they result in the script halting/crashing on the user.
F63 tests are usually about the confusion between identity and equality in Python. Use ==/!= to compare str, bytes, and int literals is the classic case. These are areas where a == b is True but a is b is False (or vice versa). Python >= 3.8 will raise SyntaxWarnings on these instances.
F7 tests logic errors and syntax errors in type hints
F82 tests are almost always undefined names which are usually a sign of a typo, missing imports, or code that has not been ported to Python 3. These also would be compile-time errors in a compiled language but in Python, a NameError is raised which will halt/crash the script on the user.

crawler gets hung after downloading a few hits

I am trying to use this crawler to build an Urdu corpus. I am running Ubuntu 18.04 inside a VMWare virtual machine. The crawler will start and successfully download a few links but will eventually get permanently hung up. Nothing happens until I ctrl-c to exit the script. I can kill the script, start it again and it will successfully get the link it got hung up on the previous run, it will then successfully crawl a few more until getting hung up again. The below copied text is an example of what I get when I kill the script with ctrl-c

...(the crawler has successfully downloaded several links so far)...
Downloading: https://www.bbc.com/urdu/entertainment-37527961
Downloading: https://www.bbc.com/urdu/entertainment-37529481
Downloading: https://www.bbc.com/urdu/entertainment-37531642
Downloading: https://www.bbc.com/urdu/entertainment-37532975
^CTraceback (most recent call last):
File "./corpuscrawler", line 28, in
sys.exit(corpuscrawler.main.main())
File "/home/thebucketmouse/Desktop/corpuscrawler-master/Lib/corpuscrawler/main.py", line 1249, in main
crawlsargs.language
File "/home/thebucketmouse/Desktop/corpuscrawler-master/Lib/corpuscrawler/crawl_ur.py", line 22, in crawl
crawl_bbc_news(crawler, out, urlprefix='/urdu/')
File "/home/thebucketmouse/Desktop/corpuscrawler-master/Lib/corpuscrawler/util.py", line 475, in crawl_bbc_news
fetchresult = crawler.fetch(url)
File "/home/thebucketmouse/Desktop/corpuscrawler-master/Lib/corpuscrawler/util.py", line 150, in fetch
content = response.read()
File "/usr/lib/python2.7/socket.py", line 355, in read
data = self._sock.recv(rbufsize)
File "/usr/lib/python2.7/httplib.py", line 597, in read
s = self.fp.read(amt)
File "/usr/lib/python2.7/socket.py", line 384, in read
data = self._sock.recv(left)
File "/usr/lib/python2.7/ssl.py", line 772, in recv
return self.read(buflen)
File "/usr/lib/python2.7/ssl.py", line 659, in read
v = self._sslobj.read(len)
KeyboardInterrupt

No module named 'corpuscrawler' error

The script doesn't run with Python 3.

Shows error :

For solving this I have tried changing this:

to: checker.parse(robots_txt) as it is already decoded in Python 3 and it worked for me.

After this I am getting error:

By running main.py in am getting error:

ModuleNotFoundError: No module named 'corpuscrawler'

Can anybody help solving this?

Documentation > Clarify language codes system in uses

Tiny issue : add a reference to the language code system you use.

I may have missed it.

Rename crawl_taq to crawl_kab

The language which we currently call taq is actually Kabyle (BCP47: kab). Our use of taq came from the website taq.tamurt.info, but their taq stands for the word “taqbaylit,” which means Kabyle in Berber.

Use available corpora for opensubtitles (63 languages)

Research

J. Tiedemann, 2016, Finding Alternative Translations in a Large Corpus of Movie Subtitles. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016)

Gain

Closest of natural oral corpora.

Links

Portal
- bre.txt.gz -- Bretonl corpus.
- 60+ languages available.
- List: af,ar,bg,bn,br,bs,ca,cs,da,de,el,en,eo,es,et,eu,fa,fi,fr,gl,he,hi,hr,hu,hy,id,is,it,ja,ka,kk,ko,lt,lv,mk,ml,ms,nl,no,pl,pt,pt_br,ro,ru,si,sk,sl,sq,sr,sv,ta,te,th,tl,tr,uk,ur,vi,ze_en,ze_zh,zh_cn,zh_tw