Git Product home page Git Product logo

corpuscrawler's Introduction

Corpus Crawler

Corpus Crawler is a tool for Corpus Linguistics.

Modern linguistic research works on language corpora, which are large samples of “real world” text. This crawler helps to build such corpora: it follows links to publicly accessible web pages known to be written in a certain language; it removes boilerplate and HTML markup; finally, it writes its output into plaintext files. The crawler implements the Robots Exclusion Standard, and it is intentionally slow so it does not cause much load on the crawled web sites.

This is not an official Google product. But if you’re a linguistic researcher, or if you’re writing a spell checker (or similar language-processing software) for an “exotic” language, you might find Corpus Crawler useful.

To build corpora for not-yet-supported languages, please read the contribution guidelines and send us GitHub pull requests.

The crawled corpora have been used to compute word frequencies in Unicode’s Unilex project.

Supported Languages

IETF BCP47 Code Language Tokens¹
aai Arifama-Miniafia 181K 💾
aak Ankave 194K 💾
aau Abau 313K 💾
aaz Amarasi 308K 💾
abt Ambulas 297K 💾
aby Aneme Wake 233K 💾
acd Gikyode 323K 💾
ace Aceh/Acehnese 817K 💾
acf Saint Lucian Creole French 236K 💾
ach Acoli 178K 💾
acn Achang 232K 💾
acr Achi 239K 💾
acu Achuar-Shiwiar 174K 💾
ade Adele 267K 💾
adh Adhola 166K 💾
adj Adioukrou 233K 💾
ae Avestan 129K 💾
ae-Latn Avestan (Latin) 141K 💾
aey Amele 218K 💾
agd Agarabi 256K 💾
agg Angor 214K 💾
agm Angaataha 238K 💾
agn Agutaynen 234K 💾
agr Aguaruna 149K 💾
ahk Akha 367K 💾
aia Arosi 223K 💾
akb Batak Angkola 220K 💾
ake Akawaio 190K 💾
akh Akha 408K 💾
akp Siwu 191K 💾
alj Alangan 185K 💾
alp Alune 225K 💾
alt Southern Altai 121K 💾
alz Alur 160K 💾
am Amharic 2,170K 💾
ame Yanesha' 221K 💾
amf Hamer-Banna 152K 💾
amk Ambai 229K 💾
amm Ama (Papua New Guinea) 246K 💾
amn Amanab 207K 💾
amp Alamblak 241K 💾
amr Amarakaeri 151K 💾
amu Guerrero Amuzgo 202K 💾
ann Obolo 236K 💾
anv Denya 214K 💾
aoj Mufian 217K 💾
aom Ömie 231K 💾
aon Bumbita Arapesh 294K 💾
aoz Uab Meto 197K 💾
ape Bukiyip 294K 💾
apr Arop-Lokep 373K 💾
apz Safeyoka 235K 💾
ar Arabic 19,593K 💾
arl Arabela 206K 💾
asg Cishingini 270K 💾
aso Dano 290K 💾
ata Pele-Ata 248K 💾
atb Zaiwa 291K 💾
atg Ivbie North-Okpela-Arhe 229K 💾
atq Aralle-Tabulahan 202K 💾
auy Awiyaana 164K 💾
av Avaric 111K 💾
avn Avatime 229K 💾
avt Au 263K 💾
avu Avokaya 391K 💾
awa Awadhi 211K 💾
awb Awa (Papua New Guinea) 179K 💾
ay Aymara 482K 💾
ayo Ayoreo 264K 💾
az Azerbaijani 3,413K 💾
azg San Pedro Amuzgos Amuzgo 271K 💾
azz Highland Puebla Nahuatl 265K 💾
ba Bashkir 666K 💾
ban Balinese 211K 💾
bao Waimaha 232K 💾
bav Vengo 250K 💾
bba Baatonum 792K 💾
bbb Barai 289K 💾
bbo Northern Bobo Madaré 211K 💾
bbr Girawa 245K 💾
bch Bariai 248K 💾
bcw Bana 304K 💾
bdd Bunama 171K 💾
be Belarusian 1,441K 💾
be-tarask Belarusian (Taraškievica) 108,431K 💾
bef Benabena 239K 💾
bep Besoa 204K 💾
bex Jur Modo 254K 💾
bfd Bafut 276K 💾
bfo Malba Birifor 260K 💾
bg Bulgarian 10,597K 💾
bgr Bawm Chin 213K 💾
bgz Banggai 186K 💾
bhl Bimin 324K 💾
bhw Biak 164K 💾
bi Bislama 315K 💾
bib Bissa 243K 💾
big Biangai 229K 💾
bik Central Bikol 183K 💾
bim Bimoba 215K 💾
biv Southern Birifor 221K 💾
bjr Binumarien 226K 💾
bjv Bedjond 268K 💾
bkl Berik 306K 💾
bku Buhid 204K 💾
bkv Bekwarra 244K 💾
blh Kuwaa 259K 💾
blt-Latn Tai Dam (Latin) 262K 💾
blz Balantak 199K 💾
bm Bambara 30K 💾
bmh Kein 253K 💾
bmq Bomu 207K 💾
bmr Muinane 122K 💾
bmu Somba-Siawari 234K 💾
bmv Bum 258K 💾
bn Bangla 7,258K 💾
bnj Eastern Tawbuid 239K 💾
bnp Bola 263K 💾
bo Tibetan 5,642K 💾
boa Bora 133K 💾
boj Anjam 255K 💾
bon Bine 244K 💾
bov Tuwuli 203K 💾
box Buamu 274K 💾
bpr Koronadal Blaan 204K 💾
bps Sarangani Blaan 214K 💾
bqc Boko 567K 💾
bqj Bandial 175K 💾
bqp Busa 162K 💾
bru Eastern Bru 261K 💾
bs Bosnian 8,993K 💾
bsn Barasana-Eduria 225K 💾
bss Akoose 199K 💾
btd Batak Dairi 192K 💾
bts Batak Simalungun 175K 💾
btt Bete-Bendi 266K 💾
btx Batak Karo 189K 💾
bua Buriat 143K 💾
bud Ntcham 207K 💾
buk Bugawac 264K 💾
bus Bokobaru 159K 💾
bvc Baelelea 308K 💾
bvz Bauzi 509K 💾
bwq Southern Bobo Madaré 214K 💾
bwu Buli 285K 💾
byr Baruya 182K 💾
byx Qaqet 387K 💾
bzh Mapos Buang 251K 💾
bzi Bisu 381K 💾
bzj Belize Kriol English 240K 💾
ca-valencia Valencian 24,295K 💾
caa Chortí 307K 💾
cab Garifuna 154K 💾
cac Chuj 244K 💾
cak Kaqchikel 259K 💾
cap Chipaya 154K 💾
car Galibi Carib 160K 💾
cax Chiquitano 149K 💾
cbc Carapana 256K 💾
cbi Chachi 187K 💾
cbl Bualkhaw Chin 210K 💾
cbr Cashibo-Cacataibo 236K 💾
cbs Cashinahua 198K 💾
cbt Chayahuita 150K 💾
cbv Cacua 265K 💾
cce Chopi 204K 💾
ccp Chakma 79K 💾
cdf Chiru 193K 💾
ce Chechen 669K 💾
ceb Cebuano 1,067K 💾
ceg Chamacoco 232K 💾
cfm Falam Chin 438K 💾
cgc Kagayanen 299K 💾
chj Ojitlán Chinantec 305K 💾
chm Mari 132K 💾
chr Cherokee 119K 💾
chz Ozumacín Chinantec 205K 💾
cjo Ashéninka Pajonal 141K 💾
cjp Cabécar 199K 💾
cjv Chuave 286K 💾
cko Anufo 272K 💾
cle Lealao Chinantec 313K 💾
cme Cerma 230K 💾
cmr Mro-Khimi Chin 275K 💾
cnh Hakha Chin 934K 💾
cni Asháninka 122K 💾
cnk Khumi Chin 237K 💾
cnl Lalana Chinantec 308K 💾
cnt Tepetotutla Chinantec 261K 💾
coe Koreguaje 181K 💾
cof Colorado 183K 💾
cok Santa Teresa Cora 230K 💾
con Cofán 151K 💾
cot Caquinte 128K 💾
crh Crimean Tatar 505K 💾
cs Czech 3,141K 💾
csk Jola-Kasa 177K 💾
cso Sochiapam Chinantec 328K 💾
ctd-Latn Tedim Chin (Latin) 852K 💾
ctu Chol 203K 💾
cub Cubeo 220K 💾
cuc Usila Chinantec 278K 💾
cui Cuiba 292K 💾
cuk San Blas Kuna 187K 💾
cul Culina 221K 💾
cv Chuvash 111K 💾
cwe Kwere 144K 💾
cwt Kuwaataay 168K 💾
cy Welsh 11,519K 💾
cya Nopala Chatino 245K 💾
czt Zotung Chin 227K 💾
da Danish 655K 💾
daa Dangaléat 208K 💾
dad Marik 197K 💾
dah Gwahatike 274K 💾
ddn Dendi 210K 💾
de German 46,431K 💾
ded Dedua 146K 💾
des Desano 210K 💾
dga Southern Dagaare 458K 💾
dgi Northern Dagara 257K 💾
dgz Daga 219K 💾
din Southwestern Dinka 196K 💾
dip Northeastern Dinka 193K 💾
djk Eastern Maroon Creole 307K 💾
dln Darlong 776K 💾
dnw Western Dani 254K 💾
dob Dobu 179K 💾
dop Lukpa 226K 💾
dsh Daasanach 211K 💾
dtb Labuk-Kinabatangan Kadazan 248K 💾
dtp Kadazan Dusun 1,038K 💾
dts Toro So Dogon 202K 💾
due Umiray Dumaget Agta 247K 💾
dug Duruma 172K 💾
duo Dupaninan Agta 266K 💾
dwr Dawro 254K 💾
dww Dawawa 208K 💾
dyi Djimini Senoufo 268K 💾
dyo Jola-Fonyi 158K 💾
dyu Dyula 1,156K 💾
dz Dzongkha 61K 💾
ee Ewe 421K 💾
eka Ekajuk 213K 💾
el Greek 5,470K 💾
emi Mussau-Emira 176K 💾
emp Northern Emberá 158K 💾
enb Markweeta 147K 💾
enq Enga 217K 💾
enx Enxet 772K 💾
eri Ogea 269K 💾
es Spanish 32,670K 💾
ese Ese Ejja 226K 💾
et Estonian 3,658K 💾
eu Basque 130K 💾
ewo Ewondo 158K 💾
eza Ezaa 963K 💾
fa Persian 9,114K 💾
fa-AF Dari 7,363K 💾
faa Fasu 238K 💾
fai Faiwol 256K 💾
fal South Fali 198K 💾
far Fataleka 286K 💾
fi Finnish 4,837K 💾
fil Tagalog 184K 💾
fip Fipa 134K 💾
fit Tornedalen Finnish 292K 💾
fj Fijian 257K 💾
fo Faroese 851K 💾
fon Fon 266K 💾
for Fore 169K 💾
fr French 5,488K 💾
fue Borgu Fulfulde 148K 💾
fuf Pular 174K 💾
fuq Central-Eastern Niger Fulfulde 156K 💾
fuv Nigerian Fulfulde 13K 💾
ga Irish 7,587K 💾
gag Gagauz 245K 💾
gah Alekano 210K 💾
gam Kandawo 250K 💾
gaw Nobonob 246K 💾
gbi Galela 288K 💾
gd Scottish Gaelic 17,105K 💾
gde Gude 217K 💾
gdn Umanakaina 306K 💾
gdr Wipi 271K 💾
gej Gen 236K 💾
gfk Patpatar 294K 💾
ghs Guhu-Samane 186K 💾
gil Gilbertese 228K 💾
gkn Gokana 267K 💾
gmv-Latn Gamo (Latin) 127K 💾
gn Guarani 142K 💾
gnd Zulgo-Gemzek 364K 💾
gng Ngangam 219K 💾
gnw Western Bolivian Guaraní 263K 💾
gof Gofa 124K 💾
gog Gogo 173K 💾
gor Gorontalo 211K 💾
gqr Gor 218K 💾
grb Northern Grebo 270K 💾
grt Garo 141K 💾
gso Southwest Gbaya 228K 💾
gsw-u-sd-chag Swiss German (Aargau) 99K 💾
gsw-u-sd-chbe Swiss German (Bern) 73K 💾
gsw-u-sd-chfr Swiss German (Fribourg) 42K 💾
gu Gujarati 702K 💾
gub Guajajára 997K 💾
guc Wayuu 211K 💾
gud Yocoboué Dida 216K 💾
guh Guahibo 204K 💾
gui Eastern Bolivian Guaraní 197K 💾
gum Guambiano 186K 💾
gun Mbyá Guaraní 176K 💾
guo Guayabero 203K 💾
guq Aché 184K 💾
gur Farefare 240K 💾
gux Gourmanchéma 215K 💾
gv Manx Gaelic 152K 💾
gvc Guanano 241K 💾
gvf Golin 276K 💾
gvl Gulay 270K 💾
gwr Gwere 157K 💾
gym Ngäbere 294K 💾
gyr Guarayu 176K 💾
ha Hausa 1,775K 💾
hae Eastern Oromo 163K 💾
hag Hanga 202K 💾
haw Hawaiian 2,221K 💾
hay Haya 112K 💾
heh Hehe 136K 💾
hi Hindi 10,004K 💾
hif Fiji Hindi 204K 💾
hig Kamwe 261K 💾
hil Hiligaynon 208K 💾
hla Halia 273K 💾
hne Chhattisgarhi 207K 💾
hnn Hanunoo 212K 💾
hns Caribbean Hindustani 312K 💾
ho Hiri Motu 240K 💾
hot Hote 222K 💾
hr Croatian 8,188K 💾
ht Haitian 1,101K 💾
hto Minica Huitoto 182K 💾
hu Hungarian 600K 💾
hub Huambisa 160K 💾
hui Huli 232K 💾
hus Huastec 236K 💾
huu Murui Huitoto 165K 💾
huv San Mateo Del Mar Huave 197K 💾
hvn Sabu 312K 💾
hy Armenian 25,972K 💾
ian Iatmul 224K 💾
iba Iban 179K 💾
icr Islander Creole English 248K 💾
id Indonesian 6,634K 💾
ifa Amganad Ifugao 810K 💾
ifb Batad Ifugao 835K 💾
ife Ifè 300K 💾
ifk Tuwali Ifugao 214K 💾
ifu Mayoyao Ifugao 258K 💾
ify Keley-I Kallahan 863K 💾
ig Igbo 13K 💾
ign Ignaciano 161K 💾
ik Inupiaq 96K 💾
ilo Iloko 169K 💾
imo Imbongu 280K 💾
inb Inga 151K 💾
ino Inoke-Yate 236K 💾
iou Tuma-Irumu 225K 💾
ipi Ipili 312K 💾
iri Irigwe 243K 💾
irk Iraqw 184K 💾
iry Iraya 205K 💾
it Italian 13,569K 💾
itv Itawit 242K 💾
iu Inuktitut 98K 💾
iws Sepik Iwam 307K 💾
izr Izere 216K 💾
izz Izii 908K 💾
ja Japanese 2,116K 💾
jac Popti' 221K 💾
jae Yabem 186K 💾
jam Jamaican Creole English 254K 💾
jbu Jukun Takum 264K 💾
jic Tol 285K 💾
jiv Shuar 134K 💾
jmc Machame 150K 💾
jun Juang 178K 💾
jv Javanese 177K 💾
jvn Caribbean Javanese 211K 💾
ka Georgian 4,978K 💾
kaa Kara-Kalpak 135K 💾
kab-Arab Kabyle (Arabic) 715K 💾
kab-Tfng Kabyle (Tifinagh) 1,338K 💾
kab Kabyle 66K 💾
kac Kachin 1,057K 💾
kao Xaasongaxango 205K 💾
kaq Capanahua 164K 💾
kbh Camsá 193K 💾
kbm Iwal 298K 💾
kbp Kabiyè 571K 💾
kbq Kamano 156K 💾
kbr Kafa 147K 💾
kcg Tyap 279K 💾
kdc Kutu 140K 💾
kdi Kumam 195K 💾
kdj Karamojong 163K 💾
kdn Kunda 144K 💾
kek Kekchí 406K 💾
ken Kenyang 200K 💾
keo Kakwa 215K 💾
ker Kera 267K 💾
kew West Kewa 247K 💾
kez Kukele 173K 💾
kgf Kube 175K 💾
kgr Abun 356K 💾
khz Keapara 196K 💾
kia Kim 525K 💾
kij Kilivila 155K 💾
kj Kuanyama 1,474K 💾
kjb Q'anjob'al 263K 💾
kje Kisar 235K 💾
kjh Khakas 128K 💾
kjs East Kewa 251K 💾
kk Kazakh 642K 💾
kki Kagulu 125K 💾
kkj Kako 263K 💾
kln Kalenjin 149K 💾
km Khmer 29,110K 💾
kma Konni 230K 💾
kmg Kâte 127K 💾
kmo Kwoma 213K 💾
kms Kamasau 293K 💾
kmu Kanite 214K 💾
kn Kannada 126K 💾
kne Kankanaey 230K 💾
knf Mankanya 164K 💾
knj Western Kanjobal 1,350K 💾
knk Kuranko 228K 💾
kno Kono 360K 💾
knv Tabo 243K 💾
kog Cogui 189K 💾
kpf Komba 174K 💾
kpg Kapingamarangi 967K 💾
kpr Korafe-Yegha 262K 💾
kpw Kobon 288K 💾
kpx Mountain Koiali 190K 💾
kpz Kupsabiny 166K 💾
kqc Doromu-Koki 209K 💾
kqe Kalagan 241K 💾
kqp Kimré 254K 💾
kqw Kandas 201K 💾
kqy Koorete 156K 💾
krc Karachay-Balkar 132K 💾
kri Krio 256K 💾
krj Kinaray-A 228K 💾
kru Kurukh 182K 💾
ksd Kuanua 228K 💾
ksr Borong 233K 💾
ktb Kambaata 113K 💾
ktj Plapo Krumen 356K 💾
kto Kuot 286K 💾
ku Kurdish 2,479K 💾
kub Kutep 281K 💾
kud ‘Auhelawa 167K 💾
kue Kuman (Papua New Guinea) 230K 💾
kum Kumyk 142K 💾
kup Kunimaipa 279K 💾
kus Kusaal 200K 💾
kv Komi 122K 💾
kvn Border Kuna 212K 💾
kwf Kwara'ae 296K 💾
kwi Awa-Cuaiquer 165K 💾
kwj Kwanga 290K 💾
kxc Konso 148K 💾
kxm Northern Khmer 257K 💾
ky Kyrgyz 18,597K 💾
kyc Kyaka 220K 💾
kyf Kouya 215K 💾
kyg Keyagana 190K 💾
kyq Kenga 250K 💾
kyu Western Kayah 466K 💾
kyz Kayabí 324K 💾
kze Kosena 164K 💾
kzf Da'a Kaili 213K 💾
kzj Coastal Kadazan 215K 💾
la Latin 48K 💾
laj Lango 175K 💾
las Lama 235K 💾
law Lauje 262K 💾
lb Luxembourgish 5,173K 💾
lcm Tungag 239K 💾
lee Lyélé 257K 💾
lef Lelemi 211K 💾
lem Nomaande 249K 💾
leu Kara (Papua New Guinea) 255K 💾
lew Ledo Kaili 198K 💾
lex Luang 271K 💾
lgg Lugbara 188K 💾
lhu Lahu 352K 💾
lia West-Central Limba 247K 💾
lid Nyindrou 308K 💾
lif Limbu 138K 💾
lip Sekpele 214K 💾
lis Lisu 304K 💾
ljp Lampung Api 188K 💾
lln Lele 291K 💾
lme Pévé 245K 💾
lmk Lamkang 217K 💾
lnd Lundayeh 670K 💾
lo Lao 4,384K 💾
lob Lobi 192K 💾
loe Saluan 220K 💾
lok Loko 264K 💾
lon Malawi Lomwe 137K 💾
lsi Lashi 1,077K 💾
lsm Saamia 156K 💾
lt Lithuanian 39,575K 💾
luc Aringa 242K 💾
lus Lushai 204K 💾
lv Latvian 1,020K 💾
lwo Luwo 255K 💾
maa San Jerónimo Tecóatl Mazatec 487K 💾
mad Madurese 706K 💾
mag Magahi 193K 💾
mai Maithili 211K 💾
maj Jalapa De Díaz Mazatec 188K 💾
mak Makasar 179K 💾
mam Mam 834K 💾
maw Mampruli 251K 💾
maz Central Mazahua 286K 💾
mbb Western Bukidnon Manobo 278K 💾
mbc Macushi 221K 💾
mbh Mangseng 321K 💾
mbt Matigsalug Manobo 226K 💾
mca Maca 208K 💾
mcb Machiguenga 132K 💾
mcd Sharanahua 200K 💾
mco Coatlán Mixe 217K 💾
mcp Makaa 237K 💾
mcq Ese 158K 💾
mcu Cameroon Mambila 260K 💾
mda Mada 312K 💾
mdy Male 589K 💾
med Melpa 283K 💾
mee Mengen 301K 💾
mej Meyah 323K 💾
mek Mekeo 234K 💾
men Mende 210K 💾
meq Merey 291K 💾
meu Motu 175K 💾
mfe Morisyen 172K 💾
mfh Matal 238K 💾
mfi Wandala 265K 💾
mfk North Mofu 248K 💾
mfq Moba 232K 💾
mfy Mayo 167K 💾
mfz Mabaan 237K 💾
mg Malagasy 1,623K 💾
mgd Moru 192K 💾
mgh Makhuwa-Meetto 150K 💾
mgo Meta' 251K 💾
mh Marshallese 750K 💾
mhi Ma'di 192K 💾
mhl Mauwake 235K 💾
mhx Maru 291K 💾
mhy Ma'anyan 190K 💾
mi Maori 1,504K 💾
mib Atatláhuca Mixtec 263K 💾
mif Mofu-Gudur 283K 💾
mil Peñoles Mixtec 365K 💾
min Minangkabau 242K 💾
mio Pinotepa Nacional Mixtec 288K 💾
miq Mískito 214K 💾
mit Southern Puebla Mixtec 273K 💾
mk Macedonian 10,422K 💾
mkl Mokole 230K 💾
ml Malayalam 118K 💾
mlh Mape 235K 💾
mlp Bargam 297K 💾
mmo Mangga Buang 269K 💾
mmx Madak 271K 💾
mna Mbula 257K 💾
mnb Muna 151K 💾
mnf Mundani 241K 💾
mnw Mon 1,836K 💾
moa Mwan 308K 💾
mog Mongondow 220K 💾
mop Mopán Maya 296K 💾
mor Moro 152K 💾
mox Molima 222K 💾
mpg Marba 210K 💾
mpm Yosondúa Mixtec 336K 💾
mps Dadibi 1,270K 💾
mpt Mian 256K 💾
mpx Misima-Panaeati 227K 💾
mqb Mbuko 302K 💾
mqj Mamasa 164K 💾
mqn Moronene 164K 💾
mr Marathi 16,594K 💾
mrw Maranao 912K 💾
ms Malay 659K 💾
msm Agusan Manobo 225K 💾
msy Aruamu 229K 💾
mt Maltese 3,331K 💾
mta Cotabato Manobo 262K 💾
mti Maiwa (Papua New Guinea) 166K 💾
mtj Moskona 321K 💾
mto Totontepec Mixe 233K 💾
mtp Wichí Lhamtés Nocten 183K 💾
muh Mündü 392K 💾
mur Murle 210K 💾
mux Bo-Ung 363K 💾
muy Muyang 265K 💾
mva Manam 231K 💾
mvp Duri 174K 💾
mwv Mentawai 141K 💾
mxb Tezoatlán Mixtec 281K 💾
mxt Jamiltepec Mixtec 267K 💾
my Burmese 1,007K 💾
my-t-d0-zawgyi Burmese (Zawgyi encoding) 593K 💾
myb Mbay 192K 💾
myk Mamara Senoufo 272K 💾
myv Erzya 143K 💾
myw Muyuw 150K 💾
myx Masaaba 164K 💾
myy Macuna 245K 💾
mza Santa María Zacatepec Mixtec 316K 💾
mzi Ixcatlán Mazatec 190K 💾
mzk Nigeria Mambila 283K 💾
mzm Mumuye 265K 💾
naf Nabak 220K 💾
nak Nakanai 333K 💾
nan-Latn Min Nan Chinese (Latin) 231K 💾
nas Naasioi 168K 💾
nca Iyo 203K 💾
nch Central Huasteca Nahuatl 195K 💾
ncj Northern Puebla Nahuatl 164K 💾
ncu Chumburung 312K 💾
ndj Ndamba 141K 💾
ndy Lutos 216K 💾
ndz Ndogo 350K 💾
neb Toura 326K 💾
new Newari 150K 💾
nfr Nafaanra 233K 💾
ngp Ngulu 149K 💾
nho Takuu 309K 💾
nhu Noone 270K 💾
nhw Western Huasteca Nahuatl 194K 💾
nhy Northern Oaxaca Nahuatl 185K 💾
nia Nias 182K 💾
nii Nii 316K 💾
nij Ngaju 194K 💾
nim Nilamba 117K 💾
nin Ninzo 267K 💾
nkf Inpui Naga 197K 💾
nko Nkonya 168K 💾
nl Dutch 58,357K 💾
nlc Nalca 241K 💾
nmz Nawdm 209K 💾
nnb Nande 127K 💾
nnq Ngindo 137K 💾
nnw Southern Nuni 291K 💾
noa Woun Meu 275K 💾
nog Nogai 104K 💾
nop Numanggang 183K 💾
not Nomatsiguenga 141K 💾
nou Ewage-Notu 266K 💾
npl Southeastern Puebla Nahuatl 148K 💾
npy Napu 192K 💾
nsn Nehan 248K 💾
nsu Sierra Negra Nahuatl 170K 💾
ntm Nateni 229K 💾
ntp Northern Tepehuan 173K 💾
ntr Delo 272K 💾
nuj Nyole 151K 💾
nus Nuer 195K 💾
nvm Namiae 290K 💾
nwb Nyabwa 316K 💾
nwi Southwest Tanna 230K 💾
ny Nyanja 356K 💾
nyf Giryama 169K 💾
nyn Nyankole 120K 💾
nyo Nyoro 120K 💾
nyy Nyakyusa-Ngonde 138K 💾
nzi Nzima 201K 💾
obo Obo Manobo 266K 💾
oc Occitan 2,706K 💾
oku Oku 239K 💾
okv Orokaiva 212K 💾
old Mochi 151K 💾
ong Olo 284K 💾
opm Oksapmin 332K 💾
or Oriya 175K 💾
os Ossetic 135K 💾
osa Osage 3K 💾
otd Ot Danum 187K 💾
ote Mezquital Otomi 251K 💾
ozm Koonzime 267K 💾
pa Punjabi 59,990K 💾
pab Parecís 156K 💾
pad Paumarí 242K 💾
pag Pangasinan 177K 💾
pah Tenharim 268K 💾
pam Pampanga 196K 💾
pau Palauan 255K 💾
pbc Patamona 181K 💾
pbi Parkwa 272K 💾
pck Paite Chin 770K 💾
pcm Nigerian Pidgin 315K 💾
pez Eastern Penan 235K 💾
pib Yine 114K 💾
pir Piratapuyo 229K 💾
pis Pijin 263K 💾
pjt Pitjantjatjara 237K 💾
pkb Pokomo 166K 💾
pl Polish 7,148K 💾
plw Brooke's Point Palawano 203K 💾
pmf Pamona 307K 💾
pny Pinyin 247K 💾
poh Poqomchi' 266K 💾
poi Highland Popoluca 179K 💾
poy Pogolo 147K 💾
ppk Uma 220K 💾
ppo Folopa 258K 💾
prf Paranan 203K 💾
prk Parauk 1,026K 💾
ps Pashto 7,343K 💾
pss Kaulong 326K 💾
pt Portuguese 20,891K 💾
pt-PT Portuguese (Portugal) 666K 💾
ptp Patep 294K 💾
ptu Bambam 194K 💾
pwg Gapapaiwa 208K 💾
pww Pwo Northern Karen 345K 💾
pxm Quetzaltepec Mixé 720K 💾
qu Quechua 580K 💾
qub Huallaga Huánuco Quechua 122K 💾
quc K'iche' 207K 💾
quf Lambayeque Quechua 161K 💾
quh South Bolivian Quechua 623K 💾
qul North Bolivian Quechua 140K 💾
qup Southern Pastaza Quechua 177K 💾
quw Tena Lowland Quichua 116K 💾
quy Ayacucho Quechua 106K 💾
qvc Cajamarca Quechua 166K 💾
qve Eastern Apurímac Quechua 168K 💾
qvi Imbabura Highland Quichua 146K 💾
qvm Margos-Yarowilca-Lauricocha Quechua 132K 💾
qvn North Junín Quechua 139K 💾
qvo Napo Lowland Quechua 117K 💾
qvs San Martín Quechua 153K 💾
qvw Huaylla Wanca Quechua 111K 💾
qvz Northern Pastaza Quichua 157K 💾
qwh Huaylas Ancash Quechua 128K 💾
qxh Panao Huánuco Quechua 123K 💾
qxl Salasaca Highland Quichua 127K 💾
qxn Northern Conchucos Ancash Quechua 150K 💾
qxo Southern Conchucos Ancash Quechua 136K 💾
qxr Cañar Highland Quichua 509K 💾
rai Ramoaaina 273K 💾
raj Malvi 198K 💾
rav Sampang 138K 💾
rej Rejang 178K 💾
rim Nyaturu 151K 💾
rm-puter Romansh (Puter) 1,068K 💾
rm-rumgr Romansh (Grischun) 4,794K 💾
rm-surmiran Romansh (Surmiran) 2,540K 💾
rm-sursilv Romansh (Sursilvan) 11,678K 💾
rm-sutsilv Romansh (Sutsilvan) 1,007K 💾
rm-vallader Romansh (Vallader) 5,560K 💾
rmc Carpathian Romani 170K 💾
rmo Sinte Romani 228K 💾
rn Rundi 120K 💾
rnl Ranglong 221K 💾
ro Romanian 13,962K 💾
ro-MD Moldavian 2,694K 💾
rom Vlax Romani 186K 💾
roo Rotokas 292K 💾
rro Waima 177K 💾
ru Russian 40,987K 💾
ruf Luguru 135K 💾
rug Roviana 956K 💾
rw Kinyarwanda 605K 💾
rwo Rawa 261K 💾
sab Buglere 405K 💾
sah Sakha 2,457K 💾
sas Sasak 196K 💾
sat Santali 149K 💾
sba Ngambay 246K 💾
sbl Botolan Sambal 251K 💾
sck Sadri 189K 💾
sda Toraja-Sa'dan 154K 💾
seh Sena 155K 💾
sey Secoya 163K 💾
sg Sango 265K 💾
sgb Mag-antsi Ayta 233K 💾
sgw Sebat Bet Gurage 116K 💾
sgz Sursurunga 327K 💾
shk Shilluk 189K 💾
shn Shan 1,435K 💾
shp Shipibo-Conibo 169K 💾
si Sinhala 1,046K 💾
sig Paasaal 277K 💾
sil Tumulung Sisaala 256K 💾
sim Mende (Papua New Guinea) 273K 💾
sja Epena 194K 💾
sk Slovak 70,933K 💾
sl Slovenian 10,975K 💾
sld Sissala 206K 💾
sll Salt-Yui 264K 💾
sm Samoan 248K 💾
smt Simte 177K 💾
sn Shona 2,542K 💾
snc Sinaugoro 216K 💾
snn Siona 222K 💾
snp Siane 237K 💾
snw Selee 212K 💾
sny Saniyo-Hiyewe 348K 💾
so Somali 874K 💾
soq Kanasi 213K 💾
soy Miyobe 205K 💾
spl Selepet 244K 💾
spp Supyire Senoufo 251K 💾
sps Saposa 324K 💾
sq Albanian 10,104K 💾
sr Serbian 4,785K 💾
sr-Latn Serbian (Latin) 10,143K 💾
sri Siriano 166K 💾
srm Saramaccan 369K 💾
srn Sranan Tongo 232K 💾
ssd Siroi 210K 💾
ssg Seimat 221K 💾
ssx Samberigi 233K 💾
stn Owa 263K 💾
su Sundanese 172K 💾
sua Sulka 458K 💾
sue Suena 227K 💾
sur Mwaghavul 261K 💾
sus Susu 205K 💾
suz Sunwar 732K 💾
sv Swedish 33,633K 💾
sw Swahili 8,817K 💾
swp Suau 175K 💾
sxn Sangir 209K 💾
ta Tamil 1,413K 💾
tab Tabassaran 132K 💾
taj Eastern Tamang 169K 💾
tap Taabwa 145K 💾
taq Tamasheq 218K 💾
tav Tatuyo 256K 💾
taw Tai 268K 💾
tbc Takia 278K 💾
tbg North Tairora 235K 💾
tbo Tawala 198K 💾
tby Tabaru 226K 💾
tbz Ditammari 692K 💾
tca Ticuna 251K 💾
tcc Datooga 135K 💾
te Telugu 574K 💾
ted Tepo Krumen 346K 💾
tem Timne 190K 💾
teo Teso 118K 💾
ter Tereno 187K 💾
tfr Teribe 228K 💾
tgo Sudest 216K 💾
tgp Tangoa 228K 💾
thk Tharaka 150K 💾
ti Tigrinya 803K 💾
tif Tifal 413K 💾
tih Timugon Murut 879K 💾
tik Tikar 264K 💾
tim Timbe 206K 💾
tk Turkmen 516K 💾
tlb Tobelo 209K 💾
tlf Telefol 422K 💾
tlj Talinga-Bwisi 159K 💾
tmc Tumak 245K 💾
tna Tacana 216K 💾
tnr Ménik 254K 💾
to Tonga 1,214K 💾
tob Toba 229K 💾
toc Coyutla Totonac 218K 💾
toh Gitonga 194K 💾
top Papantla Totonac 168K 💾
tos Highland Totonac 224K 💾
tpi Tok Pisin 8,049K 💾
tpm Tampulma 892K 💾
tpp Pisaflores Tepehua 162K 💾
tpt Tlachichilco Tepehua 173K 💾
tpz Tinputz 370K 💾
tqo Toaripi 215K 💾
tr Turkish 13,846K 💾
trs Chicahuaxtla Triqui 287K 💾
tsz Purepecha 129K 💾
tt Tatar 1,356K 💾
ttc Tektiteko 231K 💾
tte Bwanabwana 198K 💾
tue Tuyuca 141K 💾
tuf Central Tunebo 237K 💾
twb Western Tawbuid 198K 💾
twu Termanu 242K 💾
txa Tombonuo 224K 💾
txu Kayapó 354K 💾
tyv Tuvinian 614K 💾
tyz Tày 260K 💾
tzh Tzeltal 901K 💾
tzj Tz'utujil 245K 💾
ubr Ubir 222K 💾
ubu Umbu-Ungu 308K 💾
udm Udmurt 135K 💾
udu Uduk 287K 💾
ug Uyghur 9,493K 💾
uk Ukrainian 12,921K 💾
ur Urdu 3,622K 💾
ura Urarina 193K 💾
urb Urubú-Kaapor 347K 💾
urk Urak Lawoi' 368K 💾
ury Orya 301K 💾
usa Usarufa 171K 💾
usp Uspanteco 228K 💾
uvl Lote 277K 💾
uz Uzbek 131K 💾
vag Vagla 221K 💾
vec Venetian 2K 💾
vec-u-sd-itpd Venetian (Padua) 813K 💾
vec-u-sd-itts Venetian (Trieste) 12K 💾
vec-u-sd-itvr Venetian (Verona) 16K 💾
vid Vidunda 151K 💾
viv Iduna 220K 💾
vmw Makhuwa 130K 💾
vun Vunjo 141K 💾
vut Vute 206K 💾
waj Waffa 236K 💾
wap Wapishana 193K 💾
war Waray 208K 💾
way Wayana 143K 💾
wer Weri 209K 💾
wiu Wiru 232K 💾
wlx Wali 847K 💾
wmw Mwani 139K 💾
wnc Wantoat 238K 💾
wnu Usan 234K 💾
wob Wè Northern 270K 💾
wos Hanga Hundi 264K 💾
wrs Waris 213K 💾
wsk Waskia 239K 💾
wuv Wuvulu-Aua 187K 💾
wwa Waama 239K 💾
xal Kalmyk 135K 💾
xav Xavánte 440K 💾
xed Hdi 229K 💾
xla Kamula 230K 💾
xog Soga 127K 💾
xrb Eastern Karaboro 286K 💾
xsb Sambal 244K 💾
xsi Sio 319K 💾
xsm Kasem 604K 💾
xsr Sherpa 184K 💾
xsu Sanumá 408K 💾
xtd Diuxi-Tilantongo Mixtec 277K 💾
xtm Magdalena Peñasco Mixtec 335K 💾
xuo Kuo 306K 💾
yaa Yaminahua 204K 💾
yad Yagua 142K 💾
yal Yalunka 203K 💾
yam Yamba 277K 💾
yaz Lokaa 222K 💾
yby Yaweyuha 219K 💾
ycn Yucuna 202K 💾
yle Yele 298K 💾
yli Angguruk Yali 221K 💾
yml Iamalele 245K 💾
yo Yoruba 270K 💾
yon Yongkom 202K 💾
yrb Yareba 184K 💾
yre Yaouré 285K 💾
yss Yessan-Mayo 227K 💾
yua Yucateco 813K 💾
yuj Karkar-Yuri 258K 💾
yut Yopno 227K 💾
yuw Yau (Morobe Province) 243K 💾
yva Yawa 250K 💾
zaa Sierra de Juárez Zapotec 265K 💾
zad Cajonos Zapotec 180K 💾
zae Yareni Zapotec 248K 💾
zap Zapotec 194K 💾
zas Santo Domingo Albarradas Zapotec 184K 💾
zaw Mitla Zapotec 157K 💾
zca Coatecas Altas Zapotec 236K 💾
zia Zia 242K 💾
ziw Zigula 140K 💾
zlm Malay 664K 💾
zne Zande 253K 💾
zpc Choapan Zapotec 208K 💾
zpi Santa María Quiegolani Zapotec 209K 💾
zpq Zoogocho Zapotec 208K 💾
zpt San Vicente Coatlán Zapotec 229K 💾
zpz Texmelucan Zapotec 281K 💾
zyp Zyphe Chin 230K 💾

¹ Downloadable files include counts for each token; to get raw text, run the crawler yourself. For breaking text into words, we use an ICU word break iterator and count all tokens whose break status is one of UBRK_WORD_LETTER, UBRK_WORD_KANA, or UBRK_WORD_IDEO.

Running the Crawler

./corpuscrawler --language=yo --output=./corpus

corpuscrawler's People

Contributors

behnam avatar blackblitz avatar brawer avatar cash avatar cwd24 avatar jimregan avatar keshan avatar kshithijiyer avatar mahalisyarifuddin avatar sffc avatar wannaphong avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

corpuscrawler's Issues

Thanlwintimes.com No Longer Available

The site ''thanlwintimes.com'' is one of the sites used by Corpus Crawler. However, the site no longer seems to be available; when I visit, I see an error saying that the domain expired:

http://thanlwintimes.com/

The site should be replaced with another site using the same encoding.

Add Pali, Mon, and Karen

We've been referred to the following sources for corpora in additional Myanmar-script laguages.

  1. Pali (Tri Pitaka) [pi-Mymr]
    1. https://tipitaka.org/mymr/
  2. Mon [mnw]
    1. http://mon.monnews.org/
    2. https://mnw.wikipedia.org/wiki/မုက်လိက်တမ်
  3. Shan [shn] -- already included
    1. https://shannews.org/
    2. https://shn.wikipedia.org/wiki/ၼႃႈႁူဝ်ႁႅၵ်ႈ
  4. Karen [kar]
    1. http://karen.kicnews.org/
    2. https://wol.jw.org/ksw/wol/h/r350/lp-kr (Bible, Publications...)

CC @sven-oly

Does not run in python3.7 or python 2.7

$ python2 --version
Python 2.7.16+

$ python3 --version
Python 3.7.2+
$ python3 ./corpuscrawler --language tzh --output output-tzh/
Cache-Hit:      http://listen.bible.is/robots.txt
Traceback (most recent call last):
  File "./corpuscrawler", line 28, in <module>
    sys.exit(corpuscrawler.main.main())
  File "/home/fran/source/corpuscrawler/Lib/corpuscrawler/main.py", line 1249, in main
    crawls[args.language](crawler)
  File "/home/fran/source/corpuscrawler/Lib/corpuscrawler/crawl_tzh.py", line 21, in crawl
    crawl_bibleis(crawler, out, bible='TZHSBM')
  File "/home/fran/source/corpuscrawler/Lib/corpuscrawler/util.py", line 776, in crawl_bibleis
    init = crawler.fetch(firsturl)
  File "/home/fran/source/corpuscrawler/Lib/corpuscrawler/util.py", line 136, in fetch
    if not self.is_fetch_allowed_by_robots_txt(url):
  File "/home/fran/source/corpuscrawler/Lib/corpuscrawler/util.py", line 259, in is_fetch_allowed_by_robots_txt
    checker.parse(robots_txt.decode('utf-8'))
AttributeError: 'str' object has no attribute 'decode'

$ python2 ./corpuscrawler --language tzh --output output-tzh/
Traceback (most recent call last):
  File "./corpuscrawler", line 24, in <module>
    import corpuscrawler.main
  File "/home/fran/source/corpuscrawler/Lib/corpuscrawler/main.py", line 20, in <module>
    from corpuscrawler import (
  File "/home/fran/source/corpuscrawler/Lib/corpuscrawler/crawl_aaz.py", line 16, in <module>
    from corpuscrawler.util import crawl_bibleis
  File "/home/fran/source/corpuscrawler/Lib/corpuscrawler/util.py", line 18, in <module>
    from builtins import open, bytes, chr
ImportError: No module named builtins

harfbuzz-testing-wikipedia

Hi Sascha,

Nice work! Here's the output of what roozbeh did for HarfBuzz testing by extracting Wikipedia:
https://github.com/behdad/harfbuzz-testing-wikipedia

Don't know if it's of much use. That one included all talk pages of Wikipedia as well, so the word distribution is skewed, for example the word for "User" is over-represented. Anyway, thought I share here for the record.

what sites are crawled?

I looked through the readme; is there a list of what sites are crawled by this script? Is there documentation for how to add additional sites?

Add (Modern Standard) Arabic language

Is there any work being done regarding any Arabic dialects?

We can start with http://www.dw.com/ar/, which is Modern Standard Arabic. I think MSA is a good start, and we can add regional dialects later.

Please list here any source you think we should add, for MSA or regional dialects.

Shorten project structure

Related to #80. Suggestion. Mainly, move the core codes up so it is more visible.
The crawlers are kept into their own folder.

  • Reoganize project structure from :
corpuscrawler
├─ README.md
├─ LICENSE
├─ LICENSE.md
├─ CONTRIBUTING.md
├─ corpuscrawler
└─ Lib
   └─ corpuscrawler
      ├─ *.py : utilities
      └─ crawl_{iso}.py : crawlers

to

corpuscrawler
├─ README.md
├─ LICENSE
├─ LICENSE.md
├─ CONTRIBUTING.md
├─ corpuscrawler
├─ *.py : utilities
└─ crawlers
   └─ crawl_{iso}.py : crawlers

Would such changes disturb some complementary toolchain ?

Error when crawling Kaqchikel

Not sure what is going on here:

fran@ipek:~/source/corpuscrawler$ ./corpuscrawler --output ~/corpora/languages/kaqchikel/corpcrawl/ --language cak
Downloading:    http://listen.bible.is/robots.txt
Downloading:    http://listen.bible.is/CAKSBG/Matt/1
Traceback (most recent call last):
  File "./corpuscrawler", line 28, in <module>
    sys.exit(corpuscrawler.main.main())
  File "/home/fran/source/corpuscrawler/Lib/corpuscrawler/main.py", line 1249, in main
    crawls[args.language](crawler)
  File "/home/fran/source/corpuscrawler/Lib/corpuscrawler/crawl_cak.py", line 21, in crawl
    crawl_bibleis(crawler, out, bible='CAKSBG')
  File "/home/fran/source/corpuscrawler/Lib/corpuscrawler/util.py", line 718, in crawl_bibleis
    jsonraw = json.loads(content.split('var chaptersByBook = ')[1].split(';\n')[0])
IndexError: list index out of range

Python 3 compatibility

When attempting to run corpuscrawler in Python 3, I get the error:

ImportError: No module named 'mimetools'

It would be good to support Python 3.

When running in Python 2, I get this other error:

AttributeError: 'module' object has no attribute 'SSLContext'

which apparently happens because that function was backported very recently to Python 2.

[gd] Extend Scottish Gaelic corpus

For Scottish Gaelic, https://dasg.ac.uk/text/ now contains plaintext files which makes it easier to crawl than before. Some material is multilingual, but it’s already language-tagged with a custom tagging scheme using tags such as <eng> and <gai>. For example, https://dasg.ac.uk/text/68.txt has English sections that are marked up like this — a trivial regexp subsitution should be able to remove the English sections:

Dh’fhosgail e i; is léugh e:
<eng>The Queen, who is lying very ill, urges your immediate attendance.
(Signed) Eveleyn Marlborough.<gai>
“Ma thig am Prionnsa,” thuirt e, ...

/cc @jimregan

[mi] Filter out English text

The crawled Maori language corpus contains some English text. By filtering out every paragraph that contains the string the, we’d remove 1902 lines from the corpus which are mostly English-language quotes. I believe this would be an improvement. @jimregan, would you be fine with this change?

Here’s a few sample paragraphs that would be removed:

Hei tā Whakaruru, “That's what the King believed his mother signed was to give us the means to help. We want to put on the table lands, resources and people.”

The doting parents of Whakatōhea descendant Kayla Imrie have won a free ticket to the Olympics.

Hei tā Kiripatea, “This is a primary model of one of the backyard gardens that we have done and we did it over a weekend.”

Thanks to Volkswagen, they will get a free ticket to watch their daughter perform with the NZ K4 Women's Kayaking crew at Rio.

The first-time Olympian is currently in Portugal training alongside her K4 team, Aimee Fisher, Caitlin Ryan and Jamie Lovett, for the upcoming world cup in Germany this month.

Ko tā Marama Davidson, mema o te Pāti Kākāriki, “We are still in decline when it comes to those who can speak te reo. So I think it's only 3.7% of people in Aotearoa can have a conversation in te reo. We know that over 77% of children are not enrolled in any subject at schools at the moment and we know that half of all schools in Aotearoa have absolutely no student learning te reo or taking a te reo subject.”

Allow Zawgyi crawling separate from my

In order to crawl my-t-d0-zawgyi.txt, you have to wait for my.txt to finish. It would be nicer if the two files could be crawling in parallel. For example, one could initiate my.txt, my-t-d0-zawgyi.txt, and shn.txt crawling, and have data after just a few minutes; the data files can keep growing as more pages get crawled.

Improve readme documentation on how to provide a new crawler

This /CONTRIBUTING.md is a License Agreement / Code of Conduite to sign. As far as I can see, this very cherishable project has no actual tutorial.

In don't have Python and code knowledge to fix this documentation issue but I can map the road so it become easier for the next person to do so.

Wanted

If an user want to add a language such as Catalan from Barcelona (ca, cat : missing). What do he needs to jump in quickly ? What should he provide ?

  • What isn the local structure :
    • util.py : store functions uses by multiple languages crawlers
    • main.py : stores the 1000+ crawlers calls, run them all.
    • crawl_{iso}.py : stores language-specific copora's source url and processing functions.
  • What tools I have :
  • What input(s) : python list of url ?
  • What are the classic parts of a crawler function ?
  • What output format : raw text ? html is fine because a html balise wiper is then used ?
  • Example of easily hackable base-code.

API (to complete)

Defined functions within util.py, by order of apparition as of 2021/02/26. If you have some relevant knowledge, please help for a sub-section or one item.

Some tools

  • daterange(start, end): __
  • urlpath(url): __
  • urlencode(url): __

Main element

  • class Crawler(object):
    • __init__(self, language, output_dir, cache_dir, crawldelay): __
    • get_output(self, language=None): __
    • close(self): __
    • fetch(self, url, redirections=None, fetch_encoding='utf-8'): __
    • fetch_content(self, url, allow_404=False): __
    • fetch_sitemap(self, url, processed=set(), subsitemap_filter=lambda x: True): __
    • is_fetch_allowed_by_robots_txt(self, url): __
    • crawl_pngscriptures_org(self, out, language): __
    • _find_urls_on_pngscriptures_org(self, language): __
    • crawl_abc_net_au(self, out, program_id): __
    • crawl_churchio(self, out, bible_id): __
    • crawl_aps_dz(self, out, prefix): __
    • crawl_sverigesradio(self, out, program_id): __
    • crawl_voice_of_america(self, out, host, ignore_ascii=False): __
    • set_context(self, context): __

Some crawlers for multi-languages sites

  • crawl_bbc_news(crawler, out, urlprefix): __
  • crawl_korero_html(crawler, out, project, genre, filepath): __
  • write_paragraphs(et, out): __
  • crawl_deutsche_welle(crawler, out, prefix, need_percent_in_url=False): __
  • crawl_radio_free_asia(crawler, out, edition, start_year=1998): __
  • crawl_sputnik_news(crawler, out, host): __
  • crawl_udhr(crawler, out, filename): __
  • crawl_voice_of_nigeria(crawler, out, urlprefix): __
  • crawl_bibleis(crawler, out, bible): __
  • crawl_tipitaka(crawler, out, script): __
  • find_wordpress_urls(crawler, site, **kwargs): __

Some cleaners

  • unichar(i): __
  • replace_html_entities(html): __
  • cleantext(html): __
  • clean_paragraphs(html): __
  • extract(before, after, html): __
  • fixquotes(s): __

Shorter way to do so

In code comments can do a lot. Pointing to wisely chosen sections too. If you have the required know how, please add comments onto a chosen, existing crawler and point to it as an in-code tutorial.

@sffc, @brawer : anyone could help on that ?

Adding New URLs

Hi,

Can we fetch data from URLs not mentioned in the existing code by adding custom functions? Also, does it not support the English language('en' not mentioned anywhere in the list of supported languages)?

Thanks

404 error with Myanmar Zawgyi

I ran ./corpuscrawler --language=my-t-d0-zawgyi --output=./corpus (with python 2.7 on Ubuntu 18.04) and the program crashed while downloading from some url. The output is shown below.

Downloading:    http://thanlwintimes.com/robots.txt
Downloading:    http://thanlwintimes.com/
Downloading:    http://thanlwintimes.com/category/%e1%80%9e%e1%80%90%e1%80%84%e1%80%b9%e1%80%b8/
Cache-Hit:      http://thanlwintimes.com/category/%e1%80%9e%e1%80%90%e1%80%84%e1%80%b9%e1%80%b8/
Downloading:    http://thanlwintimes.com/category/%e1%80%9e%e1%80%90%e1%80%84%e1%80%b9%e1%80%b8/page/2/
Downloading:    http://thanlwintimes.com/category/%e1%80%9e%e1%80%90%e1%80%84%e1%80%b9%e1%80%b8/page/3/
Downloading:    http://thanlwintimes.com/category/%e1%80%b1%e1%80%86%e1%80%ac%e1%80%84%e1%80%b9%e1%80%b8%e1%80%95%e1%80%ab%e1%80%b8/
Cache-Hit:      http://thanlwintimes.com/category/%e1%80%b1%e1%80%86%e1%80%ac%e1%80%84%e1%80%b9%e1%80%b8%e1%80%95%e1%80%ab%e1%80%b8/
Downloading:    http://thanlwintimes.com/category/%e1%80%b1%e1%80%86%e1%80%ac%e1%80%84%e1%80%b9%e1%80%b8%e1%80%95%e1%80%ab%e1%80%b8/page/2/
Downloading:    http://thanlwintimes.com/category/%e1%80%b1%e1%80%86%e1%80%ac%e1%80%84%e1%80%b9%e1%80%b8%e1%80%95%e1%80%ab%e1%80%b8/page/3/
Downloading:    http://thanlwintimes.com/category/%e1%80%b1%e1%80%86%e1%80%ac%e1%80%84%e1%80%b9%e1%80%b8%e1%80%95%e1%80%ab%e1%80%b8/page/4/
Downloading:    http://thanlwintimes.com/category/%e1%80%b1%e1%80%86%e1%80%ac%e1%80%84%e1%80%b9%e1%80%b8%e1%80%95%e1%80%ab%e1%80%b8/page/5/
Downloading:    http://thanlwintimes.com/category/%e1%80%b1%e1%80%86%e1%80%ac%e1%80%84%e1%80%b9%e1%80%b8%e1%80%95%e1%80%ab%e1%80%b8/page/6/
Downloading:    http://thanlwintimes.com/category/%e1%80%a1%e1%80%84%e1%80%b9%e1%80%90%e1%80%ac%e1%80%97%e1%80%ba%e1%80%b4%e1%80%b8/
Cache-Hit:      http://thanlwintimes.com/category/%e1%80%a1%e1%80%84%e1%80%b9%e1%80%90%e1%80%ac%e1%80%97%e1%80%ba%e1%80%b4%e1%80%b8/
Downloading:    http://thanlwintimes.com/category/%e1%80%a1%e1%80%84%e1%80%b9%e1%80%90%e1%80%ac%e1%80%97%e1%80%ba%e1%80%b4%e1%80%b8/page/2/
Downloading:    http://thanlwintimes.com/category/%e1%80%a1%e1%80%84%e1%80%b9%e1%80%90%e1%80%ac%e1%80%97%e1%80%ba%e1%80%b4%e1%80%b8/page/3/
Downloading:    http://thanlwintimes.com/category/%e1%80%a1%e1%80%84%e1%80%b9%e1%80%90%e1%80%ac%e1%80%97%e1%80%ba%e1%80%b4%e1%80%b8/page/4/
Downloading:    http://thanlwintimes.com/category/%e1%80%a1%e1%80%9a%e1%80%b9%e1%80%92%e1%80%ae%e1%80%90%e1%80%ac%e1%80%b7-%e1%80%a1%e1%80%ac%e1%80%b1%e1%80%98%e1%80%ac%e1%80%b9/
Cache-Hit:      http://thanlwintimes.com/category/%e1%80%a1%e1%80%9a%e1%80%b9%e1%80%92%e1%80%ae%e1%80%90%e1%80%ac%e1%80%b7-%e1%80%a1%e1%80%ac%e1%80%b1%e1%80%98%e1%80%ac%e1%80%b9/
Downloading:    http://thanlwintimes.com/category/%e1%80%a1%e1%80%9a%e1%80%b9%e1%80%92%e1%80%ae%e1%80%90%e1%80%ac%e1%80%b7-%e1%80%a1%e1%80%ac%e1%80%b1%e1%80%98%e1%80%ac%e1%80%b9/page/2/
Downloading:    http://thanlwintimes.com/category/%e1%80%80%e1%80%ac%e1%80%90%e1%80%bc%e1%80%94%e1%80%b9%e1%80%b8/
Cache-Hit:      http://thanlwintimes.com/category/%e1%80%80%e1%80%ac%e1%80%90%e1%80%bc%e1%80%94%e1%80%b9%e1%80%b8/
Downloading:    http://thanlwintimes.com/category/%e1%80%80%e1%80%ac%e1%80%90%e1%80%bc%e1%80%94%e1%80%b9%e1%80%b8/page/2/
Downloading:    http://thanlwintimes.com/category/%e1%80%80%e1%80%ac%e1%80%90%e1%80%bc%e1%80%94%e1%80%b9%e1%80%b8/page/3/
Downloading:    http://thanlwintimes.com/category/%e1%80%80%e1%80%ac%e1%80%90%e1%80%bc%e1%80%94%e1%80%b9%e1%80%b8/page/4/
Downloading:    http://thanlwintimes.com/category/%e1%80%80%e1%80%ac%e1%80%90%e1%80%bc%e1%80%94%e1%80%b9%e1%80%b8/page/5/
Downloading:    http://thanlwintimes.com/category/%e1%80%80%e1%80%ac%e1%80%90%e1%80%bc%e1%80%94%e1%80%b9%e1%80%b8/page/6/
Downloading:    http://thanlwintimes.com/category/%e1%80%9e%e1%80%90%e1%80%84%e1%80%b9%e1%80%b8%e1%80%93%e1%80%ab%e1%80%90%e1%80%b9%e1%80%95%e1%80%af%e1%80%b6/
Cache-Hit:      http://thanlwintimes.com/category/%e1%80%9e%e1%80%90%e1%80%84%e1%80%b9%e1%80%b8%e1%80%93%e1%80%ab%e1%80%90%e1%80%b9%e1%80%95%e1%80%af%e1%80%b6/
Downloading:    http://thanlwintimes.com/category/%e1%80%9e%e1%80%90%e1%80%84%e1%80%b9%e1%80%b8%e1%80%93%e1%80%ab%e1%80%90%e1%80%b9%e1%80%95%e1%80%af%e1%80%b6/page/2/
Traceback (most recent call last):
  File "./corpuscrawler", line 28, in <module>
    sys.exit(corpuscrawler.main.main())
  File "/home/dell/nlp/corpuscrawler/Lib/corpuscrawler/main.py", line 1249, in main
    crawls[args.language](crawler)
  File "/home/dell/nlp/corpuscrawler/Lib/corpuscrawler/crawl_my_t_d0_zawgyi.py", line 24, in crawl
    _crawl_than_lwin_times(crawler, out)
  File "/home/dell/nlp/corpuscrawler/Lib/corpuscrawler/crawl_my_t_d0_zawgyi.py", line 28, in _crawl_than_lwin_times
    urls = find_wordpress_urls(crawler, 'http://thanlwintimes.com/')
  File "/home/dell/nlp/corpuscrawler/Lib/corpuscrawler/util.py", line 767, in find_wordpress_urls
    assert pgdoc.status == 200, (pgdoc.status, pgurl)
AssertionError: (404, u'http://thanlwintimes.com/category/%e1%80%9e%e1%80%90%e1%80%84%e1%80%b9%e1%80%b8%e1%80%93%e1%80%ab%e1%80%90%e1%80%b9%e1%80%95%e1%80%af%e1%80%b6/page/2/')

Crawl Pali corpora

It would be nice to add crawlers for https://www.tipitaka.org to Corpus Crawler. This is the Buddhist Tipitaka in the Pali language, written in various scripts. The crawled corpora will be useful for testing Pali transliteration rules, contributed to Unicode CLDR by @mjansche.

Update Zawgyi locale to Qaag

CLDR is proposing the script code Qaag to use for Zawgyi text:

Qaag is a special script code for identifying the non-standard use of Myanmar characters for display with the Zawgyi font. The purpose of the code is to enable migration to standard, interoperable use of Unicode by providing an identifier for Zawgyi for tagging text, applications, input methods, font tables, transformations, and other mechanisms used for migration.

corpuscrawler should be updated to use the new script code instead of the -u-s0-zawgyi workaround. Myanmar Tools will need to also be updated to consume the new script code.

Add Wikipedia crawler ? (300+ languages)

A quick search shows you that CorpusCrawler does not crawl or use Wikipedia. I don't know Python but it seems feasible, either from scratch on Wikipedia API (1) or using existing server-side tools (2).

Assess interest

  1. Assess how many Wikipedia languages are not in UNILEX. See unicode-org/unilex#14 .
  2. Assess quality of wikipedia raw text data in minority languages.
  3. Compare gain to other available public corpora such Tatoeba (358 languages).

Crawling via API

By using and loading available list of articles per wikipedia, then scrap the sites. If too large, could be limited to max=n articles.

Given an iso code such as Ndonga's ng :

  • download List of page titles in main namespace archive (see below)
  • get the articles into a python list variable (python)
  • code a crawler in /Lib/corpuscrawler/util.py, following other crawler as examples 1, which query Wikipedia API, extract the valuable text, save the text. (python)
  • Update relevant crawlers /Lib/corpuscrawler/

Wikipedia API provides text

Various formats available:

  • format : The format of the output.
    • jsont : Output data in JSON format.
    • jsonfmt : Output data in JSON format (pretty-print in HTML).
    • nonet : Output nothing.
    • phpt : Output data in serialised PHP format.
    • phpfmt : Output data in serialised PHP format (pretty-print in HTML).
    • rawfmt : Output data, including debugging elements, in JSON format (pretty-print in HTML).
    • xmlt : Output data in XML format.
    • xmlfmt : Output data in XML format (pretty-print in HTML).

List of Wikipedia (~300)

List of articles per Wikipedia

For convenience, I use the tiny Ndonga (ng) Wikipedia (8 articles), easier to explore by hand.

For larger demo, you could also inspect similar URLs with the iso of :

Language Native iso Articles
Ndonga Oshiwambo ng 8
Inuktitut ᐃᓄᒃᑎᑐᑦ/inuktitut iu 514
Samoan Gagana Samoa sm 985
Igbo Igbo ig 2,085
Central Bikol Bikol Central bcl 10,824

Namespaces

On all wikis. See also here

  • 0: (main)
  • 1: Talk:
  • 2: User:
  • 3: User_talk:

Dumps' & paths

Using Wikipedia extractors ?

Hybrid approach

  • ISO: get the list of all local wiki's iso codes.
  • Downloads: loop over each language code, download the dump.
  • Extract: use extractor above, zip each language
  • Cloud: put text result online.
  • Crawl: in util.py, code a simple crawler which get just that .zip, convert back to txt content, add to the corpora.

cc: @brawer

Portuguese: doubt about the corpus result

I was analyzing the exit file and I realized the text for each "news" is only the title, the headline, and the 1st paragraph. It must be correct?
I'm using the crawler for "pt" language.

Undefined names

% flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics

./corpuscrawler/Lib/corpuscrawler/crawl_mi.py:62:39: F821 undefined name 'sitemap'
        if pubdate is None: pubdate = sitemap[url]
                                      ^
./corpuscrawler/Lib/corpuscrawler/crawl_kab.py:53:48: F821 undefined name 'url'
        assert doc.status == 200, (doc.status, url)
                                               ^
./corpuscrawler/Lib/corpuscrawler/crawl_tpi.py:48:48: F821 undefined name 'url'
        assert doc.status == 200, (doc.status, url)
                                               ^
./corpuscrawler/Lib/corpuscrawler/crawl_shn.py:90:30: F821 undefined name 'striptags'
                p = ' '.join(striptags(replace_html_entities(p)).split())
                             ^
./corpuscrawler/Lib/corpuscrawler/crawl_shn.py:90:40: F821 undefined name 'replace_html_entities'
                p = ' '.join(striptags(replace_html_entities(p)).split())
                                       ^
./corpuscrawler/Lib/corpuscrawler/crawl_ga.py:147:39: F821 undefined name 'fetchresult'
        if pubdate is None: pubdate = fetchresult.headers.get('Last-Modified')
                                      ^
./corpuscrawler/Lib/corpuscrawler/crawl_th.py:25:5: F821 undefined name 'crawl_bibleis'
    crawl_bibleis(crawler, out, bible='THATSV')
    ^
./corpuscrawler/Lib/corpuscrawler/crawl_vec.py:43:48: F821 undefined name 'start_url'
        assert doc.status == 200, (doc.status, start_url)
                                               ^
8     F821 undefined name 'fetchresult'
8

https://flake8.pycqa.org/en/latest/user/error-codes.html

On the flake8 test selection, this PR does not focus on "style violations" (the majority of flake8 error codes that psf/black can autocorrect). Instead, these tests are focus on runtime safety and correctness:

  • E9 tests are about Python syntax errors usually raised because flake8 can not build an Abstract Syntax Tree (AST). Often these issues are a sign of unused code or code that has not been ported to Python 3. These would be compile-time errors in a compiled language but in a dynamic language like Python, they result in the script halting/crashing on the user.
  • F63 tests are usually about the confusion between identity and equality in Python. Use ==/!= to compare str, bytes, and int literals is the classic case. These are areas where a == b is True but a is b is False (or vice versa). Python >= 3.8 will raise SyntaxWarnings on these instances.
  • F7 tests logic errors and syntax errors in type hints
  • F82 tests are almost always undefined names which are usually a sign of a typo, missing imports, or code that has not been ported to Python 3. These also would be compile-time errors in a compiled language but in Python, a NameError is raised which will halt/crash the script on the user.

crawler gets hung after downloading a few hits

I am trying to use this crawler to build an Urdu corpus. I am running Ubuntu 18.04 inside a VMWare virtual machine. The crawler will start and successfully download a few links but will eventually get permanently hung up. Nothing happens until I ctrl-c to exit the script. I can kill the script, start it again and it will successfully get the link it got hung up on the previous run, it will then successfully crawl a few more until getting hung up again. The below copied text is an example of what I get when I kill the script with ctrl-c

...(the crawler has successfully downloaded several links so far)...
Downloading: https://www.bbc.com/urdu/entertainment-37527961
Downloading: https://www.bbc.com/urdu/entertainment-37529481
Downloading: https://www.bbc.com/urdu/entertainment-37531642
Downloading: https://www.bbc.com/urdu/entertainment-37532975
^CTraceback (most recent call last):
File "./corpuscrawler", line 28, in
sys.exit(corpuscrawler.main.main())
File "/home/thebucketmouse/Desktop/corpuscrawler-master/Lib/corpuscrawler/main.py", line 1249, in main
crawlsargs.language
File "/home/thebucketmouse/Desktop/corpuscrawler-master/Lib/corpuscrawler/crawl_ur.py", line 22, in crawl
crawl_bbc_news(crawler, out, urlprefix='/urdu/')
File "/home/thebucketmouse/Desktop/corpuscrawler-master/Lib/corpuscrawler/util.py", line 475, in crawl_bbc_news
fetchresult = crawler.fetch(url)
File "/home/thebucketmouse/Desktop/corpuscrawler-master/Lib/corpuscrawler/util.py", line 150, in fetch
content = response.read()
File "/usr/lib/python2.7/socket.py", line 355, in read
data = self._sock.recv(rbufsize)
File "/usr/lib/python2.7/httplib.py", line 597, in read
s = self.fp.read(amt)
File "/usr/lib/python2.7/socket.py", line 384, in read
data = self._sock.recv(left)
File "/usr/lib/python2.7/ssl.py", line 772, in recv
return self.read(buflen)
File "/usr/lib/python2.7/ssl.py", line 659, in read
v = self._sslobj.read(len)
KeyboardInterrupt

No module named 'corpuscrawler' error

The script doesn't run with Python 3.

Shows error :

1234

For solving this I have tried changing this:

12345

to: checker.parse(robots_txt) as it is already decoded in Python 3 and it worked for me.

After this I am getting error:

123456

  • By running main.py in am getting error:

ModuleNotFoundError: No module named 'corpuscrawler'

Can anybody help solving this?

Rename crawl_taq to crawl_kab

The language which we currently call taq is actually Kabyle (BCP47: kab). Our use of taq came from the website taq.tamurt.info, but their taq stands for the word “taqbaylit,” which means Kabyle in Berber.

Use available corpora for opensubtitles (63 languages)

Research

  • J. Tiedemann, 2016, Finding Alternative Translations in a Large Corpus of Movie Subtitles. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016)

Gain

Closest of natural oral corpora.

Links

  • Portal
    • bre.txt.gz -- Bretonl corpus.
    • 60+ languages available.
    • List: af,ar,bg,bn,br,bs,ca,cs,da,de,el,en,eo,es,et,eu,fa,fi,fr,gl,he,hi,hr,hu,hy,id,is,it,ja,ka,kk,ko,lt,lv,mk,ml,ms,nl,no,pl,pt,pt_br,ro,ru,si,sk,sl,sq,sr,sv,ta,te,th,tl,tr,uk,ur,vi,ze_en,ze_zh,zh_cn,zh_tw

There are ready-to-download open licence Wikipedia corpora available.

Project introduction Type Languages (2024) Portal all Language specific Download link Comments
OpenSubtitles 2016/2018
Subtitles
Parallel sentences
Monolingual sentences
75 Portal br&en bre (mono) '''Source:''' * P. Lison and J. Tiedemann (2016), ''"OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles"'', http://stp.lingfil.uu.se/~joerg/paper/opensubs2016.pdf . '''Licence:''' unclear, "The corpora is made freely available to the research community on the OPUS website" − Lison and Tiedemann (2016).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.