buckyroberts / spider Goto Github PK

Python website crawler.

Python 100.00%

spider's Introduction

Overview

This is an open source, multi-threaded website crawler written in Python. There is still a lot of work to do, so feel free to help out with development.

Note: This is part of an open source search engine. The purpose of this tool is to gather links only. The analytics, data harvesting, and search algorithms are being created as separate programs.

spider's People

Contributors

Stargazers

Watchers

Forkers

natechambers zcythe phanlanx emaddeve sammonk tobiassk erpragatisinghpython tedmx shamsgolap swapdewalkar nicowaltz bhavdeep-singh cophy08 archit47 rodrigofrb kiragoo srinath29 bloodysunnydays ludekhl raminfp kermit5 soyn eli103 vishwasabarish shr420 hbuno725 d3llbqy pillairaunak detrident piigeyes rohitsac xstpl swathikirankumar ankit96 ridhish10 zigzag-326 guser0512 mingprawn anirudhsridhar pysec bakaitis imntak pulin-zou leondude theforcebemay developlay gabzherri ronaldoviber fateorder zangree oukaishen zeddmaxx gitcook darrenleeyong dongyuelsqm prettyboyucas steveli90 digideskio anirudh-swaminathan rahulvairagyam veterun yongxu74 heyocool sulaiman001 runningstallion summerlove66 patanax ukrainets calmeii binlu1981 lienching eitanas cvdsouza warmodroid mental1993 chu-tianshu kamal123prd gybernstein c0demark gideonamani yyshirleyyy isolderea datafordev zhichhuang rudyrodz17 audiebant kingsunlin proxiex praveenksingh snoopygao 90sbrain mohammedshmsuddin varad0612 slavkomarjanovic vemulaharish xusai2014 prajesh-ananthan pythonuandme luojiahuli 5rkb5bplusd

spider's Issues

SSL certification error on urlopen

i'm getting the below error when running a https:// website

<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1056)>

The logic that determines if a to-be-crawled website has the specific domain name needs to be fixed

In the current version, to determine if a to-be-crawled website has the specific domain name uses the code as follows:

if Spider.domain_name not in url:

However, this would not always be correct, considering the URL below:

https://twitter.com/intent/tweet?text=Videos&url=http%3A%2F%2Fmie.umass.edu%2Fvideos&original_referer=

The project domain name is mie.umass.edu. In this case, this URL certainly does not have "mie.umass.edu" as the domain name, but true is returned by using the if Spider.domain_name not in url logic.

This could be fixed by using the get_domain_name or get_sub_domain_name function to get the URL domain name, in contrast to a simple in logic.

Contribution guidelines

Please add rules for contributing

Decode Error while writing to a file

Got this error
'utf-8' codec can't decode byte 0xb7 in position 10239: invalid start byte
when trying to write a file.

Did some search on google, but most people said using decode("utf-8", "ignore") or decode("utf-8", "replace").

Is there any other way to solve this problem?

Thanks

install pycharm and run, error "ImportError: No module named queue"

C:\Python27\python.exe C:/Users/user/Downloads/Spider-master/main.py Traceback (most recent call last): File "C:/Users/user/Downloads/Spider-master/main.py", line 2, in <module> from queue import Queue ImportError: No module named queue Process finished with exit code 1

Then I tried installing the module
Executed command: pip install queue

Error: Could not find a version that satisfies the requirement queue (from versions: ) No matching distribution found for queue

What did I do wrong or bucky, could you please update the documentation from a new user perspective setting up the whole thing?

big issue

hi
thanks for code
but it has big problem if we are middle of crawling and want to stop process for run later this is a chance we exactly stop in middle of cleaning queue file and lost all of that data
what should we do to fix it?

"RecursionError"

I get this bug at some point in the scraping.
"RecursionError: maximum recursion depth exceeded while calling a Python object"

How can I overcome this issue and let the cralwer keep doing its job?
Thanks a lot

Help it is not a issue

Dear engineers
İ am writting in turkey.İ am a student and i have Project and i have some problems about that.Could you please help me for this.İ am working on a spider Project and i was see your commit and jumped your Project abour spider for crawling web.

İ need a crawler workıng on web and saving datas on txt but it will be must keep metatags i try to add this feature your Project but always somethıngs wrong	

İ want to keep this metategs for descriptions and description with links will save the txt folder because i want to search on the txt folder like a small search engine

Project have 2 part: one this only crawler it will be run collect links and descriptins and save it on the txt folder….second part is a only txt search engıne like a Google

Could you help me for create this Project.Thank you..

Spider indentation bug?

maybe it's only on my computer but...
In line 67 of code Spider, there is an error "Unindent does not match any outer indentation level"

from urllib.request import urlopen
from link_finder import LinkFinder
from general import *

class Spider:

#class variable(shared among all instances)
project_name = ''
base_url = ''
domain_name = ''
queue_file = ''
crawled_file = ''
queue = set()
crawled = set()


def __init__(self, project_name, base_url, domain_name):
    Spider.project_name = project_name
    Spider.base_url = base_url
    Spider.domain_name = domain_name
    Spider.queue_file = Spider.project_name + '/queue.txt'
    Spider.crawled_file = Spider.project_name + 'crawled.txt'
    self.boot()
    self.crawl_page('First spider', Spider.base_url)

@staticmethod
def boot():
    create_project_dir(Spider.project_name)
    create_data_files(Spider.project_name, Spider.base_url, )
    Spider.queue = file_to_set(Spider.queue_file)
    Spider.crawled = file_to_set(Spider.crawled_file)

@staticmethod
def crawl_page(thread_name, page_url):
    if page_url not in Spider.crawled:
        print(thread_name + ' now crawling ' + page_url)
        print('Queue' + str(len(Spider.queue)) + ' crawled ' + str(len(Spider.crawled)))
        Spider.add_links_to_queue(Spider.gather_link(page_url))
        Spider.queue.remove(page_url)
        Spider.crawled.add(page_url)
        Spider.update_files()


@staticmethod
def gather_links(page_url):
    html_string = ''
    try:
         response = urlopen(page_url)
         if response.getheader('Content-Type') == 'text/html':
             html_bytes = response.read()
             html_string = html_bytes.decode("utf-8")
         finder = LinkFinder(Spider.base_url)
         finder.feed(html_string)
    except:
        print('Error: can not crawl page')
        return set()
    return finder.page_links()
@staticmethod
def add_links_to_queue(links):
    for url in links:
        if url in Spider.queue:
            continue
        if url in Spider.crawled:
            continue
        if Spider.domain_name not in url:
            continue
       Spider.queue.add(url)
@staticmethod
def update_file():
    set_to_file(Spider.queue, Spider.queue_file)
    set_to_file(Spider.crawled, Spider.crawled_file)

'utf-8' codec can't decode byte 0xa0 in position 8035: invalid start byte

HTTP Error 403: Forbidden

I'm trying to crawl a website, I've crawled this website before using bucky's code for webcrawler from the python tutorials, using beautifulsoup. However, when I'm trying to crawl the same website using this code, I get Error 403. Please help. The website is https. Is that the problem?

Would using beautifulsoup help?

external links

I'm new to python so excuse my ignorance...
You know how is the spider finds all internal links on page and store them in the queue file to be crawled later for more links and so on .... I'm trying to make the spider also save all unique external links in another file called external . Any help!!

What does queue do?

plz help What does queue do?

Crawl scope

Only crawl pages from one domain.

Error in the spider page

My gather_links() is throwing an exception. It isn't crawling the files. Could anyone solve this error? Maybe its because of the updated version of python. I am coding it in PyCharm Editor.

Is writing to file operation thread-safe?

Is it possible several threads are writing to the file using update_files at the same time?

Would love if someone create how to use .md

Please provide a brief description file for how to use it.

queue not crawling

Here is my output.
How can i resolve this.
somebody pls help

Missing license

This is an open source, multi-threaded website crawler written in Python. There is still a lot of work to do, so feel free to help out with development.

The license is missing. And so the code is proprietary. Here is the definition of open source: https://opensource.org/osd

@buckyroberts, can you please license it under a free software license?

UnicodeEncodeError: 'charmap' codec can't encode character

Hey Bucky!
I tried a few things but couldn't get this sorted out. I am trying to crawl links from zomato. It works fine for most of the time but after some time it stops working throwing this error, as shown in the snapshot below. Could you help get this sorted out?

PLEASE ADD A LICENCE

PLEASE ADD A LICENSE IN THIS PROJECT, TO AVOID ANY FUTURE COMPLICATIONS

the gather links if statement issue

when the function tests if the file is a html file the response.getheader('Content-Type'): often returns text/html; UTF-8 so i changed the if statement to this if 'text/html' in response.getheader('Content-Type'):

Not working with Python 3.4 and above

This is because docopt is only update to Python 3.3. Perhaps remove it and go to the old way?

cannot see the response

we cannot see the response from the links. If we are getting a 404 error we may want to add that to a separate file so we can identify broken links from the site.

hello

Qustion regarding webcrawler

Dear Mr. Bucky Roberts
I follow your channel on Youtube. I should appreciate you for your useful education on python programming. Now I wonder if you can guide me about two questions regarding designing a web crawler. They may be simple and straightforward, but as a beginner I don't know them.
1- There are two tools in python such as Beautiful soup and HTMLparser for parsing a web page, Which one do you recommend more?
2- After designing the web crawler in python, How can we utilize it in the website? in other words, how can we connect our website to this search engine?

subdomains

Is possible to choose to browse only a sub domain?
eg: "Subdomain.domain.com" instead of "domain.com"
thank you

Logger instead of print

Using the Python logger instead of print

problem in crawling

Where should i put all the links to crawl?
in whic module?
where?
pls help

Unable to deploy spider

I'm getting the error as py command not found?

html

What should i put in html strings
html_string = ''
???
its in main module

pls help

'super' object has no attribute 'init'
how can i #resolve this?

Not crawling all links?

I pulled the direct source code down from this repo, and decided to test on Twitter. I edited main.py to suit the needs (just the project name and URL) and it is only crawling one single URL. If you go into twitter.com unlogged in you can see that there are pages under the root domain. Why isn't this working?

Crawling Domain Subdirectory

I need to crawl just a subdirectory of a website.

In "main.py" I changed "HOMEPAGE" to the following:

 HOMEPAGE = 'http://www.example.com/subdirectory/'

After I ran "main.py" I noticed there is a link in "queue.txt" that goes to "www.example.com" (not the subdirectory).

How could I alter the code to only crawl the subdirectory?

Thanks!

convert between python 3.5 -> 2.7

Hello,

When using python 2.7 and you try to super().init(), which is perfectly find in python 3.5
TypeError: super() takes at least 1 argument (0 given). Can some tell my how to rewrite this so it would work in python 2.7

TIA,

TLT9116

Only Crawling Homepage

This is only crawling homepage. No other links are being crawled!

why error report:No module named queue

why when i perform a command :
$python main.py
report error :

Traceback (most recent call last):
  File "main.py", line 2, in <module>
    from queue import Queue
ImportError: No module named queue

what can i do

optimizations requried

New Links to Crawl

Hey Bucky!

In trying to crawl "http://learnpythonthehardway.org/", I found that it was not successfully crawling new pages because it seemed like the url was incorrect. For example, it was trying to crawl "http://learnpythonthehardway.org/book/ex47.html" using "http://learnpythonthehardway.org/ex47.html", which of course did not exist. I found changing link_finder.py line 18 from "url = parse.urljoin(self.base_url, value)" to "url = parse.urljoin(self.page_url, value)" to solve this issue. I'm a super new to python and programming in general, so hopefully you can let me know if I'm on the right track or doing something wrong. Also, I really appreciate your videos!

Error while writing file: set changed size during iteration

Received the following error while running the code:

I believe that while writing, each thread is trying to access the same file due to which the error is occurring. Moreover, I have one doubt. In the below mentioned function:

def update_files():
    set_to_file(Spider.queue_file, Spider.queue)
    set_to_file(Spider.crawled_file, Spider.crawled)

The Spider.queue is common for all the threads, or each thread has its own Spider.queue when the thread is created and Spider.crawl_page() is called?

Can't reach the hierarchy

Hello,
I am on thenewboston python tutorial 27 i.e. How to Build a Web Crawler (3/3)
I am stuck at geting the string value of products on website

url = 'https://www.patanjaliayurved.net/latest-products/'+str(page)

Source code on website for all products inside product page is like:

PRODUCT_NAME_1

By: COMPANY_NAME

Code I am using for this is:
for item_name in soup.findAll('div',{'class':'product-detail-section'}):
print(item_name.string)
Output:
None
None
None...

Another Code
for item_name in soup.findAll('h3'):
if item_name.parent.name == 'div':
print(item_name.string)
Output:
PRODUCT_NAME_1
None
Information
Categories
Get In Touch
PRODUCT_NAME_2
None
Information
Categories
Get In Touch...

Here I want ONLY Product names and not 'None' , 'Information', etc as mentioned above.
Can someone please resolve this problem.

"Brute force stop"

Good morning, I was wondering if you know where I could insert a brute force stop limit fro the number of pages that are scraped by the crawler.
Thanks a lot

Maximum recursion depth exceeded

Hi all!
I tried to use the spider onto a site with about 8000 pages.
While the spider was crawling at a certain point, there was an error that stopped the execution reporting:

RuntimeError: maximum recursion depth exceeded

I corrected this error by adding, in file main.py, the statements:

import sys

sys.setrecursionlimit (100000)

Maximum recursion depth exceeded

I tried crawling one website, but after it indicated

queue1|crawled1
Error: cannot crawl page
1links in the queue
Thread-1crawling set(['set([\'set(["set([\\\'https://www.cracked.com/\\\'])"])\'])'])
queue1|crawled2
Error: cannot crawl page "

it kept on calling crawl and create jobs (I guess on each other) a lot of times, then from there, called file_to_set, then indicated the recursion depth exceeded.
How can I fix this?