Git Product home page Git Product logo

spider's Introduction

Overview

This is an open source, multi-threaded website crawler written in Python. There is still a lot of work to do, so feel free to help out with development.


Note: This is part of an open source search engine. The purpose of this tool is to gather links only. The analytics, data harvesting, and search algorithms are being created as separate programs.

Links

spider's People

Contributors

buckyroberts avatar htmllama avatar keatinge avatar lwgray avatar oscarereyes avatar srinath29 avatar tedmx avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

spider's Issues

SSL certification error on urlopen

i'm getting the below error when running a https:// website

<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1056)>

The logic that determines if a to-be-crawled website has the specific domain name needs to be fixed

In the current version, to determine if a to-be-crawled website has the specific domain name uses the code as follows:

if Spider.domain_name not in url:

However, this would not always be correct, considering the URL below:

https://twitter.com/intent/tweet?text=Videos&url=http%3A%2F%2Fmie.umass.edu%2Fvideos&original_referer=

The project domain name is mie.umass.edu. In this case, this URL certainly does not have "mie.umass.edu" as the domain name, but true is returned by using the if Spider.domain_name not in url logic.

This could be fixed by using the get_domain_name or get_sub_domain_name function to get the URL domain name, in contrast to a simple in logic.

Decode Error while writing to a file

Got this error
'utf-8' codec can't decode byte 0xb7 in position 10239: invalid start byte
when trying to write a file.

Did some search on google, but most people said using decode("utf-8", "ignore") or decode("utf-8", "replace").

Is there any other way to solve this problem?

Thanks

install pycharm and run, error "ImportError: No module named queue"

C:\Python27\python.exe C:/Users/user/Downloads/Spider-master/main.py Traceback (most recent call last): File "C:/Users/user/Downloads/Spider-master/main.py", line 2, in <module> from queue import Queue ImportError: No module named queue Process finished with exit code 1

Then I tried installing the module
Executed command: pip install queue

Error: Could not find a version that satisfies the requirement queue (from versions: ) No matching distribution found for queue

What did I do wrong or bucky, could you please update the documentation from a new user perspective setting up the whole thing?

big issue

hi
thanks for code
but it has big problem if we are middle of crawling and want to stop process for run later this is a chance we exactly stop in middle of cleaning queue file and lost all of that data
what should we do to fix it?

"RecursionError"

I get this bug at some point in the scraping.
"RecursionError: maximum recursion depth exceeded while calling a Python object"

How can I overcome this issue and let the cralwer keep doing its job?
Thanks a lot

Help it is not a issue

Dear engineers
İ am writting in turkey.İ am a student and i have Project and i have some problems about that.Could you please help me for this.İ am working on a spider Project and i was see your commit and jumped your Project abour spider for crawling web.

İ need a crawler workıng on web and saving datas on txt but it will be must keep metatags i try to add this feature your Project but always somethıngs wrong	

İ want to keep this metategs for descriptions and description with links will save the txt folder because i want to search on the txt folder like a small search engine

Project have 2 part: one this only crawler it will be run collect links and descriptins and save it on the txt folder….second part is a only txt search engıne like a Google

Could you help me for create this Project.Thank you.. 

Spider indentation bug?

maybe it's only on my computer but...
In line 67 of code Spider, there is an error "Unindent does not match any outer indentation level"

from urllib.request import urlopen
from link_finder import LinkFinder
from general import *

class Spider:

#class variable(shared among all instances)
project_name = ''
base_url = ''
domain_name = ''
queue_file = ''
crawled_file = ''
queue = set()
crawled = set()


def __init__(self, project_name, base_url, domain_name):
    Spider.project_name = project_name
    Spider.base_url = base_url
    Spider.domain_name = domain_name
    Spider.queue_file = Spider.project_name + '/queue.txt'
    Spider.crawled_file = Spider.project_name + 'crawled.txt'
    self.boot()
    self.crawl_page('First spider', Spider.base_url)

@staticmethod
def boot():
    create_project_dir(Spider.project_name)
    create_data_files(Spider.project_name, Spider.base_url, )
    Spider.queue = file_to_set(Spider.queue_file)
    Spider.crawled = file_to_set(Spider.crawled_file)

@staticmethod
def crawl_page(thread_name, page_url):
    if page_url not in Spider.crawled:
        print(thread_name + ' now crawling ' + page_url)
        print('Queue' + str(len(Spider.queue)) + ' crawled ' + str(len(Spider.crawled)))
        Spider.add_links_to_queue(Spider.gather_link(page_url))
        Spider.queue.remove(page_url)
        Spider.crawled.add(page_url)
        Spider.update_files()


@staticmethod
def gather_links(page_url):
    html_string = ''
    try:
         response = urlopen(page_url)
         if response.getheader('Content-Type') == 'text/html':
             html_bytes = response.read()
             html_string = html_bytes.decode("utf-8")
         finder = LinkFinder(Spider.base_url)
         finder.feed(html_string)
    except:
        print('Error: can not crawl page')
        return set()
    return finder.page_links()
@staticmethod
def add_links_to_queue(links):
    for url in links:
        if url in Spider.queue:
            continue
        if url in Spider.crawled:
            continue
        if Spider.domain_name not in url:
            continue
       Spider.queue.add(url)
@staticmethod
def update_file():
    set_to_file(Spider.queue, Spider.queue_file)
    set_to_file(Spider.crawled, Spider.crawled_file)

HTTP Error 403: Forbidden

I'm trying to crawl a website, I've crawled this website before using bucky's code for webcrawler from the python tutorials, using beautifulsoup. However, when I'm trying to crawl the same website using this code, I get Error 403. Please help. The website is https. Is that the problem?

Would using beautifulsoup help?

external links

I'm new to python so excuse my ignorance...
You know how is the spider finds all internal links on page and store them in the queue file to be crawled later for more links and so on .... I'm trying to make the spider also save all unique external links in another file called external . Any help!!

Error in the spider page

My gather_links() is throwing an exception. It isn't crawling the files. Could anyone solve this error? Maybe its because of the updated version of python. I am coding it in PyCharm Editor.

UnicodeEncodeError: 'charmap' codec can't encode character

Hey Bucky!
I tried a few things but couldn't get this sorted out. I am trying to crawl links from zomato. It works fine for most of the time but after some time it stops working throwing this error, as shown in the snapshot below. Could you help get this sorted out?

image

the gather links if statement issue

when the function tests if the file is a html file the response.getheader('Content-Type'): often returns text/html; UTF-8 so i changed the if statement to this if 'text/html' in response.getheader('Content-Type'):

cannot see the response

we cannot see the response from the links. If we are getting a 404 error we may want to add that to a separate file so we can identify broken links from the site.

Qustion regarding webcrawler

Dear Mr. Bucky Roberts
I follow your channel on Youtube. I should appreciate you for your useful education on python programming. Now I wonder if you can guide me about two questions regarding designing a web crawler. They may be simple and straightforward, but as a beginner I don't know them.
1- There are two tools in python such as Beautiful soup and HTMLparser for parsing a web page, Which one do you recommend more?
2- After designing the web crawler in python, How can we utilize it in the website? in other words, how can we connect our website to this search engine?

subdomains

Is possible to choose to browse only a sub domain?
eg: "Subdomain.domain.com" instead of "domain.com"
thank you

problem in crawling

Where should i put all the links to crawl?
in whic module?
where?
pls help

html

What should i put in html strings
html_string = ''
???
its in main module

pls help

'super' object has no attribute 'init'
how can i #resolve this?

Not crawling all links?

I pulled the direct source code down from this repo, and decided to test on Twitter. I edited main.py to suit the needs (just the project name and URL) and it is only crawling one single URL. If you go into twitter.com unlogged in you can see that there are pages under the root domain. Why isn't this working?

Crawling Domain Subdirectory

I need to crawl just a subdirectory of a website.

In "main.py" I changed "HOMEPAGE" to the following:

 HOMEPAGE = 'http://www.example.com/subdirectory/'

After I ran "main.py" I noticed there is a link in "queue.txt" that goes to "www.example.com" (not the subdirectory).

How could I alter the code to only crawl the subdirectory?

Thanks!

convert between python 3.5 -> 2.7

Hello,

When using python 2.7 and you try to super().init(), which is perfectly find in python 3.5
TypeError: super() takes at least 1 argument (0 given). Can some tell my how to rewrite this so it would work in python 2.7

TIA,

TLT9116

why error report:No module named queue

why when i perform a command :
$python main.py
report error :

Traceback (most recent call last):
  File "main.py", line 2, in <module>
    from queue import Queue
ImportError: No module named queue

what can i do

New Links to Crawl

Hey Bucky!

In trying to crawl "http://learnpythonthehardway.org/", I found that it was not successfully crawling new pages because it seemed like the url was incorrect. For example, it was trying to crawl "http://learnpythonthehardway.org/book/ex47.html" using "http://learnpythonthehardway.org/ex47.html", which of course did not exist. I found changing link_finder.py line 18 from "url = parse.urljoin(self.base_url, value)" to "url = parse.urljoin(self.page_url, value)" to solve this issue. I'm a super new to python and programming in general, so hopefully you can let me know if I'm on the right track or doing something wrong. Also, I really appreciate your videos!

Error while writing file: set changed size during iteration

Received the following error while running the code:

untitled1

I believe that while writing, each thread is trying to access the same file due to which the error is occurring. Moreover, I have one doubt. In the below mentioned function:

def update_files():
    set_to_file(Spider.queue_file, Spider.queue)
    set_to_file(Spider.crawled_file, Spider.crawled)

The Spider.queue is common for all the threads, or each thread has its own Spider.queue when the thread is created and Spider.crawl_page() is called?

Can't reach the hierarchy

Hello,
I am on thenewboston python tutorial 27 i.e. How to Build a Web Crawler (3/3)
I am stuck at geting the string value of products on website

url = 'https://www.patanjaliayurved.net/latest-products/'+str(page)

Source code on website for all products inside product page is like:

 

PRODUCT_NAME_1

  By: COMPANY_NAME

image

Code I am using for this is:
for item_name in soup.findAll('div',{'class':'product-detail-section'}):
print(item_name.string)
Output:
None 
None
None...

Another Code
for item_name in soup.findAll('h3'):
if item_name.parent.name == 'div':
print(item_name.string)
Output:
PRODUCT_NAME_1
None
Information
Categories
Get In Touch
PRODUCT_NAME_2
None
Information
Categories
Get In Touch...

Here I want ONLY Product names and not 'None' , 'Information', etc as mentioned above.
Can someone please resolve this problem.

"Brute force stop"

Good morning, I was wondering if you know where I could insert a brute force stop limit fro the number of pages that are scraped by the crawler.
Thanks a lot

Maximum recursion depth exceeded

Hi all!
I tried to use the spider onto a site with about 8000 pages.
While the spider was crawling at a certain point, there was an error that stopped the execution reporting:

RuntimeError: maximum recursion depth exceeded

I corrected this error by adding, in file main.py, the statements:

import sys

sys.setrecursionlimit (100000)

Maximum recursion depth exceeded

I tried crawling one website, but after it indicated

queue1|crawled1
Error: cannot crawl page
1links in the queue
Thread-1crawling set(['set([\'set(["set([\\\'https://www.cracked.com/\\\'])"])\'])'])
queue1|crawled2
Error: cannot crawl page "

it kept on calling crawl and create jobs (I guess on each other) a lot of times, then from there, called file_to_set, then indicated the recursion depth exceeded.
How can I fix this?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.