Git Product home page Git Product logo

kijiji-scraper's Introduction

Kijiji-Scraper 3.0.1

Build

Track Kijiji ad information and sends out an email when a new ads are found.

Install

Manually

git clone https://github.com/CRutkowski/Kijiji-Scraper.git
cd Kijiji-Scraper
python3 setup.py install

Dependencies: requests, BeautifulSoup and PyYaml
Run pip install requests bs4 pyyaml to manually install all the dependencies

Try out

For instance kijiji --url https://www.kijiji.ca/b-cars-trucks/alberta/tesla-new__used/c174l9003a54a49

Configure

The script must read a configuration file to set mail server settings. Default config file config.yalm is located in ~/.kijiji_scraper/ (MacOS/Linux), %APPDATA%/.kijiji_scraper (Windows) or directly in the install folder.

  • Use kijiji --init to create config file and open with default text editor, set the sender, password and receiver fields in config file.
  • You can specify the Kijji URLs you wish to scrape at the bottom of the config file. There are a few examples in the config to show the syntax.
  • Alternatively you can use --url URLs to configure URLs to scrape and --email to set receivers addresses.

Note: If you're using gmail, you'll have to go to 'My Account>Sign in & security>Connected apps & sites' then turn "Allow less secure apps" to "On" to allow the script to sign into gmail.

For development and retro-compatibility You can also use default config.yalm file as the config file in the install folder but you must call ./main.py directly, not kijiji command.

Usage

To run the script execute kijiji command. You can always run python3 ./main.py from install folder.

% kijiji --help           
usage: kijiji [-h] [--init] [--conf File path] [--url URL [URL ...]]
               [--email Email [Email ...]] [--skipmail] [--all]
               [--ads File path] [--version]

Kijiji scraper: Track ad informations and sends out an email when a new ads
are found

optional arguments:
  -h, --help            show this help message and exit
  --init, --setup       Create config file if doesn't exist and open with
                        default text editor
  --conf File path, -c File path
                        The script * must read a configuration file to set
                        mail server settings *. Default config file
                        config.yalm is located in ~/.kijiji_scraper/
                        (MacOS/Linux), APPDATA/.kijiji_scraper (Windows) or
                        directly in the install folder.
  --url URL [URL ...], -u URL [URL ...]
                        Kijiji seacrh URLs to scrape
  --email Email [Email ...], -e Email [Email ...]
                        Email recepients
  --skipmail, -s        Do not send emails. This is useful for the first time
                        you scrape a Kijiji as the current ads will be indexed
                        and after removing the flag you will only be sent new
                        ads.
  --all, -a             Consider all ads as new, do not load ads.json file
  --ads File path       Load specific ads JSON file. Default file will be
                        store in the config folder
  --version, -V         Print Kijiji-Scraper version

Note: The script stores current ads in ads.json file located by default in the config folder ~/.kijiji_scraper/ or %APPDATA%/.kijiji_scraper. If a ./ads.json file exist, it will be loaded

How to run the script on set intervals

Windows:

The windows Task Scheduler can be used to have the script run at set intervals.

  1. Create a new task

    • Fill in name and description
  2. Add a trigger

    • Under Settings select Daily
    • Set Repeat task every: to your desired interval i.e. 5 mins to run the script every 5 mins
    • Set for a duration of: to indefinitely
  3. Add an action

    • Action is Start a program
    • Set Program/script to the location of your python executable i.e. C:\Users\{username}\AppData\Local\Programs\Python\Python36-32\pythonw.exe (use pythonw.exe to run quietly, no window)
    • Set Add arguments to main.py
    • Set Start in to the location of the main.py file i.e. C:\Users\{username}\Documents\Scripts\Kijiji-Scraper\
  4. Under Settings

    • Enable Run task as soon as possible after a scheduled start is missed

Linux and MacOS:

Crontab can be used on linux to easily run the script on a set interval.
To search for new ads every 5mn:

*/5 * * * * kijiji --url URL1 URL2 --email [email protected] [email protected]

Running several searches configurations

In order to avoid concurrent accesses to ads JSON file and corrupt the file, you'll need to dedicate one file per searches

*/5 * * * * kijiji --url URL1 URL2 --email [email protected] [email protected] --ads ~/our-ads.json
*/5 * * * * kijiji --url URL3 --email [email protected] --ads ~/roberts-ads.json
*/5 * * * * kijiji --url URL4 --email [email protected] --ads ~/lauras-ads.json

kijiji-scraper's People

Contributors

adrienpoupa avatar ardrake avatar bpjobin avatar craigho avatar crutkows avatar crutkowski avatar tristanlatr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

kijiji-scraper's Issues

ValueError: Extra data

I have encountered this error many times. Had to remove ads.json to fix and it kept getting this error every several days. (really like this project BTW)

**********@ubuntu-upstart:~# kijiji
No config file loaded
Ads file: /root/.kijiji_scraper/ads.json
Traceback (most recent call last):
File "/usr/local/bin/kijiji", line 9, in
load_entry_point('kijiji-scraper==3.0.1', 'console_scripts', 'kijiji')()
File "/usr/local/lib/python3.4/dist-packages/kijiji_scraper-3.0.1-py3.4.egg/kijiji_scraper/launcher.py", line 72, in main
kijiji_scraper = KijijiScraper(ads_filepath)
File "/usr/local/lib/python3.4/dist-packages/kijiji_scraper-3.0.1-py3.4.egg/kijiji_scraper/kijiji_scraper.py", line 18, in init
self.load_ads()
File "/usr/local/lib/python3.4/dist-packages/kijiji_scraper-3.0.1-py3.4.egg/kijiji_scraper/kijiji_scraper.py", line 32, in load_ads
self.all_ads = json.load(ads_file)
File "/usr/lib/python3.4/json/init.py", line 268, in load
parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
File "/usr/lib/python3.4/json/init.py", line 318, in loads
return _default_decoder.decode(s)
File "/usr/lib/python3.4/json/decoder.py", line 346, in decode
raise ValueError(errmsg("Extra data", s, end, len(s)))
ValueError: Extra data: line 1 column 3722674 - line 1 column 3723400 (char 3722673 - 3723399)

Stopped putting images into the email

I think Kijiji did something different with the images, so the picture doesn't make it into the email anymore.

'Image': '<img alt="Looking for a Saab 99, 4 or 5 door" data-src="https://i.ebayimg.com/00/s/NTY2WDgwMA==/z/k14AAOSwoCdd0qwc/$_35.JPG" src="https://ca.classistatic.com/static/V/8715/img/placeholder-large.png"/>'

"data-src" still points to the actual image, but "src" only to the placeholder image. What goes to the email is the content of "src". So - no pic of the item in the email...

I can't code. But I'll try.

I'll keep googling and hopefully figure it out. But if anyone has a fix - great!

Send email as unencrypted SMTP protocol

Google email stopped working about 2 weeks ago. I replaced the smtp settings with my ISP smtp server on port 25, 567 or 465 but I kept getting this error

ssl.SSLError: [SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:1108)

to fix this, I open the file /kijiji_scraper/email_client.py, and replace this line
server = smtplib.SMTP_SSL(self.smtp_server, self.smtp_port)

with
server = smtplib.SMTP(self.smtp_server, self.smtp_port)

Then I can start using SMTP servers on port 25

Thought I should let you guys know
Cheers

SMTP Error

I have configured the config.yaml properly, however I get the following SMTP error:

Traceback (most recent call last):
File "./main.py", line 37, in
email_client.mail_ads(ads, email_title)
File "/home/brain1/kijiji_3/Kijiji-Scraper-master/kijiji_scraper/email_client.py", line 26, in mail_ads
server.ehlo()
File "/usr/lib/python3.6/smtplib.py", line 440, in ehlo
self.putcmd(self.ehlo_msg, name or self.local_hostname)
File "/usr/lib/python3.6/smtplib.py", line 367, in putcmd
self.send(str)
File "/usr/lib/python3.6/smtplib.py", line 359, in send
raise SMTPServerDisconnected('please run connect() first')
smtplib.SMTPServerDisconnected: please run connect() first

Ram Leak

Script slowly uses more and more ram. After a few days it uses >1gb of ram.

Scrape broke recently

My hourly crontab started spitting errors today:

Traceback (most recent call last):
File "/home/mewse/Kijiji-Scraper/Kijiji-Scraper.py", line 250, in
main()
File "/home/mewse/Kijiji-Scraper/Kijiji-Scraper.py", line 247, in main
scrape(url_to_scrape, old_ad_dict, exclude_list, filename, skip_flag)
File "/home/mewse/Kijiji-Scraper/Kijiji-Scraper.py", line 171, in scrape
email_title = soup.find('div', {'class': 'message'}).find('strong').text.strip('"')
AttributeError: 'NoneType' object has no attribute 'find'

I fixed it by removing .strip() from the end of the line 171

170 if not email_title: # If the email title doesnt exist pull it form the html data
171 #email_title = soup.find('div', {'class': 'message'}).find('strong').text.strip('"')
172 email_title = soup.find('div', {'class': 'message'}).find('strong').text
173 email_title = toUpper(email_title)

[Error] Unable to create body for email message

[Error] Unable to create body for email message
Traceback (most recent call last):
File "kijiji.py", line 250, in
main()
File "kijiji.py", line 247, in main
scrape(url_to_scrape, old_ad_dict, exclude_list, filename, skip_flag)
File "kijiji.py", line 198, in scrape
MailAd(ad_dict, email_title) # Send out email with new ads
File "kijiji.py", line 132, in MailAd
msg = MIMEText(body, 'html')
File "/usr/lib/python2.7/email/mime/text.py", line 30, in init
self.set_payload(_text, _charset)
File "/usr/lib/python2.7/email/message.py", line 226, in set_payload
self.set_charset(charset)
File "/usr/lib/python2.7/email/message.py", line 262, in set_charset
self._payload = self._payload.encode(charset.output_charset)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 451: ordinal not in range(128)

does not pull entire description

If the post contains a lengthy description it will not pull all the information because a show more button appears in the kijiji ad. A possible solution for this would be to look if that button exists then parse the information after it is clicked.

Reset the cache and rerun the program

Hello, I was wondering if there is a way to manually clear the cache and re-run it?
The script always prompts "found 0 new ads" after its first run.

Regards.

More details...

Hi Chase,
Very nice scraper!
I'd like to be able to get all the info from vehicles, as we can get in this section:
image
Anyway you can add this function?

Thanks!
L-P

Found 0 new ads

Hello @CRutkowski!

First of all, I think this is a great scraper! It's going to help me a lot for my apartment research. However, it doesn't work for me. I did everything you told in README, but when I run main.py in my console, it keeps saying "Found 0 new ads"!

MacBook-Pro-3:Kijiji-Scraper-master francisduval$ python3 main.py
Scraping: https://www.kijiji.ca/b-quebec/samsung-s8/k0l9001?dc=true
Found 0 new ads

Do you have an idea why?

Thank you!

Run a command option?

Hi,
I just tried out Kijiji-Scraper and it works fine. I was wondering if there's an 'official' way to run a command when a new ad is found? Instead of sending an email, I'd like to run a command instead. This way you could invoke a command to send an SMS, pop-up a notification etc.
Thanks for the app!

Mileage/Trans Bug

Some ads don't have mileage or transmission info.
The script is not currently able to handle that.

Add ability to iterate through pages of ads?

Currently the script only scrapes the page it is given even though there could multiple pages of ads.
Though new ads are posted on the first page so it may not be worthwhile.

requests.exceptions.ChunkedEncodingError:

hi CRutlowski, this is what i do the following"
step 1: put the kijiji url in (python3 main.py --setup) then save and close
step 2: using ubuntu type "python3 main.py"
but after 3mins.... show error

http.client.IncompleteRead: IncompleteRead(0 bytes read)

urllib3.exceptions.ProtocolError: ('Connection broken: IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read))

requests.exceptions.ChunkedEncodingError: ('Connection broken: IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read))

could you help me thanks you

KeyError: 'data-ad-id'

Hello,
Great scraper! I have been using it for over a year. Thanks for your work on this.

I had the soup issue that other people have already mentioned. I pulled the fixed code today and tried out....now I am getting this error:

[Okay] Ad database succesfully loaded.
Traceback (most recent call last):
[Okay] Ad database succesfully loaded.
Traceback (most recent call last):
File "/home/brain1/kijiji/kijiji2.py", line 257, in
main()
File "/home/brain1/kijiji/kijiji2.py", line 254, in main
scrape(url_to_scrape, old_ad_dict, exclude_list, filename, skip_flag)
File "/home/brain1/kijiji/kijiji2.py", line 185, in scrape
third_party_ad_ids.append(ad['data-ad-id'])
File "/home/brain1/.local/lib/python3.5/site-packages/bs4/element.py", line 1011, in getitem
return self.attrs[key]
KeyError: 'data-ad-id'


My local element.py file has not changed in a year:
brain1@tunnel:~/.local/lib/python3.5/site-packages/bs4$ ls -al element.py
-rw-rw-r-- 1 brain1 brain1 68798 Jun 30 2018 element.py

Any ideas?

Stopped working?

Hi there,

The script seems to stop working on October 21st around 14:00 hrs, could you confirm that?

I have 2 copies of the script running on 2 different servers (and using 2 different e-mails) and now I'm getting "Found 0 new ads" every time I run the script.

I ereased ads.json to clear things up a bit, but it is not getting repopulated with new ads. The script still shows "Found 0 new ads", and new json file contains "{ }"

Is it due to some Kijiji website changes?

Best,
Adam

Remove Log

Remove the log function and all log comments once the script is completely functional.
Currently its used to find errors while the script is being run in the background.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.