Picture and specifications scraper

Jupyter Notebook 99.71% Python 0.29%

predicting-car-price-from-scraped-data's Introduction

Welcome to The Car Connection Dataset ™️

School Project

I scraped 32,000+ cars with 150 specifications from thecarconnection.com and ran multiple analyses with Pytorch, Scikit-Learn, and Tensorflow. Those incluse PCA, fully-connected (dense) neural networks, decision trees, random forests, svm, etc.

The specs scraper is here, but deprecated(it will only scrape 2,000 cars)
The picture scraper repo is here.

predicting-car-price-from-scraped-data's People

Contributors

Stargazers

Watchers

Forkers

longjohncoder soyseyf jasond94 andrewtvuong thomasearnest nicole-roach-data abhayghatpande vincentfortin sairahul yongduek vishesh7797 meldiwin gauravbdev liyucode ydrcg allensmile nicktam vguanwenv binianzjl winjia marenan dreamerdoremi alphashi tfy1028 roy-shaw daiwc xiaogangli deplench dsadulla indolant leo-xxx nilportugues westcityinstitute leo23 matchx wuyunxiangwyx driverxb carmanzheng wengbm sunying1122 arain-sh chase2816 l1kw1d hhy5277 sumepr srinivasanbigdata 662d7 hadryan naveen-k mationai awesome-archive shivlondon yanggui19891007 yashmukaty rahulxdd 0411tony marcdumon leogodin217 abdulazezsaleh tensigh shashank545 kunlqt sukasaha towerjoo aliborji ksyoung258 cohuynhh kuyu zeta1999 autodatabases ashivdhondea 595185987 cherlloydfan nvogen mtaseen velsgit lizy10 yangyin2016 igi123 yanglei50 qiuweibin2005 smallflyfly mercurylaci ssusantachary hadimorrow gharsalli phamdinhkhanh peanuth17 samsonafo muhamed-k olayinkapeter t357a thisisfalcon faustpy mbcoughl huitmahoon shankk24 siwtom tph-prog vbar1982

predicting-car-price-from-scraped-data's Issues

Other interesting (!?) data sources

Hello Nicolas, I saw your work on Reddit a few weeks ago, and it's really cool, and I like your approach.

I finished writing an article on a similar approach to collect data. Still, I used the website Turo as a data source.

I want to share it with you because Turo could be an excellent addition to your data sources. For me, I will definitely add The Car Connection to my data sources 😸 .

Regards

run main.py

hello, thanks for your codes, when i run the main.py, i met a problem like this:
scrape.py started running.
Traceback (most recent call last):
File "F:/BONC/vehicle/predicting-car-price-from-scraped-data/picture-scraper/scrape.py", line 92, in
run(sys.argv[1])
File "F:/BONC/vehicle/predicting-car-price-from-scraped-data/picture-scraper/scrape.py", line 78, in run
a = all_makes()
File "F:/BONC/vehicle/predicting-car-price-from-scraped-data/picture-scraper/scrape.py", line 20, in all_makes
for a in fetch(website, "/new-cars").find_all("a", {"class": "add-zip"}):
File "F:/BONC/vehicle/predicting-car-price-from-scraped-data/picture-scraper/scrape.py", line 15, in fetch
'lxml')
File "D:\python3.6\lib\site-packages\bs4_init_.py", line 245, in init
% ",".join(features))
bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?
tag.py started running.
Traceback (most recent call last):
File "F:/BONC/vehicle/predicting-car-price-from-scraped-data/picture-scraper/tag.py", line 69, in
run()
File "F:/BONC/vehicle/predicting-car-price-from-scraped-data/picture-scraper/tag.py", line 11, in run
df = pd.read_csv('specs-and-pics.csv', dtype=str, index_col=0).T
File "D:\python3.6\lib\site-packages\pandas\io\parsers.py", line 676, in parser_f
return _read(filepath_or_buffer, kwds)
File "D:\python3.6\lib\site-packages\pandas\io\parsers.py", line 448, in _read
parser = TextFileReader(fp_or_buf, **kwds)
File "D:\python3.6\lib\site-packages\pandas\io\parsers.py", line 880, in init
self._make_engine(self.engine)
File "D:\python3.6\lib\site-packages\pandas\io\parsers.py", line 1114, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "D:\python3.6\lib\site-packages\pandas\io\parsers.py", line 1891, in init
self._reader = parsers.TextReader(src, **kwds)
File "pandas_libs\parsers.pyx", line 374, in pandas._libs.parsers.TextReader.cinit
File "pandas_libs\parsers.pyx", line 674, in pandas._libs.parsers.TextReader._setup_parser_source
FileNotFoundError: [Errno 2] File specs-and-pics.csv does not exist: 'specs-and-pics.csv'
Traceback (most recent call last):
File "F:/BONC/vehicle/predicting-car-price-from-scraped-data/picture-scraper/save.py", line 15, in
df = pd.read_csv('id_and_pic_url.csv')
File "D:\python3.6\lib\site-packages\pandas\io\parsers.py", line 676, in parser_f
return _read(filepath_or_buffer, kwds)
File "D:\python3.6\lib\site-packages\pandas\io\parsers.py", line 448, in _read
parser = TextFileReader(fp_or_buf, **kwds)
File "D:\python3.6\lib\site-packages\pandas\io\parsers.py", line 880, in init
self._make_engine(self.engine)
File "D:\python3.6\lib\site-packages\pandas\io\parsers.py", line 1114, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "D:\python3.6\lib\site-packages\pandas\io\parsers.py", line 1891, in init
self._reader = parsers.TextReader(src, **kwds)
File "pandas_libs\parsers.pyx", line 374, in pandas._libs.parsers.TextReader.cinit
File "pandas_libs\parsers.pyx", line 674, in pandas._libs.parsers.TextReader._setup_parser_source
FileNotFoundError: [Errno 2] File id_and_pic_url.csv does not exist: 'id_and_pic_url.csv'
and i have a question about the srape.py:
def fetch(page, addition=''):
return bs.BeautifulSoup(urlopen(Request(page + addition,
headers={'User-Agent': 'Opera/9.80 (X11; Linux i686; Ub'
'untu/14.10) Presto/2.12.388 Version/12.16'})).read(),
'lxml')
can you help me? thank you

duplicates

Howdy; thanks for the dataset. You mentioned that there are a lot of duplicates. I used the imagehash library https://github.com/JohannesBuchner/imagehash to try to find dupes. There are about 32k (a little less than half the dataset) depending on what counts as a dupe. For example Acura_ILX_2014_28_16_110_15_4_70_55_179_39_FWD_5_4_4dr_Fdd.jpg and Acura_ILX_2013_28_16_110_15_4_70_55_179_39_FWD_5_4_4dr_onJ.jpg are exact (found by md5) duplicates, while Acura_MDX_2017_44_18_290_35_6_77_67_196_19_FWD_7_4_SUV_Jss.jpg and Acura_MDX_2017_44_18_290_35_6_77_67_196_19_FWD_7_4_SUV_CgX.jpg are resized (found by dhash) duplicates.

I made a little spreadsheet with the hashes/duplicate information in case anyone downloads the dataset and wants to clean it quickly (imagehashes.txt) -- the fastest way to use it would be something like this:

import csv
import os

records = []
with open('imagehashes.txt', newline='') as csv_file:
    reader = csv.DictReader(csv_file)
    for record in reader:
        record['duplicate'] = record['duplicate'] == 'True'
        records.append(record)

for record in records:
    if record['duplicate']:  # alternately, if record['matches'] == 'md5':
        filename = record['filename']
        print(f"deleting {filename}")
        os.unlink(filename)

Which python version to use?

Hello Nicolas, I saw your repo and I think it is really cool! Right now I tried to run the picture scraper using python 3.7, and it gives me some error. May I know which python version are you using? Thanks!

Scraping high res photos

Hello,

Firstly, thanks for sharing such a comprehensive dataset on car models!

You have mentioned on the reddit post on scraping high res photos is possible by modifying the scraper. Would you mind telling about the element id or get request that is related with the high res photos, so that I can modify the scraper script accordingly?

Bests

Getting "Problem with https://www.thecarconnection.com/specifications/..." and no images

I had to install lxml to get main.py to run, but it seems to run now. However, I get tons of errors saying *Problem with https://www.thecarconnection.com/specifications/" for each car and no images download. Do you have an idea as to what is going on?

nicolas-gervais / predicting-car-price-from-scraped-data Goto Github PK

predicting-car-price-from-scraped-data's Introduction

Welcome to The Car Connection Dataset ™️

predicting-car-price-from-scraped-data's People

Contributors

Stargazers

Watchers

Forkers

predicting-car-price-from-scraped-data's Issues

Other interesting (!?) data sources

run main.py

duplicates

Which python version to use?

Scraping high res photos

Getting "Problem with https://www.thecarconnection.com/specifications/..." and no images

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent