Git Product home page Git Product logo

predicting-car-price-from-scraped-data's Introduction

Welcome to The Car Connection Dataset โ„ข๏ธ

School Project

I scraped 32,000+ cars with 150 specifications from thecarconnection.com and ran multiple analyses with Pytorch, Scikit-Learn, and Tensorflow. Those incluse PCA, fully-connected (dense) neural networks, decision trees, random forests, svm, etc.

predicting-car-price-from-scraped-data's People

Contributors

nicolas-gervais avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

predicting-car-price-from-scraped-data's Issues

Other interesting (!?) data sources

Hello Nicolas, I saw your work on Reddit a few weeks ago, and it's really cool, and I like your approach.

I finished writing an article on a similar approach to collect data. Still, I used the website Turo as a data source.

I want to share it with you because Turo could be an excellent addition to your data sources. For me, I will definitely add The Car Connection to my data sources ๐Ÿ˜ธ .

Regards

run main.py

hello, thanks for your codes, when i run the main.py, i met a problem like this:
scrape.py started running.
Traceback (most recent call last):
File "F:/BONC/vehicle/predicting-car-price-from-scraped-data/picture-scraper/scrape.py", line 92, in
run(sys.argv[1])
File "F:/BONC/vehicle/predicting-car-price-from-scraped-data/picture-scraper/scrape.py", line 78, in run
a = all_makes()
File "F:/BONC/vehicle/predicting-car-price-from-scraped-data/picture-scraper/scrape.py", line 20, in all_makes
for a in fetch(website, "/new-cars").find_all("a", {"class": "add-zip"}):
File "F:/BONC/vehicle/predicting-car-price-from-scraped-data/picture-scraper/scrape.py", line 15, in fetch
'lxml')
File "D:\python3.6\lib\site-packages\bs4_init_.py", line 245, in init
% ",".join(features))
bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?
tag.py started running.
Traceback (most recent call last):
File "F:/BONC/vehicle/predicting-car-price-from-scraped-data/picture-scraper/tag.py", line 69, in
run()
File "F:/BONC/vehicle/predicting-car-price-from-scraped-data/picture-scraper/tag.py", line 11, in run
df = pd.read_csv('specs-and-pics.csv', dtype=str, index_col=0).T
File "D:\python3.6\lib\site-packages\pandas\io\parsers.py", line 676, in parser_f
return _read(filepath_or_buffer, kwds)
File "D:\python3.6\lib\site-packages\pandas\io\parsers.py", line 448, in _read
parser = TextFileReader(fp_or_buf, **kwds)
File "D:\python3.6\lib\site-packages\pandas\io\parsers.py", line 880, in init
self._make_engine(self.engine)
File "D:\python3.6\lib\site-packages\pandas\io\parsers.py", line 1114, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "D:\python3.6\lib\site-packages\pandas\io\parsers.py", line 1891, in init
self._reader = parsers.TextReader(src, **kwds)
File "pandas_libs\parsers.pyx", line 374, in pandas._libs.parsers.TextReader.cinit
File "pandas_libs\parsers.pyx", line 674, in pandas._libs.parsers.TextReader._setup_parser_source
FileNotFoundError: [Errno 2] File specs-and-pics.csv does not exist: 'specs-and-pics.csv'
Traceback (most recent call last):
File "F:/BONC/vehicle/predicting-car-price-from-scraped-data/picture-scraper/save.py", line 15, in
df = pd.read_csv('id_and_pic_url.csv')
File "D:\python3.6\lib\site-packages\pandas\io\parsers.py", line 676, in parser_f
return _read(filepath_or_buffer, kwds)
File "D:\python3.6\lib\site-packages\pandas\io\parsers.py", line 448, in _read
parser = TextFileReader(fp_or_buf, **kwds)
File "D:\python3.6\lib\site-packages\pandas\io\parsers.py", line 880, in init
self._make_engine(self.engine)
File "D:\python3.6\lib\site-packages\pandas\io\parsers.py", line 1114, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "D:\python3.6\lib\site-packages\pandas\io\parsers.py", line 1891, in init
self._reader = parsers.TextReader(src, **kwds)
File "pandas_libs\parsers.pyx", line 374, in pandas._libs.parsers.TextReader.cinit
File "pandas_libs\parsers.pyx", line 674, in pandas._libs.parsers.TextReader._setup_parser_source
FileNotFoundError: [Errno 2] File id_and_pic_url.csv does not exist: 'id_and_pic_url.csv'
and i have a question about the srape.py:
def fetch(page, addition=''):
return bs.BeautifulSoup(urlopen(Request(page + addition,
headers={'User-Agent': 'Opera/9.80 (X11; Linux i686; Ub'
'untu/14.10) Presto/2.12.388 Version/12.16'})).read(),
'lxml')
can you help me? thank you

duplicates

Howdy; thanks for the dataset. You mentioned that there are a lot of duplicates. I used the imagehash library https://github.com/JohannesBuchner/imagehash to try to find dupes. There are about 32k (a little less than half the dataset) depending on what counts as a dupe. For example Acura_ILX_2014_28_16_110_15_4_70_55_179_39_FWD_5_4_4dr_Fdd.jpg and Acura_ILX_2013_28_16_110_15_4_70_55_179_39_FWD_5_4_4dr_onJ.jpg are exact (found by md5) duplicates, while Acura_MDX_2017_44_18_290_35_6_77_67_196_19_FWD_7_4_SUV_Jss.jpg and Acura_MDX_2017_44_18_290_35_6_77_67_196_19_FWD_7_4_SUV_CgX.jpg are resized (found by dhash) duplicates.

I made a little spreadsheet with the hashes/duplicate information in case anyone downloads the dataset and wants to clean it quickly (imagehashes.txt) -- the fastest way to use it would be something like this:

import csv
import os

records = []
with open('imagehashes.txt', newline='') as csv_file:
    reader = csv.DictReader(csv_file)
    for record in reader:
        record['duplicate'] = record['duplicate'] == 'True'
        records.append(record)

for record in records:
    if record['duplicate']:  # alternately, if record['matches'] == 'md5':
        filename = record['filename']
        print(f"deleting {filename}")
        os.unlink(filename)

Which python version to use?

Hello Nicolas, I saw your repo and I think it is really cool! Right now I tried to run the picture scraper using python 3.7, and it gives me some error. May I know which python version are you using? Thanks!

Scraping high res photos

Hello,

Firstly, thanks for sharing such a comprehensive dataset on car models!

You have mentioned on the reddit post on scraping high res photos is possible by modifying the scraper. Would you mind telling about the element id or get request that is related with the high res photos, so that I can modify the scraper script accordingly?

Bests

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.