Git Product home page Git Product logo

funda-scraper's Introduction

Greetings! This is Will ๐Ÿ‘‹

  • ๐ŸŒท I'm a Data Scientist / AI Developer from Taiwan, now working in Amsterdam.
  • ๐Ÿ My passion lies in leveraging data to tackle real-world challenges.
  • ๐Ÿ› I have professional expertise in the financial and healthcare sectors.

funda-scraper's People

Contributors

adrianazoitei avatar kenmccann avatar marianasegbravo avatar utkuarslan5 avatar whchien avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

funda-scraper's Issues

Changing keep_cols in preprocessing

It sounds like adding the flexibility to alter which columns to keep (in my case, I want the date_list) could be useful.
E.g. can update config.yaml before running scraper.
If you guide me how to, happy to implement!

Adding a different starting page

Hey for the purpose of webscraping to make it more robust how about adding a starting page functionality as well? Basically not start from page 0 for a specific city but page 5 for example.

The implementation would be pretty simple and I think I can already open a PR if you're OK with that

Multiple neighbourhoods

When i'm browsing funda manually i have a custom query based on a list of multiple neighborhoods; all based on the 'CBS Wijken en Buurten' dataset.

Would it be possible to let the input of area be a list with neighbourhoods?

Much appreciated!
your script is a lifesaver!

package error

image

I used the scraper two days ago, and it worked fine, but now it doesn't return anything. Is this issue specific to my machine, or are others also experiencing this problem?

Default example from documentation returns

The getting started documentation code:

from funda_scraper import FundaScraper

scraper = FundaScraper(area="amsterdam", want_to="rent", find_past=False, page_start=1, n_pages=3)
df = scraper.run(raw_data=False, save=True, filepath="test.csv", min_price=500, max_price=2000)
df.head()

returns:
Traceback (most recent call last):
File "", line 1, in
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/spawn.py", line 120, in spawn_main
exitcode = _main(fd, parent_sentinel)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/spawn.py", line 129, in _main
prepare(preparation_data)
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/spawn.py", line 240, in prepare
_fixup_main_from_path(data['init_main_from_path'])
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/spawn.py", line 291, in _fixup_main_from_path
main_content = runpy.run_path(main_path,
^^^^^^^^^^^^^^^^^^^^^^^^^
File "", line 291, in run_path
File "", line 98, in _run_module_code
File "", line 88, in _run_code
File "/Users/username/PycharmProjects/HouseWatcher/main.py", line 9, in
df = scraper.run(raw_data=True, save=True, filepath="test.csv")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/username/PycharmProjects/HouseWatcher/venv/lib/python3.11/site-packages/funda_scraper/scrape.py", line 280, in run
self.scrape_pages()
File "/Users/username/PycharmProjects/HouseWatcher/venv/lib/python3.11/site-packages/funda_scraper/scrape.py", line 245, in scrape_pages
content = process_map(self.scrape_one_link, self.links, max_workers=pools)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/username/PycharmProjects/HouseWatcher/venv/lib/python3.11/site-packages/tqdm/contrib/concurrent.py", line 105, in process_map
return _executor_map(ProcessPoolExecutor, fn, *iterables, **tqdm_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/username/PycharmProjects/HouseWatcher/venv/lib/python3.11/site-packages/tqdm/contrib/concurrent.py", line 51, in _executor_map
return list(tqdm_class(ex.map(fn, *iterables, chunksize=chunksize), **kwargs))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/concurrent/futures/process.py", line 811, in map
results = super().map(partial(_process_chunk, fn),
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/concurrent/futures/_base.py", line 608, in map
fs = [self.submit(fn, *args) for args in zip(*iterables)]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/concurrent/futures/_base.py", line 608, in
fs = [self.submit(fn, *args) for args in zip(*iterables)]
^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/concurrent/futures/process.py", line 782, in submit
self._adjust_process_count()
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/concurrent/futures/process.py", line 741, in _adjust_process_count
self._spawn_process()
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/concurrent/futures/process.py", line 759, in _spawn_process
p.start()
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/context.py", line 288, in _Popen
return Popen(process_obj)
^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/popen_spawn_posix.py", line 32, in init
super().init(process_obj)
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/popen_fork.py", line 19, in init
self._launch(process_obj)
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/popen_spawn_posix.py", line 42, in _launch
prep_data = spawn.get_preparation_data(process_obj._name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/spawn.py", line 158, in get_preparation_data
_check_not_importing_main()
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/spawn.py", line 138, in _check_not_importing_main
raise RuntimeError('''
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.

    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:

        if __name__ == '__main__':
            freeze_support()
            ...

    The "freeze_support()" line can be omitted if the program
    is not going to be frozen to produce an executable.

0%| | 0/45 [00:03<?, ?it/s]
Traceback (most recent call last):
File "/Users/username/PycharmProjects/HouseWatcher/main.py", line 9, in
df = scraper.run(raw_data=True, save=True, filepath="test.csv")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/username/PycharmProjects/HouseWatcher/venv/lib/python3.11/site-packages/funda_scraper/scrape.py", line 280, in run
self.scrape_pages()
File "/Users/username/PycharmProjects/HouseWatcher/venv/lib/python3.11/site-packages/funda_scraper/scrape.py", line 245, in scrape_pages
content = process_map(self.scrape_one_link, self.links, max_workers=pools)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/username/PycharmProjects/HouseWatcher/venv/lib/python3.11/site-packages/tqdm/contrib/concurrent.py", line 105, in process_map
return _executor_map(ProcessPoolExecutor, fn, *iterables, **tqdm_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/username/PycharmProjects/HouseWatcher/venv/lib/python3.11/site-packages/tqdm/contrib/concurrent.py", line 51, in _executor_map
return list(tqdm_class(ex.map(fn, *iterables, chunksize=chunksize), **kwargs))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/username/PycharmProjects/HouseWatcher/venv/lib/python3.11/site-packages/tqdm/std.py", line 1182, in iter
for obj in iterable:
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/concurrent/futures/process.py", line 597, in _chain_from_iterable_of_lists
for element in iterable:
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/concurrent/futures/_base.py", line 619, in result_iterator
yield _result_or_cancel(fs.pop())
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/concurrent/futures/_base.py", line 317, in _result_or_cancel
return fut.result(timeout)
^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/concurrent/futures/_base.py", line 456, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

RuntimeError: An attempt has been made to start a new process before the current process has finished its bootstrapping phase.

I am running:

#!/usr/local/bin/python

from funda_scraper import FundaScraper

scraper = FundaScraper(area="amsterdam", want_to="buy", find_past=False, page_start=1, n_pages=3, min_price=300000, max_price=500000)
df = scraper.run(raw_data=False, save=True, filepath="huize.csv")

on my Macbook Pro, Python version 3.11.3 (v3.11.3:f3909b8bc8, Apr 4 2023, 20:12:10) [Clang 13.0.0 (clang-1300.0.29.30)] on darwin

I get the error below at the beginning of phase 2.

*** Phase 2: Start scraping from individual links ***
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/spawn.py", line 120, in spawn_main
    exitcode = _main(fd, parent_sentinel)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/spawn.py", line 129, in _main
    prepare(preparation_data)
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/spawn.py", line 240, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/spawn.py", line 291, in _fixup_main_from_path
    main_content = runpy.run_path(main_path,
                   ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen runpy>", line 291, in run_path
  File "<frozen runpy>", line 98, in _run_module_code
  File "<frozen runpy>", line 88, in _run_code
  File "/Users/rop/huize.py", line 6, in <module>
    df = scraper.run(raw_data=False, save=True, filepath="huize.csv")
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/funda_scraper/scrape.py", line 317, in run
    self.scrape_pages()
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/funda_scraper/scrape.py", line 282, in scrape_pages
    content = process_map(self.scrape_one_link, self.links, max_workers=pools)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/tqdm/contrib/concurrent.py", line 105, in process_map
    return _executor_map(ProcessPoolExecutor, fn, *iterables, **tqdm_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/tqdm/contrib/concurrent.py", line 51, in _executor_map
    return list(tqdm_class(ex.map(fn, *iterables, chunksize=chunksize), **kwargs))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/concurrent/futures/process.py", line 816, in map
    results = super().map(partial(_process_chunk, fn),
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/concurrent/futures/_base.py", line 608, in map
    fs = [self.submit(fn, *args) for args in zip(*iterables)]
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/concurrent/futures/_base.py", line 608, in <listcomp>
    fs = [self.submit(fn, *args) for args in zip(*iterables)]
          ^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/concurrent/futures/process.py", line 787, in submit
    self._adjust_process_count()
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/concurrent/futures/process.py", line 746, in _adjust_process_count
    self._spawn_process()
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/concurrent/futures/process.py", line 764, in _spawn_process
    p.start()
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
                  ^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/context.py", line 288, in _Popen
    return Popen(process_obj)
           ^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/popen_spawn_posix.py", line 42, in _launch
    prep_data = spawn.get_preparation_data(process_obj._name)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/spawn.py", line 158, in get_preparation_data
    _check_not_importing_main()
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/spawn.py", line 138, in _check_not_importing_main
    raise RuntimeError('''
RuntimeError: 
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.

Valid funda listings not included in output

I noticed that some listings matching my given criteria are not included when raw_data=False. If I set it to True I do get the listings in the output.

Arguments: want_to="buy", find_past=False, n_pages=99

Some example links that are not in the output but should be (note as of opening this issue the houses are still available for purchase):
https://www.funda.nl/en/koop/amsterdam/huis-42138444-maria-austriastraat-853/
https://www.funda.nl/en/koop/amsterdam/huis-88599610-johan-hofmanstraat-273-pp/

I notice that one column is shifted, probably a hint. I highlighted it below.
image

Python 3.11.3. Run from a .py script.

Range of prices

How about adding a range for prices, since right now it tends to upload a lot of parking spots as well which is not always desirable. Also being able to return the max number of pages each city has would be a useful addition

Pull total number of pages for a query

A nice improvement might be to let BS return the total number of pages for a query so instead of setting a number of pages, the complete result can be collected when the total number of pages varies. Will investigate this.

List index out of range

If I request for more pages, for example "n_pages=40" while the maximum pages are 25 on the original search in funda. It gives me the list index out of range error.

Can this be fixed?

Aiohttp fails for captcha

Been implementing asyncio instead of mp using aiohttp, but whather I tried (e.g. sleep, user agent ,etc) always falls for bot detection.
Somehow requests library can retrieve just fine.
Any ideas?

@staticmethod
    async def _get_links_from_one_parent(url: str) -> List[str]:
        """Scrape all the available housing items from one Funda search page."""
        try:
            async with aiohttp.ClientSession(headers=config.header) as session:
                async with session.get(url) as response:
                    if response.status != 200:
                        logger.error(f"Failed to fetch {url}: HTTP {response.status}")
                        return []
                    response_text = await response.text()
                    
                    # Introduce a random delay
                    await asyncio.sleep(random.uniform(0.5, 2))

            soup = BeautifulSoup(response_text, "lxml")
            script_tags = soup.find_all("script", {"type": "application/ld+json"})
            if not script_tags:
                logger.warning(f"No script tags found in {url}")
                return []

            json_data = json.loads(script_tags[0].contents[0])
            urls = [item["url"] for item in json_data["itemListElement"]]
            return list(set(urls))

        except Exception as e:
            logger.error(f"Error fetching links from {url}: {e}")
            return []

The updated HTML content you've provided still shows that you're encountering a verification page, not the actual content page you're intending to scrape. The presence of phrases like "Je bent bijna op de pagina die je zoekt" ("You are almost on the page you are looking for") and the script for Google reCAPTCHA ("grecaptcha.render("fundaCaptchaInput", {...}") suggests that the server is serving an intermediary page to verify that the request is coming from a real user, not an automated script.

neighbourhood name = na

in all cases so far seen, the neighbourhood name in the raw data result is NA. however the neighbourhood name can be found in the 'zip code' field.
image

Error '3 weken' on feature/funda_update branch

ERROR:
Traceback (most recent call last):
File "pandas/_libs/tslib.pyx", line 616, in pandas._libs.tslib.array_to_datetime
TypeError: invalid string coercion to datetime for "3 weken" at position 1
...
File "/opt/homebrew/lib/python3.11/site-packages/dateutil/parser/_parser.py", line 643, in parse
raise ParserError("Unknown string format: %s", timestr)
dateutil.parser._parser.ParserError: Unknown string format: 3 weken present at position 1

CODE TO FIX ERROR:

def clean_list_date(x: str) -> Union[datetime, str]:
    """Transform the date from string to datetime object."""

    def delta_now(d: int):
        t = timedelta(days=d)
        return datetime.now() - t

    weekdays_dict = {
        "maandag": "Monday",
        "dinsdag": "Tuesday",
        "woensdag": "Wednesday",
        "donderdag": "Thursday",
        "vrijdag": "Friday",
        "zaterdag": "Saturday",
        "zondag": "Sunday"
    }

    try:
        if x.lower() in weekdays_dict.keys():
            date_string = weekdays_dict.get(x.lower())
            parsed_date = parse(date_string, fuzzy=True)
            delta = datetime.now().weekday() - parsed_date.weekday()
            return delta_now(delta)

        elif (
                x.find("โ‚ฌ") != -1
                or x.find("na") != -1
                or x.find("Indefinite duration") != -1
        ):
            return "na"
        elif x.find("month") != -1:
            return delta_now(int(x.split("month")[0].strip()[0]) * 30)
        elif "weken" in x:  # Handling "X weken" format
            return delta_now(int(x.split(" ")[0]) * 7)
        elif x.find("Today") != -1 or x.find("Vandaag") != -1:
            return delta_now(1)
        elif x.find("day") != -1:
            return delta_now(int(x.split("month")[0].strip()))
        else:
            return datetime.strptime(x, "%d %B %Y")

    except ValueError:
        return x

ADDITIONALLY

  • Columns: room, bedroom, bathroom, has_balcony, has_garden return value = 0 for all rows

multi processing issue

I get this issue when I run it as a .py script

RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.

    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:

        if __name__ == '__main__':
            freeze_support()
            ...

    The "freeze_support()" line can be omitted if the program
    is not going to be frozen to produce an executable.
return Popen(process_obj)

File "C:\Users\user\AppData\Local\Programs\Python\Python39\lib\multiprocessing\popen_spawn_win32.py", line 45, in init
prep_data = spawn.get_preparation_data(process_obj._name)
File "C:\Users\user\AppData\Local\Programs\Python\Python39\lib\multiprocessing\spawn.py", line 154, in get_preparation_data
_check_not_importing_main()
File "C:\Users\user\AppData\Local\Programs\Python\Python39\lib\multiprocessing\spawn.py", line 134, in _check_not_importing_main
raise RuntimeError('''
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.

    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:

        if __name__ == '__main__':
            freeze_support()
            ...

    The "freeze_support()" line can be omitted if the program
    is not going to be frozen to produce an executable.

0%|

Extract all the links, but get a smaller number of results and not from the selected city

Hi Chien, thank you very much for updating the code!

I downloaded the latest version and executed it. My parameters were the following:

scraper = FundaScraper(area="amsterdam", want_to="buy", find_past=False, n_pages=275)

Although the code extracts 4125 links, the CSV output consists of only 1615 rows of data. From those 1615 datapoints, around 97% of the postal codes are not from Amsterdam but rather from a different city.

Do you have any insights on what may be causing this?

Also, when running the code, I encountered an error:

requests.exceptions.SSLError: None: Max retries exceeded with url: /koop/[random listing].

I was able to fix this by adding a delay in the scrape_one_link method:

def scrape_one_link(self, link: str) -> List[str]:
"""Scrape all the features from one house item given a link."""

    # Initialize for each page
    response = requests.get(link, headers=config.header)
    time.sleep(3)  # Add the delay here. Adjust the delay duration as needed.
    soup = BeautifulSoup(response.text, "lxml")

No data scraped (0 houses found in 0 pages)

Hi @whchien ,

Many thanks for your efforts and updating the script. I now no longer receive any error when executing the script (see below):

2023-08-06 15:10:35,394 - INFO - *** Phase 1: Fetch all the available links from all pages *** (scrape.py:107) 0%| | 0/276 [00:00<?, ?it/s] 2023-08-06 15:10:35,486 - INFO - *** Got all the urls. 0 houses found in 0 pages. *** (scrape.py:118) 2023-08-06 15:10:35,486 - INFO - *** Phase 2: Start scraping results from individual links *** (scrape.py:183) 0it [00:00, ?it/s] 2023-08-06 15:10:35,493 - INFO - *** All scraping done: 0 results *** (scrape.py:195) 2023-08-06 15:10:35,493 - INFO - *** Cleaning data *** (scrape.py:229) Empty DataFrame Columns: [house_id, city, house_type, building_type, price, price_m2, room, bedroom, bathroom, living_area, energy_label, has_balcony, has_garden, zip, address, year_built, house_age, date_list, ym_list, year_list, descrip] Index: []

My python script looks as follows:

`from funda_scraper import FundaScraper

scraper = FundaScraper(area="amsterdam", want_to="buy", find_past=False, n_pages=3)
df = scraper.run(raw_data=False)
print(df.head())`

No clue what's wrong (it seems to get the url's?).

scraping not progressing

I am running this funda scraper in Python Idle and after running it, the script freezes after starting the second phase.

image

Is this because Funda uses anti-scraping scripts on their website? Or because of something else? Does it still work for you?

Greetings,

Jonathan

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.