Git Product home page Git Product logo

tgscrape's Introduction

tgscrape

Quick and dirty public Telegram group message scraper

Usage

To dump messages from a public group

$ python3 tgscrape.py <groupname> [minid] [maxid]

Examples

To dump all messages in the group fun_with_friends type:

$ python3 tgscrape.py fun_with_friends

You can specify the message id you want to start and stop. For instance, to dump messages with id's 1000 through 2000 type:

$ python3 tgscrape.py fun_with_friends 1000 2000

If you want to start at message id 1000 and dump all messages after it, just skip the last parameter:

$ python3 tgscrape.py fun_with_friends 1000

Retrieved messages are stored in json format in the conversations folder.

To read and search dumped messages

$ python3 tgscape_cli.py <groupname>

The following is the list and description of available commands:

Commands:
    search <terms>              search words or strings (in quotes) in messages and names
    all                         returns all dumped messages
    last <num>                  returns last <num> messages (default: 10)
    date <date>                 returns all messages for a date (format: YYYY-MM-DD)
    wordcloud                   returns the top 20 words (wordlen > 3)
    exit                        exits the program
    help                        this

Examples

If you want to search all messages and names containing either "foo" and "bar" type:

> search foo bar

If you want to search all messages and names containing the string "foo bar" type:

> search "foo bar"

To read all messages written on January 3rd, 2018, type:

> date 2018-03-01

Requirements

BeautifulSoup4

To install dependencies:

$ pip install -r requirements.txt

tgscrape's People

Contributors

logr4y avatar

Stargazers

 avatar placebo avatar Waii avatar Thallysson Klein avatar  avatar  avatar Nikhil VJ avatar Giacomo Giallombardo avatar  avatar x0x8x avatar Alireza Bayat avatar Paul Louis avatar  avatar  avatar  avatar Bulat Ismagilov avatar Sander avatar Luka Teras avatar Sergey Gulesku avatar Rasha Malek avatar Francesco Giordano avatar

Watchers

AdanGQ avatar x0x8x avatar Prashant Ghimire avatar

tgscrape's Issues

Error: list index out of range

I'm guessing this is happening when the program encounters deleted messages?

To reproduce:

python3 tgscrape.py OSMIndia

Output: (irrelevant parts redacted)

> Telegram Public Groups Scraper

[2017-12-03T16:54:34+00:00] Max N (@... ): |Service message|
... just made one
... now you are admin
... Thanks..
...
... do you know how to use Inkscape?
  
Writing to ./conversations/osmindia.json...
ERROR: list index out of range
Exiting...

errors out after the 5th or so message. Increasing max_err in config.py isn't helping.

This is happening even when supplying start and end numbers in cases where there was something amiss in between.

This is an awesome project - kudos to the author. We really need to ensure the conversations happening today make their way to the open internet where someone a decade later who doesn't know anyone is able to find relevant material amongst these chats.

Question about output

Hello,
Is it possible to export the results of tgscape_cli.py into a csv file?
Thanks!
Rich

Errors

I am getting 500 errors every now and then

and also this after restart:

ERROR: time data '' does not match format '%Y-%m-%d'

Output JSON is empty.

When I run this:

python .\tgscrape.py "fun_with_friends"

this is what I get:

> Telegram Public Groups Scraper

[deleted]
[deleted]
[deleted]
[deleted]
[deleted]
[deleted]
[deleted]
[deleted]
[deleted]
[deleted]
[deleted]
[deleted]
[deleted]
[deleted]
[deleted]
[deleted]
[deleted]
[deleted]
[deleted]

Writing to ./conversations/fun_with_friends.json...
ERROR: 0
Exiting...

Is something outdated or so?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.