Git Product home page Git Product logo

asagi's People

Contributors

eksopl avatar nattofriends avatar oohnoitz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

asagi's Issues

Possible Issue?

Basically, it seems that there may be an issue with purging old threads from the queue. It happens when 4chan goes offline for a few hours and returns. It would resume fetching as usual, but the thread count increases and idles at a new number instead. Since the counter usually idles at 160 threads for most boards, I'm not exactly sure if the old threads are still in the queue or not.

[sp 283 0 0 0 0] 22613441: got HTTP status 502
[a 272 14 0 0 0] 67856285: got HTTP status 502
[v 345 0 0 0 0] 145197791: got HTTP status 502

Document everything

Document the everliving shit out of Asagi's source. It should be good enough to generate Javadoc out of it.

Save raw data when necessary

When asagi fails to parse a post or an entire thread, asagi should save the raw html/json so that you could either fix the parser or fix the html/json manually, and reparse these when restarting asagi.
Same thing should happen when the DB is dead or busy.

Error/Panic Log

There should be a log containing all error/exceptions that occur.

Flags

It seems that moot has added flags on a few boards. It might be nice to have this archived for completeness. Since we have the poster_ip field, it might be easier to populate that field with a dummy IP that is located within the same country. The other option is adding another field but doing that on large boards aren't fun.

Logging Fatal Exceptions

It seems that fatal exceptions aren't logged into the debug file when asagi crashes. This makes it hard to find and report issues when asagi is restarted and floods the console.

Exception during 4chan downtime

The dumper died while 4chan was down with the following:

Exception in thread: "Page scanner 1 - mlp" java.lang.IllegalStateException: No match found
    at java.util.regex.Matcher.group(Matcher.java:485)
    at net.easymodo.asagi.Yotsuba.newYotsubaPost(Yotsuba.java:208)
    at net.easymodo.asagi.Yotsuba.parsePost(Yotsuba.java:401)
    at net.easymodo.asagi.Yotsuba.getPage(Yotsuba.java:442)
    at net.easymodo.asagi.Board.content(Board.java:14)
    at net.easymodo.asagi.Dumper$PageScanner.run(Dumper.java:284)
    at java.lang.Thread.run(Thread.java:722)

NullPointerException without rhyme or reason

Every so often, within 10 minutes or so, I get:

Exception in thread: "Topic fetcher #0 - b" java.lang.NullPointerException
at net.easymodo.asagi.YotsubaJSON.getThread(YotsubaJSON.java:144)
at net.easymodo.asagi.AbstractDumper$TopicFetcher.run(AbstractDumper.java:288)
at java.lang.Thread.run(Thread.java:722)
Terminating dumper due to unexpected exception.
Please report this issue if you believe it is a bug.

Like the error reports, I have no dump to post, no possible evidence to support this. Is there something I can do?
I'm running JDK 1.7, if that's any help.

Move post/thread regex patterns to separate file

Move the regexes hardcoded in the source code to a separate regex file.

Perhaps with a script that either generates said file from fuuka. Either that, or make fuuka also be able to read the regex definitions from the same file.

Integrate pngquant

Altering images perhaps goes against the ideology of an archive, but perhaps this is something that might help save some serious amounts of storage, almost any .png can be reduced on average a 50%, with quality reduction being near imperceptible.

What if asagi fetched the image, and if it was a png, it compressed it, and then inserted the new hash into the DB for FF? It looks like something that could be done.

http://pngquant.org/ for more info.

4chan doesn't urlencode email causing fetcher to die

On /vg/ an email appeared that killed the fetcher: #-Û²O2hI%. 4chan isn't urlencoding or it isn't doing it properly. Right now it only uses htmlspecialchars() function that makes so it doesn't go out of the href="email:".

I have solved this by catching the following exception:

error

Import Saved HTML

This subject was discussed earlier as a joke, but it seems like it might be something reasonable.

For Developers:
It would allow us to test the fetcher against HTML code for debugging purposes. We would usually test the fetcher by running it against an entire board to see if it works, but there are always some special cases that we need to test for. This feature would allow us to test against saved HTML code or modified to ensure that we catch all of the bugs/issues/problems. We would never know if the thread we want to test will 404.

For Maintainers:
It is a bit minor, but it would be nice to be able to import original HTML code with the fetcher itself. The community would often be able to provide maintainers with old threads or missing threads saved manually. Therefore, it would be nice if we had an import feature to add the missing information easily.


I would leave the exact implementation up to you, but my suggestion would be some type of "watch" directory within the folder containing asagi. It would be monitored and would parse the entire "watch" directory for threads and import them accordingly.

/home/asagi
`-- watch
    `-- 4chan
        |-- a
        |   1000000.html
        |   1000001.html
        `-- jp
            1000001.html
            1000002.html

PostgreSQL support

It should be pretty easy to do this on this end. The heavy lifting involved is pretty much just writing the triggers and procedures for updating _threads, _images and _daily.

Support in fuuka should also be easy enough, given that part of the work done.

Also, Connector/J is GPL, so it forces asagi to be GPLv2 encumbered. Other dependencies are Apache and BSD, with one LGPL. But since I am considering asagi to be a fuuka derivative, due to it being a completely non-cleanroom reimplementation of fuuka's dumper side, it's already dual GPLv1/Artistic License, so not much of a loss there.

Die gracefully on fatal errors

Rather than letting the dumper continue in some morbid state missing threads, we should quit altogether once a thread blows up, so the user can acknowledge the dumper died for some reason, usually the fact that they launched asagi with insufficient heap memory (-Xmx flag).

Input JSON settings via STDIN

I need to input the JSON without creating the file so there is no risk to expose the asagi.json containing the database info on a badly set up server.

Kusaba support

Some kind of support for the HTML Kusaba clones output.

Ugh.

Why won't people just let some things die.

Exception YotsubaJSON.getAllThreads

Exception in thread: "Threadlist fetcher - q" com.google.gson.JsonSyntaxException: java.lang.NumberFormatException: Expected an int but was 4294967295 at line 1 column 1976
        at com.google.gson.internal.bind.TypeAdapters$7.read(TypeAdapters.java:232)
        at com.google.gson.internal.bind.TypeAdapters$7.read(TypeAdapters.java:222)
        at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$1.read(ReflectiveTypeAdapterFactory.java:93)
        at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$Adapter.read(ReflectiveTypeAdapterFactory.java:172)
        at com.google.gson.internal.bind.TypeAdapterRuntimeTypeWrapper.read(TypeAdapterRuntimeTypeWrapper.java:40)
        at com.google.gson.internal.bind.ArrayTypeAdapter.read(ArrayTypeAdapter.java:72)
        at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$1.read(ReflectiveTypeAdapterFactory.java:93)
        at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$Adapter.read(ReflectiveTypeAdapterFactory.java:172)
        at com.google.gson.internal.bind.TypeAdapterRuntimeTypeWrapper.read(TypeAdapterRuntimeTypeWrapper.java:40)
        at com.google.gson.internal.bind.ArrayTypeAdapter.read(ArrayTypeAdapter.java:72)
        at com.google.gson.Gson.fromJson(Gson.java:795)
        at com.google.gson.Gson.fromJson(Gson.java:761)
        at com.google.gson.Gson.fromJson(Gson.java:710)
        at com.google.gson.Gson.fromJson(Gson.java:682)
        at net.easymodo.asagi.YotsubaJSON.getAllThreads(YotsubaJSON.java:174)
        at net.easymodo.asagi.DumperJSON$BoardPoller.run(DumperJSON.java:46)
        at java.lang.Thread.run(Thread.java:722)
Caused by: java.lang.NumberFormatException: Expected an int but was 4294967295 at line 1 column 1976
        at com.google.gson.stream.JsonReader.nextInt(JsonReader.java:602)
        at com.google.gson.internal.bind.TypeAdapters$7.read(TypeAdapters.java:230)
        ... 16 more

403 Forbidden?

I've seen the issue crop up on /q/ as well with people using things other than asagi, but I think 4chan now requires accept headers, too?

[e 164 0 0 0 0] threads.json: HTTP error: Forbidden (403)
[h 160 0 0 0 0] threads.json: HTTP error: Forbidden (403)
[s 165 0 0 0 0] threads.json: HTTP error: Forbidden (403)
[e 164 0 0 0 0] threads.json: HTTP error: Forbidden (403)
[h 160 0 0 0 0] threads.json: HTTP error: Forbidden (403)
[s 165 0 0 0 0] threads.json: HTTP error: Forbidden (403)
[e 164 0 0 0 0] threads.json: HTTP error: Forbidden (403)
[h 160 0 0 0 0] threads.json: HTTP error: Forbidden (403)
[s 165 0 0 0 0] threads.json: HTTP error: Forbidden (403)

Search by ID

I think it'd be a great feature if there was a search by ID function.
Also great work on the imageboard, it looks very nice!

URLDecoder: Incomplete trailing escape (%) pattern

Exception in thread: "Topic fetcher #2 - tg" java.lang.IllegalArgumentException: URLDecoder: Incomplete trailing escape (%) pattern
        at java.net.URLDecoder.decode(URLDecoder.java:187)
        at net.easymodo.asagi.WWW.doCleanLink(WWW.java:150)
        at net.easymodo.asagi.YotsubaJSON.cleanLink(YotsubaJSON.java:230)
        at net.easymodo.asagi.YotsubaJSON.makePostFromJson(YotsubaJSON.java:195)
        at net.easymodo.asagi.YotsubaJSON.getThread(YotsubaJSON.java:145)
        at net.easymodo.asagi.Board.content(Board.java:18)
        at net.easymodo.asagi.Dumper$TopicFetcher.run(Dumper.java:442)
        at java.lang.Thread.run(Thread.java:722)
Terminating dumper due to unexpected exception.
Please report this issue if you believe it is a bug.

I wasn't able to track down which post contained the trailing % in the email field. However, I suggest one of the following fixes.
if(link.endsWith("%")) link = link + "25"; (forces URLDecoder to make the last character %)
if(link.endsWith("%")) link = link.substring(0, link.length() -1); (just trims the trailing %)

Uncaught exception for malformed post numbers

Exception in thread: "Topic fetcher #0 - vg" java.lang.NumberFormatException: For input string: "106a1746"
    at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
    at java.lang.Integer.parseInt(Integer.java:492)
    at java.lang.Integer.parseInt(Integer.java:527)
    at net.easymodo.asagi.Yotsuba.parsePost(Yotsuba.java:326)
    at net.easymodo.asagi.Yotsuba.getThread(Yotsuba.java:470)
    at net.easymodo.asagi.Board.content(Board.java:18)
    at net.easymodo.asagi.Dumper$TopicFetcher.run(Dumper.java:435)
    at java.lang.Thread.run(Thread.java:722)

Build 19

Proxy support

To be able to archive location restricted imageboards.

Time Zones

This is related to Issue #22.

Since one of asagi's goal is to be extended to other imageboards, there should be a setting to specify the Time Zone of the server being archived. This would ensure that the timestamp stored is accurate.

Also, are we still going to have timestamps all stored in UTC eventually?

Save States

Basically, implement a save state feature upon closing asagi. It should allow asagi to continue from that save-state to avoid parsing everything all over again.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.