antar1011 / onix Goto Github PK

View Code? Open in Web Editor NEW

7.0 7.0 0.0 4.09 MB

The Pokemon Data-Mining Package

Home Page: http://onix.readthedocs.io/en/latest/

License: GNU General Public License v3.0

Python 100.00%

onix's People

Contributors

Stargazers

Watchers

onix's Issues

Implement early-forfeit logic

Non-6v6 and non-singles formats are exempt their battles from being filtered out for being "early forfeits." The logic to check this needs to be implemented.

Performance Test sqlite3.Connection.create_function

It's not clear to me whether it's compiling the functions to C or somehow running Python from within SQL, but it's definitely worth digging into to make sure that weight-calculation isn't a significant performance drain.

Log Reader: Full nested folder structure shouldn't be required to get date

For testing / dev purposes or just in general, a JsonFileLogReader should be able to pull date information from somewhere besides the folder where the log is stored.

Proposed solution: give JsonFileLogReader a date field that it can fall back to if parsing the folder structure doesn't make sense.

Document schemas

Just noting that things are dicts is a cop-out. For all the complex data structures (PS data, DAO data...), document the expected structure. There was an ASCII-based format that I really liked. I'll post an example when I find it again.

Parse full movedex

Right now moves.json is just move-names (js2py couldn't parse the whole file). Now that we've ditched js2py, we can pull the full movedex.

Ditch js2py for py_mini_racer

js2py chokes on arrow functions and is slow as balls.

This will work much better.

Travis is continuing to build pushes to non-master branches

WTF

Current reporting system cannot combine formes for metagames without species clause

The old way of doing stats counting was to double-count Pokemon that appeared on a team twice. Based on the theories behind how we compute usage-based cutoffs, this is wrong: we should be counting the number of teams that have a given Pokemon. For tiers with species clause, this isn't an issue.

The current reporting system has the DAO pull usage for all formes separately, and then the report-generator is responsible for combining formes. This means that the system will not work for metaga,es without species clause--the DAO needs to be responsible for combining formes.

Ref: http://www.smogon.com/forums/threads/how-should-we-be-counting-for-usage-stats.3582996/

More intelligent missing-rating handling

If a player's rating is missing, it's currently* inferred to be 1500±130 (default rating). The old scripts are smarter than that:

First, they use the player's last known rating
If that's not found, it looks at who won the match and infers a rating based on the single match

Note that this is an "enhancement" not a "feature" because it's not a high priority--I don't think missing ratings are all that common any more (worth checking!), but also the long-term plan is to do player rating calculation locally (not rely on PS).

*again, entering this ticket before the code it's referring to has actually been merged to master or even put in a PR, but yet again, I'm highly confident this code's going to make it in, so entering the ticket now means I won't have to remember later

Create data store for POC

Create a data store suitable for the 0.1 release that can implement the various sinks and daos. The goal is to stand up something simple to deploy and debug, so it could be a file-based system, but I'm really hoping I can do it with SQLite.

Generate species_lookup resource

Generate it the same way the accesible_formes resource was generated (might actually want to combine scripts).

Log reader does not mega-evolve Charizard

Charizard-Mega-Y is not listed among the formes for a Charizard / Charizard-Mega-Y holding a charizarditey.

Accuracy concerns when chaning formes

Say you have two Lv. 36 Charizard-Mega-X.

Both are neutral-natured, flawless.

The first has 8 EVs in Defense, the second has none. Both would currently be represented as having a dfn stat of 96.

Here's the problem: now let's say you want to determine the dfn stat or the Pokemon when in base forme: the first Pokemon should have a dfn of 73, the second, 72.

Representing the stats the way I am now (no IVs, EVs or nature) means there's no way to tease apart those two cases.

This is a pretty niche case, as all metagames with mega evolutions are played at at least Lv. 50 (where this ceases being an issue), but making the decision not to support this could cause implications for other metas further down the road.

Support Gen 1 stat calculation

utilities.calculate_stats uses the Gen 2+ formula, so stat calculations for RBY metagames will be wrong

Re-add foreign key constraint to teams table

(again, the code that created this issue has yet to be pushed into master, but I'm confident that I'll be sticking with the strategy I'm executing now, and so I'd like to make the issue before forgetting about it)

As of bd30fa6, I've removed the foreign key constraint from the teams table. This is somewhat dangerous, because I really do want to require that every "sid" in the teams table corresponds to a set in the moveset table, but leaving the constraint in leads to a few headaches, based on the overall architecture. With the constraint in place:

The SQL moveset sink must flush before the battle_info sink. Thus, they cannot really operate asynchronously. A solution might be to have each share a session and flush together, but...
If using the SQL battle_info sink, one is forced to use the SQL moveset sink, connecting to the same database. This is almost certainly the workflow that I'm going to end up using, but if that's the case, why did I end up defining the sinks separately to begin with?

The dangers of not having the constraint in place are:

If there's a bug in the parsing code and the sets need to be updated, there's no automatic check to ensure that the teams table references valid sets
When taking slices from the database (say, pulling a month's worth of battles), there's no automatic check to ensure that all the needed sets were pulled
If the moveset sink crashes silently or has a bug or whatever while the battle_info sink keeps chugging away, the teams table will lose integrity without any error

Really, it's only the last con and the last pro that I really care about, and so far the con for having a constraint outweighs the pro, but if I ever decide to combine the sinks, I should definitely revisit this.

Make a log-generating utility for testing

Rather than use real or anonymized logs, write a script to generate mock logs that can be used for testing.

Forme DTO does not support "primary forme"

Giant irony to this comment, which caused me to rip out and rewrite the entire SQL backend: forme lists in the DTO get ordered alphabetically upon sanitizing.

It's possible that the policy will be that the ordering doesn't matter (so count "Mega-Charizard-XMega-Y" and "Mega-Charizard-Y-Mega-X" together), but that shouldn't be baked in.

Define sink interfaces

All the data extracted from LogReader will need to get routed to the appropriate sinks. Define an interface for each one that should be applicable for file-based backend, RDBMS or NoSQL.

Refactor PS data parameters into unified object

Some functions require four or five PS data lookups or resources, which means a metric ton of arguments to keep track of. What would be better would be to pass around one object that contains these lookups as attributes.

Migrate from ZenHub to Github Projects

ZenHub should automatically pick up GH Projects, and this way people can track my progress without the plugin.

_sanitize_string not stripping non-ASCII characters in Python 3

>>> from onix import contexts
>>> ctx = contexts.get_standard_context()
>>> ctx.sanitizer.sanitize('tentaécruel')
'tentaécruel'

Behaves as expected in Python 2.

Might be time to tackle #36

Ditch ORM for SQLAlchemy Core

(it's a little premature to put in this issue now, considering the backend branch isn't even in PR yet, much less merged, but since I'm pretty sure I'm going down the ORM route for now, I figure I should put this in before I lose the thread)

As per SQLAlchemy's documentation, ORM operations are slow compared to SQLAlchemy's Expression Language (or executing the SQL directly), so when it comes time to do performance tuning (0.9 milestone), this might be an avenue to pursue. Note that I can still define my tables using the "declarative" idioms--the underlying tables can be accessed via the __table__ attribute.

On the other hand (and the reason I'm not just starting off writing the backend using Core), 12s for 100k rows doesn't actually sound that bad--if I could parse a million logs in a couple of minutes, that would be an enormous improvement on what I have now. Basically what I'm saying is that even doing things the slow ORM way, there's a good chance this won't be a significant bottleneck.

Bottom line: this is something to keep track of, but 50/50 this gets closed as "won't fix"

Conda environments not installable on Linux

appnope is an OSX-only thing

Really it would be best to not autogenerate those conda env yamls and just keep track of what packages I end up installing.

Combine config files into a setup.cfg

That way we can run tests just by running py.test

Verify that sanitize_string mirrors PS' `toId` function

Here's the code on PS:

Tools.prototype.getId = function (text) {
        if (text && text.id) {
            text = text.id;
        } else if (text && text.userid) {
            text = text.userid;
        }
        if (typeof text !== 'string' && typeof text !== 'number') return '';
        return ('' + text).toLowerCase().replace(/[^a-z0-9]+/g, '');
    };

Generate docs, publish on ReadTheDocs

Before Onix can be released, we need the documentation to be complete and up-to-date. Use Sphinx to auto-generate the docs, make sure they look good, publish to RTD, and make sure you understand how to handle versioning.

Create 0.1 release

Create a 0.1 release in GitHub and RTD

Support for pre-Gen 6 base stats

The ideal way to support this would be to feed the LogReader a generation-specific pokedex. I'll have to see how difficult that would be to generate based on how PS does it.

I think there are some custom metagames that monkey with base stats as well. Again, I should see how PS handles them.

Handle ladderError

If there's a ladder error, p1rating and p2rating keys get dropped from the logs and there's a ladderError flag.

Detailed Moveset Report

Define DAO interfaces and create methods for generating plaintext (not the "chaos" JSON yet) detailed "moveset" usage reports like this one.

Log Reader: date and format should be parseable from ratings dicts

rptime gives the timestamp of last rating update (which should be, at earliest, midnight of the current day).

format is a key in the dict as well.

This means that conceivably these are not parameters that need to be statically sent to the reader. The question is whether (1) I fully understand how rptime works and (2) whether I want the system to be dependent on those keys being there.

Species lookup: castform formes in wrong order

Lines 106-107 in generate_forme_change_rules.py. They are in reverse order. Maybe write a test?

Hook up TravisCI

Set up TravisCI as Onix's automated testing environment. For now, we just need to make sure it works with Python 2.7 and 3.5 on OSX and Linux. In the future, we'll use Travis to test as many Python versions and ecosystems as possible.

py_mini_racer incompatible with pypy

And I'd previously decided to make pypy compatibility a requirement

Should LogReader instances really be specific to single metagames?

The workflow I envisioned was to have separate log readers for each metagame, but now I'm thinking that's probably not necessary. Obviously, there need to be different readers for different Contexts, but otherwise, the same LogReader should be able to work with OU and UU and PU and LC (and probably even Balanced Hackmons)

Some unknown values shouldn't raise a KeyError

Consult with PS about which unknown values in logs just get ignored (e.g. unknown natures get processed as Hardy/Serious, IIRC) and ignore them rather than raising an error.

Consider dropping pypy requirement

pypy can provide amazing speedups compared to regular python. On the other hand, it's incompatible with a lot of C-based packages (e.g. py_mini_racer, ujson and numpy/scipy/pandas (though hopefully I won't need any of those..)

So the question is: how big a bump can I get in performance from using pypy? The current scripts see limited gain. Is python execution even going to be the bottleneck? Will it actually run faster using pypy? Can I achieve similar gains using numba?

I should consider dropping the pypy req for now and then revisiting it after the system once the system in in an MVP state, where I can do profiling to answer these questions.

Use ujson for json-parsing

ujson can be massively faster than the native json library, although it's not compatible with all JSON data. Determine whether we can use ujson.

Note that ujson will not be compatible with pypy.

Set up versioning

Important for stuff like ReadTheDocs deployment and cutting releases in general.

Look into versioneer.

Figure out a way to test sink rollback

I'd like to test that if there's an error in performing an insert that the whole transaction gets rolled back, but right now, I can't think of a way to force the sink to generate an invalid insertion expression.

Switch from conda to virtualenv/tox

I'm explicitly shying away from numpy/scipy packages (if I decide I want to do statisical modeling that'll have to go in a different project), so there's little advantage to conda other than "that's what I know." On the other hand, tox is awesome for automatic testing across virtual environments.

Pipeline test

Write a test that verifies the correct operation of the entire pipeline, from collection to reporting.

It doesn't have to be a "diabolical" test--just testing normal, easy-to-validate operation is fine.

LogProcessor: more error-handling modes besides "skip" and "raise"

Right now the only two error-handling options are:

'''
                    * "raise" : raise an exception if an error is encountered.
                    * "skip" : silently skip problematic logs
'''

This means that either one bad log ruins the whole batch or that bad logs don't get reported.

"Robust error handling" is currently targeted for the 0.7 release, but in order to get through "a whole month of logs" this will need to be dealt with sooner (0.4 release).

For now, I propose just a "warn" strategy that prints the error message but keeps chugging.

Reduce the number of variables in the SQL query

Currently the integration test throws the following error:

E       sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) too many SQL variables

(it goes on for quite some time, but it's referring to the giant CASE clause that is the species lookup)

This is a known limitation with SQLite where one is limited to 999 "?" variables. Ref: https://www.sqlite.org/limits.html

Upon further investigation, I see that the insert expressions are going to have issues as well for any large-scale operations (though this can be mitigated by setting batch_size really low).

So the solution has two elements:

species_lookup needs to be handled by a temporary table rather than a giant CASE expression (there's an alternative where it only combined formes and handles weird cases like "Mr. Mime" and then the reporting DAO sanitizes the rest, but that adds a lot of complexity)
Inserts in the sinks need to done using the .execute(insert(), values) syntax rather than the current .execute(insert().values(values) syntax. AFAICT, this also necessitates storing the rows as dicts rather than tuples. This is, honestly, a better syntax, as it'll be clearer from the tests what's being tested than how it is now where I'm randomly selecting by index. But it's also going to be a PITA to rewrite all the tests.

Load-or-scrape methods

Make a function that handles all this nonsense:

        try:
            pokedex = json.load(open('.psdata/pokedex.json'))
        except IOError:
            pokedex = scrapers.scrape_battle_pokedex()

Should also handle resources that don't get scraped.

Abilities report

Create DAO interface and report-generation method for generating the ranked list of abilities for the Pokemon in a given metagame in a given month at a given weighting-baseline

Construct LogProcessor

A LogProcessor is an object to handle--at a high level--the reading of logs and the routing of the parsed data to the appropriate sinks. It also handles any errors that might arise.

A LogProcessor is initialized with the appropriate sinks and references to the logs that need processing, and it's the LogProcessor's responsibility to create the appropriate LogReaders.

Support buffered operations for sinks

Right now sinks are very much set up to store data when you tell them to, in that:

There are no "flush" or "close" methods to execute buffered operations
Sink "store" methods must return the number of items successfully stored

A better way to structure this would be to have the "store" methods return nothing, define "flush" and "close" methods, and maybe have THOSE return the total number of objects stored during the session.

At this point, the task is trivial, since the only sink implementations are stumps, but if I put this off for a later release (since this doesn't need to be optimized for the 0.1 release), the effort value will likely increase.

Create a method to turn usage stats data into a report

The method reporting.reports.generate_usage_stats should pull information from a dao.ReportingDao interface and generate a standard Smogon usage stats report like this one.

We want total battles, average weight / team and usage %. We don't need "Raw" or "Real" for now.

Through this process it should become clearer what sorts of data we need the ReportingDao to return.

Note that combining of formes and mapping of formes to printable names is in the scope of this method (read: not the dao).

Items report

Create DAO interface and report-generation method for generating the ranked list of items for the Pokemon in a given metagame in a given month at a given weighting-baseline

antar1011 / onix Goto Github PK

onix's People

Contributors

Stargazers

Watchers

onix's Issues

Recommend Projects

Recommend Topics

Recommend Org