antar1011 / onix Goto Github PK
View Code? Open in Web Editor NEWThe Pokemon Data-Mining Package
Home Page: http://onix.readthedocs.io/en/latest/
License: GNU General Public License v3.0
The Pokemon Data-Mining Package
Home Page: http://onix.readthedocs.io/en/latest/
License: GNU General Public License v3.0
Non-6v6 and non-singles formats are exempt their battles from being filtered out for being "early forfeits." The logic to check this needs to be implemented.
It's not clear to me whether it's compiling the functions to C or somehow running Python from within SQL, but it's definitely worth digging into to make sure that weight-calculation isn't a significant performance drain.
For testing / dev purposes or just in general, a JsonFileLogReader
should be able to pull date information from somewhere besides the folder where the log is stored.
Proposed solution: give JsonFileLogReader
a date
field that it can fall back to if parsing the folder structure doesn't make sense.
Just noting that things are dicts is a cop-out. For all the complex data structures (PS data, DAO data...), document the expected structure. There was an ASCII-based format that I really liked. I'll post an example when I find it again.
Right now moves.json
is just move-names (js2py couldn't parse the whole file). Now that we've ditched js2py, we can pull the full movedex.
js2py chokes on arrow functions and is slow as balls.
This will work much better.
WTF
The old way of doing stats counting was to double-count Pokemon that appeared on a team twice. Based on the theories behind how we compute usage-based cutoffs, this is wrong: we should be counting the number of teams that have a given Pokemon. For tiers with species clause, this isn't an issue.
The current reporting system has the DAO pull usage for all formes separately, and then the report-generator is responsible for combining formes. This means that the system will not work for metaga,es without species clause--the DAO needs to be responsible for combining formes.
Ref: http://www.smogon.com/forums/threads/how-should-we-be-counting-for-usage-stats.3582996/
If a player's rating is missing, it's currently* inferred to be 1500±130 (default rating). The old scripts are smarter than that:
Note that this is an "enhancement" not a "feature" because it's not a high priority--I don't think missing ratings are all that common any more (worth checking!), but also the long-term plan is to do player rating calculation locally (not rely on PS).
*again, entering this ticket before the code it's referring to has actually been merged to master or even put in a PR, but yet again, I'm highly confident this code's going to make it in, so entering the ticket now means I won't have to remember later
Create a data store suitable for the 0.1 release that can implement the various sinks and daos. The goal is to stand up something simple to deploy and debug, so it could be a file-based system, but I'm really hoping I can do it with SQLite.
Generate it the same way the accesible_formes resource was generated (might actually want to combine scripts).
Charizard-Mega-Y is not listed among the formes for a Charizard / Charizard-Mega-Y holding a charizarditey.
Say you have two Lv. 36 Charizard-Mega-X.
Both are neutral-natured, flawless.
The first has 8 EVs in Defense, the second has none. Both would currently be represented as having a dfn
stat of 96.
Here's the problem: now let's say you want to determine the dfn
stat or the Pokemon when in base forme: the first Pokemon should have a dfn
of 73, the second, 72.
Representing the stats the way I am now (no IVs, EVs or nature) means there's no way to tease apart those two cases.
This is a pretty niche case, as all metagames with mega evolutions are played at at least Lv. 50 (where this ceases being an issue), but making the decision not to support this could cause implications for other metas further down the road.
utilities.calculate_stats
uses the Gen 2+ formula, so stat calculations for RBY metagames will be wrong
(again, the code that created this issue has yet to be pushed into master, but I'm confident that I'll be sticking with the strategy I'm executing now, and so I'd like to make the issue before forgetting about it)
As of bd30fa6, I've removed the foreign key constraint from the teams table. This is somewhat dangerous, because I really do want to require that every "sid" in the teams table corresponds to a set in the moveset table, but leaving the constraint in leads to a few headaches, based on the overall architecture. With the constraint in place:
The dangers of not having the constraint in place are:
Really, it's only the last con and the last pro that I really care about, and so far the con for having a constraint outweighs the pro, but if I ever decide to combine the sinks, I should definitely revisit this.
Rather than use real or anonymized logs, write a script to generate mock logs that can be used for testing.
Giant irony to this comment, which caused me to rip out and rewrite the entire SQL backend: forme lists in the DTO get ordered alphabetically upon sanitizing.
It's possible that the policy will be that the ordering doesn't matter (so count "Mega-Charizard-XMega-Y" and "Mega-Charizard-Y-Mega-X" together), but that shouldn't be baked in.
All the data extracted from LogReader
will need to get routed to the appropriate sinks. Define an interface for each one that should be applicable for file-based backend, RDBMS or NoSQL.
Some functions require four or five PS data lookups or resources, which means a metric ton of arguments to keep track of. What would be better would be to pass around one object that contains these lookups as attributes.
ZenHub should automatically pick up GH Projects, and this way people can track my progress without the plugin.
>>> from onix import contexts
>>> ctx = contexts.get_standard_context()
>>> ctx.sanitizer.sanitize('tentaécruel')
'tentaécruel'
Behaves as expected in Python 2.
Might be time to tackle #36
(it's a little premature to put in this issue now, considering the backend
branch isn't even in PR yet, much less merged, but since I'm pretty sure I'm going down the ORM route for now, I figure I should put this in before I lose the thread)
As per SQLAlchemy's documentation, ORM operations are slow compared to SQLAlchemy's Expression Language (or executing the SQL directly), so when it comes time to do performance tuning (0.9 milestone), this might be an avenue to pursue. Note that I can still define my tables using the "declarative" idioms--the underlying tables can be accessed via the __table__
attribute.
On the other hand (and the reason I'm not just starting off writing the backend using Core), 12s for 100k rows doesn't actually sound that bad--if I could parse a million logs in a couple of minutes, that would be an enormous improvement on what I have now. Basically what I'm saying is that even doing things the slow ORM way, there's a good chance this won't be a significant bottleneck.
Bottom line: this is something to keep track of, but 50/50 this gets closed as "won't fix"
appnope is an OSX-only thing
Really it would be best to not autogenerate those conda env yamls and just keep track of what packages I end up installing.
That way we can run tests just by running py.test
Here's the code on PS:
Tools.prototype.getId = function (text) {
if (text && text.id) {
text = text.id;
} else if (text && text.userid) {
text = text.userid;
}
if (typeof text !== 'string' && typeof text !== 'number') return '';
return ('' + text).toLowerCase().replace(/[^a-z0-9]+/g, '');
};
Before Onix can be released, we need the documentation to be complete and up-to-date. Use Sphinx to auto-generate the docs, make sure they look good, publish to RTD, and make sure you understand how to handle versioning.
Create a 0.1 release in GitHub and RTD
The ideal way to support this would be to feed the LogReader
a generation-specific pokedex. I'll have to see how difficult that would be to generate based on how PS does it.
I think there are some custom metagames that monkey with base stats as well. Again, I should see how PS handles them.
If there's a ladder error, p1rating
and p2rating
keys get dropped from the logs and there's a ladderError
flag.
Define DAO interfaces and create methods for generating plaintext (not the "chaos" JSON yet) detailed "moveset" usage reports like this one.
rptime
gives the timestamp of last rating update (which should be, at earliest, midnight of the current day).
format
is a key in the dict as well.
This means that conceivably these are not parameters that need to be statically sent to the reader. The question is whether (1) I fully understand how rptime
works and (2) whether I want the system to be dependent on those keys being there.
Lines 106-107 in generate_forme_change_rules.py
. They are in reverse order. Maybe write a test?
Set up TravisCI as Onix's automated testing environment. For now, we just need to make sure it works with Python 2.7 and 3.5 on OSX and Linux. In the future, we'll use Travis to test as many Python versions and ecosystems as possible.
And I'd previously decided to make pypy compatibility a requirement
The workflow I envisioned was to have separate log readers for each metagame, but now I'm thinking that's probably not necessary. Obviously, there need to be different readers for different Contexts, but otherwise, the same LogReader should be able to work with OU and UU and PU and LC (and probably even Balanced Hackmons)
Consult with PS about which unknown values in logs just get ignored (e.g.
unknown natures get processed as Hardy/Serious, IIRC) and ignore them rather than raising an error.
pypy can provide amazing speedups compared to regular python. On the other hand, it's incompatible with a lot of C-based packages (e.g. py_mini_racer
, ujson
and numpy
/scipy
/pandas
(though hopefully I won't need any of those..)
So the question is: how big a bump can I get in performance from using pypy? The current scripts see limited gain. Is python execution even going to be the bottleneck? Will it actually run faster using pypy? Can I achieve similar gains using numba?
I should consider dropping the pypy req for now and then revisiting it after the system once the system in in an MVP state, where I can do profiling to answer these questions.
ujson can be massively faster than the native json library, although it's not compatible with all JSON data. Determine whether we can use ujson.
Note that ujson will not be compatible with pypy.
Important for stuff like ReadTheDocs deployment and cutting releases in general.
Look into versioneer.
I'd like to test that if there's an error in performing an insert that the whole transaction gets rolled back, but right now, I can't think of a way to force the sink to generate an invalid insertion expression.
I'm explicitly shying away from numpy/scipy packages (if I decide I want to do statisical modeling that'll have to go in a different project), so there's little advantage to conda other than "that's what I know." On the other hand, tox is awesome for automatic testing across virtual environments.
Write a test that verifies the correct operation of the entire pipeline, from collection to reporting.
It doesn't have to be a "diabolical" test--just testing normal, easy-to-validate operation is fine.
Right now the only two error-handling options are:
'''
* "raise" : raise an exception if an error is encountered.
* "skip" : silently skip problematic logs
'''
This means that either one bad log ruins the whole batch or that bad logs don't get reported.
"Robust error handling" is currently targeted for the 0.7 release, but in order to get through "a whole month of logs" this will need to be dealt with sooner (0.4 release).
For now, I propose just a "warn" strategy that prints the error message but keeps chugging.
Currently the integration test throws the following error:
E sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) too many SQL variables
(it goes on for quite some time, but it's referring to the giant CASE clause that is the species lookup)
This is a known limitation with SQLite where one is limited to 999 "?
" variables. Ref: https://www.sqlite.org/limits.html
Upon further investigation, I see that the insert expressions are going to have issues as well for any large-scale operations (though this can be mitigated by setting batch_size
really low).
So the solution has two elements:
CASE
expression (there's an alternative where it only combined formes and handles weird cases like "Mr. Mime" and then the reporting DAO sanitizes the rest, but that adds a lot of complexity).execute(insert(), values)
syntax rather than the current .execute(insert().values(values)
syntax. AFAICT, this also necessitates storing the rows as dict
s rather than tuples. This is, honestly, a better syntax, as it'll be clearer from the tests what's being tested than how it is now where I'm randomly selecting by index. But it's also going to be a PITA to rewrite all the tests.Make a function that handles all this nonsense:
try:
pokedex = json.load(open('.psdata/pokedex.json'))
except IOError:
pokedex = scrapers.scrape_battle_pokedex()
Should also handle resources that don't get scraped.
Create DAO interface and report-generation method for generating the ranked list of abilities for the Pokemon in a given metagame in a given month at a given weighting-baseline
A LogProcessor
is an object to handle--at a high level--the reading of logs and the routing of the parsed data to the appropriate sinks. It also handles any errors that might arise.
A LogProcessor
is initialized with the appropriate sinks and references to the logs that need processing, and it's the LogProcessor
's responsibility to create the appropriate LogReader
s.
Right now sinks are very much set up to store data when you tell them to, in that:
A better way to structure this would be to have the "store" methods return nothing, define "flush" and "close" methods, and maybe have THOSE return the total number of objects stored during the session.
At this point, the task is trivial, since the only sink implementations are stumps, but if I put this off for a later release (since this doesn't need to be optimized for the 0.1 release), the effort value will likely increase.
The method reporting.reports.generate_usage_stats
should pull information from a dao.ReportingDao
interface and generate a standard Smogon usage stats report like this one.
We want total battles, average weight / team and usage %. We don't need "Raw" or "Real" for now.
Through this process it should become clearer what sorts of data we need the ReportingDao
to return.
Note that combining of formes and mapping of formes to printable names is in the scope of this method (read: not the dao).
Create DAO interface and report-generation method for generating the ranked list of items for the Pokemon in a given metagame in a given month at a given weighting-baseline
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.