GitHub Scraper is a tool for tracking several repositories within one Google Spreadsheet, making task management and status info sharement between teammates easier.
First table building iteration can take a lot of time (in case of liveful repositories). As GitHub API returns total number of issues, retrieved by filter, we can count a percentage of processed issues/PRs, and show it in logs.
Scraper uses issue number and repository short name as a row ids in internal calculations. It's not very convenient as users have to keep Issue and Repository columns in their own tables, which can be not wanted.
Add functionality to move done and closed issues into archive table to avoid overwhelming the existing tables.
Function to_be_archived() should be added into fill_funcs.py. It should designate, if issue should be archived. Function should return list ready to be inserted into archive sheet.
For this feature new constant is required to be added into config.py. It must have the same structure as other sheets configurations have
For now issues are inserted monolith - the whole table with a single request. This takes time and forces Scraper to recalculate all the highlighting, which is tens and hundreds of requests. Plus to this, before sending new highlighting requests we have to clear all of the current highlighting.
The best way to implement speedup is to do operations one by one: first sort the updated backend table, and send requests to move updated rows. Then, insert new rows into the backend table, sort it, and send requests to insert new rows into spreadsheet. Then delete deleted rows from the backend table, sort it and send deleting rows requests. With this system itself will become more stable.
Probably some issues can be wanted not to be tracked by Scraper. Implement functionality which can give user an ability to set rules for ignoring issues. This should be located in filling_funcs.py for easier access, and should be called on every tables update.
Spreadsheet name for now is used only in config.py file, which is not very convenient. Some people may wanna work with names of their spreadsheets instead of ids. Thus, it'll be good to implement name attribute for Spreadsheet() class, definitely with setter, which will be changing spreadsheet name on the service.
It's probably caused by using updated_at time for since filter. It'll be good to use time of last table updates instead, as on a first update after start Scraper analyzes only opened issue. That means the last update date will be equal to last opened issue update. If any issue was closed after that time, Scraper will add it into table.
It may worth to track pull requests, which are not related to any issues. The problem in here is how to show them in table, how to track, how not to technically mix them with issues.
UPDATE: issues are now determined by their URLs. Issues and PRs numbers are unique with each other, so PRs without related issues feature can be now implemented.
User should have an opportunity to tweak conditions on which issue must be deleted from a table (mostly to avoid overfilling). For example, issue, which was closed within three days without pull request may not be very interesting to table owners.
It would probably optimal to call cleanup function on every issue (after it got all the data updates). fill_funcs.py is a good place for it. Function itself should return bool: True - delete issue, False - let issue stay in a table. Internal Scraper code will be deleting issue marked in such fashion.
Speed up Scraper by adding date filter for issues. Issues should be retrieved with sorting by "updated_at", then we could avoid reading and processing issues which were updated long ago. Also since filter can be used to get only recently changed issues.
For now relations between issues and PRs can be designated only if both instances are in one repository. It makes sense to search for connections between all of the repositories tracked on a single one sheet.
try_match_keywords() function probably must be widen to understand if issue posted in another repository.
For now Scraper testing includes only running it on a live data, that can take some time. It would be great to have a bunch of unit/system tests to be able to easily check Scrapper's health, when new features arrived.
Tests should be added into new "tests" folder.
While writing the tests coverage package should be used to make sure that functions/objects are completely covered with tests.
In some repositories number of requests can be up to hundreds and thousands - processing them on backend can take time. During this processing visual artifacts can appear at the spreadsheet (URLs can be erased, highlighting cleared, etc.), which actually can look really weird.
It's better to send requests with smaller batches. This practise can also reduce number of non operational highlights as batch falls entirely. Meaning 300-sized failed batch will skip 300 requests, while failing 1 of 30 batches with 10 requests per batch will avoid failing of 290 requests.
User should have an ability to set cell color while designating it's value within filling function. Better add new field into old_issue object for such causes.
Functionality to build statistics tables should be implemented to show how many issues were created/closed/PRs merged, and other actions were made during given periods of time.