Git Product home page Git Product logo

trusat-backend's People

Contributors

cameronfr avatar interplanetarychris avatar kmoneal avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

trusat-backend's Issues

RDE submissions via website will only succeed for the first observation.

https://github.com/consensys-space/trusat-backend/blob/d847b0188f81d4bd68bb3e435f50241759d959a4/database.py#L991-L996

As-written, this only allows:
Common Data Block
Date of Observation
First record

Where there can be multiple observation dates, and records per submission.

Recommended solution is to update iod.py IOD/RDE/UK parsing routines to also return an array of lines indicating which lines succeeded, and which lines failed to import.

Encrypt RSA Private Key

To improve security, the RSA private key should be encrypted files. In order to decrypt the private key, there should be an environmental variable of the key used to decrypt this file.

Implement code to acknowledge opt_out users/stations

In order to prevent re-processing of data for Stations (users) who have opt-ed out of TruSat, a number of enhancements should be implemented to prevent this data from being processed in TruSat.org and potentially other future decentralized implementations.

At present, this would appear to apply to:

  1. SQL Station Table (a place to store the flag per station)
  2. addParsedIOD() - Check for flagged stations before submitting
  3. Any queries which JOIN with the Station table - the station_num will need to be stored, but location data will be nulled out (currently a bug that needs to be filed). This is probably not an issue because there will also be no ParsedIOD records for the station.

If the data is manually removed from the database, this should prevent re-introduction of the data in future import sessions.

Time formatting should probably happen closer to the user

Currently, we have a few database queries that format timestamps into dates, like 'observation_time': 'September 18, 2016'

I get that we have to serialise to some format to send over JSON, but compared with something like:
'observation_time': '2016-09-18T10:22:44.411Z' the current format seems like a weird choice for several reasons:

  • It loses precision; the time of day is thrown away for no clear reason.
  • IMHO it's an unusual way of writing dates
  • It needlessly introduces English words for months, when numeric dates would be language agnostic and better for a global audience
  • It doesn't sort easily like ISO 8601 would (https://xkcd.com/1179/)
  • It makes it harder to adapt how dates are displayed for a given user's locale

Surely it's better to pass the date in a format that can be automatically formatted back into a JavaScript date, so that the front end can format as it sees fit without back-end changes. (E.g. there's a fashion for using "moments", like "5 hours ago" or "6 months ago".)

read_seesat_mbox.py and read_seesat_hypermail.py stall on import

Following unit test updates, both SeeSat archive import scripts appear to be "paused," in a Ctrl-S terminal way, in the process of importing. Ctrl-C in the terminal allows them to continue, but it is unclear what is the underlying cause, or if data loss happens because of the Ctrl-C.

Debugger sessions have yet to refine the location of the problem itself.

From terminal feedback with the -V flag:
https://github.com/consensys-space/trusat-backend/blob/257bf4606147dbd927f3c3b9acf453b475c22656/database_tools/read_seesat_mbox.py#L346

...the script appears to consistently "pause" after the following lines:

Found  67 IOD obs in msg: 2015-03-19 23:23:00+01:00 LB Obs 2015 Mar 19
Found 162 IOD obs in msg: 2018-04-21 10:05:40+02:00 LB Obs 2018 Apr 20-21 night
Found   2 IOD obs in msg: 2019-03-27 22:56:50+01:00 Obs 2019 Mar 27 pm
Found   3 IOD obs in msg: 2019-09-20 07:28:25-04:00 slow moving unid seen on Sept 19

Error handling / monitoring on the API may be deficient

There are many places where the API code catches exceptions, then prints details of an error and returns "200 OK" regardless.

I'm not sure how this works with the front-end but my suspicion is that errors will often go undetected. In some cases we probably just show no data rather than an error message. In other cases it may appear that actions were successful when they really failed.

Related to this, do we (i.e. TruSat as an organisation, rather than our users) have a way of detecting and investigating errors that happen during API calls?

Catalog queries return many results per object rather than one per object

Only the /catlog/debris and /catalog/undisclosed group results by object ID before returning.

Other /catalog/... queries will return many results per object, perhaps more than one per IOD because observations from stations with more than one user will list an observation per user rather than per station.

The intention seems to be to return one result per object?

If we introduce a GROUP BY clause to the other catalog queries, we must take care to select the correct observation details (obs time, user, eth address and obs quality). (Similarly to the "multiple users per station" problem, we must take care not to return many results per object if an object has multiple IODs with the same "latest" timestamp.)

The mapping from an IOD to its user (i.e. observer) should be unambiguous

Some station_nums have multiple entries in the Station table (see below). This is unsatisfactory, as it theoretically leaves us unable to link an observation to a user, and when linking observations to users the resulting need for de-duplication makes queries more complex and slower (or, more commonly, where there is currently no de-duplication, it just makes them wrong).

Two possibilities:

  1. A Station can have only ever have one user. (This seems to be the reality, for now?) We should enforce this in our processes, and also in the database by making station_num a unique key on the Station table. When we know there is only one row per station in the Station table, queries become easier (or, in some cases, the existing query syntax becomes correct rather than bugged :).
  2. A station can have multiple users. In this case, which user should be assumed to have made an observation from a multi-user station? Can we save info about this with the observation? If not, what heuristic should we use to decide on the user? If the heuristic gives us a constant answer for any given station, we may wish to save this answer as a view or in its own table. Existing queries should be updated to use this new table or view, to ensure that observations from the 4 stations below are not duplicated.

Note: some queries attempt a LIMIT 1 operation to choose a single user, but do so in a bugged way: #10.

station_num user initial latitude longitude elevation_m name MPC details preferred_format source_url notes
8835 31   35.1477 90.0135 75 IDSat       IDSat COSPAR.TXT  
8835 31 JN 35.1477 -90.0135 75 Jim Nix     IOD http://www.satobs.org/seesat/Jan-2002/0276.html  
8935 31 JN 35.3166 -89.886 94 Jim Nix     IOD http://www.satobs.org/seesat/Jun-1999/0356.html Added
8935 31   35.3166 89.886 94 IDSat       IDSat COSPAR.TXT  
8936 31 JN 35.1231 -89.9354 90 Jim Nix     IOD http://www.satobs.org/seesat/Sep-2000/0311.html Added
8936 31   35.2131 89.9354 90 IDSat       IDSat COSPAR.TXT  
9127 39   -27.591 -153.05 19 IDSat       IDSat COSPAR.TXT  
9127 39 JM -27.591 153.05 19 Jim McManus     IOD http://www.satobs.org/seesat/Aug-2001/0200.html  

TLE queries/endpoints can return a small number of duplicates

If there is more than one TLE with the joint-max epoch for a given satellite, then the epoch queries will return duplicates.

(On my test DB, there are two such objects.)

We should probably return only one TLE for each object. Not clear how to choose...

FROM TLE
GROUP BY satellite_number)
select epoch, satellite_number, count(*) as c
from TLE inner join max_epochs on (epoch = max_epochs.max_epoch and satellite_number = max_epochs.satnum)
group by epoch, satellite_number
order by c desc;```

/signup endpoint throwing a CORS error

Attempted to follow the sign up flow and received the email with my secret. But as the endpoint returned a CORS error, I didn't receive the "success" message on the front end to inform the user to check their email for further details on how to log in.

Cross-Origin Request Blocked: The Same Origin Policy disallows reading the remote resource at https://api.consensys.space:8080/signup. (Reason: CORS request did not succeed).

selectObjectInfo_JSON returns incorrect results

The user count, user name and last seen date all appear to be wrong. (E.g. see this Slack thread).

Query just seems not quite right. GROUP BY is grouping by users but isn't picking the max observation date for each user so the choice of user ends up being random. And the "user count" seems to reflect the number of observations made by this randomly selected user, rather than the number of distinct users.

I haven't checked the other data.

Code maintainability/robustness improvements/refactors

I'm using this issue to note some misc questions/concerns about some coding patterns in this repo, such as:

General

  • Catch-all exceptblocks, where most should only catch known, anticipated errors to avoid swallowing something truly unexpected that needs our attention
  • except blocks that silently swallow and hide errors.
  • some functions have no error handling and it's not clear how they will behave in error cases. This should be documented. The act of documenting it might force us to think about what really happens, and perhaps to change it.
  • More generally, many functions have missing, minimal or incomplete docstrings.

server.py

  • The use of login.txt where environment variables might be more appropriate
  • Lots of duplicate or boilerplate code (e.g. self.send_header('Access-Control-Allow-Origin', '*') is repeated in many places, and all will need to change in the same way if/when we update our CORS policy)

databse.py

  • The extensive use of prepared statements and prepared cursors, even where they add little benefit and make the code more verbose
  • Executing multiple queries where one would do (e.g. d710031, b02510e)
  • Use of except where it may be preferable to check for a non-exceptional (e.g. row count = 0) edge case in advance (e.g. d6ab307)
  • Lots of unnecessary code for "sqllite"; maybe this is needed and I just haven't grasped why we need this?

A lot of this stuff can be fixed as and when the code is updated or refactored for other reasons.

Catalog queries could be much faster

Currently, some of these large queries take ~1s. To get times, run pytest -s --durations 20.

Query times could doubtless be improved (rough target: take 90% less time) with some schema and query optimisation.

Catalog queries incorrectly restrict to a single station_number

The catalog queries all make use of the following subquery:

     JOIN (SELECT
                Station.station_num as station_num,
                Station.user as station_user,
                Observer.id as obs_id,
                Observer.eth_addr as eth_addr,
                Observer.name as user_name
                FROM Station,Observer
                WHERE Station.user = Observer.id
                LIMIT 1) Obs ON ParsedIOD.station_number = Obs.station_num

I think the LIMIT 1 here is a bug. It has the effect of limiting the whole catalog result set to a single station number (on my DB, ParsedIOD.station_number = 2420), which I don’t think is what we want. I believe I’ve inadvertently fixed some cases as part of query optimization, but other cases need attention.

We must not use an integer "nonce" as the JWT signing key

Several probable issues with this:

  • the domain of the integers used is far (dozens of orders of magnitude?) too small. It's currently easy for anyone to find valid JWTs for any user by brute force, and anyone reading the code will know this.
  • I am not an expert, but I've not heard of a signing key being called a "nonce" before
  • if we rename the "nonce" to "key", some problems with the way we are generating and storing it, unencrypted, become more obvious. For a start, anyone with read database access can see all keys and (even if we expand the key domain to something more sensible) can trivially log in as any user without being detected.

More generally, it's worth being acutely aware of how difficult security is in JS. Much of the following article does not apply to us, and some of it is out of date, but some of it is relevant and it's a fun read. :)
https://www.nccgroup.trust/us/about-us/newsroom-and-events/blog/2011/august/javascript-cryptography-considered-harmful/

No error handling when sending emails

From a code read, it looks like any failure to send emails to new users will happen silently. send_message prints an error message, but then the API will return 200 and the front-end will think everything is OK.

The email we send says "This email is the only time we can send you this code" so if that's true (is it? what makes it true?) then failed emails could leave users without any access to their accounts.

Queries must use bind variables

For security (https://www.explainxkcd.com/wiki/index.php/327:_Exploits_of_a_Mom). Otherwise malicious users can probably delete all our data.

Perhaps also for performance (in an Oracle database, using bind variables allows query execution plans to be shared between queries that are identical apart from the bind variables, which can significantly speed up execution; not sure about mariadb.)

See https://github.com/consensys-space/sathunt-database/pull/23/commits/9c861dc562dbac15629b54bb0ba64f59feb8c7ac for an example.

Cache API results in the browser

I think the API should return headers that encourage browsers to cache API results, even if it's only for 5mins*. This is pretty trivial to implement, and could greatly improve UX. (It should also reduce database/API load significantly, which isn't an issue right now but could be in the future).

For bonus points, some data (e.g. historical observations) could be safely cached for much longer periods.

* I chose 5mins because I saw user stories like "changes must be seen within 5mins".

Does `selectUserObjectsObserved_JSON` return the right mix of info?

This was doing the wrong thing so I had to take a best guess at the requirements when rewriting it.

As of my fix:

  • notably observation_quality refers to the observation made by the selected user.
  • notably username_last_tracked, time_last_tracked, and address_last_tracked all refer to the most recent observation of the same object by any user.
  • object_secondary_purpose was returning a placeholder string mentioning team members, so I took it out and we now return the empty string until we decide what to do with this field.
  • observation_quality was returning a code like "E", which I updated to a short description like "excellent".

Not 100% clear that I've got those requirements right?

If I have, is it weird that we don't tell the selected user the date/time at which they last observed each object?

User queries do the wrong thing

Similar to #10, user queries selectUserObservationHistory_JSON, selectObjectUserSightings_JSON, selectUserObjectsObserved_JSON seem like they probably limit results to a single station (untested).

Some of the queries also seem like they return duplicate results on inconsistent info.

This bug may be most visible to our most important users, some of whom seem to use 10 or 11 different stations.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.