smnorris / bcfishobs Goto Github PK

View Code? Open in Web Editor NEW

5.0 3.0 1.0 304 KB

Reference BC Known Fish Observations to the Freshwater Atlas stream network

License: Apache License 2.0

Makefile 100.00%

stream-network fish-observations fwa british-columbia fiss

bcfishobs's People

Contributors

Stargazers

Watchers

Forkers

newgraphenvironment

bcfishobs's Issues

just create tables, not views

pg_featureserv doesn't yet support views and pg_tileserv is a bit fussy about them. Instead of creating view whse_fish.fiss_fish_obsrvtn_events_vw, just create a table. Maybe call it whse_fish.fiss_fish_obsrvtn_events_sp

output tables - distinct locations or all locations

Currently a point is created for each observation record in whse_fish.fiss_fish_obsrvtn_events_sp, while whse_fish.fiss_falls_events_sp only includes distinct locations. This is fine for current use cases but standardizing would be valuable. It is probably best to go with what I've done for observations - use the events table for actual processing (no geoms), and use events_sp for visualizing. It is handy to have every input record easily queryable by the source id.

discard maximal observation query

No longer used

matching stream observations > 100m from stream

fish_observation_point_id = 466322 on Wardropper Creek (north side of Lake Cowichan) does not get matched to a stream.

The site is <100m from a stream but is ~125m from the non-waterbody portion of the stream. Presumably there is no match via the xref and linear_feature_id.

As we already have a ~98% match rate this may not be a priority but to fix, we could try:

attempt to match sites 100-500m from a stream that do not have exact linear_feature_id matches via blue_line_key, watershed_code etc, and attributes like gazetted_name
allow matching stream records to flow lines in waterbodies

de-aggregated view is primary product of interest

The aggregated locations are nice and efficient but most workflows benefit from having one record per observation.
Modify schema and outputs so that the de-aggregated data is what gets dumped / is readily available for other tools

replace fwakit with fwapg

keep species_id as an array in the distinct observations rather than code

Species codes are easier to work with than ids, but by storing the ids in the array we can use https://www.postgresql.org/docs/12/intarray.html for indexed (instant) searches - I think this is worth the tradeoff.

improve qa/summary

If records are accidentally dropped with a bad query in 07_create-output-tables.sql, they will not be caught and reported on. Modify the QA query to double check that all records (that are matched to a stream) make it through the analysis.

write outputs to non-whse schema, if writing to bcfishpass consider moving code to bcfishpass repo

Not a priority, but while I am tidying up / hardening the various dependencies to bcfishpass I am thinking it would be nice to keep WHSE_FISH and similar DataBC schemas clean, holding only the source data and generally writing derived data to separate schemas.

This repo is generally for bcfishpass related work, it could be a sub-repository within the bcfishpass repo and outputs could be written to the bcfishpass schema.

CI/CD

The job could be scheduled and run with GHA and the result posted to file somewhere (rather than requiring every user to run the job).

use latest bcdata

cannot add postgres view to arcgis desktop

On ArcGIS 10.1 - perhaps this isn't an issue for current versions but it could be valuable to write whse_fish.fiss_fish_obsrvtn_events_vw as a table or maybe materialized view

suggest a readme hint that rebuilding from older versions of repo should perhaps remove all targets

Makes sense that a build of fiss_fish_obsrvtn_events_vw via sql/schema.sql does not happen with calls to

$ rm .make/fiss_fish_obsrvtn_pnt_sp
$ make

but it took me a while to clue into what was going on. Thinking that it wasn't there before because the tables in the db were dumped in from an older db before the upgrade to the view.

Does it make sense to add "for a full rebuild run:

rm -d -f .make

to the Makefile for the make white belts?

increase distance threshold for points with no match via fwa-50k lookup

We are perhaps just a little bit too conservative with the 100m cutoff for matching observations to streams where there is no lookup table confirmation of the match. A few valid seeming observations do not make it in to the output.

Perhaps go up to ~150 or 200m in cases where there is no lookup match?

reconfigure for easier refresh

To avoid having to CASCADE if any views etc depend on the data:

use views instead of temp tables where practical
create tables once, truncate/reload/refresh when re-processing

installation for windows

Incredible tool. Hope to get it running. Working on installing this to windows. So far i have done the following. I am not experienced with most of these tools so please pardon my ignorance and newbieisms.

I did not:

get started with a fresh Windows Subsystem for Linux (WSL) (Ubuntu 20.04.1 LTS) - maybe this was my first mistake?!

I did:

Install virt env and bcdata as per the instructions at bcfishpass
Download wget (actually just found it elsewhere on my system) and install into the same bcfp virt env directory
git clone the bcfishobs repo locally https://github.com/smnorris/bcfishobs.git
rename 01_load.sh to 01_load.bat and 02_process.sh to 02_process.bat
change line endings in both bat files from \ to ^

When trying to run 01_load.bat from the Anaconda prompt at cd bcfishobs and within the bcfp virt env issue arose at:

(bcfp) C:\scripts\bcfishobs>ogr2ogr -f PostgreSQL "PG:host=$PGHOST user=$PGUSER dbname=$PGDATABASE port=$PGPORT" -lco OVERWRITE=YES -lco SCHEMA=whse_fish -nln wdic_waterbodies_load -nlt NONE whse_fish.wdic_waterbodies.csv

fails with

ERROR 1: PQconnectdb failed.
invalid integer value "$PGPORT" for connection option "port"

tried

export PGOGR='host=localhost user=postgres dbname=postgres password=postgres port=5432'

and

export PGHOST=localhost etc.

as per README.md on bcfishpass to no prevail with message

'export' is not recognized as an internal or external command

below is a screen shot of my env variables. Should they have dollar signs in front of them perhaps?

line 45 of 01_load.bat fails psql -c "CREATE TABLE whse_fish.fiss_obstacles_unpublished (id integer .......... due to unexpected line endings. Could not seem to find a solution other than re-writing the command with no spaces and copying into the terminal directly. This worked but is there another way that you know of?
bcdata bc2pg WHSE_FISH.FISS_OBSTACLES_PNT_SP --fid FISH_OBSTACLE_POINT_ID fails with the following error message:
Any idea what might be going wrong? Thank you!

(bcfp) C:\scripts\bcfishobs>bcdata bc2pg WHSE_FISH.FISS_OBSTACLES_PNT_SP --fid FISH_OBSTACLE_POINT_ID
Traceback (most recent call last):
  File "c:\users\allan\.conda\envs\bcfp\lib\runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "c:\users\allan\.conda\envs\bcfp\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\Users\allan\.conda\envs\bcfp\Scripts\bcdata.exe\__main__.py", line 7, in <module>
  File "c:\users\allan\.conda\envs\bcfp\lib\site-packages\click\core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "c:\users\allan\.conda\envs\bcfp\lib\site-packages\click\core.py", line 782, in main
    rv = self.invoke(ctx)
  File "c:\users\allan\.conda\envs\bcfp\lib\site-packages\click\core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "c:\users\allan\.conda\envs\bcfp\lib\site-packages\click\core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "c:\users\allan\.conda\envs\bcfp\lib\site-packages\click\core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "c:\users\allan\.conda\envs\bcfp\lib\site-packages\bcdata\cli.py", line 350, in bc2pg
    conn = pgdata.connect(db_url)
  File "c:\users\allan\.conda\envs\bcfp\lib\site-packages\pgdata\__init__.py", line 20, in connect
    return Database(url, schema, sql_path=sql_path, multiprocessing=multiprocessing)
  File "c:\users\allan\.conda\envs\bcfp\lib\site-packages\pgdata\database.py", line 41, in __init__
    self.engine = create_engine(url)
  File "c:\users\allan\.conda\envs\bcfp\lib\site-packages\sqlalchemy\engine\__init__.py", line 500, in create_engine
    return strategy.create(*args, **kwargs)
  File "c:\users\allan\.conda\envs\bcfp\lib\site-packages\sqlalchemy\engine\strategies.py", line 56, in create
    plugins = u._instantiate_plugins(kwargs)
AttributeError: 'NoneType' object has no attribute '_instantiate_plugins'

table and column comments

document the tables so descriptions/comments show up in https://features.hillcrestgeo.ca/bcfishobs/collections.html

CLI options

Options that could be valuable to add to the CLI:

output table name
initial matching threshold (currently 1500m)
modify the current 100m / 500m thresholds

handle observations at the same location

In output event table, position is primary key and we insert with on conflict do nothing - this means records are lost if they are in the same location as another. Points with slightly different locations can be snapped to exactly the same point on the network.

Fixing #33 would help ensure problems like this are found immediately

add relevant fields to output view

as per @NewGraphEnvironment comments, these columns are very useful:

life_stage
activity
acat_report_url

Add them (and potentially other columns) to output review rather than requiring user to join or refer back to source

database not specified in query

Need to point to $DATABASE_URL here:
https://github.com/smnorris/bcfishobs/blob/main/Makefile#L66

minor reorganization for easy updates

Load data that doesn't change in script 1, download BCGW and process in script 2.
This way updates can easily be processed via a single script called via cron job

slight inconsistencies

Are the (very) minor changes to qa_summary.csv outputs due to imprecise queries, or is it changes to the source data?

Increase number of observation data points by incorporating point_type_code = 'Summary'

I have noticed that in 01_clean-fishobs.sql that o.point_type_code = 'Observation'. This excludes o.point_type_code = 'Summary'.

I am not sure when exactly the submission requirements to fulfill permit conditions changed, but I do know that since 2008, a permit holder need only submit observations as a summary. For this reason, there will be many many thousands of observations that are not included in the current bcfishobs outputs.

The requirement to only summarize fish captured by species and life stage is very unfortunate as there is so much good information lost that way in my opinion. The classifying fish by life stage is likely extremely error-ridden and is very easily done afterwards when each individual fish isinput. Nevertheless, the summary does contain an accurate account of how many fish of each species is captured and is often the only data submitted (example fish_observation_point_id = 471343).

I am obviously not expecting that you are going to jump on this or change anything at this point but it is something that I will explore in the next couple of months as we look towards densities of fish captured in electrofishing sites and fish observations vs. stream/watershed characteristics.

Btw, this package is a thing of beauty. ⭐

reduce measure precision

Observations at measures such as 309.442m and 309.761m are in essentially the same spot.
Try rounding values and perhaps storing downstream_route_measure as an integer

include event id in view

This will make it easier to join the two outputs if/when necessary

remove unneeded dependencies

pgdata - unnecessary
psqlcsv - use /copy OR, even better - create a table to hold the summary and write the results with a timestamp each time job is run

WHSE_FISH schema does not exist

mkdir -p .make
mkdir -p data
# wdic waterbodies table for relating 50k waterbodies to fwa
wget -qNP data https://hillcrestgeo.ca/outgoing/public/whse_fish/whse_fish.wdic_waterbodies.csv.zip
  HTTP/1.1 200 OK
  Server: nginx/1.18.0 (Ubuntu)
  Date: Wed, 08 Feb 2023 19:34:14 GMT
  Content-Type: application/zip
  Content-Length: 16294451
  Last-Modified: Sat, 30 Oct 2021 00:26:53 GMT
  Connection: keep-alive
  ETag: "617c914d-f8a233"
  Accept-Ranges: bytes
unzip -qjun -d data data/whse_fish.wdic_waterbodies.csv.zip
# load via ogr because it makes cleaning the input file easy.
ogr2ogr \
	-f PostgreSQL \
	"PG:postgresql://postgres@localhost:5432/bcfp_test" \
	-lco OVERWRITE=YES \
	-lco SCHEMA=whse_fish \
	-nln wdic_waterbodies_load \
	-nlt NONE \
	data/whse_fish.wdic_waterbodies.csv
ERROR 1: ERROR:  schema "whse_fish" does not exist
LINE 1: CREATE TABLE "whse_fish"."wdic_waterbodies_load" ( "ogc_fid"...

observations at zero measure can be lost

Joining to streams, if point is at zero measure, current logic is not quite right, points at zero measure will be lost.
https://github.com/smnorris/bcfishobs/blob/main/sql/07_create-output-tables.sql#L103

update to match like this, or using +/- .01 or similar

ON r.blue_line_key = s.blue_line_key
and r.downstream_route_measure < s.upstream_route_measure
and r.downstream_route_measure >= s.downstream_route_measure

load and process data in a temp schema then flush to `bcfishobs` when complete

Recreating the schema and tables on every load means also updating grants for application roles. Therefore, errors on refresh currently break any downstream applications.
To fix:

create all empty bcfishobs relations on setup (only run once)
download and run processing in temp schema
flush data to bcfishobs only on successful completion of entire load

cascade when dropping view

psycopg2.errors.DependentObjectsStillExist: cannot drop table whse_fish.fiss_fish_obsrvtn_pnt_sp because other objects depend on it
DETAIL:  materialized view bcfishpass.observations_vw depends on table whse_fish.fiss_fish_obsrvtn_pnt_sp
materialized view bcfishpass.freshwater_fish_habitat_accessibility_model_salmon_vw depends on materialized view bcfishpass.observations_vw
materialized view bcfishpass.freshwater_fish_habitat_accessibility_model_steelhead_vw depends on materialized view bcfishpass.observations_vw
materialized view bcfishpass.freshwater_fish_habitat_accessibility_model_observations_vw depends on materialized view bcfishpass.observations_vw
HINT:  Use DROP ... CASCADE to drop the dependent objects too.

generate primary key for falls

fish_obstacle_point_id is only guaranteed to be a unique identifier when the data is downloaded, it is not stable between downloads (when new records are loaded).

Generate a stable primary key based on geom and some other combination of columns

independent make recipes - whse_fish.wdic_waterbodies_load

load table is required when running .wdic_waterbodies but deleted in final step. Create and delete the temp file in the wdic_waterbodies recipe.

visibility of downstream traces for observations with small measures

To make short traces visible, we can set downstream_route_measure to 20 where it is <= 20 before generating a downstream line.

However, this won't always be valid - there are bound to be some that do not continue to 20m - the tweak would have to be conditional.

output from PG11 and PG12 are slightly different

$ diff qa_match_report_1.csv diffs_pg12/qa_match_report_1.csv
diff --git a/qa_match_report_1.csv b/diffs_pg12/qa_match_report_1.csv
index 38c8cbd..05e45b8 100644
--- a/qa_match_report_1.csv
+++ b/diffs_pg12/qa_match_report_1.csv
@@ -1,11 +1,11 @@
match_type,n_distinct_events,n_observations
A. matched - stream; within 100m; lookup,54010,166828lookup,53916,166813
B. matched - stream; within 100m; closest stream,6668,19294stream,6694,19289
C. matched - stream; 100-500m; lookup,4542,30501lookup,4549,30510
D. matched - waterbody; construction line within 1500m; lookup,11775,113250lookup,11828,113250
E. matched - waterbody; construction line within 1500m; closest,1360,15814closest,1367,15814
TOTAL MATCHED,78355,345687MATCHED,78354,345676
F. unmatched - less than 1500m to stream,1740,5348stream,1742,5359
G. unmatched - more than 1500m to stream,103,722
TOTAL UNMATCHED,1843,6070UNMATCHED,1845,6081
GRAND TOTAL,80198,351757TOTAL,80199,351757

bcdata 0.8

date download is now logged to bcdata.log not public.bcdata

flag maximal observations by species

For quicker queries/lookups and easier visualization, add field flagging maximal observations, by species

minor cleanup

Reference DATABASE_URL variable for all psql commands, so various $PG_ vars don't need to be set

consistent fish_obsrvtn_pnt_distinct_id

instead of using a serial PK, ensure the PK is consistent between runs by using a reference to the stream network like this:

id bigint
     GENERATED ALWAYS AS ((((blue_line_key::bigint + 1) - 354087611) * 10000000) + round(downstream_route_measure::bigint)) STORED PRIMARY KEY

fish_observation_point_id is unstable

It seems the IDs can change over time (as with FISS Obstacles).

There is not much we can do about this but it is worth noting in the README.

add FISS obstacles

Use existing logic to also match WHSE_FISH.FISS_OBSTACLES_PNT_SP to FWA features. The obstacles data is already better matched to FWA streams than observations, but there are still records that would benefit from matching via the xref rather than just finding the closest stream.

improve matching of observations to streams based on watershed code

On submission and load to BCGW, fish observations are snapped to 1:20k streams when loaded to the warehouse. However, observations may occur on unmapped streams and older observations may not have been snapped to 1:50k streams (or not snapped to a stream at all).

Modify the logic for matching observations (to streams, not waterbodies) to use watershed codes when distance from stream is greater than a small threshold. Review logic of matching streams to waterbodies to see if using the watershed code will improve the match.